Pinterest的DS主要负责利用数据来解决复杂问题,发现洞察,并提出基于数据的解决方案。相比于机MLE,DS更注重统计分析和数据挖掘,面试过程也更加关注对统计学和分析方法的了解。
Pinterest’s data scientists (DS) are primarily responsible for using data to solve complex problems, uncover insights, and propose data-driven solutions. In comparison to machine learning engineers (MLE), data scientists place more emphasis on statistical analysis and data mining, and the interview process also focuses more on understanding statistical and analytical methods.
面试总结如下:
A summary of the onsite interview is as follows:
- 编码方面:
- 考察 SQL 和 Python,不过难度相对较简单。
- A/B 测试:
- 强调选择与实验目标直接相关的主要指标,如用户转化率、点击率、收入等。
- 需要考虑次要指标,以全面评估实验的影响。
- 如果实验结果未显示明显差异,需要重新审视实验设计和执行,可能需要重新评估目标、实验时间、样本量等因素,或尝试不同的变体设计。
- 机器学习:
- 并非一定需要现场编写代码,重点可能更放在解释模型流程、讨论特征工程、数据处理等方面。
- 针对识别新用户是否会成为长期用户的问题,可以考虑使用各种机器学习模型,如逻辑回归、随机森林、或深度学习模型。
- 关键在于选择合适的特征来描述用户,如注册信息、行为数据等。
- 针对高度不平衡的数据问题,可以采用过采样、欠采样或者集成学习等方法进行处理。
Coding Aspect:
- Assessment of SQL and Python, although the difficulty level is relatively moderate.
A/B Testing:
- Emphasis on selecting key metrics directly related to experimental goals, such as user conversion rate, click-through rate, revenue, etc.
- Consideration of secondary metrics to comprehensively evaluate the impact of the experiment.
- If the experiment results do not show significant differences, a reevaluation of experimental design and execution may be necessary, involving factors like goal reassessment, experimental duration, sample size, or attempting different variant designs.
Machine Learning:
- Coding may not be required on-site; the focus may be more on explaining model processes, discussing feature engineering, data processing, etc.
- Addressing the problem of identifying whether new users will become long-term users could involve using various machine learning models such as logistic regression, random forests, or deep learning models.
- The key is to choose appropriate features to describe users, such as registration information, behavioral data, etc.
- For highly imbalanced data issues, approaches like oversampling, undersampling, or ensemble learning may be employed.
以下为Pinterest DS的OA:
- Recall和False Positive Rate(FPR)找符合要求的Confusion Matrix:
- Recall(召回率)是分类模型中的一个性能指标,表示在所有实际正例中,模型成功识别的比例。计算公式为 TP/(TP+FN)。
- FPR(False Positive Rate)是指在所有实际负例中,模型错误地将负例分类为正例的比例。计算公式为 FP/(FP+FN)。
- 通过调整模型的阈值,可以在Recall和FPR之间找到符合特定要求的混淆矩阵。
- Ensembling Model:
- Ensemble Modeling 是一种通过结合多个模型来提高整体性能的技术。常见的方法包括 Bagging(如随机森林)和 Boosting(如AdaBoost、Gradient Boosting)。
- Decision Tree:
- 决策树是一种树状模型,通过对输入特征进行递归分割,最终将实例分到不同的类别。它易于理解和解释,但也容易过拟合。
- Loss Function:
- Loss Function(损失函数)用于度量模型的预测与实际值之间的差异。在训练模型时,目标是最小化损失函数。
- Neural Network (NN) Classifier 初始化Weight Vector:
- 初始化神经网络权重为全0可能导致对称性问题,不利于网络学习不同特征。通常使用随机初始化的方法,如从高斯分布中抽取小的随机值。
- Bootstrap Aggregation:
- Bootstrap Aggregation(Bagging)是一种通过构建多个模型,每个模型使用不同的训练数据子集,然后组合它们的技术。这有助于减小模型的方差,提高模型的稳定性。
- Neural Network Output:
- 神经网络的输出是指模型对输入数据的预测结果。在分类问题中,通常使用softmax函数将输出转换为概率分布。
- 算法:找所有的Local Maximum:
- 在数学和优化问题中,寻找所有局部最大值是通过计算导数或梯度为零的点,进而确定函数曲线上所有可能的局部最大值。
- Bootstrap Tree:
- Bootstrap Tree 是通过对数据集进行 Bootstrap 采样构建的决策树。通过构建多个并组合它们,可以提高模型的性能。
- Gaussian Naive Bayes:
- 高斯朴素贝叶斯是一种基于贝叶斯定理的分类算法。它假设特征之间的关系符合高斯分布,适用于连续型特征。这是朴素贝叶斯算法的一种变体。
- Recall and False Positive Rate (FPR) to Find the Required Confusion Matrix:
- Recall is a performance metric in classification models, representing the proportion of actual positives correctly identified by the model. The formula is TP/(TP+FN).
- FPR (False Positive Rate) refers to the proportion of actual negatives incorrectly classified as positives by the model. The formula is FP/(FP+FN).
- By adjusting the model’s threshold, a confusion matrix meeting specific requirements for Recall and FPR can be found.
- Ensembling Model:
- Ensemble Modeling is a technique that improves overall performance by combining multiple models. Common methods include Bagging (e.g., Random Forest) and Boosting (e.g., AdaBoost, Gradient Boosting).
- Decision Tree:
- A decision tree is a tree-shaped model that recursively splits instances based on input features, ultimately assigning them to different categories. It is easy to understand and interpret but can be prone to overfitting.
- Loss Function:
- A Loss Function measures the difference between a model’s predictions and actual values. During model training, the goal is to minimize the loss function.
- Neural Network (NN) Classifier Initializing Weight Vector:
- Initializing neural network weights to all zeros may lead to symmetry issues, hindering the learning of different features. Typically, random initialization is used, drawing small random values from a Gaussian distribution.
- Bootstrap Aggregation:
- Bootstrap Aggregation (Bagging) involves building multiple models, each trained on a different subset of training data, and then combining them. This helps reduce model variance and improve stability.
- Neural Network Output:
- The output of a neural network refers to the model’s predictions on input data. In classification problems, softmax function is often used to convert the output into a probability distribution.
- Algorithm: Finding All Local Maximum:
- In mathematics and optimization problems, finding all local maximum involves calculating points where the derivative or gradient is zero, thus identifying all potential local maximum points on a function curve.
- Bootstrap Tree:
- A Bootstrap Tree is a decision tree constructed by bootstrapping, involving creating multiple decision trees through sampling from the dataset with replacement. Combining these trees can enhance model performance.
- Gaussian Naive Bayes:
- Gaussian Naive Bayes is a classification algorithm based on Bayes’ theorem. It assumes that the relationships between features follow a Gaussian distribution, making it suitable for continuous features. This is a variant of the naive Bayes algorithm.
( 1 )Metric Analysis:
- Homefeed是什么: Homefeed通常指社交媒体平台或应用中用户主页上显示的内容流,包括推文、图片、视频等。
- 怎么measure fresh pins:针对fresh pins(创建时间在过去7天内的),可以使用以下指标:
- Percentage of Fresh Pins in Homefeed:计算在用户主页上显示的pin中,有多少是创建在过去7天内的。
- Engagement Metrics on Fresh Pins:分析fresh pins的互动情况,比如点击、分享、评论等。
( 2 )SQL:
Calculate the percent of users that saw fresh content on each day, for alldates present in the table. fresh: created<7 days ago.. X 两个tables: pins: pin id,created at;events:dt, pin id, user id, action type,count:
SELECT
dt,
COUNT(DISTINCT user_id) / COUNT(DISTINCT pins.pin_id) AS percent_users_saw_fresh_content
FROM
events
JOIN pins ON events.pin_id = pins.pin_id
WHERE
created_at >= CURRENT_DATE – INTERVAL 7 DAY
GROUP BY
dt;
( 3 )Python:
两个lists of dates (l1.l2), return the list that is more fresh. More fresh:has more dates within the 7 davs of the provided date (current dt)
def more_fresh_list(l1, l2, current_date):
fresh_dates_l1 = [date for date in l1 if current_date – timedelta(days=7) <= date <= current_date]
fresh_dates_l2 = [date for date in l2 if current_date – timedelta(days=7) <= date <= current_date]
return l1 if len(fresh_dates_l1) > len(fresh_dates_l2) else l2
(4 )Experiment:
- 如何评估新pins对用户参与度的影响:
- 设计A/B测试,将新pins引入其中一组,另一组作为对照组。
- 比较两组用户的参与度指标,如点击率、留存率等。
- 样本大小: 根据实验设计和预期效应确定样本大小。
- 风险:可能存在时间效应、用户选择偏差等风险。
( 5 )P-value:
- P-value(P值)是什么: 在统计学中,P值表示观察到的结果或更极端结果的概率。通常,P值小于设定的显著性水平(例如0.05)时,我们拒绝零假设。 P值越小,对零假设的拒绝程度越高。
Q1. Product Sense:
- How would you define fresh pins?
- Fresh pins can be defined as pins that were created within the last 7 days.
- How should we let users see more fresh pins?
- Implement a sorting or filtering algorithm in the user’s feed to prioritize and display recently created pins first.
- Utilize push notifications or email alerts to inform users about new and fresh content.
- How much % of fresh pins should a user see every day?
- The percentage of fresh pins a user should see every day can vary based on user preferences and engagement patterns. A/B testing different percentages can help identify the optimal balance between fresh and existing content.
Q1. 产品理念:
- 如何定义新鲜的 pins?
- 新鲜的 pins 可以被定义为在过去7天内创建的 pins。
- 如何让用户看到更多新鲜的 pins?
- 在用户的动态流中实施排序或过滤算法,优先显示最近创建的 pins。
- 利用推送通知或电子邮件提醒用户有关新鲜内容。
- 每天用户应该看到多少%的新鲜 pins?
- 用户每天应该看到的新鲜 pins 的百分比可能会根据用户偏好和参与模式而有所变化。进行A/B测试以确定新鲜和现有内容之间的最佳平衡。
Q2. SQL:
Given 2 tables, find the % of users who saw fresh pins on a given day.
Fresh is defined as a pin that has at least 2 impressions with impression type=1 within 7 days of the created date.
table1
pin_id | created_date
ABC | 1/1/2017
table2
User_id | pin_id | date impression | impression type |count |
JohnSmith | ABC | 1/1/2018 | 1 | 1 |
:
SELECT
date,
COUNT(DISTINCT user_id) / COUNT(DISTINCT pins.pin_id) AS percent_users_saw_fresh_pins
FROM
table2
JOIN table1 ON table2.pin_id = table1.pin_id
WHERE
created_date >= date - INTERVAL 7 DAY
AND impression_type = 1 AND count >= 2
GROUP BY date;
Q3. Python:
Given two lists of dates, representing pin creation dates for two usershomefeeds, can you write a function that would return the list (aka feedthat is more fresh?
More fresh” in this case is defined as the one with a higher numbel(or percent) of pins that are<7 days old.
1=「”2021-11-15″,”2021-11-13″,”2021-11-10″,”2021-05-28″,”2021-0602″,”2021-06-02″,”2021-11-02″]
2= [“2021-11-11″,”2021-03-02″,”2021-11-05″,”2021-05-20″,”2021-0501″,”2021-06-01″,”2021-04-08”]..
dt=”2021-11-15″
from datetime import datetime, timedelta
def more_fresh_list(l1, l2, current_date):
def count_fresh_pins(dates):
return sum(1 for date in dates if (current_date – datetime.strptime(date, “%Y-%m-%d”)).days <= 7)
count_fresh_pins_l1 = count_fresh_pins(l1)
count_fresh_pins_l2 = count_fresh_pins(l2)
return l1 if count_fresh_pins_l1 > count_fresh_pins_l2 else l2
list1 = [“2021-11-15”, “2021-11-13”, “2021-11-10”, “2021-05-28”, “2021-06-02”]
list2 = [“2021-11-11”, “2021-03-02”, “2021-11-05”, “2021-05-20”, “2021-05-01”, “2021-06-01”, “2021-04-08”]
current_dt = “2021-11-15”
result = more_fresh_list(list1, list2, datetime.strptime(current_dt, “%Y-%m-%d”))
print(result)
This Python function calculates and compares the freshness of two lists based on the number of pins created within the last 7 days.
我们还提供OA与VO的咨询与支持服务,如果有需要,请联系我们:
chen@csoahelp.com
We also provide consultation and support services for OA and VO. If needed, please feel free to contact us:
chen@csoahelp.com
