50 Data Science Interview Questions
I needed this list of questions when I was interviewing for data science or ML engineering roles. This list aim to help data science leaders interview DS/ML engineers and help DS/ML engineers to study what is important and ace their interviews.
- What is the difference between supervised and unsupervised learning?
- What is the difference between regression, classification, clustering and ranking?
- What metrics will you use to evaluate a regression problem?
- What does it mean to have low MAE and high MSE?
- What metrics will you use to evaluate a classification model?
- Why is accuracy a bad metric for classification?
- How can you tackle data imbalance?
- Can you describe a situation where precision is more important than recall and F-score?
- Is the F-score a statistically significant metric?
- Can you explain what the area under ROC Curve (AUC-ROC) is?
- Is AUC_ROC immune to data imbalance?
- What is the area under the PR curve (AUC_PR) metric?
- How will you measure the association/correlation between two numerical variables?
- How will you measure the association/correlation between two categorical variables?
- How will you measure the association/correlation between one numerical variable and one categorical variable?
- What are the data assumption to use Pearson correlation?
- If we have zero Pearson correlation, what does this imply?
- Can you explain Spearman’s Correlation?
- Can you explain how the decision tree works?
- Can you explain how logistic regression works? Why is it called regression despite being a classifier?
- Can you explain the bias-variance trade-off?
- What is cross-validation and when is it important?
- What is A/B Testing?
- Can you Explain the curse of dimensionality?
- Can you explain how PCA works?
- Can you explain any other feature reduction methods other than PCA? Independent Component Analysis, Canonical Correlation Analysis, Common Spatial Method?
- Can you explain how deep learning works?
- What is gradient descent?
- How can you detect outliers?
- What is the difference between the deterministic model and the stochastic model?
- Do distance-based algorithms require orthogonality of the features? Why?
- What is feature scaling? Normalization and standardization?
- What is discretization? When is it important?
- What is hyperparameter optimization?
- What is the survival bias? And why is it a problem?
- How can you specify the number of clusters to use in a clustering problem?
- What is regularization? Why is it important? What are the different types of regularization?
- Describe the most interesting project you worked on during your career
- Can you explain reinforcement learning?
- How would you evaluate the clustering model?
- Can you explain what are “Self-Selection Bias”, “Under coverage bias” and “Survival Bias”?
- What are sampling techniques you know? random sampling, systematic sampling, stratified sampling, cluster sampling, etc.?
- Does correlation imply causality? Does correlation imply common causality?
- What are the confounding variables?
- How do you handle missing values?
- Can you explain what is the discriminative bias?
- Can you explain what is the under-represented segments?
- Can you explain what is the data drift and the concept drift?
- What is the difference between data science, data engineering and data analytics?
- What is the difference between artificial intelligence, machine learning and deep learning?
Connect on LinkedIn: https://www.linkedin.com/in/hanyhossny/
Read other articles: https://hany-hossny.medium.com/
Follow on Twitter: https://twitter.com/_HanyHossny
Many of these questions are written by me and others are taken from various websites. I listed the other websites as references in the links below
References
Hany is an AI/ML enthusiast, academic researcher, and lead scientist @ Catch.com.au Australia. I like to make sense of data and help businesses to be data-driven