Machine Learning Interview Questions & Answers

Tech Interviews
7 min readJul 5, 2021
Machine Learning Interview Questions & Answers

Below is the list of few interview questions on Machine Learning, follow us to stay updated with new questions:

How do we check if a variable follows the normal distribution?

  • Plot a histogram out of the sampled data. If you can fit the bell-shaped “normal” curve to the histogram, then the hypothesis that the underlying random variable follows the normal distribution.
  • Check Skewness and Kurtosis of the sampled data. Zero-skewness and zero-kurtosis are typical for a normal distribution, so the farther away from 0, the more non-normal the distribution.
  • Use Kolmogorov-Smirnov or/and Shapiro-Wilk tests for normality. They take into account both Skewness and Kurtosis simultaneously.
  • Check for Quantile-Quantile plot. It is a scatterplot created by plotting two sets of quantiles against one another. Normal Q-Q plot place the data points in a roughly straight line.

What if we want to build a model for predicting prices? Are prices distributed normally? Do we need to do any pre-processing for prices? ‍

Data is not normal. Specially, real-world datasets or uncleaned datasets always have certain skewness. Same goes for the price prediction. Price of houses or any other thing under consideration depends on a number of factors. So, there’s a great chance of presence of some skewed values i.e. outliers if we talk in data science terms.

Yes, you may need to do pre-processing. Most probably, you will need to remove the outliers to make your distribution near-to-normal.

How do we choose K in K-fold cross-validation? What’s your favourite K?

There are two things to consider while deciding K: the number of instances we get and the size of validation set. We do not want the number of folds to be too less, like 2 or 3. At least 4 models give a less biased decision on the metrics. On the other hand, we would want the dataset to be at least 20–25% of the entire data. So that at least a ratio of 3–4:1 between training and validation set is maintained.

It’s a good practice to use K as 4 for small datasets and 5 for large datasets. This means that 20% of the data is used for testing, this is usually pretty accurate. However, if your dataset size increases dramatically, like if you have over 100,000 instances, it can be seen that a 10-fold cross validation would lead in folds of 10,000 instances. This should be sufficient to reliably test your model.

What is classification? Which models would you use to solve a classification problem?

Classification problems are problems in which prediction space is discrete, i.e. there is a finite number of values, the output variable can have. Some models which can be used to solve classification problems are logistic regression, decision tree, random forests, multi-layer perceptron, KNN, SVM, Naive Bayes classifier, one-vs-all, amongst others.

Can you cite some examples where a false negative important than a false positive?

Assume there is an airport ‘A’ which has received high security threats and based on certain characteristics they identify whether a particular passenger can be a threat or not. Due to shortage of staff they decided to scan passenger being predicted as risk positives by their predictive model.

Ex 1) What will happen if a threat customer is being flagged as non-threat by airport model?

Ex 2) Another example can be judicial system. What if Jury or judge decide to make a criminal go free?

Ex 3) What if you rejected to marry a very good person based on your predictive model and you happen to meet him/her after few years and realize that you had a false negative?

Can you cite some examples where both false positive and false negatives are equally important?

In the banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good, you will not make any profit rather you will risk huge losses.

Banks don’t want to lose good customers and at the same point of time they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure.

These days we hear many cases of players using steroids during sport competitions Every player has to go through a steroid test before the game starts. A false positive can ruin the career of a Great sportsman and a false negative can make the game unfair.

What is Precision-recall trade-off?

Tradeoff means increasing one parameter would lead to decreasing of other. Precision-recall tradeoff occur due to increasing one of the parameters (precision or recall) while keeping the model same.

In an ideal scenario where there is a perfectly separable data, both precision and recall can get maximum value of 1.0. But in most of the practical situations, there is noise in the dataset and the dataset is not perfectly separable. There might be some points of positive class closer to the negative class and vice versa. In such cases, shifting the decision boundary can either increase the precision or recall but not both. Increasing one parameter leads to decreasing of the other.

In which cases AU PR is better than AU ROC?

AU ROC looks at a true positive rate TPR and false positive rate FPR while AU PR looks at positive predictive value PPV and true positive rate TPR.

Typically, if true negatives are not meaningful to the problem or you care more about the positive class, AU PR is typically going to be more useful. If you care equally about the positive and negative class or your dataset is quite balanced, then going with AU ROC is a good idea.”

Which feature selection techniques do you know? ‍

Here are some of the feature selection techniques:

  • Principal Component Analysis
  • Neighborhood Component Analysis
  • Relief Algorithm

Which hyper-parameter tuning strategies (in general) do you know?

There are several strategies for hyper-tuning but I would argue that the three most popular nowadays are the following:

  • Grid Search is an exhaustive approach such that for each hyper-parameter, the user needs to manually give a list of values for the algorithm to try. After these values are selected, grid search then evaluates the algorithm using each and every combination of hyper-parameters and returns the combination that gives the optimal result (i.e. lowest MAE). Because grid search evaluates the given algorithm using all combinations, it’s easy to see that this can be quite computationally expensive and can lead to sub-optimal results specifically since the user needs to specify specific values for these hyper-parameters, which is prone for error and requires domain knowledge.
  • Random Search is similar to grid search but differs in the sense that rather than specifying which values to try for each hyper-parameter, an upper and lower bound of values for each hyper-parameter is given instead. “With uniform probability, random values within these bounds are then chosen and similarly, the best combination is returned to the user. Although this seems less intuitive, no domain knowledge is necessary and theoretically much more of the parameter space can be explored.
  • In a completely different framework, Bayesian Optimization is thought of as a more statistical way of optimization and is commonly used when using neural networks, specifically since one evaluation of a neural network can be computationally costly. In numerous research papers, this method heavily outperforms Grid Search and Random Search and is currently used on the Google Cloud Platform as well as AWS. Because an in-depth explanation requires a heavy background in Bayesian statistics and gaussian processes (and maybe even some game theory), a “simple” explanation is that a much simpler/faster acquisition function intelligently chooses (using a surrogate function such as probability of improvement or GP-UCB) which hyper-parameter values to try on the computationally expensive, original algorithm. Using the result of the initial combination of values on the expensive/original function, the acquisition function takes the result of the expensive/original algorithm into account and uses it as its prior knowledge to again come up with another set of hyper-parameters to choose during the next iteration. This process continues either for a specified number of iterations or for a specified amount of time and similarly the combination of hyper-parameters that performs the best on the expensive/original algorithm is chosen.

What’s the learning rate?

The learning rate is an important hyperparameter that controls how quickly the model is adapted to the problem during the training. It can be seen as the “step width” during the parameter updates, i.e. how far the weights are moved into the direction of the minimum of our optimization problem.

What happens when the learning rate is too large or too small?

A large learning rate can accelerate the training. However, it is possible that we “shoot” too far and miss the minimum of the function that we want to optimize, which will not result in the best solution. On the other hand, training with a small learning rate takes more time but it is possible to find a more precise minimum. The downside can be that the solution is stuck in a local minimum, and the weights won’t update even if it is not the best possible global solution.

Thanks for reading

Hope you find this useful. Let me know your thoughts in the comment section and don’t forget to clap if you found the article helpful. We will be releasing more interview questions and answers every week on technical topics. To get notified, follow us on medium.

To get access to 100+ Questions and Answers on Machine Learning and get answers to questions as follows, please visit below link:

Questions List

  1. List of assumptions in linear regression?
  2. What is normal distribution? Why do we care about it?
  3. How do we verify if a feature follows the normal distribution or not?
  4. What is SGD (Stochastic Gradient Descent? How is it different from gradient descent?
  5. What all metrics do we use for evaluating regression models?
  6. Share some examples where a false positive is important than a false negative?
  7. What is Precision-recall trade-off?
  8. Can we use L1 regularization for feature selection?
  9. What happens when we have correlated features in our data?
  10. How we can incorporate implicit feedback (clicks etc.) into recommender systems?

--

--