Data Science Interview Questions

Q1: How to deal with unbalanced binary classification?

Ans: When doing binary classification, if the data set is unbalanced, the model's accuracy cannot be predicted properly using simply the R2 score. For example, if one of the two classes' data is relatively tiny compared to the other, conventional accuracy will take a very small proportion of the smaller class. Even if just 5% of the samples belong to the smaller class and the model identifies all outputs as the larger class, the accuracy would still be about 95%. But this is incorrect. To address this, we may perform the following:

  • Use alternative techniques to calculate model performance, such as precision/recall, F1 score, and so on.
  • Resample the data using techniques such as undersampling (lowering the sample size of the bigger class) and oversampling (raising the sample size of the smaller class using repetition, SMOTE, and other similar methods).
  • Applying K-fold cross-validation
  • Using ensemble learning, each decision tree considers the complete sample of the smaller class but just a subset of the bigger class.

Q2: What is the bias-variance trade-off?


Bias: A bias is an inaccuracy introduced in your model due to the machine learning algorithm being oversimplified. It may result in underfitting. When you train your model, it makes simplified assumptions to make the goal function clearer to comprehend.

Low bias machine learning algorithms —> Decision Trees, k-NN and SVM High bias machine learning algorithms —> Linear Regression, Logistic Regression

Variance: Variance is an inaccuracy created in your model due to a complicated machine learning process; your model learns noise from the training data set and performs poorly on the test data set. It might result in overfitting and excessive sensitivity.

Typically, as the complexity of your model increases, you will observe a drop in error owing to decreasing bias in the model. However, this only lasts until a certain point. As you continue to make your model more complicated, you wind up over-fitting it, and your model suffers from excessive variance.

Bias-Variance trade-off: To obtain high prediction performance, every supervised machine learning method should have low bias and variance.

  • The k-nearest neighbour method has a low bias and a large variance. Still, the trade-off may be adjusted by raising the value of k, which increases the number of neighbours who contribute to the prediction and, as a result, raises the model's bias.
  • The support vector machine algorithm has a low bias and a large variance. Still, the trade-off may be altered by raising the C parameter, which impacts the number of margin breaches permitted in the training data, increasing the bias while decreasing the variance.

In machine learning, there is no getting around the connection between bias and variation. As the bias is increased, the variance decreases. Increasing the variance reduces bias.

Q3: What does the ROC Curve represent, and how to create it?

Ans: The ROC curve (Receiver Operating Characteristic) depicts the difference between false-positive and true-positive rates at various thresholds. The curve represents a trade-off between sensitivity and specificity.

The ROC curve is constructed by comparing true positive rates (TPR or sensitivity) against false-positive rates (FPR or (1-specificity). TPR denotes the proportion of positive observations accurately predicted out of all positive observations. The FPR reflects the fraction of erroneously anticipated observations among all negative observations. In the case of medical testing, the TPR reflects the percentage of patients who are accurately confirmed positive for a specific condition.

Q4: What are dimensionality reduction and its benefits?

Ans: Dimensionality reduction is the process of reducing the number of characteristics in a given dataset. There are several ways for reducing dimensionality, including-

Feature Selection Methods

Matrix Factorization

Manifold Learning

Autoencoder Methods

Linear Discriminant Analysis (LDA)

Principal component analysis (PCA)

The curse of dimensionality is one of the primary motivations for dimensionality reduction. The model grows increasingly complicated as the number of features rises. However, if the number of data points is too small, the model will begin learning or overfitting the data. The data will not be generalized by the model. This is referred to as the curse of dimensionality.

Dimensionality reduction also has the following advantages; Time and storage space are saved; it is simpler to see and graphically depict data in 2D or 3D, and the complexity of space is decreased.

Q5: How should you maintain a deployed model?

Ans: A model must be maintained once it has been deployed. The data that is being supplied may vary over time. For example, in the case of a model forecasting home prices, property values may grow over time or vary due to some other reason. The model's accuracy on fresh data can be recorded. Some popular methods for ensuring accuracy are as follows:

  • The model should be validated regularly by putting negative test data into it. It's OK if the model has a low accuracy with negative test data.
  • Create an Auto Encoder in which the AE model calculates the reconstruction error value using anomaly detection techniques. If the Reconstruction error number is large, the new data does not follow the model's previously learnt pattern.

If the model performs well with new data, it implies that it follows the pattern of generalization acquired by the model with previous data. As a result, the model may be retrained using the fresh data. If the model's accuracy on new data is poor, it may be retrained using feature engineering on the data features alongside the old data. If the model's accuracy is poor, it may need to be trained from scratch.