Data Science Interview Questions


Q6: What are the available feature selection methods for selecting the right variables for building efficient predictive models?

Ans: When utilizing a dataset in data science or machine learning methods, not all variables may be required or helpful for building a model. To enhance the efficiency of our model, smarter feature selection approaches are necessary to prevent redundant models. The three major approaches for feature selection are as follows:

Filter Methods:

  • These approaches detect just the intrinsic characteristics of features as evaluated by univariate statistics and do not account for cross-validated performance. When compared to wrapper techniques, they are simpler, quicker, and use fewer computing resources.
  • The Chi-Square test, Fisher's Score technique, Correlation Coefficient, Variance Threshold, Mean Absolute Difference (MAD) method, Dispersion Ratios, and more filter methods are available.

Wrapper Methods:

  • These approaches require some mechanism for searching greedily on all potential feature subsets and determining their quality by learning and evaluating a classifier using the feature.
  • The selection approach is based on the machine learning algorithm, which must match the provided dataset.
    • Forward Selection: In this method, one feature is examined, and more features are added until a satisfactory match is found.
    • Backward Selection: All of the characteristics are evaluated, and the non-fitting ones are eliminated one at a time to discover which works best.
    • Recursive Feature Elimination: The features are verified recursively, and their performance is assessed.
  • These approaches are often computationally demanding and necessitate the use of high-end computing resources for analysis. However, these approaches typically result in stronger prediction models with higher accuracy than filter methods.

Embedded Methods:

  • Embedded techniques combine the benefits of filter and wrapper methods by including feature interactions while retaining low computing costs.
  • These approaches are iterative in the sense that they take each model iteration and carefully extract features that contribute to the majority of training in that iteration.
  • LASSO Regularization (L1), Random Forest Importance are two examples of embedded techniques.

Q7: What is the difference between an error and a residual error?

Ans: The discrepancy between the anticipated and actual value is referred to as an error. Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) are the most often used methods for calculating mistakes in data science (RMSE). At the same time, residual is the difference between a set of observed values and their arithmetic mean. A mistake is usually unobservable; however, a residual error can be seen on a graph. Error reflects the difference between observed data and the actual population. On the other hand, a residual indicates how the observed data differs from the sample population data.

Q8: What is a p-value?

Ans: A p-value can assist you to evaluate the strength of your results while doing a hypothesis test in statistics. The p-value ranges from 0 to 1. The value will denote the strength of the results. The claim under consideration is known as the Null Hypothesis.

A low p-value (0.05) demonstrates strength against the null hypothesis, implying that we can reject it. A high p-value (0.05) implies that the null hypothesis is strong, suggesting that we accept the null hypothesis. A p-value of 0.05 shows that the hypothesis may go either way. To put it another way, high P values indicate that your data is most likely a genuine null. Low P values indicate that your data is unlikely to include a genuine null.

Q9: Give one example where both false positives and false negatives are important equally?

Ans: In the banking industry, lending loans are the primary source of income for banks. However, if the payback rate is poor, there is a possibility of significant losses rather than gains. Giving loans to clients is thus a gamble because banks cannot afford to lose good customers while they cannot afford to attract poor ones. This is a typical illustration of how false positive and false negative scenarios are equally important.

Q10: What are the differences between over-fitting and under-fitting?

Ans: One of the most common jobs in statistics and machine learning is fitting a model to a collection of training data so that it can make trustworthy predictions on untrained data. Ans: One of the most common jobs in statistics and machine learning is fitting a model to a collection of training data so that it can make trustworthy predictions on untrained data.

Overfitting occurs when a statistical model describes random error or noise rather than the underlying relationship. Overfitting happens when a model is overly complicated, such as when there are too many parameters compared to the amount of data. Overfitted models have poor prediction performance because they overreact to slight changes in the training data.

When a statistical model or machine learning method fails to capture the underlying trend of the data, this is referred to as underfitting. Fitting a linear model to non-linear data, for example, would result in underfitting. A model like this would also have poor prediction performance.