Data Science Interview Questions


Q11: What is TF/IDF vectorization?

Ans: The term frequency-inverse document frequency, abbreviated as TF–IDF, is a numerical statistic meant to indicate how significant a word is to a document in a collection or corpus. In information retrieval and text mining, it is frequently employed as a weighting factor.

The TF–IDF value rises proportionately to the number of times a word appears in the text but is offset by the frequency of the term in the corpus, which helps to account that certain words appear more frequently than others.

Q12: What is the difference between data analytics and data science?

Ans:

  • Data science is the process of changing data via numerous technical analysis methodologies to obtain relevant insights that data analysts can apply to their business scenarios.
  • Data analytics is concerned with testing current hypotheses and facts and answering queries to make better and more successful business decisions.
  • Data Science fosters innovation by addressing questions that lead to new connections and solutions to future challenges. Data analytics is concerned with extracting current meaning from existing historical context, whereas data science is concerned with predictive modelling.
  • Data Science is a wide subject that uses many mathematical and scientific tools and methods to solve complicated issues. In contrast, data analytics is a particular area that deals with certain concentrated problems utilizing fewer statistical and visualization techniques.

Q13: Why is data cleaning crucial? How do you clean the data?

Ans: To gain accurate insights while running an algorithm on any data, it is critical to have correct and clean data that contains only essential information. Dirty data frequently leads to poor or inaccurate insights and forecasts, which can have negative consequences.

For example, when starting any large campaign to sell a product, if our data analysis advises us to target a product with no demand in reality, and if the campaign is started, it is certain to fail. As a result, the company's revenue is lost. This is where the necessity of having accurate and clean data comes into play.

  • Data cleaning from many sources aids in data transformation and results in data on which data scientists may work.
  • Properly cleaned data improves model accuracy and yields extremely strong predictions.
  • If the dataset is sufficiently huge, it becomes difficult to run data on it. If the data is large, the data cleansing phase takes a long time (about 80% of the time). It cannot be used while the model is running. As a result, cleaning the data before running the model increases the model's speed and efficiency.
  • Data cleaning assists in identifying and correcting any structural problems in the data. It also aids in the removal of duplicates and the maintenance of data consistency.

Q14: How is the grid search parameter different from the random search tuning strategy?

Ans: Tuning techniques are employed to determine the optimal set of hyperparameters. Before the model is evaluated or trained on the dataset, hyperparameters are fixed and model-specific characteristics. Grid search and random search tuning procedures are both optimization approaches for determining efficient hyperparameters.

Grid Search:

  • Each combination of a predefined list of hyperparameters is tested and assessed here.
  • The search pattern is similar to searching in a grid, where the values are stored in a matrix, and a search is carried out. Each parameter set is tested, and its correctness is recorded. After testing every possible combination, the model with the highest accuracy is picked as the best.
  • The primary disadvantage here is that as the number of hyperparameters increases, so does the method. With each increment in the hyperparameter, the number of assessments might grow exponentially. In a grid search, this is known as the dimensionality problem.

Random Search:

  • In this approach, random hyperparameter combinations are tested and analyzed to identify the optimum answer. The figure below illustrates that the function is evaluated in parameter space at random configurations to optimize the search.
  • Because the pattern used in this approach is random, the odds of discovering ideal parameters are improved. There is a probability that the model was trained on optimal parameters without aliasing.
  • Because it takes less time to identify the correct set when there are fewer dimensions, this search works best when there are fewer dimensions.

Q15: How is feature selection performed using the regularization method?

Ans: The regularisation approach includes applying penalties to various parameters in the machine learning model to reduce the model's flexibility and avoid overfitting.

Regularization methods include linear model regularisation, Lasso/L1 regularisation, and others. The linear model regularisation applies the penalty on coefficients that multiply the predictors. The Lasso/L1 regularisation has the property of reducing some coefficients to zero, allowing them to be eliminated from the model.