Ans: When doing binary classification, if the data set is unbalanced, the model's accuracy cannot be predicted properly using simply the R2 score. For example, if one of the two classes' data is relatively tiny compared to the other, conventional accuracy will take a very small proportion of the smaller class. Even if just 5% of the samples belong to the smaller class and the model identifies all outputs as the larger class, the accuracy would still be about 95%. But this is incorrect. To address this, we may perform the following:
Bias: A bias is an inaccuracy introduced in your model due to the machine learning algorithm being oversimplified. It may result in underfitting. When you train your model, it makes simplified assumptions to make the goal function clearer to comprehend.
Low bias machine learning algorithms —> Decision Trees, kNN and SVM High bias machine learning algorithms —> Linear Regression, Logistic Regression
Variance: Variance is an inaccuracy created in your model due to a complicated machine learning process; your model learns noise from the training data set and performs poorly on the test data set. It might result in overfitting and excessive sensitivity.
Typically, as the complexity of your model increases, you will observe a drop in error owing to decreasing bias in the model. However, this only lasts until a certain point. As you continue to make your model more complicated, you wind up overfitting it, and your model suffers from excessive variance.
BiasVariance tradeoff: To obtain high prediction performance, every supervised machine learning method should have low bias and variance.
In machine learning, there is no getting around the connection between bias and variation. As the bias is increased, the variance decreases. Increasing the variance reduces bias.
Ans: The ROC curve (Receiver Operating Characteristic) depicts the difference between falsepositive and truepositive rates at various thresholds. The curve represents a tradeoff between sensitivity and specificity.
The ROC curve is constructed by comparing true positive rates (TPR or sensitivity) against falsepositive rates (FPR or (1specificity). TPR denotes the proportion of positive observations accurately predicted out of all positive observations. The FPR reflects the fraction of erroneously anticipated observations among all negative observations. In the case of medical testing, the TPR reflects the percentage of patients who are accurately confirmed positive for a specific condition.
Ans: Dimensionality reduction is the process of reducing the number of characteristics in a given dataset. There are several ways for reducing dimensionality, including
Feature Selection Methods
Matrix Factorization
Manifold Learning
Autoencoder Methods
Linear Discriminant Analysis (LDA)
The curse of dimensionality is one of the primary motivations for dimensionality reduction. The model grows increasingly complicated as the number of features rises. However, if the number of data points is too small, the model will begin learning or overfitting the data. The data will not be generalized by the model. This is referred to as the curse of dimensionality.
Dimensionality reduction also has the following advantages; Time and storage space are saved; it is simpler to see and graphically depict data in 2D or 3D, and the complexity of space is decreased.
Ans: A model must be maintained once it has been deployed. The data that is being supplied may vary over time. For example, in the case of a model forecasting home prices, property values may grow over time or vary due to some other reason. The model's accuracy on fresh data can be recorded. Some popular methods for ensuring accuracy are as follows:
If the model performs well with new data, it implies that it follows the pattern of generalization acquired by the model with previous data. As a result, the model may be retrained using the fresh data. If the model's accuracy on new data is poor, it may be retrained using feature engineering on the data features alongside the old data. If the model's accuracy is poor, it may need to be trained from scratch.
