In order to predict the errors for the linear models, we apply the model predictions on the test data and calculate the MSE (Mean-Squared Error) for the model. The MSE is the average squared difference between the model’s predictions and the actual target values in the test dataset, providing a measure of how well the model fits the data and predicts outcomes. However, if the dataset is not large enough, then we may need to apply resampling methods such as Cross-Validation and Bootstrap. In today’s class, Professor showed videos of these CV techniques and explained how these are helpful to obtain information about the prediction error on the test set, standard deviation and bias of the model parameters.
The validation set approach is a straightforward method for estimating the test error of a statistical learning model. It begins by randomly splitting the dataset into two parts: a training set and a validation (or hold-out) set. The model is trained on the training set and then used to predict responses for the validation set. The error rate observed on the validation set, often measured by metrics like Mean Squared Error (MSE) for quantitative responses, serves as an estimate of the model’s test error rate.
- The Validation Set Approach:
The Validation Set Approach, while conceptually simple and easy to implement, presents two potential drawbacks.
-
- Validation set approach can yield highly variable test error estimates due to varying training and validation set compositions.
- This approach may overestimate the test error rate as the model is trained on a subset of data, potentially leading to suboptimal performance.
- LOOCV (Leave-One-Out Cross-Validation):
LOOCV (Leave-One-Out Cross-Validation) differs from the validation set approach by using only one observation as the validation set while the remaining data points form the training set. The model is trained on n-1 observations, and then a prediction is made for the omitted observation, resulting in an unbiased but highly variable estimate of the test error. Despite its unbiased nature, LOOCV can be impractical for large datasets due to its computational intensity, as it repeats this process for each data point.
- K-fold Cross-Validation (CV):
K-fold cross-validation (CV) is an alternative to LOOCV for estimating the test error in machine learning. In K-fold CV, the dataset is divided into k roughly equal-sized folds. Each fold is used as a validation set once while the rest of the data (k-1 folds) forms the training set. The mean squared error (MSE) is computed for each validation fold separately, resulting in k error estimates (MSE1, MSE2, …, MSEk). The overall k-fold CV estimate is obtained by averaging these individual MSE values using the formula as shown in the image.
This process provides a more stable and less computationally intensive way to estimate the test error compared to LOOCV, especially for large datasets.