September 27, 2023

In continuation to Monday’s class, today Professor focused on the implementation of K-fold Cross-Validation on the CDC Diabetes dataset. Polynomial regression models of higher degree (from 1 to 10), are applied on the 354 data points which have all three variables. A plot of the training error (Mean-Squared Error which is the average squared difference between the model’s predictions and the actual target values in the dataset) as a function of the degree of the polynomials is seen to gradually reduce as the degree increases. This observation might lead one to believe that higher-order polynomial models improve the model’s ability to generalize to new data. However, this could be a case of overfitting, where the model essentially memorizes the dataset and performs poorly on new, unseen data.

I attempted to build and apply higher degree polynomial models on the dataset in Python along with plotting a graph similar to that mentioned by the Professor. The resultant graph is shown below where the training error decreases with model complexity.

To overcome this issue, Professor explained the application of a 5-fold Cross-Validation on this dataset where the data is split into 5 approximately equal groups of 71, 71, 71, 71 and 70. While doing this, he suggested to add unique indexes to the observations to eliminate the issue of duplicate records in the 3 variables. When the Average MSE of the test dataset using the 5-fold CV technique is plotted across model complexity, the test error reduces at degree 2 and then gradually increases. This suggests that the training error tends to underestimate the test error and hence a K-fold CV technique helps to obtain a more accurate assessment of the model’s performance. As the next step, I am planning to apply 5-fold CV technique to the CDC diabetes dataset in Python and look at the results.

Leave a Reply

Your email address will not be published. Required fields are marked *