As part of my project analysis, I performed the next step of 5-fold Cross-Validation on the CDC Diabetes dataset. I tried implementing this technique using Python using the sklearn library. I built a function that performs a 5-fold Cross-Validation on the dataset for up to 4 degrees of polynomials to check if the Average Mean-Squared Error made by the model on the test dataset. When a graph is plotted with polynomials in the x-axis and Average Test MSE on the y-axis, the below graph is obtained. It is seen that the Test error reduces at the 2nd degree polynomial and then gradually increases.
This is in contrast with the graph obtained previously where the training error reduced with model complexity. This discrepancy highlights the significance of Cross-Validation as a powerful tool for estimating a model’s performance on unseen data. It provides a more realistic assessment of how well the model will generalize to new data, taking into account both underfitting and overfitting. As part of my next step, I am planning to work on Support Vector Regression (SVR) to see if I can build a better model.
In today’s class, Professor clarified a query which I had on the requirement of Monte-Carlo permutation test on this dataset. This question arose because the Monte-Carlo permutation method is typically used to test statistical significance between two groups which has highly non-normally distributed variables. However, Professor explained the relevance of this test in the context of the CDC Diabetes dataset.
The rationale behind this is that the Monte-Carlo permutation test offers a robust way to assess the statistical significance of observed results through random data sampling. This approach can be particularly useful when explaining and justifying the significance of results to stakeholders or in situations where traditional statistical tests like t-tests may not be applicable or straightforward to interpret.