Based on the Simple Linear Regression model I built earlier, I created a Q-Q plot to assess the normality of the residuals. It is seen that the residuals obtained from the simple linear model are not normally distributed. Additionally, I built a scatter plot with the fitted values on the x-axis vs the residuals on the y-axis. It seems that the residuals are fanning out (funnel-shaped) which turns out that the linear model is heteroscedastic and violates one of the critical assumptions of Linear Regression, which is heteroscedasticity. The test yielded a Chi-squared statistic of 52.846460747754506 with a corresponding p-value of 3.6066798910958464e-13. Since the p-value is less than the conventional significance level of 0.05, it suggests that this model exhibits heteroscedasticity.
In today’s class, Professor answered to several questions from my classmates on the dataset which gave me a deeper understanding on next steps of analysis. Professor mentioned that for this dataset, non-linear models can be implemented which might lead to a higher R-squared value. To my question on his recommendation on applying transformations (log and exponential) to the variables, Professor gave an example that for datasets which have highly skewed distribution, a log transformation might help to make is normally distributed. But since our dataset is almost normally distributed, he suggested not to implement transformations on the variables for this dataset.
Furthermore, Professor explained the structure of Punchline reports that we need to follow for our Project report which was really helpful. He also mentioned that he will discuss about collinearity between the predictor variables (potentially leading to issues in distinguishing their unique contributions to the dependent variable) next week, which might have a negative impact on the linear model’s performance for our dataset.