The Linear Regression model that I previously built yielded a very low R-squared value and displayed violations of the Homoscedasticity assumption in the residuals. As a result, I am considering to build a Multiple Linear Regression model, with ‘% DIABETIC’ as the target variable and ‘% OBESE’ and ‘% INACTIVE’ as predictor variables. To do this, I merged the three metrics in the dataset based on the FIPS code, resulting in a dataset with only 354 observations.
During today’s class, Professor provided valuable insights into Multiple Linear Regression, which is a type of linear regression model that involves more than one predictor variable. In Multiple Linear Regression, the goal is to find the best-fit plane using the Ordinary Least Squares (OLS) method, and a considerable improvement in the R-squared value is observed with this model. The equation for Multiple Linear Regression can be expressed as follows:
y= β₀ + β1X1 + β2X2 + ε
Additionally, Professor discussed the possibility of adding an interaction term to the Multiple Linear Regression model. The equation for this model with an interaction term is:
y= β₀ + β1X1 + β2X2 + β3 X1 * X2 + ε
Here, X1 * X2 represents the interaction term between the two predictor variables which represents the combined effect of two or more variables. Interestingly, the inclusion of an interaction term resulted in a comparatively higher improvement in the R-squared value.
Following this, Professor introduced the concept of building a Generalized Linear Model (GLM), which is a quadratic model that includes second-order degree terms of the predictor variables in the regression equation. The equation for this model is described as follows:
y= β₀ + β1X1 + β2X2 + β3 X1 * X2 + β4X12 + β5X22 + ε
However, Professor advised against using higher-degree polynomials, as they can lead to overfitting. Overfitting occurs when the model memorizes the dataset but struggles to generalize to new data.