September 18, 2023 – MTH 522-01B (Sujitha)

The Linear Regression model that I previously built yielded a very low R-squared value and displayed violations of the Homoscedasticity assumption in the residuals. As a result, I am considering to build a Multiple Linear Regression model, with ‘% DIABETIC’ as the target variable and ‘% OBESE’ and ‘% INACTIVE’ as predictor variables. To do this, I merged the three metrics in the dataset based on the FIPS code, resulting in a dataset with only 354 observations.

During today’s class, Professor provided valuable insights into Multiple Linear Regression, which is a type of linear regression model that involves more than one predictor variable. In Multiple Linear Regression, the goal is to find the best-fit plane using the Ordinary Least Squares (OLS) method, and a considerable improvement in the R-squared value is observed with this model. The equation for Multiple Linear Regression can be expressed as follows:

y= β₀ + β₁X₁ + β₂X₂ + ε

Additionally, Professor discussed the possibility of adding an interaction term to the Multiple Linear Regression model. The equation for this model with an interaction term is:

y= β₀ + β₁X₁ + β₂X₂ + β₃ X₁* X₂+ ε

Here, X1 * X2 represents the interaction term between the two predictor variables which represents the combined effect of two or more variables. Interestingly, the inclusion of an interaction term resulted in a comparatively higher improvement in the R-squared value.

Following this, Professor introduced the concept of building a Generalized Linear Model (GLM), which is a quadratic model that includes second-order degree terms of the predictor variables in the regression equation. The equation for this model is described as follows:

y= β₀ + β₁X₁ + β₂X₂ + β₃ X₁* X₂+ β₄X₁² + β₅X₂² + ε

However, Professor advised against using higher-degree polynomials, as they can lead to overfitting. Overfitting occurs when the model memorizes the dataset but struggles to generalize to new data.

Leave a Reply Cancel reply