September 11, 2023

According to my analysis, the ‘cdc-diabetes-2018’ dataset has 3 separate tabs of data for 3 different metrics – % DIABETIC, % OBESE and % INACTIVE for different counties across the United States for the year 2018. These metrics can be related based on the unique FIPS (Federal Information Processing Standards) codes. However, there is a column name inconsistency in the ‘Inactivity’ tab, where the column is labeled as ‘FIPDS’ instead of ‘FIPS’. This inconsistency needs to be corrected to enable proper data integration based on this common unique column.

Upon merging the data, it’s observed that only 354 observations have all three variables present. This limited sample size might not be ideal for building a robust model with % DIABETIC as the target variable and % OBESE and % INACTIVE as predictor variables. It could benefit from techniques like bootstrapping and cross-validation to address potential issues related to the small sample size. For now, as the first step I am planning on a simpler approach which involves creating a Simple Linear Regression model with % DIABETIC as the target variable and % INACTIVE as the single predictor variable. This combination results in 1370 observations, which is a more reasonable sample size for building such a model.

On building histograms for the % DIABETIC and % INACTIVE variables, it is seen that the % DIABETIC is slightly right skewed with a high kurtosis, indicating a heavy-tailed distribution while % INACTIVE is slightly left skewed with a lower kurtosis, suggesting a less heavy-tailed distribution. On building boxplots for both these variables, it is seen that the % DIABETIC has outliers beyond the upper and lower whiskers while % INACTIVE has outliers beyond the lower whisker.

In today’s class, professor’s explanation of Simple Linear Regression was very clear. I understood the concept of Ordinary Least Squares method which seeks to find the optimal slope and intercept parameters for a regression model by minimizing the Residual Sum of Squares (RSS), where RSS is the sum of squared differences between actual and predicted values.

We also acquired knowledge on other important statistical concepts such as:

  1. Kurtosis measures the concentration of data around the mean, with a value of 3 indicating a normal distribution.
  2. Skewness indicates the asymmetry of a distribution.
  3. A Q-Q plot helps to find the normality of the residuals which is a critical assumption of linear models.

Referring to the section 3.1.3 of the textbook which describes methods of assessing the accuracy of the model, for Simple Linear Regression, R-squared is defined as the square of Pearson’s correlation.

Another critical assumption of linear models is Homoscedasticity, which means that the variance of errors is constant across the range of predicted values. Violations of Homoscedasticity result in Heteroscedasticity, which can undermine the reliability of the linear model.

Leave a Reply

Your email address will not be published. Required fields are marked *