December 8, 2023

In the final stage of my analysis, I developed an ARIMA model specifically designed for forecasting the “Temp_Avg” variable in time series data.

Stationarity Check:

In this step, I performed the Augmented Dickey-Fuller (ADF) to check if the temperature data behaves consistently over time. The p-value obtained from the Augmented Dickey-Fuller (ADF) test is used to determine the stationarity of a time series. p-value ≤ 0.05: If the p-value is less than or equal to 0.05, we reject the null hypothesis. This indicates that there is enough evidence to suggest that the time series is stationary. In other words, the time series has no unit root, and it is relatively stable over time.

In this case, with a p-value of 4.564129009307574e-14 (a very small value close to zero), we would reject the null hypothesis. This suggests that our temperature time series is likely stationary, which is a good prerequisite for applying models like ARIMA that assume stationarity.

Hyperparameters Best Params Selection:

The next step is to find the best parameters for our ARIMA model. This involved testing different combinations of parameters (p, d, q) to see which set provides the most accurate predictions for our dataset. The combination that resulted in the lowest Root Mean Square Error (RMSE) was chosen as the best parameters. In this case, the best parameters were (4, 1, 3).

Final Model on the Entire Dataset with Best Params:

With the best parameters identified, I built the final ARIMA model using the entire temperature dataset. The model was trained on the historical data, and then it is used to forecast future temperatures. The actual and predicted values were plotted, visually showing how well the model captures the patterns in the temperature data. The red line represents the forecasted temperatures, while the blue line represents the actual observed temperatures. This final model allows us to make predictions for the entire dataset and provides a useful tool for understanding future temperature trends based on historical patterns.

December 6, 2023

In my subsequent analysis, my focus has been on exploring data-driven time series forecasting models for the “Temp_Avg” variable. These models are crafted to discern and leverage patterns from the historical temperature data, aiding in a more profound understanding of temperature variations over time. The models under consideration encompass diverse methodologies, including linear, exponential, quadratic, and those integrating additive or multiplicative seasonality. To enhance the models’ flexibility and adaptability, I augmented the dataset with terms such as t, t_squared, and log(t), where ‘t’ represents the time index. These additions are crucial for capturing and incorporating different temporal characteristics, ensuring the models are well-equipped to comprehend and predict the nuanced patterns embedded in the temperature dataset.

Data-Driven Time Series Forecasting Models Analysis:

  1. Linear Model: This model assumes a linear relationship between time and temperature. It yielded a Root Mean Square Error (RMSE) of approximately 14.56, indicating its performance in capturing linear trends.
  2. Exponential Model: The exponential model aims to capture exponential growth or decay in temperature over time. However, the extremely high RMSE value of approximately 4.37 suggests that this model may not be suitable for this dataset.
  3. Quadratic Model:  The quadratic model introduces a squared term to account for curvature in the temperature trend. It resulted in an RMSE of around 14.66, indicating its ability to capture more complex patterns than the linear model.
  4. Additive Seasonality: This model considers seasonal variations in temperature and resulted in an RMSE of approximately 2.59, showcasing its effectiveness in capturing repeating patterns over the months.
  5. Additive Seasonality with Linear Trend: Combining linear trend and seasonal variations, this model produced an RMSE of about 2.63, indicating a slightly higher but still reasonable accuracy in capturing temperature fluctuations.
  6. Additive Seasonality with Quadratic Trend: Introducing a quadratic trend along with seasonality, this model achieved an RMSE of approximately 2.72, indicating its ability to capture more complex temperature patterns.
  7. Multiplicative Seasonality: This model considers both seasonal and overall multiplicative variations. However, the extremely high RMSE value suggests that it may not be the most suitable model for this dataset.
  8. Multiplicative Seasonality with Linear Trend: Similar to the multiplicative model, this includes a linear trend and seasonality. However, the extremely high RMSE value indicates potential limitations in accurately predicting temperature variations.
  9. Multiplicative Seasonality with Quadratic Trend: Combining a quadratic trend with multiplicative seasonality, this model yielded an extremely high RMSE value, suggesting challenges in accurately forecasting temperature patterns.

Conclusion: Among the models tested, the ones incorporating additive seasonality (both linear and quadratic trends) demonstrated the best performance, with lower RMSE values. These models effectively capture the seasonal variations in temperature, providing more accurate forecasts compared to other methods. The linear and quadratic models without seasonality demonstrated reasonable accuracy but were outperformed by the seasonal models. The exponential and multiplicative models, however, exhibited extremely high RMSE values, indicating potential limitations in their applicability to this temperature dataset.

December 4, 2023

As part of my next step, I explored Exponential Smoothing methods for time series forecasting on the “Temp_Avg” variable. This choice was motivated by the need to capture and incorporate the inherent patterns and trends present in the historical temperature data. Exponential Smoothing offers a versatile approach that considers various components, such as trends and seasonality, providing a robust framework for forecasting temperature fluctuations over time. In this analysis, I employed four specific methods – Simple Exponential Method, Holt’s Method, and two variations of Holt-Winter’s exponential smoothing with different seasonality assumptions – with the aim of comparing their effectiveness in capturing the complexities of temperature variations.

The evaluation was based on the Mean Absolute Percentage Error (MAPE) values, allowing for a comprehensive comparison of the forecasting performance across these distinct Exponential Smoothing models.

Comparison of Exponential Smoothing Methods for Temperature Forecasting:

  1. Simple Exponential Method: The Simple Exponential Method resulted in a Mean Absolute Percentage Error (MAPE) value of approximately 24.38%. This method utilizes a straightforward approach by considering the historical average temperature without accounting for trends or seasonality.
  2. Holt’s Method: Holt’s method, incorporating trend information, yielded a MAPE value of about 37.79%. This model incorporates trend information in addition to historical average temperature, allowing for the consideration of evolving patterns over time.
  3. Holt-Winter’s with Additive Seasonality and Additive Trend: The Holt-Winter’s model with additive seasonality and additive trend demonstrated a lower MAPE value of around 3.85%. This approach considers both trend and seasonal variations in the temperature data, assuming that the influence of each is additive.
  4. Holt-Winter’s with Multiplicative Seasonality and Additive Trend: The Holt-Winter’s model with multiplicative seasonality and additive trend resulted in a MAPE value of approximately 3.99%. This method is similar to the previous method but assumes that the seasonal variations have a multiplicative effect, offering flexibility in accommodating different patterns in the data.

Conclusion:  The Holt-Winter’s model with additive seasonality and additive trend outperformed the other models, suggesting that this model may be a more suitable representation for the given temperature dataset.

December 1, 2023

Time series decomposition is a valuable approach used to dissect and analyze patterns within a sequence of data points, specifically in the context of the “Temp_Avg” variable over time. This method aims to break down the temperature time series into its essential components, enabling a more nuanced examination of the inherent structures influencing the observed variations in average temperature.

The primary components in the decomposition of the “Temp_Avg” variable are:

  1. Trend: This denotes the long-term direction or pattern in the average temperature data, providing insights into overarching changes or tendencies throughout the entire time period.
  2. Seasonality: Seasonal components capture recurring patterns or fluctuations in the average temperature that repeat at regular intervals. This allows us to discern cyclic behaviors related to specific time periods, such as seasons, months, or days.
  3. Residuals: Also referred to as errors, residuals represent the unexplained variability in the average temperature data that is not accounted for by the trend or seasonality. Analyzing residuals helps assess the accuracy of the decomposition and provides insights into random fluctuations or unexpected influences on temperature.

Next I have plotted the ACF and PACF Plots.

Autocorrelation Function (ACF) Plot:

The ACF plot helps us understand how the average temperature in Boston relates to its past values over time. Each point on the plot represents the similarity between the current temperature and its past values at different time intervals, ranging from no time lag (current temperature) up to 12 months back. Peaks or spikes in the plot indicate periods where the temperature tends to follow a repeating pattern, which could be related to seasonal cycles or other trends.

Partial Autocorrelation Function (PACF) Plot:

The PACF plot helps to explore the connection between the current average temperature in Boston and its past values, taking into account the influence of intervening time points. The plot specifically looks at relationships without the impact of other time points in between. Similar to the ACF plot, spikes in this plot indicate significant associations between the current temperature and its past values at different lags, providing insights into the specific time intervals that might influence temperature patterns.

November 29, 2023

For my third project, I have chosen to utilize weather data for the city of Boston sourced from the National Weather Service website (https://www.weather.gov/). This platform provides a comprehensive range of weather, water, and climate information, along with forecasts, warnings, and decision support services aimed at safeguarding life, property, and bolstering the national economy.

The dataset I extracted comprises monthly summarized data spanning from 2019 to November 2023, specifically focusing on the variable “Average Temperature” for Boston, resulting in a total of 59 entries.

In the Python processing of this data for time-series forecasting, a thorough examination revealed the absence of duplicate records and outliers within the “Temp_Avg” variable. Furthermore, a time plot has been constructed to visualize the trend of the “Temp_Avg” variable over the years. The observed trend appears to be consistently constant, indicating an additive seasonality.

Trend refers to the long-term general direction or pattern in a time series, indicating the underlying movement or tendency that persists over an extended period.

Seasonality pertains to recurring and predictable patterns or fluctuations in a time series that follow a specific regular interval, often corresponding to calendar seasons, months, or other repetitive cycles.

November 27, 2023

Project 3 – Time Series Forecasting for Weather Dataset
As part of my third project, I am planning to delve into the fascinating domain of time series forecasting, with a specific focus on weather datasets. Weather patterns exhibit a dynamic and sequential nature, making them ideal candidates for time series analysis. The objective of this project is to harness the power of advanced forecasting techniques to predict future weather conditions based on historical data. By employing state-of-the-art machine learning algorithms and statistical models, I aim to unravel the intricate patterns embedded within the time series of meteorological data. This undertaking not only presents a challenging computational task but also holds significant real-world implications, as accurate weather predictions are crucial for various sectors ranging from agriculture and energy to disaster management.

Methodology and Impact
To achieve this, my approach involves preprocessing and analyzing historical weather data, identifying seasonality, trends, and potential anomalies. Leveraging machine learning frameworks, I plan to implement time series forecasting models such as ARIMA (AutoRegressive Integrated Moving Average) and to evaluate the model’s ability to capture and extrapolate complex temporal dependencies within the dataset accurately. The anticipated outcomes include weather predictions and climate trends. Such predictions have the potential to revolutionize decision-making processes in agriculture, resource planning, and disaster preparedness. The project’s significance lies not only in its technical complexity but also in its potential real-world impact, as improved weather forecasting can contribute to more resilient and sustainable communities.

November 20, 2023

Model Estimation and Forecasting with ARIMA

The estimation and forecasting process for ARIMA models involves several key steps. Once a time series has been identified and analyzed, the next step is to determine the appropriate values for the model’s parameters (p, d, q). This often involves inspecting autocorrelation and partial autocorrelation plots to guide the selection of the autoregressive and moving average orders. Differencing is applied to achieve stationarity, and the order of differencing (d) is determined accordingly.

Estimation of ARIMA parameters is typically done using maximum likelihood estimation (MLE) methods. The model is then fitted to the historical data, and the residuals (the differences between the observed and predicted values) are examined to ensure that they exhibit no significant patterns, indicating a well-fitted model.

Once the ARIMA model is successfully estimated and validated, it can be used for forecasting future values of the time series. Forecasting involves propagating the model forward in time, with predicted values based on the estimated autoregressive and moving average parameters. Confidence intervals can also be computed to provide a measure of uncertainty around the point forecasts.

Despite its widespread use, ARIMA models have limitations, such as the assumption of linearity and stationarity. In practice, other advanced time series models, like SARIMA (Seasonal ARIMA) or machine learning approaches, may be employed to address these limitations and improve forecasting accuracy. Nonetheless, ARIMA models remain a valuable and accessible tool for time series analysis and forecasting.

November 17, 2023

ARIMA Models:

As part of our Project-3, we are planning to focus on Time Series Forecasting and use models such as ARIMA. Autoregressive Integrated Moving Average (ARIMA) models are a class of statistical models widely used in time series analysis and forecasting. Developed to capture and describe the temporal dependencies present in a time series dataset, ARIMA models are a combination of autoregressive (AR) and moving average (MA) components, with an added differencing step for stationarity.

The “AR” in ARIMA refers to the autoregressive component, which implies that the current value of the time series is dependent on its previous values. The “MA” stands for the moving average component, indicating that the current value is also influenced by a stochastic term representing the past forecast errors. The “I” in ARIMA represents differencing, a crucial step to transform a non-stationary time series into a stationary one. This differencing helps stabilize the mean and variance of the time series, making it amenable to modeling.

ARIMA models are denoted as ARIMA(p, d, q), where “p” is the order of the autoregressive component, “d” is the order of differencing, and “q” is the order of the moving average component. The appropriate choice of these parameters depends on the characteristics of the specific time series being analyzed. ARIMA models have proven effective in various fields, including economics, finance, and environmental science, making them a valuable tool for researchers and analysts seeking to make accurate predictions based on historical time series data.

November 15, 2023

In today’s class, Professor had a discussion on Time Series Forecasting:

Time Series Forecasting:
Time series forecasting is a crucial aspect of data analysis that deals with the prediction of future values based on past observations. It involves analyzing a sequence of data points, typically collected at regular intervals over time. Time series data often exhibit temporal dependencies, where the current value is dependent on previous values. Forecasting methods aim to capture these patterns and trends to make accurate predictions. Techniques such as autoregressive models, moving averages, and machine learning algorithms like ARIMA (AutoRegressive Integrated Moving Average) and LSTM (Long Short-Term Memory) networks are commonly employed for time series forecasting. Successful forecasting can provide valuable insights for decision-making in various fields, including finance, economics, weather prediction, and supply chain management.

Challenges in Time Series Forecasting:
Despite its significance, time series forecasting poses several challenges. One major issue is the presence of seasonality, where patterns repeat at regular intervals, and handling these recurring fluctuations can be complex. Another challenge is the need to address external factors that might impact the time series data, such as holidays, economic events, or sudden changes in the environment. Additionally, time series data may exhibit non-linear trends, making it necessary to employ advanced models capable of capturing complex relationships. Data preprocessing, feature engineering, and selecting appropriate model parameters are critical steps in overcoming these challenges. In practice, the effectiveness of a forecasting model often depends on the quality of data, the chosen algorithm, and the appropriateness of the selected features.

Applications of Time Series Forecasting:
Time series forecasting finds applications in a wide range of domains. In finance, it is employed to predict stock prices, currency exchange rates, and market trends. In retail, businesses use forecasting to optimize inventory management and predict customer demand. Weather forecasting relies heavily on time series analysis to predict temperature, precipitation, and other meteorological parameters. Energy consumption and production planning, healthcare resource management, and traffic flow prediction are additional areas where accurate time series forecasting is instrumental. The ability to anticipate future trends and patterns from historical data empowers organizations to make informed decisions, reduce risks, and enhance overall efficiency in various aspects of their operations.

November 10, 2023

In class, Professor explained about Decision Trees algorithm.

Decision Trees Definition and Use Cases:

A decision tree is a powerful machine learning algorithm used for both classification and regression tasks. It is a tree-like model where each node represents a decision based on the input features, and each branch represents the possible outcome of that decision. The leaves of the tree contain the final predicted label or value. Decision trees are popular due to their simplicity, interpretability, and ability to handle both numerical and categorical data.

One common use case for decision trees is in the field of medicine for diagnosing diseases. A decision tree can be trained on patient data, considering symptoms, test results, and medical history, to predict the likelihood of a specific disease. Another application is in finance for credit scoring, where decision trees help assess the creditworthiness of individuals based on factors such as income, debt, and credit history.

Gini Index:

The Gini index is a metric used in decision tree algorithms to measure the impurity or disorder of a dataset. It quantifies how often a randomly chosen element would be incorrectly classified. In the context of decision trees, the Gini index is used to evaluate the quality of a split at a particular node. The goal is to minimize the Gini index, leading to more homogeneous subsets and, consequently, more accurate predictions.

Mathematically, the Gini index for a node is calculated by summing the squared probabilities of each class being chosen times the probability of a misclassification. A Gini index of 0 indicates a pure node where all instances belong to the same class, while a higher Gini index suggests greater impurity.

Information Gain:

Information gain is a concept used in decision tree algorithms to determine the effectiveness of a feature in reducing uncertainty about the classification of a dataset. It is calculated by measuring the difference in entropy (a measure of disorder or uncertainty) before and after a dataset is split based on a particular feature. The goal is to maximize information gain, indicating that splitting the data using a specific feature results in more organized and predictable subsets.

Higher information gain implies that a feature is more relevant for making decisions. Decision tree algorithms use information gain to decide the order in which features are considered for node splits. By recursively selecting features with the highest information gain, the tree builds a hierarchy that optimally classifies the data. Information gain is a crucial aspect of decision tree training as it guides the model in selecting the most informative features to make accurate predictions.

November 8, 2023

In my analysis today, I calculated a statistic known as Cohen’s d. This statistic serves a valuable purpose by providing us with insights into the practical significance of the observed difference in the average ages between two distinct racial groups, namely the white and black populations. Upon analysis, I found that the resulting Cohen’s d value was approximately 0.57. This numerical value holds a particular significance as it allows us to categorize the effect size of the observed age difference. According to well-established guidelines, a Cohen’s d value of this magnitude falls into the category of a medium effect size.

What this essentially signifies is that the approximately 7-year difference in average ages between the white and black racial groups carries meaningful weight. While it may not reach the magnitude of a large effect, it is nonetheless a noteworthy and discernible difference that merits our attention and consideration. In practical terms, this medium effect size implies that the disparity in average ages between these two racial groups is of moderate importance and relevance.

November 6, 2023

Statistical Analysis: Assessing Age Differences Between White and Black Individuals

In today’s analysis, two statistical methods, a two-sample t-test and a Monte Carlo simulation, were employed to evaluate potential age differences between two groups, one represented by “AgesWhite” and the other by “AgesBlack.”

Two-Sample T-Test: The two-sample t-test is a widely used statistical technique that assesses whether there is a statistically significant difference in means between two groups. In this analysis, the t-test yielded the following results:

  • T-statistic: 19.207307521141903
  • P-value: 2.28156216181107e-79
  • Negative Log (base 2) of p-value: 261.2422975351452

The t-statistic of 19.21 indicates a substantial difference in means between the ‘Black’ and ‘White’ races. The small p-value (2.28e-79) suggests strong evidence against the null hypothesis of no difference. Furthermore, the negative log (base 2) of the p-value emphasizes the significance, equating the observed age difference to the likelihood of obtaining more than 261 consecutive tails when flipping a fair coin. This very low probability reinforces the statistical significance of the age difference.

Monte Carlo Simulation: Since the ‘age’ variable is not normally distributed, the t-test might not provide an accurate representation of the observed age difference. Hence, a Monte Carlo simulation was conducted to further scrutinize the observed age difference. The simulation involved 2,000,000 iterations, with each iteration randomly drawing a sample of the same size as the ‘White’ and ‘Black’ data from the combined age distributions of both groups.

Surprisingly, none of the 2,000,000 random samples generated in the Monte Carlo simulation produced a difference in means greater than the observed 7.2-year difference between White and Black individuals. This outcome is consistent with the t-test results, and it strongly indicates a very low probability of observing such a substantial age difference if the null hypothesis (no difference in means) were accurate.

Combined Conclusion: Both the two-sample t-test and the Monte Carlo simulation converge on the same conclusion. The age difference of 7.2 years between White and Black individuals is highly statistically significant. The t-test provides strong evidence against the null hypothesis, and the Monte Carlo simulation reinforces this by demonstrating that such a substantial age difference is exceedingly unlikely to occur by random chance. This collective statistical analysis underscores the presence of a genuine and significant difference in mean ages between these two demographic groups.

November 3, 2023

In today’s analysis, I delved into understanding the age distribution from different perspectives, shedding light on its extremeness and behavior compared to a standard normal distribution.

In the first analysis, I aimed to determine what percentage of the right tail of the age distribution lies more than 2 standard deviations from the mean for both Black and White races. The first step involved calculating the mean age and standard deviation, providing crucial insights into the distribution’s central tendency and spread. A threshold is calculated by adding 2 times the standard deviation to the mean, which delineated a boundary for outliers in the right tail. Subsequently, I determined the percentage of data points in the dataset exceeding this threshold, offering a glimpse into the rarity of values in the right tail. Additionally, I used the standard normal distribution as a benchmark, enabling a comparison between our data and a theoretical normal distribution, particularly in the tail region beyond 2 standard deviations.

The percentage of values greater than 2 standard deviations above the mean (Black): 4.9275%
The percentage of values greater than 2 standard deviations above the mean (White): 3.0518%

In summary, this analysis quantified the extremeness of age values in the right tail of the distribution and contrasted it with the behavior of a standard normal distribution. This information can prove invaluable for decision-making, risk assessment, and outlier identification within the dataset. Such analyses empower data-driven insights into the tail behavior of the distribution, with applications spanning various fields.

In the second analysis, my focus shifted to assessing how many cases and what percentage of total cases fell within the range of -1 to 1 standard deviation from the mean in an age distribution for both ‘Black’ and ‘White’ races. Furthermore, I compared this percentage to the corresponding percentage for a standard normal distribution.

To begin, I calculated the mean and standard deviation of the age data, fundamental statistics that provide insight into the distribution’s characteristics. The lower and upper bounds for the specified range (-1 to 1 standard deviation from the mean) were computed, marking the boundaries for this analysis. I then determined the number of cases within this age range by identifying the values falling between the lower and upper bounds. The percentage of cases within this range was calculated, presenting a measure of the distribution’s behavior within this specific interval. Additionally, I provided context by calculating the corresponding percentage for a standard normal distribution within the same range. This allowed for a comparison between the age distribution and the behavior expected from an idealized standard normal distribution.

Number of cases within -1 to 1 standard deviation from the mean: 1199
Percentage of cases within -1 to 1 standard deviation from the mean: 14.9838%
Lower Bound: 21.5428%
Upper Bound: 44.3135%
Number of cases within -1 to 1 standard deviation from the mean: 2185
Percentage of cases within -1 to 1 standard deviation from the mean: 27.3057%
Lower Bound: 26.9653%
Upper Bound: 53.2856%

In summary, this analysis quantified the cases and percentage within the -1 to 1 standard deviation range from the mean in the age distribution. It also offered a benchmark through a standard normal distribution, enabling a better understanding of how age data deviates from the idealized distribution within this specific range. Such insights can have practical applications in various decision-making processes and data-driven fields.

November 1, 2023

As part of today’s analysis, I looked at the age distribution between the White and the Black races. A combined kernel density plot is shown below for the ‘age’ variable of the ‘Black’ and ‘White’ races. The kernel density plot of the age variable for the ‘Black’ race (in red) indicates that it is positively skewed and moderately peaked. The statistical summary reveals that the dataset contains 1,725 individuals, with an average age of approximately 32.93 years and a standard deviation of about 11.39. The age range spans from 13 to 88 years, with quartile values indicating the distribution’s spread.

While the kernel density plot of the ‘age’ variable for the ‘White’ race (in blue) indicates a slightly negatively skewed and relatively flat distribution. The statistical summary reveals that the dataset encompasses 3,244 individuals, with an average age of approximately 40.13 years and a standard deviation of around 13.16. Age values range from 6 to 91 years, with quartile values providing insights into the distribution’s variability.

Additionally, the qq-plot indicates that the age distribution for both the ‘Black’ and ‘White’ races is not normal, which is an important observation for statistical analysis, as shown below.

Overall, these analyses provide a comprehensive view of age demographics within the two racial groups, revealing differences in skewness and shape of their age distributions.

October 30, 2023

In my recent analysis, I utilized the DBSCAN clustering algorithm to process latitude and longitude data for the state of California. However, it’s important to note that DBSCAN has its limitations, and the results were not as effective as expected. The presence of cluster IDs marked as ‘-1’ indicates instances that were considered outliers or noise in the dataset.

Consequently, I decided to switch to implementing the K-Means clustering algorithm on the same latitude and longitude data using Python. In this case, I specified the number of clusters as ‘4’. This means that K-Means divides the data into four distinct clusters based on geographical proximity and assigns unique colors to each cluster for visualization.

When visualizing these K-Means clusters on a map of the United States for the state of California, it becomes evident that the red and blue clusters appear to be denser. This observation leads to the inference that the increased population density in certain areas of California could be contributing to the prominence of these clusters.

In summary, by shifting from DBSCAN   to K-Means clustering and employing a color-coded visualization on a map, we gain valuable insights into the geographical distribution of shootings in California, particularly highlighting areas with higher population density and potential clustering patterns.

My next objective is to examine the distribution of the ‘age’ variable for Black and White races.

October 27, 2023

In today’s class, Professor mentioned about the instability of DBSCAN compared to K-means. Below are the different scenarios which highlight the instability of DBSCAN.

Sensitivity to Density Variations:

DBSCAN’s stability can be influenced by variations in data point density. When data density exhibits significant discrepancies across different segments of the dataset, it can lead to the formation of clusters with varying sizes and shapes. Consequently, the task of selecting appropriate parameters (such as the maximum distance ε and minimum point thresholds) for defining clusters effectively becomes challenging.

Conversely, K-means operates under the assumption of spherical and uniformly sized clusters, thereby potentially performing more effectively when the clusters share similar densities and shapes.

Sensitivity to Parameter Choices:

DBSCAN necessitates the configuration of hyperparameters, such as ε (representing the maximum distance that defines the neighborhood of a data point) and the minimum number of data points required to establish a dense region. These parameter choices hold considerable influence over the resultant clusters.

K-means, while also requiring a parameter (the number of clusters, K), is generally more straightforward to determine since it directly reflects the desired number of clusters. In contrast, the parameters of DBSCAN are more abstract, which can introduce sensitivity to the selection of parameter values.

Boundary Points and Noise:

DBSCAN explicitly identifies noise points, which are data points that do not belong to any cluster, and is proficient at handling outliers. However, the delineation of boundary points (those located on the periphery of a cluster) within DBSCAN can sometimes exhibit an arbitrary nature.

In K-means, data points situated at the boundaries of clusters may be assigned to one of the neighboring clusters, potentially resulting in instability when a data point is proximate to the boundary shared by two clusters.

Varying Cluster Shapes:

DBSCAN excels in its ability to accommodate clusters with arbitrary shapes and to detect clusters with irregular boundaries. This stands in contrast to K-means, which presupposes the presence of roughly spherical clusters and consequently demonstrates greater stability when the data conforms to this assumption.

October 23, 2023

In today’s class, different technical concepts, including K-Means and DBSCAN clustering techniques, were explained by the Professor as outlined below:

K-medoids:
K-medoids is another partitioning clustering algorithm, but it is noted for its increased robustness compared to K-means with respect to outliers. Instead of utilizing the mean (average) as the cluster center, K-medoids employs the actual data point (medoid) within a cluster, which minimizes the sum of distances to all other points in that cluster. This approach renders K-medoids less sensitive to outliers and particularly suitable for non-Gaussian clusters.

Hierarchical Clustering:
Hierarchical clustering is characterized by the construction of a tree-like structure of clusters, wherein data points are incrementally grouped into larger clusters, thus establishing a hierarchy. Two primary methodologies exist: agglomerative (bottom-up) and divisive (top-down). In agglomerative clustering, each data point initiates as its own cluster and is subsequently merged iteratively with its closest neighboring cluster, leading to the formation of a dendrogram. The divisive method, on the other hand, starts with all data points within a single cluster and then proceeds to recursively divide them. Importantly, hierarchical clustering eliminates the need to specify the number of clusters in advance, and it offers a visual representation of the data’s inherent grouping.

Dendrograms:
A dendrogram is a diagram structured like a tree that is employed to visualize the hierarchy of clusters within hierarchical clustering. This visualization method displays the sequence of merges or splits, as well as the respective distances at which these actions occur. The height of the vertical lines within the dendrogram signifies the dissimilarity or distance between clusters. By selectively cutting the dendrogram at a particular height, one can obtain a specific number of clusters.

October 20, 2023

Today, I tried to calculate the geodesic distance between two geographic coordinates, representing Seattle, WA, and Miami, FL. I used the geodesic function from the geopy library, which accurately computes distances on the Earth’s curved surface. The resulting distance is calculated in both miles and kilometers as shown below.

As part of my next step of analysis, I am planning to carry out clustering based on the geographic locations within the state of California. To achieve this, I explored two clustering algorithms: K-Means and DBSCAN.

K-Means:
K-Means is a popular clustering algorithm that partitions a dataset into ‘K’ distinct clusters. It works by iteratively assigning data points to the nearest cluster center (centroid) and recalculating the centroids until convergence. K-Means is simple to implement and computationally efficient, making it widely used for various applications, such as image segmentation and customer segmentation.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
DBSCAN is a density-based clustering algorithm that groups data points based on their density within a neighborhood. It doesn’t require specifying the number of clusters beforehand, making it suitable for discovering clusters of arbitrary shapes. DBSCAN identifies core points, which have a sufficient number of data points in their neighborhood, and border points, which are near core points but don’t have enough neighbors to be considered core. Noise points don’t belong to any cluster. DBSCAN is robust to noise and capable of uncovering clusters of varying sizes, making it valuable for data with complex structures.

October 18, 2023

In the subsequent stage of my analysis, I used geospatial data and related libraries to illustrate the occurrences of police shootings across the United States. Initially, I extracted the ‘latitude’ and ‘longitude’ attributes from the dataset and filtered out any null values in these columns. I proceeded to construct a geographical map of the United States using a designated shapefile, enhancing it with the addition of red markers that create a Geospatial Scatter Plot. The resulting visualization provides a clear and geographically accurate representation of where these incidents have occurred, offering valuable insights into their distribution across the country. By plotting these incidents on a map, it becomes evident where police shootings are concentrated, allowing for a better understanding of regional trends and patterns.

The scatter plot can be visualized for individual states of the United States to help policymakers, researchers, and the public gain insights into the geographic aspects of police shootings, potentially leading to more informed discussions and actions aimed at addressing this important issue. I created a similar Geospatial Scatter Plot for the state of Massachusetts as shown below.

As part of my next phase of analysis, I am planning to work on GeoHistograms along with few of the clustering algorithms which Professor mentioned in the previous class. Specifically, our professor introduced two distinct clustering techniques: K-Means and DBSCAN. Notably, it was emphasized that the K-Means algorithm requires predefining the value of K, which can be a limitation. My objective is to implement both of these algorithms in Python and assess whether they yield meaningful clusters when applied to the geographic locations of the shooting data.

October 16, 2023

Today, my focus was on analyzing the distribution of the ‘age’ variable. From the below density plot of the ‘age’ variable, it is seen that the age variable, which represents the ages of 7,499 individuals (non-null values), exhibits a positive skew, implying that the majority of the population falls on the younger side with a right-tailed distribution. The average age is approximately 37.21 years, with a moderate level of variability around this mean, as indicated by a standard deviation of 12.98. The age range spans from 2 to 92 years, with the youngest individual being 2 years old and the oldest 92 years old. The kurtosis value of 0.234 suggests that the distribution is somewhat less peaked than a normal distribution, signifying a dispersion of ages rather than a tight clustering around the mean. Additionally, the median age, at 35 years, serves as the midpoint of the dataset.

From the box plot of the age variable displayed below, the presence of outliers beyond the upper whisker is clearly visible. In a box plot, the ‘whiskers’ typically represent the range within which most of the data falls. Anything beyond these whiskers is considered an outlier, which means it lies significantly outside the typical range of values.

In this specific case, the upper whisker of the box plot extends to a certain value, typically defined as 1.5 times the interquartile range (IQR) above the third quartile (Q3). Any data point beyond this threshold is considered an outlier. Outliers beyond the upper whisker in the ‘age’ variable indicate that there are individuals in the dataset whose ages are significantly higher than the upper range of ages within the ‘typical’ or ‘normal’ population.

October 13, 2023

As I begin my analysis of the ‘fatal-police-shootings-data’ dataset in Python, I have loaded the data to examine the various variables and their distributions. In this dataset, there is a numerical column, ‘age,’ which represents the age of individuals who were fatally shot by the police. Additionally, there are latitude and longitude values, indicating the precise locations of these incidents.

I also noticed that there is an ‘id’ column that may not offer significant insights, so I am considering removing it from further analysis.

In my initial assessment, I have examined the dataset for missing values in the variables. It is observed that the following variables contain missing or null values: ‘name,’ ‘armed,’ ‘age,’ ‘gender,’ ‘race,’ ‘flee,’ ‘longitude,’ and ‘latitude’ as shown below.

Furthermore, I investigated the presence of duplicate records within the dataset. It is evident that there is only one duplicate record in the entire dataset, and this particular record does not have a ‘name’ value.

For my next steps, I intend to focus on exploring the distribution of the ‘age’ variable for further investigation.

In today’s class, Professor provided an introduction to the computation of geospatial distances using location information. This knowledge will enable us to create GeoHistograms, a valuable tool for visualizing and analyzing data related to geographical locations. GeoHistograms can be particularly useful for identifying spatial trends, hotspots, and clusters within geographic datasets, thereby enhancing our understanding of the underlying phenomena.

October 11, 2023

Today, we began working with a new dataset sourced from ‘The Washington Post’ website. The dataset reveals a troubling statistic: in the United States, police shootings result in the deaths of over 1,000 individuals on average each year, according to an ongoing analysis conducted by The Washington Post.

A key turning point in this ongoing investigation was the tragic 2014 killing of Michael Brown, an unarmed Black man, by police in Ferguson, Missouri. The Post’s inquiry exposed a significant issue – the data reported to the FBI regarding fatal police shootings was significantly undercounted, with more than half of such incidents going unreported. This problem has only grown worse over time, with, by 2021, just a third of fatal shootings being reflected in the FBI’s database. The primary reason behind this underreporting is that local police departments are not obligated to report these incidents to the federal government. Furthermore, complications arise from an updated FBI system for data reporting and confusion among local law enforcement agencies regarding their reporting responsibilities.

In response, The Washington Post initiated its own comprehensive investigation in 2015 by meticulously documenting every instance in which an on-duty police officer in the United States shot and killed someone. Over the years, their reporters have compiled a substantial dataset, which now stands at 8,770 records. The dataset includes various variables, such as date, name, age, gender, whether the person was armed, their race, the city and state in which the incident occurred, whether they were attempting to flee, if body cameras were in use, signs of mental illness, and crucially, the involved police departments.

It’s worth noting that this dataset covers incidents from 2015 onward, and The Post has recently updated it in 2022 to include the names of the police agencies connected to each shooting, which provides a means to better assess accountability at the department level.

In today’s class, we delved into some initial questions about this dataset. Notably, we discovered that there are multiple versions of the dataset available. The one accessible on GitHub provides information on police shootings by agencies, but the version directly obtained from The Washington Post’s website includes a variable called ‘police_departments_involved.’ This means there’s no need for an external relationship to discover which police stations were involved in these shootings.

As the next step in my analysis, I plan to conduct a more detailed examination of the dataset and its variables to uncover further insights.

October 6, 2023

Based on the conclusions drawn from our analysis of the CDC dataset, we have drafted an initial version of the Project report following the Punchline format, as advised by our Professor. Upon careful examination of the format and in line with our Professor’s guidance, it has become apparent that two critical aspects should be diligently addressed when constructing our report.

For the benefit of senior managers and policy makers, the key focus should be on simplicity and clarity. This entails using plain language, employing concise visual aids, and featuring an executive summary to emphasize the most significant findings. The avoidance of technical jargon is essential, with the primary goal being the communication of actionable insights that can guide decision-making and policy development.

Conversely, when addressing technical line managers, an emphasis on transparency and comprehensive detailing becomes imperative. This entails providing a thorough account of our research methodology, statistical analyses, and data preprocessing. The inclusion of proper references, supplementary materials in appendices, and the execution of sensitivity analyses to test the study’s resilience under various conditions are vital steps. This approach ensures that technical experts can assess the validity of our study and replicate the work if necessary.

By adhering to these principles, we are confident that we are aligning ourselves with the expectations of a data scientist’s role as per industry standards. We have submitted our report for review, and we are incorporating the minor corrections suggested by our Professor and Grader.

October 4, 2023

In light of our current findings, we are in the process of preparing a succinct and impactful report summarizing the results of our study conducted on the CDC Diabetes dataset.

In our study on the CDC Diabetes dataset, we employed a diverse range of statistical techniques, including exploratory data analysis, correlation analysis, simple and multiple linear regression, the Breusch-Pagan test for constant variance assessment, introduction of interaction terms, polynomial regression for investigating higher-order relationships, and support vector regression. To ensure a rigorous and robust analysis, we also utilized cross-validation to evaluate model performance and generalization.

Our findings revealed intriguing insights into the predictive power of our models. Initially, when we introduced an interaction term to the Simple Linear model, the contribution was 36.5%, but it increased to 38.5% with a Multi-Linear quadratic regression diabetes prediction model that incorporated ‘ % INACTIVE’ and ‘ % OBESE.’ Furthermore, when we applied Support Vector Regression, the explanatory power dropped to 30.1%.

While ‘ % INACTIVE’ and ‘ % OBESE’ do contribute significantly to diabetes prediction, it is evident that they may not fully capture the intricate dynamics involved. This underscores the necessity for a more comprehensive analysis that encompasses a broader range of influencing factors. Thus, incorporating additional variables becomes crucial for a deeper understanding and a more holistic perspective on diabetes prediction.

October 2, 2023

In continuation with my previous analysis, I tried to implement the Support Vector Regression on the CDC dataset involving the quadratic and interaction terms. Support Vector Regression (SVR) is a type of machine learning algorithm used for regression tasks. It is a variation of Support Vector Machines (SVM), which are primarily used for classification tasks. SVR, like SVM, is particularly useful when dealing with high-dimensional data, and it excels in cases where traditional linear regression models may not perform well due to complex relationships between variables.

The SVR algorithm is initiated by initializing the SVR model with specific parameters such as a ‘RBF (Radial Basis Function) kernel, the regularization parameter (C), and the tolerance for errors (epsilon).

  • The RBF (Radial Basis Function) kernel is a mathematical function used in Support Vector Regression (SVR) to capture non-linear relationships.
  • ‘C’ is the regularization parameter controlling the trade-off between fitting the training data and preventing overfitting.
  • ‘epsilon’ is the tolerance parameter specifying the margin within which errors are acceptable in SVR.

The features used to build the SVR model from the CDC dataset are ‘INACTIVE’, ‘OBESE’, their respective squared values (‘INACTIVE_sq’ and ‘OBESE_sq’), and an interaction term (‘OBESE*INACTIVE’).

A K-Fold cross-validator with 5 folds is created to split the data into training and testing sets for cross-validation.  A new SVR model is created, fitted to the training data, and used to make predictions on the testing data. As part of the training phase, the SVR model is fitted with the ‘RBF’ kernel to the training data, enabling the model to learn the underlying relationships between the input features and the target variable. Predictions are generated on the test data using the trained SVR model.

The model’s performance is assessed through the calculation of the R-squared (R2) score which is 0.30 from the SVR model. This is very low than the R-squared from our quadratic model built earlier. This suggests that the SVR model may not be capturing the relationships in the data as effectively as the quadratic model.

September 29, 2023

As part of my project analysis, I performed the next step of 5-fold Cross-Validation on the CDC Diabetes dataset. I tried implementing this technique using Python using the sklearn library. I built a function that performs a 5-fold Cross-Validation on the dataset for up to 4 degrees of polynomials to check if the Average Mean-Squared Error made by the model on the test dataset. When a graph is plotted with polynomials in the x-axis and Average Test MSE on the y-axis, the below graph is obtained. It is seen that the Test error reduces at the 2nd degree polynomial and then gradually increases.

This is in contrast with the graph obtained previously where the training error reduced with model complexity. This discrepancy highlights the significance of Cross-Validation as a powerful tool for estimating a model’s performance on unseen data. It provides a more realistic assessment of how well the model will generalize to new data, taking into account both underfitting and overfitting. As part of my next step, I am planning to work on Support Vector Regression (SVR) to see if I can build a better model.

In today’s class, Professor clarified a query which I had on the requirement of Monte-Carlo permutation test on this dataset. This question arose because the Monte-Carlo permutation method is typically used to test statistical significance between two groups which has highly non-normally distributed variables. However, Professor explained the relevance of this test in the context of the CDC Diabetes dataset.

The rationale behind this is that the Monte-Carlo permutation test offers a robust way to assess the statistical significance of observed results through random data sampling. This approach can be particularly useful when explaining and justifying the significance of results to stakeholders or in situations where traditional statistical tests like t-tests may not be applicable or straightforward to interpret.

September 27, 2023

In continuation to Monday’s class, today Professor focused on the implementation of K-fold Cross-Validation on the CDC Diabetes dataset. Polynomial regression models of higher degree (from 1 to 10), are applied on the 354 data points which have all three variables. A plot of the training error (Mean-Squared Error which is the average squared difference between the model’s predictions and the actual target values in the dataset) as a function of the degree of the polynomials is seen to gradually reduce as the degree increases. This observation might lead one to believe that higher-order polynomial models improve the model’s ability to generalize to new data. However, this could be a case of overfitting, where the model essentially memorizes the dataset and performs poorly on new, unseen data.

I attempted to build and apply higher degree polynomial models on the dataset in Python along with plotting a graph similar to that mentioned by the Professor. The resultant graph is shown below where the training error decreases with model complexity.

To overcome this issue, Professor explained the application of a 5-fold Cross-Validation on this dataset where the data is split into 5 approximately equal groups of 71, 71, 71, 71 and 70. While doing this, he suggested to add unique indexes to the observations to eliminate the issue of duplicate records in the 3 variables. When the Average MSE of the test dataset using the 5-fold CV technique is plotted across model complexity, the test error reduces at degree 2 and then gradually increases. This suggests that the training error tends to underestimate the test error and hence a K-fold CV technique helps to obtain a more accurate assessment of the model’s performance. As the next step, I am planning to apply 5-fold CV technique to the CDC diabetes dataset in Python and look at the results.

September 25, 2023

In order to predict the errors for the linear models, we apply the model predictions on the test data and calculate the MSE (Mean-Squared Error) for the model. The MSE is the average squared difference between the model’s predictions and the actual target values in the test dataset, providing a measure of how well the model fits the data and predicts outcomes. However, if the dataset is not large enough, then we may need to apply resampling methods such as Cross-Validation and Bootstrap. In today’s class, Professor showed videos of these CV techniques and explained how these are helpful to obtain information about the prediction error on the test set, standard deviation and bias of the model parameters.

The validation set approach is a straightforward method for estimating the test error of a statistical learning model. It begins by randomly splitting the dataset into two parts: a training set and a validation (or hold-out) set. The model is trained on the training set and then used to predict responses for the validation set. The error rate observed on the validation set, often measured by metrics like Mean Squared Error (MSE) for quantitative responses, serves as an estimate of the model’s test error rate.

  1. The Validation Set Approach:

The Validation Set Approach, while conceptually simple and easy to implement, presents two potential drawbacks.

    • Validation set approach can yield highly variable test error estimates due to varying training and validation set compositions.
    • This approach may overestimate the test error rate as the model is trained on a subset of data, potentially leading to suboptimal performance.
  1. LOOCV (Leave-One-Out Cross-Validation):

LOOCV (Leave-One-Out Cross-Validation) differs from the validation set approach by using only one observation as the validation set while the remaining data points form the training set. The model is trained on n-1 observations, and then a prediction is made for the omitted observation, resulting in an unbiased but highly variable estimate of the test error. Despite its unbiased nature, LOOCV can be impractical for large datasets due to its computational intensity, as it repeats this process for each data point.

  1. K-fold Cross-Validation (CV):

K-fold cross-validation (CV) is an alternative to LOOCV for estimating the test error in machine learning. In K-fold CV, the dataset is divided into k roughly equal-sized folds. Each fold is used as a validation set once while the rest of the data (k-1 folds) forms the training set. The mean squared error (MSE) is computed for each validation fold separately, resulting in k error estimates (MSE1, MSE2, …, MSEk). The overall k-fold CV estimate is obtained by averaging these individual MSE values using the formula as shown in the image.

This process provides a more stable and less computationally intensive way to estimate the test error compared to LOOCV, especially for large datasets.

September 22, 2023

In continuation with my project analysis, taking Professor’s explanation of the interaction term and the quadratic model that includes second-order degree terms of the predictor variables in the regression equation, as baseline, I applied these concepts to the cdc diabetes combined dataset. On adding the interaction term (% OBESE * % INACTIVE) to the Regression equation, it is seen that the accuracy of the model has improved slightly to 0.365. Now the regression equation for the Multi Linear model involving interaction term for this dataset is expressed as:

% DIABETIC = -10.0647 + 1.1534*% INACTIVE + 0.7430*% OBESE – 0.0496*% INACTIVE*% OBESE

Furthermore, I have built a quadratic model for this dataset and it is seen that the accuracy has improved a little to around 0.385. The regression equation for this model is expressed as:

% DIABETIC = -11.5906 + 0.4779*% INACTIVE + 0.4779 *% OBESE + 0.0197 *% – 0.0494*% OBESE^2 – 0.0197*% INACTIVE^2

In today’s class, Professor answered to questions on the dataset which included details on how collinearity analysis between the predictor variables depends on the scenario of the linear model – whether we are using the model for prediction or for explaining any positive correlation between the independent variables.

Additionally, Professor also explained the concepts of Paired and Unpaired T-Tests and how they are useful in interpreting the statistical significance of difference in means between 2 groups, while ANOVA (Analysis of Variance) is used for comparing the difference in means among multiple groups.

Paired T-Test:

A paired t-test, also known as a dependent t-test or matched-pairs t-test, is used to compare the means of two related groups or conditions. It’s called ‘paired’ because it involves paired data points. These paired data points represent measurements or observations taken on the same subjects or items under two different conditions or time points.

Unpaired T-Test:

An unpaired t-test, also known as an independent t-test, is used to compare the means of two independent groups or conditions. Unlike paired data, where each data point is related to another, unpaired data involves two distinct and unrelated groups or conditions.

September 20, 2023

In today’s class, Professor used the crab molt dataset which has 2 variables: pre-molt and post-most sizes as an example. A linear model is required to try to predict the target variable pre-molt size with the help of the predictor variable, post-molt size. When such a linear model is built, the regression line seems to be almost closer to the actual data points and trying to fit in most of them. However, looking at the descriptive statistics and the histograms of these variables, it is seen that both of these are non-normally distributed and are left-skewed. The Kurtosis of these 2 variables is also very high.

In order to compare the difference in means of the 2 variables, and see if they are statistically significant, i.e., if there is really a difference between the means of the two groups, we need to perform a t-test and if the p-value obtained as a result of t-test is less than 0.05, then we can reject the null hypothesis which says that there is no difference between the means of the 2 variables. A t-test is a statistical method used to determine if there is a significant difference between the means of two groups or populations. It calculates a t-statistic by comparing the difference in means to the variability in the data, and the resulting p-value indicates the likelihood that this difference occurred by chance, with a lower p-value suggesting stronger evidence of a real difference.

Now since the variables are highly non-normally distributed, the assumptions of t-test are violated which makes the p-value an unsuitable metric for Hypothesis testing. Because of these reasons, a Monte-Carlo permutation test is carried out to obtain a reliable estimate of the p-value. This procedure involves creating a pool of 944 observations with 472 pre-molt observations and 472 post-molt observations. From this pooled data, random samples are created and the difference in means of 2 sample groups is plotted. The resulting distribution is a normal distribution. The p-value is calculated using the formula p = n/N where n is the difference in means between the 2 randomly sampled groups and N is the number of samples randomly created from the pool. For the 10,000,000 random samplings created from the pool of crab data, the p-value is 0. Hence, we can conclude that that there is a statistically significant difference in means between the two groups, thus rejecting the null hypothesis.

September 18, 2023

The Linear Regression model that I previously built yielded a very low R-squared value and displayed violations of the Homoscedasticity assumption in the residuals. As a result, I am considering to build a Multiple Linear Regression model, with ‘% DIABETIC’ as the target variable and ‘% OBESE’ and ‘% INACTIVE’ as predictor variables. To do this, I merged the three metrics in the dataset based on the FIPS code, resulting in a dataset with only 354 observations.

During today’s class, Professor provided valuable insights into Multiple Linear Regression, which is a type of linear regression model that involves more than one predictor variable. In Multiple Linear Regression, the goal is to find the best-fit plane using the Ordinary Least Squares (OLS) method, and a considerable improvement in the R-squared value is observed with this model. The equation for Multiple Linear Regression can be expressed as follows:

y= β₀ + β1X1 ​+ β2X2​ + ε

Additionally, Professor discussed the possibility of adding an interaction term to the Multiple Linear Regression model. The equation for this model with an interaction term is:

y= β₀ + β1X1 ​+ β2X2​ + β3 X1 * X2 + ε

Here, X1 * X2​ represents the interaction term between the two predictor variables which represents the combined effect of two or more variables. Interestingly, the inclusion of an interaction term resulted in a comparatively higher improvement in the R-squared value.

Following this, Professor introduced the concept of building a Generalized Linear Model (GLM), which is a quadratic model that includes second-order degree terms of the predictor variables in the regression equation. The equation for this model is described as follows:

y= β₀ + β1X1 ​+ β2X2​ + β3 X1 * X2 + β4X12​ + β5X22​ + ε

However, Professor advised against using higher-degree polynomials, as they can lead to overfitting. Overfitting occurs when the model memorizes the dataset but struggles to generalize to new data.

September 15, 2023

Based on the Simple Linear Regression model I built earlier, I created a Q-Q plot to assess the normality of the residuals. It is seen that the residuals obtained from the simple linear model are not normally distributed. Additionally, I built a scatter plot with the fitted values on the x-axis vs the residuals on the y-axis. It seems that the residuals are fanning out (funnel-shaped) which turns out that the linear model is heteroscedastic and violates one of the critical assumptions of Linear Regression, which is heteroscedasticity. The test yielded a Chi-squared statistic of 52.846460747754506 with a corresponding p-value of 3.6066798910958464e-13. Since the p-value is less than the conventional significance level of 0.05, it suggests that this model exhibits heteroscedasticity.

In today’s class, Professor answered to several questions from my classmates on the dataset which gave me a deeper understanding on next steps of analysis. Professor mentioned that for this dataset, non-linear models can be implemented which might lead to a higher R-squared value. To my question on his recommendation on applying transformations (log and exponential) to the variables, Professor gave an example that for datasets which have highly skewed distribution, a log transformation might help to make is normally distributed. But since our dataset is almost normally distributed, he suggested not to implement transformations on the variables for this dataset.

Furthermore, Professor explained the structure of Punchline reports that we need to follow for our Project report which was really helpful. He also mentioned that he will discuss about collinearity between the predictor variables (potentially leading to issues in distinguishing their unique contributions to the dependent variable) next week, which might have a negative impact on the linear model’s performance for our dataset.

September 13, 2023

From my previous observations on the data, as the first step, I have applied Simple Linear Regression on the combined dataset which has ‘% DIABETIC’ and ‘% INACTIVE’ metrics for US Counties.

Before applying the regression, I checked the correlation between the 2 variables. It is seen that they have a correlation coefficient of 0.441706. Additionally, I also created a scatter plot with ‘% INACTIVE’ as the x-variable and ‘% DIABETIC’ as the y-variable. On observing, the scatter plot did not show a very probable linear relationship which indicates that there might or might not be a linear correlation between the 2 variables.

A Simple Linear Regression model is then built and fitted on the dataset with ‘% INACTIVE’ as the predictor variable and ‘% DIABETIC’ as the target variable. Upon analyzing the model summary, I found that both the Intercept and the coefficient for ‘% INACTIVE’ are statistically significant (with p-values less than 0.05), but R-squared value is around 0.195. This means that only 20% of the variation in the target variable (% DIABETIC) is explained by the independent variable (% INACTIVE).

So, the Regression equation for this Linear model can be written as:

% DIABETIC = 3.7731 + 0.2331 * % INACTIVE (y = β₀ + β₁x)

As a next step, I am planning to test the residuals obtained from the linear model, to see if they satisfy the assumptions of Simple Linear Regression.

In today’s class, I understood the concept of Hypothesis Testing (Null and Alternate Hypothesis) along with how p-value helps to access the strength of evidence against the Null Hypothesis. Subsequently, I learned about the Breusch-Pagan test for heteroscedasticity on how a Chi-squared distribution with 1 degree of freedom can be used to test whether the residuals obtained from the original Linear Regression model exhibit constant variance. The test involves fitting a linear regression model with the squared residuals against the predictor variables (used in the original linear model) as independent variables. A Chi-squared statistic (nr^2) is calculated, where ‘n’ is the number of observations and ‘r^2’ is the value obtained from the linear model. This statistic helps to determine if the residuals are uniformly scattered. By comparing this Chi-squared statistic to the Chi-squared distribution with 1 degree of freedom, we can calculate a p-value. If the p-value is less than 0.05, it suggests that the model exhibits heteroscedasticity.

September 11, 2023

According to my analysis, the ‘cdc-diabetes-2018’ dataset has 3 separate tabs of data for 3 different metrics – % DIABETIC, % OBESE and % INACTIVE for different counties across the United States for the year 2018. These metrics can be related based on the unique FIPS (Federal Information Processing Standards) codes. However, there is a column name inconsistency in the ‘Inactivity’ tab, where the column is labeled as ‘FIPDS’ instead of ‘FIPS’. This inconsistency needs to be corrected to enable proper data integration based on this common unique column.

Upon merging the data, it’s observed that only 354 observations have all three variables present. This limited sample size might not be ideal for building a robust model with % DIABETIC as the target variable and % OBESE and % INACTIVE as predictor variables. It could benefit from techniques like bootstrapping and cross-validation to address potential issues related to the small sample size. For now, as the first step I am planning on a simpler approach which involves creating a Simple Linear Regression model with % DIABETIC as the target variable and % INACTIVE as the single predictor variable. This combination results in 1370 observations, which is a more reasonable sample size for building such a model.

On building histograms for the % DIABETIC and % INACTIVE variables, it is seen that the % DIABETIC is slightly right skewed with a high kurtosis, indicating a heavy-tailed distribution while % INACTIVE is slightly left skewed with a lower kurtosis, suggesting a less heavy-tailed distribution. On building boxplots for both these variables, it is seen that the % DIABETIC has outliers beyond the upper and lower whiskers while % INACTIVE has outliers beyond the lower whisker.

In today’s class, professor’s explanation of Simple Linear Regression was very clear. I understood the concept of Ordinary Least Squares method which seeks to find the optimal slope and intercept parameters for a regression model by minimizing the Residual Sum of Squares (RSS), where RSS is the sum of squared differences between actual and predicted values.

We also acquired knowledge on other important statistical concepts such as:

  1. Kurtosis measures the concentration of data around the mean, with a value of 3 indicating a normal distribution.
  2. Skewness indicates the asymmetry of a distribution.
  3. A Q-Q plot helps to find the normality of the residuals which is a critical assumption of linear models.

Referring to the section 3.1.3 of the textbook which describes methods of assessing the accuracy of the model, for Simple Linear Regression, R-squared is defined as the square of Pearson’s correlation.

Another critical assumption of linear models is Homoscedasticity, which means that the variance of errors is constant across the range of predicted values. Violations of Homoscedasticity result in Heteroscedasticity, which can undermine the reliability of the linear model.