In today’s class, Professor used the crab molt dataset which has 2 variables: pre-molt and post-most sizes as an example. A linear model is required to try to predict the target variable pre-molt size with the help of the predictor variable, post-molt size. When such a linear model is built, the regression line seems to be almost closer to the actual data points and trying to fit in most of them. However, looking at the descriptive statistics and the histograms of these variables, it is seen that both of these are non-normally distributed and are left-skewed. The Kurtosis of these 2 variables is also very high.
In order to compare the difference in means of the 2 variables, and see if they are statistically significant, i.e., if there is really a difference between the means of the two groups, we need to perform a t-test and if the p-value obtained as a result of t-test is less than 0.05, then we can reject the null hypothesis which says that there is no difference between the means of the 2 variables. A t-test is a statistical method used to determine if there is a significant difference between the means of two groups or populations. It calculates a t-statistic by comparing the difference in means to the variability in the data, and the resulting p-value indicates the likelihood that this difference occurred by chance, with a lower p-value suggesting stronger evidence of a real difference.
Now since the variables are highly non-normally distributed, the assumptions of t-test are violated which makes the p-value an unsuitable metric for Hypothesis testing. Because of these reasons, a Monte-Carlo permutation test is carried out to obtain a reliable estimate of the p-value. This procedure involves creating a pool of 944 observations with 472 pre-molt observations and 472 post-molt observations. From this pooled data, random samples are created and the difference in means of 2 sample groups is plotted. The resulting distribution is a normal distribution. The p-value is calculated using the formula p = n/N where n is the difference in means between the 2 randomly sampled groups and N is the number of samples randomly created from the pool. For the 10,000,000 random samplings created from the pool of crab data, the p-value is 0. Hence, we can conclude that that there is a statistically significant difference in means between the two groups, thus rejecting the null hypothesis.