Central limit theorem states that, when the sample size is large, regardless of the population distribution, the sample will still be normally distributed. We will use R to simulate this.
We will start by simulating an exponential distribution (example below).

Set these basic parameters for the population distribution.
set.seed(1234) # for reproducibility
n <- 40 # large sample size
lambda <- 0.2 # determines the shape
Build a for-loop which will take the mean of 40 samples of the exponentially distributed population using the function rexp() and repeat 1,000 times. Each iteration append the value of sample mean to a vector called clt_exp (which has to be declared as an empty variable before the for-loop).
These are the properties of an exponential distribution: $\mu=1/\lambda$ and $\sigma=(1/\lambda)/\sqrt n$. Plot a graph which will show the frequency distribution of the sample means, add a vertical line where the means of the sample means is, and add a vertical line where the means of the population mean is. Are they similar?
Adjust the bin size to visualise the distribution better (for example, the argument breaks in hist() controls the bin size).
Compute the 95% confidence interval for the sample means. What does it imply?
How about variance of sample means and population variance?
You might want to try the same simulation with other types of distributions. You have already known rt() for t-distribution. There are others: runif() for uniform distribution, rpois() for Poisson distribution, rgamma() for gamma distribution.
<aside> 💡 You will learn these other distributions in Professor Hector’s part later in this module. You can always come back to this practical later.
</aside>
Non-parametric tests are usually more preferred dealing with real-life biological data which is most likely not normal.
Researches predict that carotenoids in birds may affect immune system function. To test this, a group of zebra finches was randomly divided into two groups, one receiving supplemental carotenoids in their diet, another one not. Two variables were measured, the cell-mediated immunocompetence (PHA) and the humoral immunity (SRBC).
df[df$treatment == "CAROT", ] and df[df$treatment == "NO", ].var.test(), test the distributions of PHA for both groups, and then SRBC.Data transformation can be a useful method to correct for normality violation, and log transformation is particularly useful when the data span through orders of magnitude.
Wolbachia is an endosymbiont living in mosquitoes and may confer their immunity against viruses such as those responsible for dengue fever among humans. To examine their effects on transmission of dengue, researchers infected two groups of mosquitoes with dengue, one the WB1 type which harbours the endosymbiont, one the wild type. Viral titers were measured after 14 days.
Permutation test can be a suitable alternative when the data set is violating the normality assumption. There is no standard R function to do this as it varies case by case. Below is a procedure proposed by Davison & Hinkley (1997) using Monte Carlo p-value with correction.
This is the dummy data set. What do you observe from just the code below?
data <- list(experiment = c(27,20,21,26,27,31,24,21,20,19,23,24,28,19,24,29,18,20,17,31,20,25,28,21,27),
control = c(21,22,15,12,21,16,19,15,22,24,19,23,13,22,20,24,18,20))
Does this data set violate normality assumption? Test the hypothesis with a two-sample t-test.
Conduct the permutation test. While the code has been given to you, try understand what each line of the code does. Most of the labels should be self-explanatory.
nrep <- 1000
all_data <- c(data$experiment, data$control)
permutations <- replicate(nrep, sample(1:length(all_data), length(data$experiment)))
result <- apply(permutations, 2, function(permutation){
mean_expt <- mean(all_data[permutation])
mean_ctrl <- mean(all_data[-permutation])
test_statistic <- mean_expt - mean_ctrl
})
n <- length(result)
# r = number of replications at least as extreme as observed event
obs <- mean(data$experiment) - mean(data$control)
r <- sum(abs(result) >= obs)
# compute Monte Carlo p-value with correction (Davison & Hinkley, 1997)
mc.p.value=(r+1)/(n+1)
hist(result, breaks = 50, prob = T, main="",
sub = paste0("MC p-value for H0: ", mc.p.value),
xlab = paste("found", r, "as extreme effects for", n, "replications"))
abline(v = obs, lty=2, col="red")
Re-run the entire code in 4c. Each time you will see a different null distribution.