P6. Violations of normal assumptions

Help! I don’t know how to use this function!

Central limit theorem states that, when the sample size is large, regardless of the population distribution, the sample will still be normally distributed. We will use R to simulate this.

We will start by simulating an exponential distribution (example below).
1. Set these basic parameters for the population distribution.
```
set.seed(1234) # for reproducibility
n <- 40        # large sample size
lambda <- 0.2  # determines the shape
```
2. Build a for-loop which will take the mean of 40 samples of the exponentially distributed population using the function rexp() and repeat 1,000 times. Each iteration append the value of sample mean to a vector called clt_exp (which has to be declared as an empty variable before the for-loop).
3. These are the properties of an exponential distribution: $\mu=1/\lambda$ and $\sigma=(1/\lambda)/\sqrt n$. Plot a graph which will show the frequency distribution of the sample means, add a vertical line where the means of the sample means is, and add a vertical line where the means of the population mean is. Are they similar?
  - Answer for 1b-c:
4. Adjust the bin size to visualise the distribution better (for example, the argument breaks in hist() controls the bin size).
5. Compute the 95% confidence interval for the sample means. What does it imply?
6. How about variance of sample means and population variance?
You might want to try the same simulation with other types of distributions. You have already known rt() for t-distribution. There are others: runif() for uniform distribution, rpois() for Poisson distribution, rgamma() for gamma distribution.

<aside> 💡 You will learn these other distributions in Professor Hector’s part later in this module. You can always come back to this practical later.

</aside>

Non-parametric tests are usually more preferred dealing with real-life biological data which is most likely not normal.

Researches predict that carotenoids in birds may affect immune system function. To test this, a group of zebra finches was randomly divided into two groups, one receiving supplemental carotenoids in their diet, another one not. Two variables were measured, the cell-mediated immunocompetence (PHA) and the humoral immunity (SRBC).

P4Q2
1. Separate the dataset into two data frame for each group by using conditional df[df$treatment == "CAROT", ] and df[df$treatment == "NO", ].
2. Using var.test(), test the distributions of PHA for both groups, and then SRBC.
3. Either using data transformation or non-parametric tests, does PHA differ on average between the birds that received supplemental carotenoids and those that did not?
4. What about SRBC?

Data transformation can be a useful method to correct for normality violation, and log transformation is particularly useful when the data span through orders of magnitude.

Wolbachia is an endosymbiont living in mosquitoes and may confer their immunity against viruses such as those responsible for dengue fever among humans. To examine their effects on transmission of dengue, researchers infected two groups of mosquitoes with dengue, one the WB1 type which harbours the endosymbiont, one the wild type. Viral titers were measured after 14 days.

P6Q3
1. Plot the data with an appropriate type of chart, and what trend is suggested? Why log transformation should be considered?
2. Apply a log transformation $Y'=\ln Y$ to the measurements. What happens?
3. Now apply a log transformation of $Y'=\ln(Y+1)$ instead. Why does it work this time? In what occasions do you think by adding a constant $a$, such that $Y'=\ln(Y+a)$ is necessary?
4. Using the transformed data from 3c, calculate a 95% confidence interval for the difference in mean viral titres. Let a difference between strain means of 1 or less on the log scale represent a small effect and otherwise a large effect. Based on the confidence interval, is this difference small, large, or uncertain?
5. Convert the lower and upper limits of this confidence interval back to the original scale. What does this imply?

Permutation test can be a suitable alternative when the data set is violating the normality assumption. There is no standard R function to do this as it varies case by case. Below is a procedure proposed by Davison & Hinkley (1997) using Monte Carlo p-value with correction.

We will generate a dummy data set and compare the two means.

This is the dummy data set. What do you observe from just the code below?

data <- list(experiment = c(27,20,21,26,27,31,24,21,20,19,23,24,28,19,24,29,18,20,17,31,20,25,28,21,27),
             control    = c(21,22,15,12,21,16,19,15,22,24,19,23,13,22,20,24,18,20))

Does this data set violate normality assumption? Test the hypothesis with a two-sample t-test.

Conduct the permutation test. While the code has been given to you, try understand what each line of the code does. Most of the labels should be self-explanatory.

nrep <- 1000

all_data <- c(data$experiment, data$control)
permutations <- replicate(nrep, sample(1:length(all_data), length(data$experiment)))

result <- apply(permutations, 2, function(permutation){
  mean_expt <- mean(all_data[permutation])
  mean_ctrl <- mean(all_data[-permutation])
  test_statistic <- mean_expt - mean_ctrl
})

n <- length(result)
# r = number of replications at least as extreme as observed event
obs <- mean(data$experiment) - mean(data$control)
r <- sum(abs(result) >= obs)
# compute Monte Carlo p-value with correction (Davison & Hinkley, 1997)
mc.p.value=(r+1)/(n+1)

hist(result, breaks = 50, prob = T, main="",
     sub = paste0("MC p-value for H0: ", mc.p.value),
     xlab = paste("found", r, "as extreme effects for", n, "replications"))
abline(v = obs, lty=2, col="red")

Re-run the entire code in 4c. Each time you will see a different null distribution.