⌛ 30 min

Gene-environment association (GEA) analysis aims at detecting specific allele or locus genotype is associated with a specific environmental predictor, while controlling for neutral genetic structure. The advances of genomic sequencing have allowed us to identify millions of single nucleotide polymorphisms (SNPs), which enhances the power to detect these signals of adaptation (or selection).

Rellstab et al. (2015). Mol EcolI. https://onlinelibrary.wiley.com/doi/full/10.1111/mec.13322

Rellstab et al. (2015). Mol EcolI. https://onlinelibrary.wiley.com/doi/full/10.1111/mec.13322

<aside> 💡 In this activity, we will analyse SNP data from a previous study on North America gray wolves (Canis lupus) by Schweizer et al. (2015) Mol Ecol to achieve these objectives:

Delineate the neutral and adaptive population structures
Conduct and interpret a Latent Factor Mixed Model (LFMM) to detect gene-environment associations
Correct for multiple hypothesis testing for genomic data </aside>

<aside> 💌 I wish to thank Dr Brenna Forester (USFWS) for developing the data used in this practical.

</aside>

Understanding and loading the data

<aside> 📥 Download

wolf.tsv

</aside>

[ ] First, we will investigate the wolf.tsv.
```
wolf <- read.table("wolf.tsv", sep = "\\t", header = T)
dim(wolf)
```
- What are the columns and what are the rows?
[ ] Depending on the genotyping method, sometimes there is missing data at certain loci for certain individuals. We will investigate how frequent these missing data are.
```
sum(is.na(wolf))
```
[ ] We will then convert it to a lfmm genotype matrix format for testing the population structure. The only difference is a lfmm object does not contain any meta data in column and row names.
```
write.table(wolf, "wolf.lfmm", sep = " ", col.names = F, row.names = F)
```
[ ] If we want to search for signals of adaptation, we have to account for the population structure which will confound the signal. This is done through a mixed effect model. First, we will define the population structure. Instead of DAPC, we will use the sNMF algorithm, which is fast to compute but as accurate as PCA approaches.
```
wolf_snmf <- snmf("wolf.lfmm", K = 1:8, ploidy = 2, repetitions = 3, entropy = T, project = "new")
```
[ ] To determine the optimal $K$, we will plot their cross-entropies. Similar to BIC, we will pick the optimal $K$ which has the least cross-entropy.
```
plot(wolf_snmf, col = "blue", cex = 1.4, pch = 19)
```
- What is the optimal $K$?
[ ] We ran for repetitions = 3 for each $K$ in the sNMF algorithm. We will pick the best run (with the least cross-entropy) among the repetitions for $K=3$.
```
K <- 3
best <- which.min(cross.entropy(wolf_snmf, K = K))
```

[ ] We can visualise how prominent the population structure is using bar plots.

wolf_qmat <- Q(wolf_snmf, K = K, run = best)

barplot(t(wolf_qmat), col = brewer.pal(n = K, name = "Set2"), border = NA, space = 0,
        xlab = "Individuals", ylab = "Admixture coefficients")

What do you observe?

[ ] Many GEA analysis is not able to tolerate missing data. Therefore we would use the genetic clusters wolf_snmf to impute the missing data.
```
impute(wolf_snmf, "wolf.lfmm", K = 3, run = best)
```
- Oh, an error Missing data must be encoded as 9 comes up!
This will produce an output file wolf.lfmm_imputed.lfmm in your working directory.

Preparing the environmental predictor data

<aside> 📥 Download

env.csv

</aside>

[ ] We will load the reduced set of environmental data from the original paper in R.
```
env <- read.csv("env.csv")
str(env)
```
- What environmental predictors are there in the dataset? Which columns are they?