⌛ 30 min
Gene-environment association (GEA) analysis aims at detecting specific allele or locus genotype is associated with a specific environmental predictor, while controlling for neutral genetic structure. The advances of genomic sequencing have allowed us to identify millions of single nucleotide polymorphisms (SNPs), which enhances the power to detect these signals of adaptation (or selection).

Rellstab et al. (2015). Mol EcolI. https://onlinelibrary.wiley.com/doi/full/10.1111/mec.13322
<aside> 💡 In this activity, we will analyse SNP data from a previous study on North America gray wolves (Canis lupus) by Schweizer et al. (2015) Mol Ecol to achieve these objectives:
<aside> 💌 I wish to thank Dr Brenna Forester (USFWS) for developing the data used in this practical.
</aside>
<aside> 📥 Download
</aside>
[ ] First, we will investigate the wolf.tsv.
wolf <- read.table("wolf.tsv", sep = "\\t", header = T)
dim(wolf)
[ ] Depending on the genotyping method, sometimes there is missing data at certain loci for certain individuals. We will investigate how frequent these missing data are.
sum(is.na(wolf))
[ ] We will then convert it to a lfmm genotype matrix format for testing the population structure. The only difference is a lfmm object does not contain any meta data in column and row names.
write.table(wolf, "wolf.lfmm", sep = " ", col.names = F, row.names = F)
[ ] If we want to search for signals of adaptation, we have to account for the population structure which will confound the signal. This is done through a mixed effect model. First, we will define the population structure. Instead of DAPC, we will use the sNMF algorithm, which is fast to compute but as accurate as PCA approaches.
wolf_snmf <- snmf("wolf.lfmm", K = 1:8, ploidy = 2, repetitions = 3, entropy = T, project = "new")
[ ] To determine the optimal $K$, we will plot their cross-entropies. Similar to BIC, we will pick the optimal $K$ which has the least cross-entropy.
plot(wolf_snmf, col = "blue", cex = 1.4, pch = 19)
[ ] We ran for repetitions = 3 for each $K$ in the sNMF algorithm. We will pick the best run (with the least cross-entropy) among the repetitions for $K=3$.
K <- 3
best <- which.min(cross.entropy(wolf_snmf, K = K))
[ ] We can visualise how prominent the population structure is using bar plots.
wolf_qmat <- Q(wolf_snmf, K = K, run = best)
barplot(t(wolf_qmat), col = brewer.pal(n = K, name = "Set2"), border = NA, space = 0,
xlab = "Individuals", ylab = "Admixture coefficients")
[ ] Many GEA analysis is not able to tolerate missing data. Therefore we would use the genetic clusters wolf_snmf to impute the missing data.
impute(wolf_snmf, "wolf.lfmm", K = 3, run = best)
Missing data must be encoded as 9 comes up!This will produce an output file wolf.lfmm_imputed.lfmm in your working directory.
<aside> 📥 Download
</aside>
[ ] We will load the reduced set of environmental data from the original paper in R.
env <- read.csv("env.csv")
str(env)