**DAPC (Discriminant Analysis of Principal Components) combines both PCA and LDA.**
Rosewoods are the world’s most illegally trafficked wild product, which amounts to ~40% of the global illegal wildlife trade (more than all animal products added together). They are notoriously called the bloodwood, because the conflicts between the local poachers, often driven from extreme poverty, and the forest rangers cause bloodsheds. We want to save this species and alleviate the poverty at the same time. We are teaching local people how to source seeds, grow them, and generate income. NGOs and governments buy these trees for restoration projects. These all sound fantastic. However, it is known that traditionally forest conservation programmes ignore the standing genetic diversity and create genetic bottlenecks in the germplasms, for example, by sourcing seeds that are genetically closely related.

<aside> 📥 Download
</aside>
library(adegent)
library(poppr)
library(mapplots)
library(maps)
library(RColorBrewer)
[ ] First, investigate the rosewood.csv using a simple spreadsheet software (e.g. Excel). This is known as a GenAlEx format.
rosewood.csv store these metadata?A and B). Also note that D1 (above ANG) and E1 (above BAN) say 4 and 28 respectively. What could those numbers mean?[ ] We will now start our session in R.
rosewood <- read.genalex("rosewood.csv")
rosewood
What does it tell you? Do they match with what you observed from above?
[ ] We will use the DAPC (Discriminant Analysis of Principal Components) to identify and describe genetic clusters in this species. The first step is to use the function find.clusters() to identify clusters.
str <- find.clusters(rosewood, max.n.clust = 20)
Choose the number PCs to retain (>= 1):Choose the number of clusters (>=2):[ ] We will then use the function dapc() to describe the relationships between these clusters. DAPC provides an efficient description of genetic clusters using a few synthetic variables. These are constructed as linear combinations of the original variables (alleles) which have the largest between-group variance and the smallest within-group variance. Coefficients of the alleles used in the linear combination are called loadings, while the synthetic variables are called discriminant functions.
rw_dapc <- dapc(rosewood, str$grp)
Choose the number PCs to retain (>= 1):Choose the number discriminant functions to retain (>=1):[ ] We will plot a scatterplot to see how prominent the genetic structure is.
scatter(rw_dapc)
[ ] We want to see what this means to our populations. We will thus assign membership probabilities to each individual (i.e. how likely each individual belongs to each of the 5 clusters).
postprobs <- as.data.frame(round(rw_dapc$posterior, 4))
head(postprobs)
[ ] We will then compute the mean of the membership probabilities across each population (i.e. the proportion of membership to each of the 5 clusters for each population).
# We will stick with K = 5
K <- 5
# Load the coordinates data
rw_coords <- read.csv("rosewood_coords.csv", row.names = 1)
# Retrieve the original population for each individual
rw_pops <- rosewood$pop
# This retrieves the number of populations, which is simply 26.
Npop <- length(unique(rw_pops))
# This creates an empty matrix with 26 populations and K = 5 clusters
qpop <- matrix(NA, ncol = K, nrow = Npop)
row.names(qpop) <- unique(rw_pops) # Name the rows with population IDs
# For each population
for (i in unique(rw_pops)){
# Compute the mean of probabilities for each cluster, and put it back to the matrix
qpop[i,] <- apply(postprobs[rw_pops == i,], 2, mean)
}
View(qpop)