**DAPC (Discriminant Analysis of Principal Components) combines both PCA and LDA.**

Rosewoods are the world’s most illegally trafficked wild product, which amounts to ~40% of the global illegal wildlife trade (more than all animal products added together). They are notoriously called the bloodwood, because the conflicts between the local poachers, often driven from extreme poverty, and the forest rangers cause bloodsheds. We want to save this species and alleviate the poverty at the same time. We are teaching local people how to source seeds, grow them, and generate income. NGOs and governments buy these trees for restoration projects. These all sound fantastic. However, it is known that traditionally forest conservation programmes ignore the standing genetic diversity and create genetic bottlenecks in the germplasms, for example, by sourcing seeds that are genetically closely related.

Untitled

<aside> 📥 Download

rosewood.csv

rosewood_coords.csv

</aside>

R packages

library(adegent)
library(poppr)
library(mapplots)
library(maps)
library(RColorBrewer)

Understanding and loading the data

[ ] First, investigate the rosewood.csv using a simple spreadsheet software (e.g. Excel). This is known as a GenAlEx format.
- This population study has microsatellite data based on 9 loci from 523 individuals across 26 populations. Where in the rosewood.csv store these metadata?
- Roughly look at the sample and population columns (A and B). Also note that D1 (above ANG) and E1 (above BAN) say 4 and 28 respectively. What could those numbers mean?
- Is this species hiploid or diploid? How do we know by judging the loci data?
[ ] We will now start our session in R.
```
rosewood <- read.genalex("rosewood.csv")
rosewood
```
What does it tell you? Do they match with what you observed from above?

Detecting population structure

[ ] We will use the DAPC (Discriminant Analysis of Principal Components) to identify and describe genetic clusters in this species. The first step is to use the function find.clusters() to identify clusters.
```
str <- find.clusters(rosewood, max.n.clust = 20)
```
- It will show you the first graph, and ask you Choose the number PCs to retain (>= 1):
- It will then show you the second graph, and ask you Choose the number of clusters (>=2):
- There is no true $K$: how many clusters are there really in the data?
[ ] We will then use the function dapc() to describe the relationships between these clusters. DAPC provides an efficient description of genetic clusters using a few synthetic variables. These are constructed as linear combinations of the original variables (alleles) which have the largest between-group variance and the smallest within-group variance. Coefficients of the alleles used in the linear combination are called loadings, while the synthetic variables are called discriminant functions.
```
rw_dapc <- dapc(rosewood, str$grp)
```
- It will show you the first graph, and ask you Choose the number PCs to retain (>= 1):
- It will then show you the second graph, and ask you Choose the number discriminant functions to retain (>=1):
[ ] We will plot a scatterplot to see how prominent the genetic structure is.
```
scatter(rw_dapc)
```
- What does it tell you?
[ ] We want to see what this means to our populations. We will thus assign membership probabilities to each individual (i.e. how likely each individual belongs to each of the 5 clusters).
```
postprobs <- as.data.frame(round(rw_dapc$posterior, 4))
head(postprobs)
```

[ ] We will then compute the mean of the membership probabilities across each population (i.e. the proportion of membership to each of the 5 clusters for each population).

# We will stick with K = 5
K <- 5

# Load the coordinates data
rw_coords <- read.csv("rosewood_coords.csv", row.names = 1)

# Retrieve the original population for each individual
rw_pops <- rosewood$pop

# This retrieves the number of populations, which is simply 26.
Npop <- length(unique(rw_pops))

# This creates an empty matrix with 26 populations and K = 5 clusters
qpop <- matrix(NA, ncol = K, nrow = Npop)
row.names(qpop) <- unique(rw_pops)    # Name the rows with population IDs

# For each population
for (i in unique(rw_pops)){
	# Compute the mean of probabilities for each cluster, and put it back to the matrix
  qpop[i,] <- apply(postprobs[rw_pops == i,], 2, mean)
}

View(qpop)