P8Q1.csv from P8 to experiment the LDA.
We will have to use the library(MASS) and the lda() function. We already know the clustering a priori (Cultivar).
For lda(), the first argument is a formula which takes the form of Y ~ X1 + X2 + X3 + ... + Xn, where Y is the response grouping factor, i.e. Cultivar, and Xn are the discriminants. The second argument is data = wine. Name this lda object as wine.lda.
Proportion of trace is the percentage separation achieved by each discriminant function. How much does the first discriminant function achieve?
Inspect wine.lda, what is the first discriminant function?
We will have to calculate values of the linear discriminant functions.
wine.lda.values <- predict(wine.lda, wine[2:9])
We will make scatterplots of the first two discriminant functions, based on the values predicted above.
plot(wine.lda.values$x[,1], wine.lda.values$x[,2]) # make a scatterplot
text(wine.lda.values$x[,1], wine.lda.values$x[,2], wine$Cultivar, cex=0.7, pos=4, col="red") # add labels
<aside> 💡 Credit to Jeff Oliver. Adapted from their tutorial here.
</aside>

Read in the CSV as otter. Missing data can cause problems, therefore we will use na.omit() to remove any rows with missing data. After dropping those NA rows, we will force re-numbering of the rows with rownames(otter) <- NULL.
We will run the LDA with lda() and store this as otter.lda. This time, we will specify equal probability assignment for each of four species. Therefore, besides the formula = , and data = , we will specify prior = c(1,1,1,1)/4.
Calculate values of the linear discriminant functions and plot the first two discriminant functions, similarly to P9Q1.
We actually want to look at how well the LDA did in classifying the specimens. We will use the function lda() again, but this time add one extra argument CV = TRUE, which performs cross-validation on the model and provide posterior probabilities of group membership in the posterior object returned by lda().
We will pass two arguments to the table(): the a priori species identification of each sample, and the a posteriori assignment of each sample from the LDA.
table(otter$species, otter.lda$class, dnn = c("a priori", "assignment"))
The dnn = parameter is used to label the table. What do you observe?
We will investigate the posterior object by calling head(otter.lda$posterior. What do you observe?
Look at round(otter.lda$posterior[88, ], 3). What is concerning in this sample?
We will arbitrary decide that any sample with a posterior probability less than 0.95 as an ambiguous assignment. We will then create a matrix of assignments based on this cut-off.
minimum_prob <- 0.95
unambiguous <- otter.lda$posterior > minimum_prob
head(unambiguous)
Now look at the sample 88 again in this new matrix with unambiguous[88, ]. Is this what you expected?
We will find all ambiguous rows (i.e. all the values are FALSE) as ambiguous <- umabiguous[rowSums(umabiguous) == 0, ].
We will find what these samples originally by getting their row numbers, and match them with the original otter data set. What are the a priori known species of these ambiguous samples?
We can also plot the posterior probabilities for each sample. The base R code has been given to you below as a starting point. Feel free to draw advanced graphs with ggplot2.
posteriors <- t(otter.lda$posterior)
tick_pos <- c(0,
nrow(otter) - match(levels(otter$species)[1], rev(otter$species)) + 1,
nrow(otter) - match(levels(otter$species)[2], rev(otter$species)) + 1,
nrow(otter) - match(levels(otter$species)[3], rev(otter$species)) + 1,
nrow(otter) - match(levels(otter$species)[4], rev(otter$species)) + 1)
label_pos <- c(floor((tick_pos[2] - tick_pos[1]) / 2),
floor((tick_pos[3] - tick_pos[2]) / 2 + tick_pos[2]),
floor((tick_pos[4] - tick_pos[3]) / 2 + tick_pos[3]),
floor((tick_pos[5] - tick_pos[4]) / 2 + tick_pos[4]))
legend_cols <- c("black", "green4", "cyan3", "red3")
# Store graphics defaults
mar_default <- c(5, 4, 4, 2) + 1 # from par documentation
par(mar = mar_default + c(2, 0, 0, 0)) # add space to bottom margin
barplot(posteriors,
ylab = "Posterior Probability",
space = 0,
border = NA,
xaxt = 'n',
xlim = c(0, ncol(posteriors) + 45),
col = legend_cols,
legend.text = levels(otter$species),
args.legend = list(cex = 0.7, x = ncol(posteriors) + 55, y = 0.6, xpd = TRUE))
# Add tick marks
axis(side = 1,
at = tick_pos,
labels = FALSE)
# Add axis labels
axis(side = 1,
at = label_pos,
labels = levels(otter$species),
tick = FALSE,
par(las = 2, cex = 0.8))
Divergent Skull Morphology Supports Two Trophic Specializations in Otters (Lutrinae)