<aside> 💡 **Whenever you have any questions regarding how a particular function works, in the command line type ?function_name() to access the help page.

All functions have this syntax:**

function_name(
	argument1,
	argument2,
	argument3 = 5,
	argument4 = TRUE)

Some arguments have default values (e.g. argument3 = 5 and argument4 = TRUE), but some do not argument1 and argument2. As a minimum, you need to input values for all arguments that have no default value.

When executing a function, you can omit the argument keys if you are inputting the values in the same order as the function syntax.

function_name(
	40,     # argument 1
	30,     # argument 2
	8       # argument 3, overriding the default value 5
	FALSE)  # argument 4, overriding the default value TRUE

However, sometimes you may omit the values for certain arguments (usually because you do not need to change the default value). Then you must state both its key and value for those not in order.

function_name(
	40,     # argument 1
	30,     # argument 2
	argument4 = FALSE)  # argument 4, overriding the default value TRUE

</aside>

The three Pythagorean means (arithmetic, geometric, and harmonic) and median are all useful to describe the central tendency of the data. With the scenarios below, we will understand why a particular Pythagorean mean is suitable than the others.

  1. Arithmetic mean works well to describe the centre of the data when there is an additive relationship between the numbers (i.e. linear).
    1. Create a arithmetic series of 7 values, where each number is produced by adding 3 to the previous number, starting with 1: a <- seq(1, 19, by = 3).
    2. The arithmetic mean can be found by either mean(a) or sum(a) / 7.
    3. Plot a graph of a: plot(a, type = "b").
    4. Add a vertical line to indicate the arithmetic mean: abline(h = mean(a)).
    5. What is the median of a?
    6. What is the relationship between arithmetic mean and median of a?

<aside> 💡 Remember to turn the graphics off after finishing one plot and before starting a new plot: dev.off()

</aside>

  1. Geometric mean works well when there is a multiplicative relationship (i.e. exponential).

    1. Create a geometric series of 7 values, where each number is produced by multiplying 3 to the previous number, start with 1. Also use the function seq() and assign this series to g.
    2. The geometric mean can be found by exp(mean(log(g))).
    3. Plot a graph of g.
    4. Add a horizontal line to indicate the arithmetic mean: abline(h = mean(g)).
    5. Add another line to indicate the geometric mean: abline(h = exp(mean(log(g))), col = "blue"), with blue colour to distinguish from the arithmetic mean.
    6. What is the median of g?
    7. What is the relationship between arithmetic mean, geometric mean, and median of g?
  2. We will use compound interest as a real-world example to understand the feasibility of geometric mean better. Assume we have $100,000 that accrues a varying rate of interest each year for 5 years: rates <- c(1.01, 1.09, 1.06, 1.02, 1.15).

    1. We want a shortcut to find our average annual interest rate and thus the total amount of money after 5 years. Calculate the arithmetic mean of the rates rates_amean <- mean(rates) and compute the total money after 5 years using the compound interest formula: 100000 * (rates_amean^5 - 1) + 100000.

    2. We want to make sure the calculation from 3a was correct, so we can do it the stupid way (year by year). After year 2: it will be year_1 <- 100000 * 0.01 + 100000; after year 2, it will be year_2 <- year_1 * 0.09 + year_1, and so on. This can be done using a for-loop:

      rates <- c(1.01, 1.09, 1.06, 1.02, 1.15)
      money <- 100000                # starting money
      for (r in rates){              # for every rate
        money <- money * (r-1) + money   # calculate the new money amount and re-write itself
      }                              # the loop will move on to next rate (year)
      print(money)                       # return the final value
      
    3. Do the two values match? Was the calculation from 3a correct?

    4. Instead of using arithmetic mean, calculate the geometric mean of rates and then use the compound interest formula to calculate the amount of money after 5 years again. Is it correct this time?

      rates_gmean <- exp(mean(log(rates)))
      final_money <- 100000 * (rates_gmean^5 - 1) + 100000
      
  3. Harmonic mean can be easily confused with geometric mean. There is a good canonical example in real life: the travelling over physical space at different rates (i.e. speeds). Let’s say you want to visit the famous Blenheim Palace after having this lesson.

    1. Using Google Maps, estimate the distance between Blenheim Palace and where you are now.

    2. Assume that on your way there you are driving 30 mph the entire way, and sadly, on the way back traffic is crawling and you are driving 10 mph the same entire way. What is the arithmetic mean of your speed across this entire trip?

    3. It, of course, is incorrect. Why? How can we apply the arithmetic mean correctly? Consider a weighted arithmetic mean approach, taking account into how much time it takes both ways.

    4. Now we will try using the harmonic mean instead:

      speeds <- c(30,10)
      h <- length(speeds)/sum(speeds^(-1))
      print(h)
      
    5. Et voilà! What is the relationship between the weighted arithmetic mean and the harmonic mean?

R has some built-in data sets that are useful for practice. We will practice drawing boxplot and computing interquartile range with one of those.

  1. mtcars is a built-in data set and has many interesting measurements. Use ?mtcars to understand how this data set is generated.
    1. We are interested in how miles per gallon (mpg) are different with groups of different number of cylinders (cyl). A simple boxplot can be built using:

      boxplot(mpg ~ cyl, data = mtcars, main="Car Milage Data",
         xlab="Number of Cylinders", ylab="Miles Per Gallon")
      
    2. cyl is treated as categorical variable here. Why?

    3. summary(mtcars$cyl) gives the interquartile range. But it is not what we want, we want the interquartile range for each group. From what you have learned from subsetting and indexing, compute the IQRs for all three groups.