Measures of Central Tendency

Published

October 8, 2024

1 Background

Collected data possess many observations that we are not interested in each one individually.
Instead, we are interested in summarizing the data in a concise and informative way.
Descriptive measures are single numbers that summarize the data.
They are used to describe the center and spread of the data.
They can be calculated from the data of a sample or a population.
Descriptive measures calculated from the sample are referred to as sample statistics, while those calculated from the population are referred to as population parameters.

2 Measures of central tendency

Measures of central tendency are used to describe the center of the data (i.e., the average or typical value).
They are also referred to as measures of location.
The most common measures of central tendency are the mean, median, and mode.

2.1 Arithmetic Mean

It is calculated by summing all observations and dividing by the total number of observations.

Let $X$ is a random variable:

The mean of a finite population of size $N$ is denoted by $\mu$, which reads as mu, while the mean of a sample of size $n$ is denoted by $\bar{x}$, which reads as x-bar.

The population and sample means can be calculated as follows:

Population mean	Sample mean
$\mu = \displaystyle \frac{x_1 + x_2 + \ldots + x_N}{N} = \displaystyle \frac{\displaystyle \sum_{i=1}^{N} x_i}{N}$	$\bar{x} = \displaystyle \frac{x_1 + x_2 + \ldots + x_n}{n} = \displaystyle \frac{\displaystyle \sum_{i=1}^{n} x_i}{n}$

Example: consider the variable mpg in the mtcars dataset that we discussed in earlier chapters:
Click to show/hide code
```
mtcars$mpg
```
```
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
```
- The number of observations can be calculated using the length() function:
  Click to show/hide code
  length(mtcars$mpg)
```
[1] 32
```
- The sum of all values can be calculated using the sum() function:
  Click to show/hide code
  sum(mtcars$mpg)
```
[1] 642.9
```
- $\bar{x} = \displaystyle \frac{21 + 21 + 22.8 + \ldots + 21.4}{32} = \frac{642.9}{32} \approx 20.09$
- The mean can be calculated directly using the built-in R function mean():
  Click to show/hide code
  mean(mtcars$mpg)
```
[1] 20.09062
```
  Note
  - If the variable contains missing values, the na.rm argument should be set to TRUE to exclude them before calculating the mean, otherwise, the result will be NA.
  - For example, consider the following variable with some missing values $(1,\ 2,\ 3,\ \text{NA},\ 5)$:
    
    Using the mean() function without the na.rm argument will return NA:
    
    Click to show/hide code
    
    x <- c(1, 2, 3, NA, 5) mean(x)
    
    [1] NA
    
    This error can be avoided by setting the na.rm argument to TRUE:
    
    Click to show/hide code
    
    mean(x, na.rm = TRUE)
    
    [1] 2.75
The mean can be calculated if the values are given along with their frequencies as weighted mean:
- Let $x_1, x_2, \ldots, x_k$ be the values of a variable with frequencies $f_1, f_2, \ldots, f_k$, then the weighted mean is calculated as:
\[ \bar{x} = \displaystyle \frac{f_1x_1 + f_2x_2 + \ldots + f_kx_k}{f_1 + f_2 + \ldots + f_k} = \displaystyle \frac{\displaystyle \sum_{i=1}^{k} f_ix_i}{\displaystyle \sum_{i=1}^{k} f_i} \]
- The weight of each observation $\displaystyle w_i = \frac{f_i}{\displaystyle \sum_{i=1}^{k} f_i}$, therefore, the above equation can also be written as $\bar{x} = \displaystyle \sum_{i=1}^{k} w_ix_i$
- Example: consider the following variable with values $(1.3,\ 2.1,\ 3.5,\ 4.1,\ 5.2)$ and frequencies $(2,\ 3,\ 1,\ 4,\ 2)$:
  Click to show/hide code
  library(DT) Value = c(1.3, 2.1, 3.5, 4.1, 5.2) Frequency = c(2, 3, 1, 4, 2) my_data <- data.frame( Value, Frequency, Weight = Frequency / sum(Frequency) ) datatable( my_data, options = list( dom = "t", columnDefs = list( list( className = "dt-center", targets = "_all" ) ) ), rownames = FALSE )
  - The weighted mean is calculated as:
  \[ \bar{x} = \displaystyle \frac{2 \times 1.3 + 3 \times 2.1 + 1 \times 3.5 + 4 \times 4.1 + 2 \times 5.2}{2 + 3 + 1 + 4 + 2} = \]
  
  \[ \frac{2.6 + 6.3 + 3.5 + 16.4 + 10.4}{12} = \frac{39.2}{12} \approx 3.27 \]
  - It can also be calculated using the weights as follows:
    Click to show/hide code
    
    with(my_data, sum(Value * Weight))
    [1] 3.266667
  - The built-in R function weighted.mean() can be used for direct calculation:
    Click to show/hide code
    
    weighted.mean( x = my_data$Value, w = my_data$Frequency) # w represents the weights which can be given as frequencies or weights
    [1] 3.266667

2.1.1 Characteristics of the mean

The sum of deviations of the observations from the arithmetic mean is always zero:

\[ \displaystyle \sum_{i=1}^{n} (x_i - \bar{x}) = 0 \]

The sum of the squared deviations of the observations from a certain value is the least (i.e., minimized) when this value is the mean (this is referred to as the least squares principle).
If a constant $c$ is added to each observation, the mean is also increased by $c$, i.e., the mean of $x_i + c = \bar{x} + c$.
If the observations are multiplied by a constant $c$, then the mean is also multiplied by $c$, i.e., the mean of $cx_i = c\bar{x}$.
If the observations $x_1, x_2, \ldots x_n$ are linearly transformed, where $y_i = a + bx_i$ ($a$ and $b$ are known constants), then it holds that $\bar{y} = a + b\bar{x}$.
The mean is sensitive to extreme values (outliers), and it may not be a good measure of central tendency in the presence of outliers.
Listing 1: Impact of outliers on the mean
- Example: consider the following variable with values $(32.1,\ 34.5,\ 33.6,\ 37.2,\ 39.4)$ the mean is calculated as follows:
  
  Click to show/hide code
  
  x <- c(32.1, 34.5, 33.6, 37.2, 39.4) mean(x)
  
  [1] 35.36
  
  Lets replace the last value with an extreme value of $250.1$, so the values would be:
  
  Click to show/hide code
  
  x[5] <- 250.1 x
  
  [1] 32.1 34.5 33.6 37.2 250.1
  
  The mean of the new values with the outlier would be:
  
  Click to show/hide code
  
  mean(x)
  
  [1] 77.5
  
  The presence of the outlier affected the mean that it is no longer a good measure of central tendency for this variable.

2.2 Median

The median, denoted by $\large \tilde{x}_{0.5}$, is the value which divides the ordered observations into two equal parts (i.e., at least 50% of the values $\ge$ the median and at least 50% of the values $\le$ the median).
The median is calculated as follows:
- If the number of observations is odd, the median is the middle value of the ordered observations.
- If the number of observations is even, the median is the arithmetic mean of the two middle observations.

\[ \tilde{x}_{0.5} = \begin{cases} x_{\frac{n+1}{2}} & \text{if $n$ is odd} \\ \\ \frac{1}{2}(x_{\frac{n}{2}} + x_{\frac{n}{2}+1}) & \text{if $n$ is even} \end{cases} \]

Example B.3.1
Example B.3.2

Consider the following set of observations: $(3, 5, 7, 7, 7, 9, 11)$.
- There is an odd number of observations $n = 7$, so the median is the middle value of the ordered observations, which is ranked $\displaystyle \frac{n+1}{2} = \frac{7+1}{2}=4^{th}$ and its value is $7$.
- Half of the measurements $(3, 5, 7)$ are below or equal to $7$, and half of the measurements $(7, 9, 11)$ are equal to or above $7$.
- Therefore, the previous defintion of the median contains “equal to” to cover the case of having repeated, similar values.

Example: the variable mpg has $32$ observations, so the median is the arithmetic mean of the two middle observations (i.e., the mean of the ordered observations ranked $\displaystyle \frac{32}{2} = 16^{th}$ and $\displaystyle \frac{32}{2} + 1 =17^{th}$):
- The ordered observations are:
  Click to show/hide code
  mpg_ordered <- sort(mtcars$mpg) mpg_ordered
```
 [1] 10.4 10.4 13.3 14.3 14.7 15.0 15.2 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7
[16] 19.2 19.2 19.7 21.0 21.0 21.4 21.4 21.5 22.8 22.8 24.4 26.0 27.3 30.4 30.4
[31] 32.4 33.9
```
- The $16^{th}$ and $17^{th}$ ordered observations are:
  Click to show/hide code
  mpg_ordered[16:17]
```
[1] 19.2 19.2
```
- The median is the arithmetic mean of these two values:
  Click to show/hide code
  mean(mpg_ordered[16:17])
```
[1] 19.2
```
- The median can also be calculated using the built-in R function median():
  Click to show/hide code
  median(mtcars$mpg)
```
[1] 19.2
```

2.2.1 Characteristics of the median

Unlike the mean, the median is not drastically affected by extreme values (outliers).
Consider the example discussed above Listing 1:
- The median of the original set of observations is:
  Click to show/hide code
  median(c(32.1, 34.5, 33.6, 37.2, 39.4))
```
[1] 34.5
```
- The median of the set of observations with the outlier is:
  Click to show/hide code
  median(c(32.1, 34.5, 33.6, 37.2, 250.1))
```
[1] 34.5
```
- Because the median is concerned with the middle value, it is less affected by the presence of the outliers than the mean.
The median is a robust measure of central tendency, and it is preferred when the data is skewed or contains outliers.

2.3 Quantiles

Quantiles are a generalization of of the concept of the median.
A quantile is defined as the value that divides the ordered observations into two parts such that a certain proportion of the observations are less than or equal to the quantile.
In this sense, the median is the $50^{th}$ quantile because at least $50\%$ of the observations are less than or equal to the median.
Similarly, the $20^{th}$ quantile is the value that divides the ordered observations into two parts such that at least $20\%$ of the observations are less than or equal to the that quantile, while at least $80\%$ of the observations are greater than or equal to that quantile.

General definition of quantiles

Let $\alpha$ be a number between $0$ and $1$.
The $[(\alpha \times 100)\%$ quantile is denoted by $\large \tilde{x}_{\alpha}$.
It is defined as the value that splits the data into two parts such that at least $\alpha \times 100\%$ of the observations are less than or equal to the quantile, while at least $(1 - \alpha) \times 100\%$ of the observations are greater than or equal to the quantile.

Deciles divide the data into $10$ parts, corresponding to quantiles at $\alpha = 0.1, 0.2, \ldots, 0.9$.
Quintiles divide the data into $5$ parts, corresponding to quantiles at $\alpha = 0.2, 0.4, 0.6, 0.8$.
Quartiles divide the data into $4$ parts, corresponding to quantiles at $\alpha = 0.25, 0.5, 0.75$.
Percentiles: if $(\alpha \times 100)$ takes an integer value, the quantile is referred to as a percentile (i.e, divide the data into $100$ equal parts).

2.3.1 Calculation of quantiles

The quantiles can be obtained in R using the quantile() function.
For example, the quantiles of the variable mpg in the mtcars dataset can be calculated as follows:
Click to show/hide code
```
quantile(mtcars$mpg)
```
```
    0%    25%    50%    75%   100% 
10.400 15.425 19.200 22.800 33.900 
```
By default, The function outputs the $0^{th}$, $25^{th}$, $50^{th}$, $75^{th}$, and $100^{th}$ quantiles (i.e., the minimum, first quartile, median, third quartile, and maximum).
To get a specific quantile, use the probs argument to add a vector of the desired probabilities (i.e., $\alpha$). For example, the following code can be used to get the $5^{th}$ and $95^{th}$ quantiles:
Click to show/hide code
```
quantile(
  mtcars$mpg, 
  probs = c(0.05, 0.95)
)
```
```
    5%    95% 
11.995 31.300 
```
R provides nine algorithms to calculate the quantiles, which can be specified using the type argument:
- Types $1-3$ are used for discontinuous data, while types $4-9$ are used for continuous data.
- Type $1$ and $3$ are used for class “Date” and for ordered factors.
- Type $7$ is the default method for continous data.
- Type $6$ is used to get results similar to SPSS, Minitab, or Graphpad Prism.
- For example, the of $25^{th}$ and $75^{th}$ quantiles of the variable mpg in the mtcars dataset using R (type $6$ or $7$), SPSS, and, Graphpad Prism are as follows:
  Using type $6$:
  
  Click to show/hide code
  
  quantile( mtcars$mpg, probs = c(0.25, 0.75), type = 6 )
  
  25% 75% 15.275 22.800
  
  Using type $7$:
  
  Click to show/hide code
  
  quantile( mtcars$mpg, probs = c(0.25, 0.75), type = 7 )
  
  25% 75% 15.425 22.800
- Detailed discussion of the algorithms used to calculate the quantiles can be found here.

2.4 Trimean

The trimean is defined as weighted average of the first quartile $(Q_1)$, the median $(Q_2)$, and the third quartile $(Q_3)$:

\[ \text {Trimean} = \frac{Q_1 + (2 \times Q_2) + Q_3}{4} \]

The trimean is a robust measure of central tendency and has the advantage that it uses more information than the median, which is based on only one value and ignores the overall distribution.

2.5 Mode

The mode of a set of observations is the observation that occurs most frequently compared with all other values.
A set of observations can have:
- No mode (if all observations occur with the same frequency).
- One mode (if one observation occurs most frequently).
- More than one mode (if two or more observations occur with the same highest frequency).
Example: consider the variable cyl (number of car cylinders) in the mtcars dataset:
- R has no direct built-in function to calculate the mode, but it can be calculated from the frequency table as follows:
  Click to show/hide code
  library(dplyr) # create table of the frequencies of the values freq_tbl <- table(mtcars$cyl) # convert to adata frame freq_df <- as.data.frame(freq_tbl) names(freq_df)[1] <- "cyl" # arrange the values in descending order of frequency freq_df <- freq_df %>% arrange(desc(Freq)) freq_df
```
  cyl Freq
1   8   14
2   4   11
3   6    7
```
  Click to show/hide code
  paste( "The mode is", freq_df$cyl[which.max(freq_df$Freq)], "cyliners" )
```
[1] "The mode is 8 cyliners"
```
- The mode can also be calculated directly using the Mode() function from the DescTools package:
  Click to show/hide code
  library(DescTools)
```
Warning: package 'DescTools' was built under R version 4.4.1
```
  Click to show/hide code
  Mode(mtcars$cyl)
```
[1] 8
attr(,"freq")
[1] 14
```
  - The first line of the output shows the mode ($8$ cylinders) and the last line shows the frequency of the mode.
Caution
- R has a built-in function mode() that returns the storage mode of an object (e.g., numeric, character, etc.) rather than the mode.
  Click to show/hide code
  
  mode(mtcars$cyl)
  
  [1] "numeric"
- For numerical data which contains many unique values, the mode may not be a good measure of central tendency. However, if the observations are summarized in groups (e.g., frequency table or histogram), the mode can be calculated as the class interval with the highest frequency.
  - Example: consider the histogram of the variable mpg in the mtcars dataset:
    - The class interval $(15, 20]$ has the highest frequency of $12$ observations, so the mode is the interval $(15, 20]$ (others may also consider the midpoint of the interval, $17.5$, as the mode).

2.6 Geometric mean

Let $X$ a quantitative random variable with positive and non-zero values $x_1, x_2, \ldots, x_n$, then the geometric mean $(\bar{x}_G)$ is computed by multiplying all the values and then taking the $n^{th}$ root of the product:

\[ \bar{x}_G = \large \sqrt[n]{x_1 \times x_2 \times \ldots \times x_n} = \left( \displaystyle \prod_{i=1}^{n} x_i \right)^{\frac{1}{n}} \]
It is less sensitive to extreme values in skewed distributions than the arithmetic mean.
Geometric mean is used for averaging data that are multiplicative in nature (e.g., percentage change and rates).
- Percentage change is calculated as the difference between the final $(x_1)$ and initial $(x_0)$ values divided by the initial value, then multiplied by $100$:
  
  \[ \displaystyle \text{Percentage change} = \left( \frac{x_1 - x_0}{x_0} \right) \times 100 \]
  - If the sign of the percentage change is positive, it indicates an increase, while a negative sign indicates a decrease.
- Rate is the ratio of two quantities that describes how one quantity changes with respect to another (e.g., speed, growth rate, and interest rate).

Consider the following set of observations: $(4, 16, 64, 256)$.
The geometric mean is calculated as:

\[ \sqrt[4]{4 \times 16 \times 64 \times 256} = \sqrt[4]{1048576} = 32 \]
The geometric mean can be calculated using the Gmean() function from the DescTools package or the geometric.mean() function from the psych package:
Click to show/hide code
```
library(DescTools)
library(psych)
x <- c(4, 16, 64, 256)
Gmean(x)
```
```
[1] 32
```
Click to show/hide code
```
geometric.mean(x)
```
```
[1] 32
```

Assume a bank provides the following interest rates for a three-year plan: $3.3\%$ in the first year, $4.5\%$ in the second year, and $5.2\%$ in the third year.

If a client deposits $\textdollar{500}$, the change in the amount can be calculated as follows:

Time	Year	Amount $(\textdollar)$	Rate $(\%)$	Interest $(\textdollar)$ $(\text{Rate}_t \times \text{Amount}_{t-1})$	Growth factor $(\text{Amount}_t/\text{Amount}_{t-1})$
$t_0$	0	$500.0$	–	–	–
$t_1$	1	$516.5$	$3.3$	$16.5$	$1.033$
$t_2$	2	$539.7$	$4.5$	$23.2$	$1.045$
$t_3$	3	$567.8$	$5.2$	$28.1$	$1.052$

The interest rate is also known as the growth rate.
The growth rate $= (\text{growth factor} - 1) \times 100$.
The average growth factor is calculated as the geometric mean of the growth factors:
Click to show/hide code
```
growth_factors <- c(1.033, 1.045, 1.052)
Gmean(growth_factors)
```
```
[1] 1.043304
```
So, the average growth rate per year $= (1.043 - 1) \times 100 = 4.3\%$.
This average growth rate can also be obtained by calculating the geometric mean of the interest (growth) rates:
Click to show/hide code
```
interest_rates <- c(3.3, 4.5, 5.2)
Gmean(interest_rates)
```
```
[1] 4.258369
```
However, this approach is not recommended because the interest rates can be negative (the rate decreases), which results in error while computing the geometric mean.
The total amount of money ($A$) can be calculated at any year using the formula: $A_t = A_{t_0} \times (\text{average growth factor}) \large ^t$. For example, the amount at the end of the third year $= 500 \times 1.043304^3 = \textdollar567.8$.

Calculate the average interest rate for the following four-year plan: $10\%$, $14\%$, $15\%$, and $-5\%$.

The average rate per year is $\%$ (to one decimal place)

The average interest rate can not be computed directly by calculating the geometric mean of the interest rates because the last rate is negative.
Therefore, we calculate the average growth factor as the geometric mean of the growth factors.
Growth factor $= \displaystyle \frac{\text{rate}}{100} + 1$
So, the growth factors are $(1.1, 1.14, 1.15, 0.95)$.

The average growth factor is equal to:

Click to show/hide code

growth_factors <- c(1.1, 1.14, 1.15, 0.95)
Gmean(growth_factors)

[1] 1.081882

Therefore, the average interest rate per year is $8.2\%$.

2.6.1 Geometric mean and logarihmic transformation

If the observations are transformed by taking the logarithm (common or natural), the antilogarithm of the arithmetic mean of the transformed values is equal to the geometric mean of the original values.
- Let $X$ be a random variable with positive, non-zero values $x_1, x_2, \ldots, x_n$, and $Y$ is the transformed variable with values $y_1, y_2, \ldots, y_n$ such that $y_i = \log_{10}(x_i)$:
- The arithmetic mean of the transformed values is calculated as:
\[ \bar{y} = \displaystyle \frac{y_1 + y_2 + \ldots + y_n}{n} = \displaystyle \frac{\log_{10}(x_1) + \log_{10}(x_2) + \ldots + \log_{10}(x_n)}{n} = \]

\[ \log_{10} \left( \sqrt[n]{x_1 \times x_2 \times \ldots \times x_n} \right) \]
- The antilogarithm of the arithmetic mean of the transformed values is:
\[ 10^{\bar{y}} = 10^{\log_{10} \left( \sqrt[n]{x_1 \times x_2 \times \ldots \times x_n} \right)} = \sqrt[n]{x_1 \times x_2 \times \ldots \times x_n} \]
- The last equation represents the geometric mean of the original values.
Example: consider the following set of observations $(5,\ 9,\ 10,\ 22,\ 13, 50)$, the following table shows the original and $log_{10}$ transformed values:
Click to show/hide code

library(DT) x <- c(5, 9, 10, 22, 13, 50) y <- round(log10(x), 5) my_data <- data.frame( Original = x, Log10_transformed = y ) datatable( my_data, options = list( dom = "t", columnDefs = list( list( className = "dt-center", targets = "_all" ) ) ), rownames = FALSE )
- The arithmetic mean of the transformed values is:
  Click to show/hide code
  mean(y)
```
[1] 1.134757
```
- The antilogarithm of the arithmetic mean of the transformed values is:
  Click to show/hide code
  10^mean(y)
```
[1] 13.63819
```
- The geometric mean of the original values is:
  Click to show/hide code
  Gmean(x)
```
[1] 13.63823
```
- The antilogarithm of the arithmetic mean of the transformed values is equal to the geometric mean of the original values.
Note
- When the original values are log transformed, the back-transformed mean is not the same as the original mean but it is the geometric mean of the original values.
- On the other hand, back-transforming the median of the log-transformed values gives the median of the original values:
  - For the mpg variable from the mtcars dataset, the median of the log-transformed values is:
    
    Click to show/hide code
    
    med_log <- median( log(mtcars$mpg)) med_log
    
    [1] 2.95491
  - The antilogarithm of the median (back transformation) of the log-transformed values is:
    
    Click to show/hide code
    
    exp(med_log) # exponentiation to back transform
    
    [1] 19.2
  - The median of the original values is:
    
    Click to show/hide code
    
    median(mtcars$mpg)
    
    [1] 19.2
  - The antilogarithm of the median of the log-transformed values is equal to the median of the original values.

2.7 Harmonic mean

The harmonic mean $(\bar{x}_H)$ is the reciprocal of the arithmetic mean of the reciprocals of the observations:
- Let $X$ be a random variable with values $x_1, x_2, \ldots, x_n$, then the reciprocals of the observations are $\displaystyle \frac{1}{x_1}, \frac{1}{x_2}, \ldots, \frac{1}{x_n}$, and their arithmetic mean is calculated as:
\[ \displaystyle \frac{1}{n} \left( \displaystyle \frac{1}{x_1} + \displaystyle \frac{1}{x_2} + \ldots + \displaystyle \frac{1}{x_n} \right) = \displaystyle \frac{\displaystyle \sum_{i=1}^{n} \frac{1}{x_i}}{n} \]
- The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals:
\[ \bar{x}_H = \displaystyle \frac{n}{\displaystyle \sum_{i=1}^{n} \frac{1}{x_i}} \]
It is used for averaging data that are ratios or rates (e.g., speed or financial ratios).
If the observations do not contribute equally towards the calculation of the mean (i.e., the observations have different weights), the harmonic mean can be weighted:
- Let $X$ be a random variable with values $x_1, x_2, \ldots, x_n$ that have coressponding weights $w_1, w_2, \ldots, w_n$, then the harmonic mean is calculated as:
\[ \text{Weighted}\ \bar{x}_H = \displaystyle \frac{w_1 + w_2 + \ldots + w_n}{\displaystyle \frac{w_1}{x_1} + \frac{w_2}{x_2} + \ldots + \frac{w_n}{x_n}} = \displaystyle \frac{\displaystyle \sum_{i=1}^{n} w_i}{\displaystyle \sum_{i=1}^{n} \frac{w_i}{x_i}} = \displaystyle \frac{1}{\displaystyle \sum_{i=1}^{n} \frac{w_i}{x_i}} \]
Example: consider the following set of observations $(2, 4, 6, 8)$:
- The harmonic mean is calculated as:
  Click to show/hide code
  x <- c(2, 4, 6, 8) n <- length(x) harmonic_mean <- n / sum(1 / x) harmonic_mean
```
[1] 3.84
```
- The harmonic mean can also be calculated using the Hmean() function from the DescTools package or harmonic.mean() function from the psych package:
  Click to show/hide code
  library(DescTools) Hmean(x)
```
[1] 3.84
```
  Click to show/hide code
  library(psych) harmonic.mean(x)
```
[1] 3.84
```
The harmonic mean is used to calculate some statistical measures such as F1 score used to assess the performance of logistic regression models. In addition, it is used during pairwise multiple comparisons.

Example B.3.5
Example B.3.6

A car travels $100\ \text{km}$ at a speed $60\ \text{km/h}$, $100\ \text{km}$ at $70\ \text{km/h}$, $100\ \text{km}$ at $90\ \text{km/h}$, and $100\ \text{km}$ at $100\ \text{km/h}$. Calculate the average speed of the car.

The arithmetic mean of the speed $= \displaystyle \frac{60 + 70 + 90 +100}{4} = \frac{320}{4} = 80$ km/h.
However, this method ignores the time taken to travel each distance.
The time taken to travel each distance is given in the following table:

Speed (km/h)	Distanc (km)	Time (h) [Distance/Speed]
$60$	$100$	$1.666667$
$70$	$100$	$1.428571$
$90$	$100$	$1.111111$
$100$	$100$	$1$

The total distance traveled is $400$ km, and the total time taken is $5.21\ \text{h}$ So, the actual average speed $= \displaystyle \frac{400}{5.206349} \approx 76.83\ \text{km/h}$.
This average can be found using the harmonic mean of the speeds:
Click to show/hide code
```
speeds <- c(60, 70, 90, 100)
Hmean(speeds)
```
```
[1] 76.82927
```
The average speed can also be calculated as the weighted arithmetic mean using weights based on the time taken to travel each distance:
Click to show/hide code
```
times <- c(100/60, 100/70, 100/90, 100/100) # use times as weights
weighted.mean(speeds, w = times)
```
```
[1] 76.82927
```

Note

To find the average of rates with different numerators but the same denominator use the arithmetic mean or weighted harmonic mean (weights are based on numerators).
To find the average of rates with same numerators but different denominators use the harmonic mean or weighted arithmetic mean (weights are based on denominators).
To find the average of rates with different numerators and denominators use the weighted arithmetic mean (weights are based on denominators) or weighted harmonic mean (weights are based on numerators).

Suppose that you invested $\textdollar{2550}$ in stock market. In the first year, you spent $\textdollar{1000}$ to purchase $40$ shares of stock $A$. In the second year, you spent $\textdollar{1550}$ to purchase $78$ shares of the same stock. Calculate the average price per share.

The rate here is the price $(\textdollar)$ per share.
The total amount of the shares bought $= 40 + 78 = 118$ shares, so the actual average price per share $= \displaystyle \frac{2550}{118} \approx \textdollar{21.61}$.
In the first year, the price per share $= \displaystyle \frac{1000}{40} = \textdollar{25}$.
In the second year, the price per share $= \displaystyle \frac{1550}{78} \approx \textdollar{19.87}$.

Year	Price per share $(\textdollar/\text{share})$	Amount invested $(\textdollar)$	Number of shares
$1$	$25.00$	$1000$	$40$
$2$	$19.87$	$1550$	$78$

The arithmetic mean of the price per share $= \displaystyle \frac{25 + 19.87}{2} = \displaystyle \frac{44.87}{2} = \textdollar{22.44}$, which is higher than the actual average price per share.
Because both the numerator (amount invested) and denominator (number of shares) are different, the weighted arithmetic or weighted harmonic mean should be used to calculate the average price per share:
- The weighted arithmetic mean using denominator (number of shares) as weights:
  Click to show/hide code
  weights_1 <- c(40/118, 78/118) price_1 <- 1000/40 price_2 <- 1550/78 weighted.mean(c(price_1, price_2), w = weights_1)
```
[1] 21.61017
```
- The weighted harmonic mean using numerators (amount invested) as weights:
  Click to show/hide code
  weights_2 <- c(1000/2550, 1550/2550) h_mean <- 1/sum(weights_2 / c(price_1, price_2)) h_mean
```
[1] 21.61017
```

2.8 Trimmed mean

The observations are arranged based on magnitude, then a certain percentage of the observations are removed from both ends of the ordered observations.
The trimmed mean $(\bar{x}_T)$ is then calculated as the arithmetic mean of the remaining observations.
Typically, $5\%$ to $25\%$ of the observations are trimmed from each end.
The trimmed mean is less sensitive to outliers than the arithmetic mean becuase the extreme values are removed.
Some robust statistical tests use the trimmed mean such as Yuen’s test, which is used to compare the trimmed means of two groups when the assumption of normality is violated.
Trimmed mean can be calculated in R using the mean() function with the trim argument set to the percentage of observations to be trimmed.
- Example: the $5\%$ trimmed mean of the variable mpg in the mtcars dataset is calculated as follows:
  Click to show/hide code
  mean(mtcars$mpg, trim = 0.05)
```
[1] 19.95333
```
  - DescTools package has a function Trim() that is an excerpt from the base function mean() but returns the trimmed data without calculating the mean:
    Click to show/hide code
    
    library(DescTools) mpg_ordered <- sort(mtcars$mpg) Trim(mpg_ordered, 0.05)
    [1] 10.4 13.3 14.3 14.7 15.0 15.2 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 19.2 [16] 19.2 19.7 21.0 21.0 21.4 21.4 21.5 22.8 22.8 24.4 26.0 27.3 30.4 30.4 32.4 attr(,"trim") [1] 1 32
    - The output shows the cleaned data after trimming $5\%$ of the observations from each end and the location of the trimmed observations (the $1^{\text{st}}$ and $32^{\text{nd}}$ observations that were removed).
    - The percentage of trimming is multiplied by the number of observations to get the number of observations to be trimmed from each end (i.e., $0.05 \times 32 = 1.6$ observations, which R rounds down to $1$).
    - Therefore, one observation has been removed from each end.
$50\%$ trimming is equivalent to the median:
Click to show/hide code
```
mean(mtcars$mpg, trim = 0.5)
```
```
[1] 19.2
```
Discarding the extreme values leads to loss of information that can introduce bias in the estimation of the mean. Therefore, trimming should be used cautiously. It can be safely used to discard extreme values (outliers) that might be due to errors during measurement or data collection or when the extreme values are irrelevant to the rest of the data.

2.9 Winsorized mean

The observations are arranged based on magnitude, then the extreme values beyond a certain percentile threshold are replaced by these threshold values.
For example, $5\%$ Winsorization involves replacing the values smaller than the $5^{th}$ percentile with the $5^{th}$ percentile value and the values larger than the $95^{th}$ percentile with the $95^{th}$ percentile value.
The Winsorized mean $(\bar{x}_W)$ is then calculated as the arithmetic mean of the modified observations.
The Winsorized mean is less sensitive to outliers than the arithmetic mean.
To Winsorize data, the Winsorize() function from the DescTools package can be used:
- The function has probs argument that takes a vector of two values representing the lower and upper percentiles to be Winsorized (e.g., probs = c(0.05, 0.95) for $5\%$ Winsorization).
- In addition, there is type argument that specifies the algorithm to be used for calculating the perecentiles (discussed in details here).
- Example: the variable mpg in the mtcars dataset is $5\%$ Winsorized as follows:
  - Calculate the $5^{th}$ and $95^{th}$ percentiles of the variable mpg:
    Click to show/hide code
    
    quantile( mtcars$mpg, probs = c(0.05, 0.95), type = 7 )
    5% 95% 11.995 31.300
  - The quantile() function has been used to calculate the $5^{th}$ and $95^{th}$ percentiles of the variable mpg, which are $11.995$ and $31.3$, respectively. The type argument is set to $7$, which is the default algorithm for calculating the percentiles in R for continuous data.
  - Display the original values in order:
    Click to show/hide code
    
    mpg_ordered
    [1] 10.4 10.4 13.3 14.3 14.7 15.0 15.2 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 [16] 19.2 19.2 19.7 21.0 21.0 21.4 21.4 21.5 22.8 22.8 24.4 26.0 27.3 30.4 30.4 [31] 32.4 33.9
  - Winsorize the variable mpg using the $5^{th}$ and $95^{th}$ percentiles:
    Click to show/hide code
    
    library(DescTools) mpg_win <- Winsorize( x = mpg_ordered, val = quantile( mpg_ordered, probs = c(0.05, .95) ) ) mpg_win
    [1] 11.995 11.995 13.300 14.300 14.700 15.000 15.200 15.200 15.500 15.800 [11] 16.400 17.300 17.800 18.100 18.700 19.200 19.200 19.700 21.000 21.000 [21] 21.400 21.400 21.500 22.800 22.800 24.400 26.000 27.300 30.400 30.400 [31] 31.300 31.300
  - The output of the Winsorize() function shows that the two smallest values $(10.4\ \text{and}\ 10.4)$ that are smaller than the $5^{th}$ percentile $(11.995)$ have been replaced by the $5^{th}$ percentile. Whereas, the two largest values $32.4\ \text{and}\ 33.9$ that are larger than the $95^{th}$ percentile $(31.3)$ have been replaced by the $95^{th}$ percentile.
  - Calculate the Winsorized mean of the variable mpg:
    Click to show/hide code
    
    mean(mpg_win)
    [1] 20.07469
  - The Winsorized mean can be calculated directly using the function winsor.mean() function from the psych package:
    Click to show/hide code
    
    library(psych) winsor.mean( mtcars$mpg, trim = 0.05 )
    [1] 20.07469
Unlike trimming, Winsorization preserves some of the original information in the data. The extreme values are not totally discarded but their weight (impact) is reduced.

3 References

Daniel, W. W. and Cross, C. L. (2013). Biostatistics: A Foundation for Analysis in the Health Sciences, Tenth edition. Wiley
Heumann, C., Schomaker, M., and Shalabh (2022). Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in R. Springer
Hoffman, J. (2019). Basic Biostatistics for Medical and Biomedical Practitioners, Second Edition. Academic Press
Lane, D. M. et al., (2019). Introduction to Statistics. Online Edition. Retrieved September 14, 2024, from https://openstax.org/details/introduction-statistics

Time	Year	Amount \((\textdollar)\)	Rate \((\%)\)	Interest \((\textdollar)\) \((\text{Rate}_t \times \text{Amount}_{t-1})\)	Growth factor \((\text{Amount}_t/\text{Amount}_{t-1})\)
\(t_0\)	0	\(500.0\)	–	–	–
\(t_1\)	1	\(516.5\)	\(3.3\)	\(16.5\)	\(1.033\)
\(t_2\)	2	\(539.7\)	\(4.5\)	\(23.2\)	\(1.045\)
\(t_3\)	3	\(567.8\)	\(5.2\)	\(28.1\)	\(1.052\)

Speed (km/h)	Distanc (km)	Time (h) [Distance/Speed]
\(60\)	\(100\)	\(1.666667\)
\(70\)	\(100\)	\(1.428571\)
\(90\)	\(100\)	\(1.111111\)
\(100\)	\(100\)	\(1\)

Year	Price per share \((\textdollar/\text{share})\)	Amount invested \((\textdollar)\)	Number of shares
\(1\)	\(25.00\)	\(1000\)	\(40\)
\(2\)	\(19.87\)	\(1550\)	\(78\)