Click to show/hide code
$mpg mtcars
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
October 8, 2024
Collected data possess many observations that we are not interested in each one individually.
Instead, we are interested in summarizing the data in a concise and informative way.
Descriptive measures are single numbers that summarize the data.
They are used to describe the center
and spread
of the data.
They can be calculated from the data of a sample or a population.
Descriptive measures calculated from the sample are referred to as sample statistics
, while those calculated from the population are referred to as population parameters
.
Measures of central tendency are used to describe the center of the data (i.e., the average or typical value).
They are also referred to as measures of location.
The most common measures of central tendency are the mean
, median
, and mode
.
It is calculated by summing all observations and dividing by the total number of observations.
Let
The mean of a finite population of size mu
, while the mean of a sample of size x-bar
.
The population and sample means can be calculated as follows:
Population mean | Sample mean |
---|---|
Example: consider the variable mpg
in the mtcars
dataset that we discussed in earlier chapters:
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
The number of observations can be calculated using the length()
function:
The sum of all values can be calculated using the sum()
function:
The mean can be calculated directly using the built-in R function mean()
:
If the variable contains missing values, the na.rm
argument should be set to TRUE
to exclude them before calculating the mean, otherwise, the result will be NA
.
For example, consider the following variable with some missing values
mean()
function without the na.rm
argument will return NA
:na.rm
argument to TRUE
:The mean can be calculated if the values are given along with their frequencies as weighted mean:
The weight of each observation
Example: consider the following variable with values
Value | Frequency | Weight |
---|---|---|
1.3 | 2 | 0.1666666666666667 |
2.1 | 3 | 0.25 |
3.5 | 1 | 0.08333333333333333 |
4.1 | 4 | 0.3333333333333333 |
5.2 | 2 | 0.1666666666666667 |
It can also be calculated using the weights
as follows:
The built-in R function weighted.mean()
can be used for direct calculation:
The sum of the squared deviations of the observations from a certain value is the least (i.e., minimized) when this value is the mean (this is referred to as the least squares principle
).
If a constant
If the observations are multiplied by a constant
If the observations
The mean is sensitive to extreme values (outliers), and it may not be a good measure of central tendency in the presence of outliers.
Example: consider the following variable with values
Lets replace the last value with an extreme value of
The mean of the new values with the outlier would be:
The presence of the outlier affected the mean that it is no longer a good measure of central tendency for this variable.
The median, denoted by
The median is calculated as follows:
If the number of observations is odd, the median is the middle value of the ordered observations.
If the number of observations is even, the median is the arithmetic mean of the two middle observations.
Consider the following set of observations:
There is an odd number of observations
Half of the measurements
Therefore, the previous defintion of the median contains “equal to” to cover the case of having repeated, similar values.
Example: the variable mpg
has
The ordered observations are:
The
The median is the arithmetic mean of these two values:
The median can also be calculated using the built-in R function median()
:
Unlike the mean, the median is not drastically affected by extreme values (outliers).
Consider the example discussed above Listing 1:
The median of the original set of observations is:
The median of the set of observations with the outlier is:
Because the median is concerned with the middle value, it is less affected by the presence of the outliers than the mean.
The median is a robust measure of central tendency, and it is preferred when the data is skewed or contains outliers.
Quantiles are a generalization of of the concept of the median.
A quantile is defined as the value that divides the ordered observations into two parts such that a certain proportion of the observations are less than or equal to the quantile.
In this sense, the median is the
Similarly, the
Let
The
It is defined as the value that splits the data into two parts such that at least
Deciles divide the data into
Quintiles divide the data into
Quartiles divide the data into
Percentiles: if
The quantiles can be obtained in R using the quantile()
function.
For example, the quantiles of the variable mpg
in the mtcars
dataset can be calculated as follows:
By default, The function outputs the
To get a specific quantile, use the probs
argument to add a vector of the desired probabilities (i.e.,
R provides nine algorithms to calculate the quantiles, which can be specified using the type
argument:
Types
Type
Type
Type SPSS
, Minitab
, or Graphpad Prism
.
For example, the of mpg
in the mtcars
dataset using R
(type SPSS
, and, Graphpad Prism
are as follows:
Detailed discussion of the algorithms used to calculate the quantiles can be found here.
The mode of a set of observations is the observation that occurs most frequently compared with all other values.
A set of observations can have:
No mode (if all observations occur with the same frequency).
One mode (if one observation occurs most frequently).
More than one mode (if two or more observations occur with the same highest frequency).
Example: consider the variable cyl
(number of car cylinders) in the mtcars
dataset:
R has no direct built-in function to calculate the mode, but it can be calculated from the frequency table as follows:
cyl Freq
1 8 14
2 4 11
3 6 7
[1] "The mode is 8 cyliners"
The mode can also be calculated directly using the Mode()
function from the DescTools
package:
Warning: package 'DescTools' was built under R version 4.4.1
[1] 8
attr(,"freq")
[1] 14
For numerical data which contains many unique values, the mode may not be a good measure of central tendency. However, if the observations are summarized in groups (e.g., frequency table or histogram), the mode can be calculated as the class interval with the highest frequency.
Example: consider the histogram of the variable mpg
in the mtcars
dataset:
Let positive
and non-zero
values
It is less sensitive to extreme values in skewed distributions than the arithmetic mean.
Geometric mean is used for averaging data that are multiplicative
in nature (e.g., percentage change and rates).
Percentage change is calculated as the difference between the final
Rate is the ratio of two quantities that describes how one quantity changes with respect to another (e.g., speed, growth rate, and interest rate).
Consider the following set of observations:
The geometric mean is calculated as:
The geometric mean can be calculated using the Gmean()
function from the DescTools
package or the geometric.mean()
function from the psych
package:
Assume a bank provides the following interest rates for a three-year plan:
If a client deposits
Time | Year | Amount |
Rate |
Interest |
Growth factor |
---|---|---|---|---|---|
0 | – | – | – | ||
1 | |||||
2 | |||||
3 |
The interest rate is also known as the growth rate
.
The growth rate
The average growth factor is calculated as the geometric mean of the growth factors:
So, the average growth rate per year
This average growth rate can also be obtained by calculating the geometric mean of the interest (growth) rates:
However, this approach is not recommended because the interest rates can be negative (the rate decreases), which results in error while computing the geometric mean.
The total amount of money (
Calculate the average interest rate for the following four-year plan:
The average rate per year is
The average interest rate can not be computed directly by calculating the geometric mean of the interest rates because the last rate is negative.
Therefore, we calculate the average growth factor as the geometric mean of the growth factors.
Growth factor
So, the growth factors are
The average growth factor is equal to:
Therefore, the average interest rate per year is
If the observations are transformed by taking the logarithm (common or natural), the antilogarithm of the arithmetic mean of the transformed values is equal to the geometric mean of the original values.
Let
The arithmetic mean of the transformed values is calculated as:
Example: consider the following set of observations
Original | Log10_transformed |
---|---|
5 | 0.69897 |
9 | 0.95424 |
10 | 1 |
22 | 1.34242 |
13 | 1.11394 |
50 | 1.69897 |
The arithmetic mean of the transformed values is:
The antilogarithm of the arithmetic mean of the transformed values is:
The geometric mean of the original values is:
The antilogarithm of the arithmetic mean of the transformed values is equal to the geometric mean of the original values.
When the original values are log transformed, the back-transformed mean is not the same as the original mean but it is the geometric mean of the original values.
On the other hand, back-transforming the median of the log-transformed values gives the median of the original values:
For the mpg
variable from the mtcars
dataset, the median of the log-transformed values is:
The antilogarithm of the median (back transformation) of the log-transformed values is:
The median of the original values is:
The antilogarithm of the median of the log-transformed values is equal to the median of the original values.
The harmonic mean
It is used for averaging data that are ratios
or rates
(e.g., speed or financial ratios).
If the observations do not contribute equally towards the calculation of the mean (i.e., the observations have different weights), the harmonic mean can be weighted:
Example: consider the following set of observations
The harmonic mean is calculated as:
The harmonic mean can also be calculated using the Hmean()
function from the DescTools
package or harmonic.mean()
function from the psych
package:
The harmonic mean is used to calculate some statistical measures such as F1 score used to assess the performance of logistic regression models. In addition, it is used during pairwise multiple comparisons.
A car travels
The arithmetic mean of the speed
However, this method ignores the time taken to travel each distance.
The time taken to travel each distance is given in the following table:
Speed (km/h) | Distanc (km) | Time (h) [Distance/Speed] |
---|---|---|
The total distance traveled is
This average can be found using the harmonic mean of the speeds:
The average speed can also be calculated as the weighted arithmetic mean using weights based on the time taken to travel each distance:
To find the average of rates with different numerators but the same denominator use the arithmetic mean or weighted harmonic mean (weights are based on numerators).
To find the average of rates with same numerators but different denominators use the harmonic mean or weighted arithmetic mean (weights are based on denominators).
To find the average of rates with different numerators and denominators use the weighted arithmetic mean (weights are based on denominators) or weighted harmonic mean (weights are based on numerators).
Suppose that you invested
The rate here is the price
The total amount of the shares bought
In the first year, the price per share
In the second year, the price per share
Year | Price per share |
Amount invested |
Number of shares |
---|---|---|---|
The arithmetic mean of the price per share
Because both the numerator (amount invested) and denominator (number of shares) are different, the weighted arithmetic or weighted harmonic mean should be used to calculate the average price per share:
The weighted arithmetic mean using denominator (number of shares) as weights:
The weighted harmonic mean using numerators (amount invested) as weights:
The observations are arranged based on magnitude, then a certain percentage of the observations are removed from both ends of the ordered observations.
The trimmed mean
Typically,
The trimmed mean is less sensitive to outliers than the arithmetic mean becuase the extreme values are removed.
Some robust statistical tests use the trimmed mean such as Yuen’s test, which is used to compare the trimmed means of two groups when the assumption of normality is violated.
Trimmed mean can be calculated in R using the mean()
function with the trim
argument set to the percentage of observations to be trimmed.
Example: the mpg
in the mtcars
dataset is calculated as follows:
DescTools
package has a function Trim()
that is an excerpt from the base function mean()
but returns the trimmed data without calculating the mean:
[1] 10.4 13.3 14.3 14.7 15.0 15.2 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 19.2
[16] 19.2 19.7 21.0 21.0 21.4 21.4 21.5 22.8 22.8 24.4 26.0 27.3 30.4 30.4 32.4
attr(,"trim")
[1] 1 32
The output shows the cleaned data after trimming
The percentage of trimming is multiplied by the number of observations to get the number of observations to be trimmed from each end (i.e.,
Therefore, one observation has been removed from each end.
Discarding the extreme values leads to loss of information that can introduce bias in the estimation of the mean. Therefore, trimming should be used cautiously. It can be safely used to discard extreme values (outliers) that might be due to errors during measurement or data collection or when the extreme values are irrelevant to the rest of the data.
The observations are arranged based on magnitude, then the extreme values beyond a certain percentile threshold are replaced by these threshold values.
For example,
The Winsorized mean
The Winsorized mean is less sensitive to outliers than the arithmetic mean.
To Winsorize data, the Winsorize()
function from the DescTools
package can be used:
The function has probs
argument that takes a vector of two values representing the lower and upper percentiles to be Winsorized (e.g., probs = c(0.05, 0.95)
for
In addition, there is type
argument that specifies the algorithm to be used for calculating the perecentiles (discussed in details here).
Example: the variable mpg
in the mtcars
dataset is
Calculate the mpg
:
The quantile()
function has been used to calculate the mpg
, which are type
argument is set to
Display the original values in order:
Winsorize the variable mpg
using the
[1] 11.995 11.995 13.300 14.300 14.700 15.000 15.200 15.200 15.500 15.800
[11] 16.400 17.300 17.800 18.100 18.700 19.200 19.200 19.700 21.000 21.000
[21] 21.400 21.400 21.500 22.800 22.800 24.400 26.000 27.300 30.400 30.400
[31] 31.300 31.300
The output of the Winsorize()
function shows that the two smallest values
Calculate the Winsorized mean of the variable mpg
:
The Winsorized mean can be calculated directly using the function winsor.mean()
function from the psych
package:
Unlike trimming, Winsorization preserves some of the original information in the data. The extreme values are not totally discarded but their weight (impact) is reduced.
Daniel, W. W. and Cross, C. L. (2013). Biostatistics: A Foundation for Analysis in the Health Sciences, Tenth edition. Wiley
Heumann, C., Schomaker, M., and Shalabh (2022). Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in R. Springer
Hoffman, J. (2019). Basic Biostatistics for Medical and Biomedical Practitioners, Second Edition. Academic Press
Lane, D. M. et al., (2019). Introduction to Statistics. Online Edition. Retrieved September 14, 2024, from https://openstax.org/details/introduction-statistics