Variability refers to the degree of spread or dispersion of your dataset. As with central tendency, there are a variety of ways in which variability can be measured. Variability measures are typically appropriate for quantitative data only.

The Range

The easiest way to quantify the variability of your data is to calculate the range of scores in the dataset. The range is represented by the following formula:

Range = Highest\enspace Score - Lowest\enspace Score

While this measure makes a lot of intuitive sense, it suffers from two major problems:

  • It is entirely dependent on the most extreme scores in the dataset. For example, consider the range of Dataset A: {1, 23, 48, 48, 52, 52, 57, 900}. The range is 900 – 1 = 899. However, this number doesn’t reflect the fact that most of the scores in the dataset (those close to 50) are not all that variable at all.
  • Most of the scores in the dataset have no effect on the range. Dataset B {1, 100, 300, 500, 600, 700, 800, 900} has exactly the same range as Dataset A, but the datapoints are clearly more variable in B.

We need a measure that deals with these issues.

The Interquartile Range (IQR)

The IQR is a move in this direction. The IQR is calculated using the following formula:

IQR = 75th\enspace percentile\enspace score - 25th\enspace percentile\enspace score

The 25th percentile is the point at which 25% of the datapoints fall below the point, and the 75th percentile is the point at which 75% of the datapoints fall below the point. Using Dataset A (above), since there are 8 data points, the 25th percentile is the number that is greater than 2 out of 8 data points. Any number greater than 23 and less than 48 would qualify. However, we use the same convention that we use for the median: take the average of the two points closest to the boundary. In this case, the 25th percentile is \frac {23+48} {2} = 35.5. Similarly, the 75th percentile is \frac {52+57} {2} = 54.5. Therefore, the IQR for this dataset is 54.5 - 35.5 = 19.

The IQR for Dataset B is 750 - 200 = 550. This is a good thing: the IQR better reflects the increased variability of Dataset B relative to Dataset A. In addition, the IQR is an improvement over the Range in that it is not as susceptible to extreme scores. It avoids this problem, though, by ignoring half of the dataset: any data below the 25th or above the 75th percentiles does not affect the value of the statistic. Ideally, we would want our variability statistic to reflect the variability of all or nearly all of our dataset.

Average Deviation

In our quest for the perfect variability statistic we might try something like the average deviation of each score from the mean:

Average\enspace deviation = \frac {\sum {(X - \overline {X})}} {n}

The problem with this measure, though, is that it is always equal to zero:

\frac {\sum {(X - \overline {X})}} {n} = \frac{\sum{X}}{n} + \frac{\sum{\overline {X}}}{n} = \overline {X} - \frac{n\overline{X}}{n} = \overline{X} - \overline{X} = 0

To demonstrate this fact for our Dataset A, we first calculate the mean of the sample as 147.625. We then calculate the deviation of  each score from the mean, and add them all up:

\frac{(1 - 147.625) + (23 - 147.625) + ... + (900-147.625)}{8}

= \frac{(-146.625) + (-124.625) + ... + 752.375}{8} = 0

A variability statistic that never changes is not very useful.

Mean Absolute Deviation

Instead, we could calculate the absolute value of each deviation and calculate the Mean Absolute Deviation:

MAD=\frac {\sum {|X - \overline {X}|}} {n}

This variability statistic does a good job of meeting our requirements. It reflects the variability of all of the scores, and it is not unduly affected by extreme scores. In fact, this variability statistic does see some use in the real world, although the Mean Squared Deviation is much more commonly used.

Variance (Mean Squared Deviation)

Instead of using the absolute value, we can square each deviation instead. This quantity is called variance. The variance of a population is denoted as \sigma^2:

\sigma^2=\frac {\sum {(X - \mu)}^2} {N}

\mu is the population mean, and N is the number of scores in the population.

You might think that the variance of a sample would follow the same pattern:

S^2=\frac {\sum {(X - \overline {X})}^2} {n}

However, this version of the sample variance statistic is biased. Biased sample statistics tend to be under- or overestimates of their corresponding population parameter.Remember that the whole point of statistics is to use samples to describe and make inferences about populations. We don’t like it when one of our sample statistics tends to be an underestimate of the corresponding population parameter. We want our sample statistics to be good estimates of their corresponding population parameters. In the case of the sample variance, leaving N in the denominator leads to a statistic S^2 that tends to be an underestimate of the population variance \sigma^2.

To illustrate this a little more clearly, imagine that you sample n=10 scores from a population whose variance is \sigma^2. Let’s call this Sample #1. We can calculate the variance of this sample using n in the denominator. Let’s denote it as S_1^2. Now imagine that we repeat this sampling process another 9,999 times. Samples #2 through 10,000 also have n=10 scores each, and we calculate the sample variances for each one of them as S_2^2, S_3^2, et cetera. If we take the mean of those sample variances it will tend to be slightly less than \sigma^2:

\frac{\sum{S^2}}{10,000} \approx \frac{n-1}{n}\sigma^2

So if we want the statistic to be unbiased (we do), we can just multiply S^2 by \frac{n}{n-1}:

\frac{n}{n-1}S^2=\frac{n}{n-1}\frac {\sum {(X - \overline {X})}^2} {n}=\frac {\sum {(X - \overline {X})}^2} {n-1}

This version of the sample variance statistic is unbiased and is usually denoted as s^2:

s^2=\frac{n}{n-1}S^2=\frac {\sum {(X - \overline {X})}^2} {n-1}

The unbiased statistic s^2 is much more commonly used than the biased S^2.

At any rate, the variance statistic (just like the MAD) does a nice job of meeting our requirements for a good variability measure. A minor issue with it, though, is that the statistic is in squared units rather than the original (unsquared) units that the data were measured in. The measure would be easier to interpret if it were in our original data units. We can fix this by taking the square root of the variance. This quantity is called the standard deviation.

Standard Deviation (Root Mean Squared Deviation)

The standard deviation is the square root of the variance. For populations, the formula is:

\sigma=\sqrt{\frac {\sum {(X - \mu)}^2} {N}}

and for samples, it is:

s=\sqrt{\frac {\sum {(X - \overline {X})}^2} {n-1}}

This variability measure is arguably the most commonly used in statistics.

Variability Formulas in Microsoft Excel 2010+

For all of the following, I am assuming that your dataset resides in cells A2 through A10 of your Excel spreadsheet. You will need to replace these values with the actual location of your dataset in order for them to work properly in your spreadsheet.

Range

=MAX(A2:10) – MIN(A2:A10)

Interquartile Range

=PERCENTILE.EXC(A2:A10,0.75) – PERCENTILE.EXC(A2:A10,0.25)

Mean Absolute Deviation

First you need to calculate the deviation of each score from the mean. This formula should go in B2, to the right of your first datapoint:

=A2 – AVERAGE(A$2:A$10)

Copy that formula into cells B3 through B10.

Second, you need to calculate the absolute value of each deviation. Put this formula in cell C2:

=ABS(B2)

Copy that formula into cells C3 through C10.

Finally, calculate the MAD:

=AVERAGE(C2:C10)

Variance

For populations:

=VAR.P(A2:A10)

For samples:

=VAR.S(A2:A10)

Standard Deviation

For populations:

=STDEV.P(A2:A10)

or

=SQRT(VAR.P(A2:A10))

For samples:

=STDEV.S(A2:A10)

or

=SQRT(VAR.S(A2:A10))


Categorised in: Statistics