Lecture 7

Describing measurements II

Measures of spread



Dr Lincoln Colling

07 Nov 2022


Psychology as a Science

Outline for today

  • Measures of spread

    • Range

    • Interquartile range

    • Deviation

    • Variance

      • Sample Variance and Population Variance
    • Standard Deviation

  • The relationship between samples and populations

Measures of spread

Last week we started learning about the tools we can use to describe data. Specifically, we learned about the mean, mode, and median. And we learned about how they are different ways of describing the typical value.

But apart describing the typical value, we might also want a way to describe how spread out the data is around this value.

Measures of spread

In Figure 1 you can see two sets of data. Both the datasets have a mean 0, but they have different amounts of spread.

Figure 1: Histogram of two distributions with equal means but different spreads. N = 10,000 in each case.

Range

  • The range of a variable is the distance between its smallest and largest values.

  • For example, if we gather a sample of 100 participants and the youngest is 17 years old, and the oldest is 67 years old, then the range of our age variable in this sample if 67 - 15 = 50 years.

Interquartile range

  • A slightly more useful measure than the range is the interquartile range (IQR)

  • Involves splitting the data into quarters.

    • Find the median to split the data in half
    • Split each of the halves into half again
  • The IQR is the range covered by the middle two quarters (50%) of the data

Range and IQR

  • The range and the IQR only tell us limited information. Two datasets can have the same range and IQR but still look very different.

Deviation

  • The range and the IQR depend on only two points.

  • To get a more fine-grained idea of the spread we need to take every data-point into account

  • One way to do this is to take each data-point and calculate how far it is away from some reference point, such as the mean

  • This is known as the deviation

Mathematically, we can represent deviation with Equation 1, below:

\[D=x_i - \bar{x} \qquad(1)\]

Deviation

  • Once we have the deviation values, then what do we do with them?

  • If we add them up then the sum will just be bigger whenever we have more data

  • But it’s possible to have bunched up large datasets and spread out small datasets and our measure should be able to account for this

Deviation

  • Instead of adding up the deviations we could work out the average of the deviations

  • But some deviations will be negative (smaller than the mean) and some deviations will be positive (larger than the mean), so they’ll just average up to 0

Value Mean Deviation
94 163 69
96 163 67
299 163 -136
  • For example: (69 + 67 + -136) ÷ 3 = 0

Squared deviations

  • We can make sure all the deviations are positive by squaring the values
Value Mean Deviation Squared Deviation
94 163 69 4761
96 163 67 4489
299 163 -136 18496
  • For example: (4761 + 4489 + 18496) ÷ 3 = 9248.67

  • The mean of the squared deviations will be the basis for our next measure of spread, the variance

Variance

  • The population variance is the defined as the mean of the squared deviations from the population mean

\[\mathrm{Var}(X)=\frac{\displaystyle\sum^{N}_{i=1}{(x_i - \mu)^2}}{N} \qquad(2)\]

  • But we don’t usually know the value of the population mean, so can we just use the sample mean instead?

Let’s compare what happens when we use the population mean and the sample mean

Squared deviations from the population mean

  • We’ll start off with a population where we know the population mean

    1. and the variance of the population (225)
  • We’ll take samples from this population, and work out the average of the squared deviations from the population mean

  • We’ll plot these values for different samples in Figure 2

Figure 2: Variance (mean squared deviation from the population mean) calculated for different samples

The value we calculate varies from sample to sample, but what does it do on average?

Squared deviations from the population mean

  • We can repeat what we did with the sample mean in Lecture 6 and see what happens with the average squared deviations from the population mean

  • The running average of the average squared deviations from the population mean is shown in Figure 3

Figure 3: Running average of the mean squared deviation from the population mean

On average the average of the mean squared deviation from the population mean will be equal to the variance of the population

Squared deviations from the sample mean

  • Now let’s repeat the process but use the deviation from the sample mean instead

Figure 4: Variance (mean squared deviation from the sample mean) calculated for different samples

  • Nothing in Figure 4 looks strange…

  • But, remember, we want to know how I calculate value behaves on average

Squared deviations from the sample mean

  • In Figure 5 we can see the running average of average squared deviations from the sample mean

Figure 5: Running average of the mean squared deviation from the sample mean

  • Now we can see the problem of using deviation from the sample mean instead of deviation from the population mean

  • Our calculated value will on average not be the same as the variance of the population

So what’s the solution?

Sample variance

  • When we only have access to information from the sample (e.g., the sample mean) then we have to calculate a quantity known as the sample variance

  • We said variance was defined by Equation 3:

\[\frac{\displaystyle\sum^{N}_{i=1}{(x_i - \mu)^2}}{N} \qquad(3)\]

  • For the sample variance we’ll make an adjustment so that we have Equation 4

\[\frac{\displaystyle\sum^{N}_{i=1}{(x_i - \bar{x})^2}}{N - 1} \qquad(4)\]

Sample variance

  • In Figure 6 we can see the running average of the sample variance.

Figure 6: Running average of sample variance


  • Dividing by N - 1 rather than taking a simple average (dividing by N) means that on average the sample variance will be equal to the variance of the population

Sample variance and population variance

  • If you have access to the entire population (e.g., you can compute the population mean) then you can calculate the population variance (divide by N)

  • If you only have access to the sample characteristics (e.g., you can only calculate the sample mean) then you must calculate the sample variance (divide by N - 1)

But the confusing part is: The sample variance is an unbiased estimator of the variance of the population

  • This just means that the sample variance will converge to the variance of the population

Using the population variance formula with sample values is a biased estimator of the variance of the population

  • This just means that it won’t converge to the variance of the population

Remember, what we really want to know are the features of the population (it’s mean and variance) but we need to estimate these from the sample

Standard Deviation

The variance is a good measure of spread, and it’s a commonly used measure, but it can be a little difficult to interpret

For example, think back to the salary example from Lecture 6:

  • If salary is measure in USD

  • Then the variance is measures in USD2, whatever that means!

Fortunately, there’s an easy solution! Just take the square root of the variance

This measure is called the standard deviation. We can represent it mathematically with Equation 5

\[s=\sqrt{\frac{\displaystyle\sum^{N}_{i=1}{(x_i - \bar{x})^2}}{N - 1}} \qquad(5)\]

Why the squared deviations and not the absolute value?

Before we continue, we’ll just have a brief digression…

  • When we worked out the deviations, we squared them to turn the negative values into positive values

  • But could we just take the absolute value? (that is, just ignore the sign?)

Why the squared deviations and not the absolute value?

  • Below we have two data sets made up of four data points each

  • The data in A are more spread out than the data in B

So let’s calculate the average of the squared deviations and the average of the absolute value of the deviations

Why the squared deviations and not the absolute value?

First for the data in A

value mean deviation
40 160 120
140 160 20
180 160 -20
280 160 -120
  • The mean of the absolute deviations is: 70

  • The mean of the squared deviations is: 7400

Why the squared deviations and not the absolute value?

And then for the data in B

value mean deviation
90 160 70
90 160 70
230 160 -70
230 160 -70
  • The mean of the absolute deviations is: 70

  • The mean of the squared deviations is: 4900

So even though the two sets of data have different amounts of spread, the mean of absolute deviations doesn’t pick it up, but the mean of the squared deviations does

The relationship between samples and populations

Now that we have tools for describing the centre / typical value of a set of measurements (mean) and the spread of a set of measurements (variance / standard deviation) we can these two ideas together.

  • In lecture 6 we saw that individual sample means were spread out around the population mean

  • We can quantify that spread using the idea of the standard deviation


But we’re now no longer calculating the spread of our sample or even the spread of the population

We’re now calculating the spread of sample means around the population mean

This kind of standard deviation has a special name. It’s called:

The standard error of the mean

The standard error of the mean

  • The standard error of the mean in technical terms is the standard deviation of the sampling distribution of the mean

  • To fully appreciate the concept of the standard error of the mean we’ll need to understand the concept of the sampling distribution

  • And to understand the sampling distribution we’ll first need to understand what distributions are, what they look like, and why they look the way they do

But we get to that I want to return to the problem I left you with last week

The standard error of the mean

  • Last week I got you to start thinking about the problem how you can know how close sample means will be to the population mean on average

  • You should now be able to recognise this question is about the standard deviation between sample means and the population mean

  • Or more technically, it’s a question about the standard error of the mean

Without me telling you how to work out the standard error of the mean can you work out what it’s formula might be?

To do this, I want you to think of two scenarios where the standard error of the mean will be very small:

  • One of these scenarios has something to do with the nature of the samples you’re collecting

  • The other scenario has something to do with the nature of the population you’re sampling from