Lecture 6

Describing measurements I

Measures of central tendency

Dr Lincoln Colling

31 Oct 2022

Psychology as a Science

Today’s lecture

So far we’ve talked about quantitative methods in the abstract
We’ve said quantitative methods is all about putting numbers to things, but we haven’t talked about what to do with the numbers

In this lecture, and the one that follows we’re going to start talking about the techniques we can use for describing sets of measurements

We’ll use these tools when we start learning the basics of statistical models, and learn about the sampling distribution

Measures of central tendency

When we have a set of measurements the first thing we might want to do is work out what the typical value is

Figure 1: National average annual salary [source: worlddata.info]

What we mean by typical value is not always clear

What is the typical average income in the set of countries shown in Figure 1?
Lot’s of countries have an income below USD 30,000, but some have incomes higher than USD 100,000

Measures of central tendency

Maybe we should pick the bracket with the most countries in it?

Then the typical salary is between $0 and USD 10,000 per year.

Maybe we should pick the value where half the countries have a lower average salary and half the countries have a higher one?

Then the typical salary is $12,855 USD per year.

Depending on how we define the most typical value, we get different answers.

We’ll cover the three main ways of defining the typical value or average

Together, these ways of describing the typical or average value are known as measures of central tendency.

Mode

The mode is the most frequent value in a set of measurements
For this kind of average the most typical salary is between $0 and $10,000
The easiest way to spot the mode is to draw a plot like the one we did in Figure 1 and then look for the tallest bar

A set of numbers can have more than one mode. Some examples of this are shown in Figure 2.

Mode

The mode is the only definition of typical value that works for data that is measured at the nominal/categorical level (see Lecture 4).
When it comes to truly continuous variables, such as height, the mode is often not very informative. Why?
- Because each value in the dataset is probably unique
- For this reason, the mode is rarely used for continuous variables measured at the interval or ratio levels

Average salary is a continuous variable, but we’ve turned it into a discrete variable by placing measurements into discrete ranges. By doing so, we can make the mode useful.

Median

The median is the middle value where half the measurements are above that value and half the measurements are below
The easiest way to work out the media is to sort our data (e.g., from the smallest to the largest)

Let’s say we roll a 6-sided dice 5 times and get the following: 3, 4, 6, 1, and 1

To calculate the median, we’ll do two steps:

Sort the data from smallest to largest: 1, 1, 3, 4, 6
Find the mid-point: We have five observations so the third one in the sorted sequence is the mid-point.

Out of the five rolls the median is 3 (and the mode is 1).

Median

Figure 3: Average national salary per country sorted from lowest to highest (Hover over the bars to see the name of the country and the value).

Because we have an even number of countries in our dataset (78), there are two mid-points.

To get the median annual national salary, we need to find the value half-way between Romania and Venezuela, which is $12,855 USD.

Median

To be able to calculate a meaningful median, the variable must be measured on at least the ordinal level.
If we had categorical data like eye colour, then it wouldn’t make sense to ask for the median between a set of four blue eyes and 3 green eyes

Mean

The mean is what most people think of when we talk about the average
You can work out the mean by adding up all the values and then diving this by the number of values

Mathematically, you can represent this with the formula shown in Equation 3, below:

\[\bar{x}=\frac{\displaystyle\sum^{N}_{i=1}{x_i}}{N} \qquad(1)\]

You can also write Equation 3 as Equation 2

\[\bar{x}={\displaystyle\sum^{N}_{i=1}{\frac{x_i}{N}}} \qquad(2)\]

Equation 3 might be more familiar to you, but I like Equation 2 because it makes it clear that a mean is just a special way of adding up numbers (more in Lecture 8)

Mean

\[\bar{x}=\frac{\displaystyle\sum^{N}_{i=1}{x_i}}{N} \qquad(3)\]

This formula just tells use that the mean ($\bar{x}$) is equal to the sum of all the numbers ($\sum^{N}_{i=1}{x_i}$) divided by the number of values in the dataset ($N$).
We use the symbol $\bar{x}$ (pronounced x-bar) to represent the mean of a sample of data.
We use the symbol $\mu$ (pronounced mew) to represent the mean of the population

Mean vs Median

Both the mean and median have their advantages and disadvantages

The mean is easier to work with from a mathematical point of view
Means taken from different samples of the same population_ tend to be more similar to each other (see Figure 4)

Figure 4: Means and medians for 5 dice rolls repeated 100 times

Mean vs Median

There are also some downsides to the mean relative to the median.

The mean is sensitive to extreme values in a way the median is not

Sample means and population means

So far we’ve just talked about describing the typical value in a set of measurements that we have—our sample
But we want to do with statistics is to make inferences about populations from the information that we get from samples
If you’re interested in the average height of people in the UK the “easy” way to find an answer to this question is to measure all the people in the UK and then work out the average height

But if you can’t measure everyone in the UK, then what do you do?

You could instead select a smaller group, or subset, of people from the UK. Measure the height of people in this group, and then try to use this information to figure out plausible values for the average height of people in the UK.

In this example, the group (or groups) you’re making claims about is the population, and the sample is a subset of this population

Theoretical populations

We often talk about populations as if they’re a set of actually existing things that we can take our sample from—for example, all living humans
But populations don’t have to be sets of actually existing things. Instead, they can be the set of possible things from which our samples can be drawn
Let’s say we want to collect a sample of 2 dice rolls
- To collect our sample, we take a die and roll it twice
- We can then work out the typical value (i.e., the mean) from these rolls

Our sample is the set of 2 dice rolls we’ve collected, but what is our population?

One way to think of our population is as the set of possible outcomes that could occur if we rolled the dice twice

Theoretical populations

If our population is all possible rolls of two dice then what is the mean of this population?

We can easily draw out all the possible things that will happen if we roll a dice twice:

From this, we can count up how many times we get a total 2, 3, 4, etc from two dice rolls

We’d find that 6 sequences lead to a total of 7 (more than any other total(
A total of 7 gives a mean of 3.5 (7 ÷ 2 = 3.5)

Theoretical populations

We can work out the population mean of two dice rolls because we know something about the data generating process
Our samples are just a set of instances of data generated by this process
Applying this idea to something like the Stroop task we say that:
- Our population isn’t all living humans but all humans that might have lived, might be living now, and might be living in the future
- Our samples are just instances of data generated by the process that goes on in peoples brains when they do the Stroop task

For the Stroop task (or any other psychological process we might be interested in) we can’t just work out exactly what that data generating process looks like

So we collect samples to try to characterise it

From samples to populations

Let’s say we have defined our population as all people in the UK

Our sample is a subset of this

We really want to know about the population. E.g., What is the average (mean) height of people in the UK?
But all we have is our sample. I.e., The average (mean) height of people in our sample.

If we want to go from our sample to the population then ideally our sample mean should resemble our population mean

But if real life situations we don’t know the population mean, so how would we know whether our sample mean resembles it?

A sample of samples

Let’s say I happen to know that the population mean for the height of people in the UK is 170cm (and heights range from about 78cm to 231 cm)
I can now draw a sample of 50 people from this population and calculate the mean
I can do this over and over and plot each sample mean in Figure 5

import {sample_means} from "@ljcolling/measures-of-central-tendency"
import {viewof replay_mean} from "@ljcolling/measures-of-central-tendency"

sample_means(1200, 300)

Figure 5: Repeatably sampling our population

viewof replay_mean

A sample of samples

The samples don’t always line up exactly with the population mean
Sometimes it’s higher, sometimes it’s lower. Sometimes it’s closer and sometimes it’s further way
Because we don’t know the population we’d never know whether any particular sample was close, far, higher, or lower than the population mean

Even though we can’t say that a particular sample is close to the population, there is something else we can say

We can say how sample means will behave on average:

The sample mean will on average be the same as the population mean

This is an idea that we’ll make use of a lot in statistics

But what does it mean?

The average of the sample means

If treat each sample mean from 50 people as a measurement
As we collect more samples, we average together the sample means

import {sample_means_ave} from "@ljcolling/measures-of-central-tendency"
import {viewof replay_running_mean} from "@ljcolling/measures-of-central-tendency"

sample_means_ave(1200, 300)

Figure 6: Running average of sample means

viewof replay_running_mean

This average of sample means will eventually be the same as the population mean¹

Looking forward

So far we’ve covered measures of central tendency for samples
And we’ve covered the idea of the sample mean ($\bar{x}$) and population mean ($\mu$)
We saw that although we don’t know whether a particular sample mean is the same as the population mean, we do know that on average they will be the same
In the coming lectures we’ll learn how to describe how spread out our sample is, and how spread out the population is

When we plotted the individual sample means we saw that they were spread out around the population mean

We’ll finally put ideas about means and spread together to finally work out how to quantify this spread

But all that is for later…