Psychology as a Science
So far we’ve talked about quantitative methods in the abstract
We’ve said quantitative methods is all about putting numbers to things, but we haven’t talked about what to do with the numbers
In this lecture, and the one that follows we’re going to start talking about the techniques we can use for describing sets of measurements
We’ll use these tools when we start learning the basics of statistical models, and learn about the sampling distribution
When we have a set of measurements the first thing we might want to do is work out what the typical value is
What we mean by typical value is not always clear
What is the typical average income in the set of countries shown in Figure 1?
Lot’s of countries have an income below USD 30,000, but some have incomes higher than USD 100,000
Maybe we should pick the bracket with the most countries in it?
Maybe we should pick the value where half the countries have a lower average salary and half the countries have a higher one?
Depending on how we define the most typical value, we get different answers.
We’ll cover the three main ways of defining the typical value or average
Together, these ways of describing the typical or average value are known as measures of central tendency.
The mode is the most frequent value in a set of measurements
For this kind of average the most typical salary is between $0 and $10,000
The easiest way to spot the mode is to draw a plot like the one we did in Figure 1 and then look for the tallest bar
A set of numbers can have more than one mode. Some examples of this are shown in Figure 2.
The mode is the only definition of typical value that works for data that is measured at the nominal/categorical level (see Lecture 4).
When it comes to truly continuous variables, such as height, the mode is often not very informative. Why?
Because each value in the dataset is probably unique
For this reason, the mode is rarely used for continuous variables measured at the interval or ratio levels
Average salary is a continuous variable, but we’ve turned it into a discrete variable by placing measurements into discrete ranges. By doing so, we can make the mode useful.
The median is the middle value where half the measurements are above that value and half the measurements are below
The easiest way to work out the media is to sort our data (e.g., from the smallest to the largest)
Let’s say we roll a 6-sided dice 5 times and get the following: 3, 4, 6, 1, and 1
To calculate the median, we’ll do two steps:
Sort the data from smallest to largest: 1, 1, 3, 4, 6
Find the mid-point: We have five observations so the third one in the sorted sequence is the mid-point.
Out of the five rolls the median is 3 (and the mode is 1).
Because we have an even number of countries in our dataset (78), there are two mid-points.
To get the median annual national salary, we need to find the value half-way between Romania and Venezuela, which is $12,855 USD.
To be able to calculate a meaningful median, the variable must be measured on at least the ordinal level.
If we had categorical data like eye colour, then it wouldn’t make sense to ask for the median between a set of four blue eyes and 3 green eyes
The mean is what most people think of when we talk about the average
You can work out the mean by adding up all the values and then diving this by the number of values
Mathematically, you can represent this with the formula shown in Equation 3, below:
\[\bar{x}=\frac{\displaystyle\sum^{N}_{i=1}{x_i}}{N} \qquad(1)\]
You can also write Equation 3 as Equation 2
\[\bar{x}={\displaystyle\sum^{N}_{i=1}{\frac{x_i}{N}}} \qquad(2)\]
Equation 3 might be more familiar to you, but I like Equation 2 because it makes it clear that a mean is just a special way of adding up numbers (more in Lecture 8)
\[\bar{x}=\frac{\displaystyle\sum^{N}_{i=1}{x_i}}{N} \qquad(3)\]
This formula just tells use that the mean (\(\bar{x}\)) is equal to the sum of all the numbers (\(\sum^{N}_{i=1}{x_i}\)) divided by the number of values in the dataset (\(N\)).
We use the symbol \(\bar{x}\) (pronounced x-bar) to represent the mean of a sample of data.
We use the symbol \(\mu\) (pronounced mew) to represent the mean of the population
Both the mean and median have their advantages and disadvantages
The mean is easier to work with from a mathematical point of view
Means taken from different samples of the same population_ tend to be more similar to each other (see Figure 4)
There are also some downsides to the mean relative to the median.
The mean is sensitive to extreme values in a way the median is not
So far we’ve just talked about describing the typical value in a set of measurements that we have—our sample
But we want to do with statistics is to make inferences about populations from the information that we get from samples
If you’re interested in the average height of people in the UK the “easy” way to find an answer to this question is to measure all the people in the UK and then work out the average height
But if you can’t measure everyone in the UK, then what do you do?
In this example, the group (or groups) you’re making claims about is the population, and the sample is a subset of this population
We often talk about populations as if they’re a set of actually existing things that we can take our sample from—for example, all living humans
But populations don’t have to be sets of actually existing things. Instead, they can be the set of possible things from which our samples can be drawn
Let’s say we want to collect a sample of 2 dice rolls
To collect our sample, we take a die and roll it twice
We can then work out the typical value (i.e., the mean) from these rolls
Our sample is the set of 2 dice rolls we’ve collected, but what is our population?
One way to think of our population is as the set of possible outcomes that could occur if we rolled the dice twice
If our population is all possible rolls of two dice then what is the mean of this population?
We can easily draw out all the possible things that will happen if we roll a dice twice:
From this, we can count up how many times we get a total 2, 3, 4, etc from two dice rolls
We’d find that 6 sequences lead to a total of 7 (more than any other total(
A total of 7 gives a mean of 3.5 (7 ÷ 2 = 3.5)
We can work out the population mean of two dice rolls because we know something about the data generating process
Our samples are just a set of instances of data generated by this process
Applying this idea to something like the Stroop task we say that:
Our population isn’t all living humans but all humans that might have lived, might be living now, and might be living in the future
Our samples are just instances of data generated by the process that goes on in peoples brains when they do the Stroop task
For the Stroop task (or any other psychological process we might be interested in) we can’t just work out exactly what that data generating process looks like
So we collect samples to try to characterise it
Let’s say we have defined our population as all people in the UK
Our sample is a subset of this
We really want to know about the population. E.g., What is the average (mean) height of people in the UK?
But all we have is our sample. I.e., The average (mean) height of people in our sample.
If we want to go from our sample to the population then ideally our sample mean should resemble our population mean
But if real life situations we don’t know the population mean, so how would we know whether our sample mean resembles it?
Let’s say I happen to know that the population mean for the height of people in the UK is 170cm (and heights range from about 78cm to 231 cm)
I can now draw a sample of 50 people from this population and calculate the mean
I can do this over and over and plot each sample mean in Figure 5
The samples don’t always line up exactly with the population mean
Sometimes it’s higher, sometimes it’s lower. Sometimes it’s closer and sometimes it’s further way
Because we don’t know the population we’d never know whether any particular sample was close, far, higher, or lower than the population mean
Even though we can’t say that a particular sample is close to the population, there is something else we can say
We can say how sample means will behave on average:
The sample mean will on average be the same as the population mean
This is an idea that we’ll make use of a lot in statistics
But what does it mean?
If treat each sample mean from 50 people as a measurement
As we collect more samples, we average together the sample means
import {sample_means_ave} from "@ljcolling/measures-of-central-tendency"
import {viewof replay_running_mean} from "@ljcolling/measures-of-central-tendency"
This average of sample means will eventually be the same as the population mean1
So far we’ve covered measures of central tendency for samples
And we’ve covered the idea of the sample mean (\(\bar{x}\)) and population mean (\(\mu\))
We saw that although we don’t know whether a particular sample mean is the same as the population mean, we do know that on average they will be the same
In the coming lectures we’ll learn how to describe how spread out our sample is, and how spread out the population is
When we plotted the individual sample means we saw that they were spread out around the population mean
We’ll finally put ideas about means and spread together to finally work out how to quantify this spread
But all that is for later…