Quantitative research, measurement, and variables

In this lecture, we cover a lot of ground. First, we talk about the difference between theory and hypotheses and how theories make prediction that can be tested by quantitative research. Then we talk about the most important types of quantitative research design and how to categorise them. Next, we discuss measurement and its levels and the topics of validity and reliability. Finally, we talk about variables and their types.

Milan Valášek (University of Sussex)

Quantitative research

Quantitative methodology aims to obtain generalisable knowledge - statements that are not only “true” with respect to a given case under investigation but that can, with some level of abstraction, be applied in general. Unlike qulitative methodology, it relies on measurement, mathematical modelling, and statistical testing to test hypotheses and form theories.

These two approaches are also interested in different kinds of research questions. Quantitative research deals with questions the answer to which can be quantified, such as…

Qualitative research, on the other hand, is interested in those questions that do not have a quantifiable answer. Questions like…

So quantitative research generates knowledge by testing hypotheses and building theories.

Speaking of theories, let’s set the record straight on what this word means and how theories and hypotheses relate to each other.

When speaking about science, the word theory has a technical meaning that is different to the common parlance sense. Usually, when we say: “I have a theory that…” we mean a hunch, an idea, or a conjecture. Now that we’re learning to be #scientists, let’s forget about this meaning. Instead, form now on, let’s only use the word in its technical sense.

A scientific theory is a logically coherent framework, supported by observation, that describes, explains, and predicts some aspect of the natural world.

Let’s take, for example, the theory of evolution as currently understood by science (technically referred to as the modern evolutionary synthesis), as it is arguably one of the best supported and most robust theories people have developed. In simplified terms, it states that the genetic diversity we see in living organisms on this planet is a result of heritable mutations and natural selection.

The box below contains a very short and basic introduction to the theory of evolution. While it is a theory of biology and not psychology, because people are fundamentally biological beings that evolved as a result of the process described by the theory, understanding the basics is important for any student of the human mind and behaviour. It also illustrates the relationship between an observed fact and a theory.

Theory of Evolution

Before we talk about some basic principles of the theory, it is important to understand the difference between evolution and the theory of evolution. The technical definition of evolution is the change in the allelic frequency within a gene pool over time. Let’s break this down. You have probably heard of genes before. Genes are just section of deoxyribonucleic acid (DNA). DNA is found in almost every cell of every living organism, our red blood cells being one exception. It is a chemical that contains the “instructions” about how to build an organism. It is a huge factor in both structure and function of a living creature. DNA also has the amazing and mind-boggling ability to self-replicate (copy itself) and it gets passed down from parent to offspring via either sexual or asexual reproduction. Because copying is a complicated process, copies tend to not be perfect and errors creep in. These errors are known as mutations. Most mutations are neutral and won’t change the way a gene (a section of DNA) functions. Some mutations are detrimental and can cause anything from a minor disorder in the structure and function of the gene to the death of the organism. But sometimes, a mutation can change the function of the gene that results in a viable, not harmful change to the organism. For instance, it can alter the shape or colour of an animal’s body part. If a mutation occurs in a cell that takes part in the creation of the offspring (in sexually reproducing organisms, such as humans, these would be sperm cells and eggs), it gets passed down to the next generation. In other words, the offspring will inherit the mutation from its parent.

Sexually reproducing organisms, like us, inherit half of our DNA from each parent. DNA is organised into several structures called chromosomes. Typically, humans have two sets of 23 chromosomes: one from the egg and one from the sperm that fertilised it. Most of the genes across these two sets of chromosomes will be functionally identical but, because of mutations, there can be two versions of the same gene. These alternative versions are called alleles. For example, you can have an allele that makes your body carry on producing the enzyme lactase even after you are weaned off breast milk. This will make you able to digest lactose, the sugar found in milk, even as an adult. You can, however, have an allele that switches off the production of lactase, making your lactose-intolerant. If you look at the genes in the bodies of all living people, AKA the gene pool, you might find that one of these alleles is found more often, while the other is rarer. The frequency with which an allele occurs in the gene pool is called the allelic frequency. If you look at the representation of different alleles in the population at two different points in time and find that the allelic frequency has changed, then, in the words of Eddie Vedder, that’s evolution baby! (flashing light and GenX music warning for the link)

And with that, we’ve unpicked the definition of evolution: the change in the allelic frequency within a gene pool over time.

Evolution itself is an observed fact; it simply happens (all caveats from lecture 2 apply). But the observation of this fact itself does not tell us how, or why the change in allelic frequency occurs. In order to explain this fact, we need a theoretical framework. The modern evolutionary synthesis (commonly called the theory of evolution) is such a framework. It explains the mechanisms of mutation and inheritance by which the genetic diversity is created and it describes the mechanism of natural selection which leads to the change in allelic frequency over time. Simply put, natural selection is a process that preserves beneficial mutations more often than unfavourable ones. If an organism inherits a mutation that makes it live longer or reproduce more often than its relatives without this mutation, over a long period of time, there will be more individuals with this mutation than those without it. What mutation is favourable can be heavily influenced by the environment if which the organism lives. For example, maybe you thought that lactose intolerance is a harmful trait because of the digestive difficulties it causes, and yes, you were right. The way humans ended up living, being lactose intolerant isn’t great. But consider that other adult mammals never drink milk so they really don’t need their bodies to keep producing the enzyme that is only good for digesting lactose. But, as humans expanded from their east African homeland to inhabit colder regions of central and northern Asia and Europe, they became more reliant on animal milk for nutrition.1 Those people who were able to digest milk had a better source of nutrients and so were healthier than those who couldn’t. And so, over time, the gene that preserves the production of lactase after maturity has become more widespread in the population of these cold climate inhabitants. In other words, the environment exerted selective pressure on the mutations which led to an evolution of a novel trait – adult lactose tolerance. So, in a way, lactose intolerance is the original thang! The rest of us are mutants…

Now that we know what the theory of evolution is and how it describes and explains the observed fact of evolution, we can have a look at how a theoretical framework is able to generate hypotheses that predict future findings.

The theory explains how the diversity in life we can see happens: how it’s possible that a species can, over millions of years, evolve into both lion and dandelion. All living organisms on this Earth share a common ancestor!

Theory and prediction

OK, that’s all extremely cool but sometimes, there may be challenges to a theoretical framework. For example, based on many sources of evidence, we know that humans are a species of great apes and that our closest living evolutionary ancestors are chimpanzees and bonobos. However, humans have 23 pairs of chromosomes, while the other great apes have 24 pairs. This is a problem because a chromosome cannot simply disappear from the genome. Such a mutation would certainly make the organism not viable and thus would never propagate within a population. So, what’s up with that? Well, although a chromosome cannot just disappear, two chromosomes can sometimes fuse into one. And so, if our theoretical framework is correct and humans and other great apes share a relatively recent common ancestor, humans must have a chromosome that’s a result of a fusion of two great ape chromosomes. And as it turns out, this is true: the human chromosome 2 shows clear signs of fusion of two great ape chromosomes, named 2A and 2B after this discovery.

Comparison of human chromosome 2 with chimp chromosomes 2A and 2B; found on Quora


This amazing discovery showcases the predictive power of a good theoretical framework. Not only it predicts an as of yet unknown finding but, without this theory, it would make no sense to go look for a fused chromosome in the human gene pool! In light of the theory, however, it makes perfect sense.

This is but one piece of a huge amount of evidence that support the theory of evolution. We can leave this topic behind with a map of the prevalence of lactose intolerance in humans across the world taken from an article in the New Scientist, credited to University of Reading. It nicely illustrates the evolution of lactase production retention in adult humans discussed above.

Research design

So, remember, theories are frameworks that describe, explain, and predict observations. Predictions generated within a framework are called hypotheses. The purpose of quantitative research studies is to test hypotheses using measurement, quantification, and mathematical or statistical modelling.

There are many different kinds of study design suited for different kinds of research questions. Let’s look at a few ways of categorising them.

Presence of manipulation

The first criterion by which we can organise different designs is whether or not we, as the researchers, are manipulating a crucial attribute of the study. Using this criterion, we can divide study designs into experimental and observational.

Experimental design

Imagine you saw a poster advertising a psychological study about reading. It is a call for participants: they are looking for native English speakers between 18 and 60 years of age, with normal or corrected-to-normal vision, and without dyslexia. You meet the criteria so you decide to take part and get in touch with the researchers. They are thrilled that you want to take part and tell you to come to the lab tomorrow at 2pm but to make sure to get a good rest tonight and to have lunch around noon.

Next day, you get to the lab, have a chat with the researcher, who explains what’s going to happen to you and ask you to consent to taking part in the study. Then, they lead you to a dark room and sit you down in front of a computer screen. They enter a seemingly random number into a box on the screen and leave the room. An instruction appears on the screen telling you that your task will be to focus on the ‘+’ in the middle of the screen. You will be presented a series of words in different colours and your task is to say aloud the name of the colour in which the given word is displayed as quickly and accurately as you can. OK, that sounds fairly straight-forward.

Here we go!

Click to start/stop


After a minute or two of this, the computer tells you to take a short break and then click the mouse when you are ready for more. When you do, things are a little different…

Click to start/stop


Wow, that was substantially harder! This time, the words that appeared on the screen were colour words but they did not match the colours in which they were displayed. But you powered through, maybe made a few mistakes but that’s OK. The study is now over and the researcher returns to the room for a short chat. They tell you that the purpose of the study was to see how colour recognition can be interfered with by reading. They thank you for your participation and off you go.

On your way home, you stop for a coffee and bump into your friend. You tell them about the study you just participated in. They tell you that they also took part in a very similar study but for them, the words on the screen matched the colour in which they were shown.

You can feel the warmth from the coffee cup spreading through the palms of your hands as the realisation dawns on you: This was an EXPERIMENT!


Let’s break down what happened in this scenario. The researcher wanted to know whether there is a difference in colour-naming performance when the colour words match the colours they are shown in versus when the words and colours do not match. To find out, they designed a study that presents participants with a task (in this case, it was a version of the Stroop task). In this task, participants are presented a sequence of stimuli, the colourful words, and are instructed to respond by naming the colour in which the given stimulus is presented. One stimulus-response pair is called a trial. So, an experimental task consists of trials, where a stimulus is presented and a response is recorded.

What makes this an experiment is the introduction an experimental manipulation: some people are shown words that match their colours, some people are shown words that do not match their colours. The two versions of the task are called conditions. In our case, we can call the conditions congruent (matching) and incongruent (clashing), respectively. This manipulation allows the researcher to compare participants’ performance in the two conditions. Because you and your friend received the same instruction and underwent the same procedure, the researcher can claim that any difference in the responses between participants in the congruent condition and the participants in the incongruent condition is caused by the manipulation.

A well-designed experiment is the only kind of research design that can ascertain causality.

However, to make sure we can make claims about a manipulation causing an effect, we need to make sure that the manipulation really is the only thing that systematically differs between the two conditions. In other words, we need to control all external variables.

In our little story, there were several things the researcher did to control external influences on the experiment:

As you can see, designing a good experiment takes a lot of thinking about all the things you need to control!


The scenario above had the researcher type in a seemingly random number into a box on the screen and then left before you started the task. This happened for two reasons. The first one is randomisation. If it were only you and your friend taking part in the study, the researcher wouldn’t be able to say that the differences in your respective performances are caused by the manipulation. It could also be the case that it’s just the individual differences between the two of you that explain why you performed the way you performed and it has nothing to do with what the stimulus words actually read. Different still, the result could be a mixture of the experimental manipulation and the individual differences between you and your buddy. Differences between participants thus introduce another layer of unwanted variability into our tightly controlled experiment. To deal with this, we not only need multiple participants in each condition/group but we also need to make sure we sort them into the groups randomly. If we don’t randomise participant allocation into groups, we might end up with groups that differ on some unknown characteristic which could skew the results of our study. For instance, if we put all blue-eyed people into one group and all brown-eyed people into another group, this would introduce systematic variability in the design. We don’t have to have a good explanation for why people’s eye colour would influence their performance on the task but we cannot rule out that there may be relevant differences between these two groups. That is why it is really important to randomise participant allocation into groups, if possible.

Participant allocation is random if each participant has an equal chance to be in any of the groups.

By typing in the code, our imaginary researcher put you into one of the two conditions (congruent vs incongruent) and made the computer run the appropriate version of the task.


While we’re discussing randomisation, notice that when you run the Stoop task demo above, it will be different each time: The order in which the words are presented as well as the colours in which they are shown will both differ. That’s because a well-designed experiment randomises both participant allocation and stimulus presentation.


The second reason for the code is blinding. Participants come in all sorts. Some of them are a little mischievous, most of them want to be helpful. Because of that, it is crucial that the participants be kept unaware of what the idea behind the study and hypothesis you are testing are. If they knew what results you are hoping for, it could influence their performance. This can happen both willingly and unwillingly so it’s best to take precautions. Apart from not telling participants what it is you are hoping to find out, it’s also important that they don’t know what conditions they have been allocated to. The code our researcher typed in didn’t convey any information about the condition to you or your friend – you didn’t know whether or not you’d see congruent or incongruent stimuli. If participants are naïve to group allocation, the study is said to be single-blind.

Our example went further than that, though. Not even the experimenter knew what condition the code would put you in. If neither the participants nor the researcher know which condition the participants are put in, the study design is known as double-blind. Of course, the allocation is recorded but it is only revealed once the study is over and the data are being analysed. The reason to double-blind a study is pretty much the same as for single-blinding. By giving up the knowledge about group allocation, the researcher is ensuring that they will not unwittingly influence the results.


Randomised double-blind experiments are the gold standard of quantitative research in psychology!

Let’s summarise the code characteristic of this design:

It is important to note that the experimental setting with its tight controls is always at least somewhat artificial and presents a kind of abstraction from reality. Just because something is true in the lab, doesn’t necessarily mean it will be true in “the real world”.

Another potential drawback of this design is that it’s very effortful and time-consuming for both the researcher and the participants.

The experimental design provides the most rigorous methodology to investigate causal relationships (yeah, you might have misread that…) between things. Unfortunately, is not appropriate for all research questions. There can be all sorts of methodological, logistic, and ethical obstacles to randomisation, manipulation, and controls that render designing an experiment infeasible.

Luckily, there are other types of quantitative research design we can pick from.


A quasi-experiment is a study that conforms to all the requirements of the experimental design except for participant randomisation. This kind of study is used in situations where it is not possible to randomise the allocation of participants into group. To stick with the example of the Stroop task, imagine we want to know whether dyslexia affects the amount to which the incongruent colour words interfere with colour naming performance. We can simply let every participant complete both congruent and incongruent trials and look at the different in performance between people with and without dyslexia. What we cannot do, however, is to randomise who does and who doesn’t have dyslexia, so our groups are not determined by us, the researchers. In this case, we should make an effort to match our participants on all other relevant characteristics, such as age, vision, native language, perhaps gender. We cannot guarantee that our two groups don’t differ on other things that may or may not be related to dyslexia and so the purpose of this so-called matching is to make sure that the groups do not differ at least on the characteristics we think might be relevant to the research.

Natural experiment

An interesting subset of quasi-experimental design is the natural experiment. In a natural experiment, the manipulation and randomisation occur not as a result of the researcher’s actions but through some natural or socio-political means. A good example of a natural experiments are twin studies. Identical twins are, biologically speaking, clones and so they share essentially 100% of their genes, while “fraternal” twins only share on average 50% of human-specific genes.2 Both kinds of twins, however, tend to share the same home environment, as they tend to be raised together. Comparing similarities between identical twins and similarities between fraternal twins, researchers can estimate the role of genes and environment in all sorts of things (physical/mental health, personality, cognitive ability, etc.). Other kinds of natural experiments may be made possible due to policy changes (smoking ban, length of compulsory education…) or natural events such as pandemics (sorry!).

Quasi-experiments are technically not experiments and so they can’t provide evidence of causality of the same strength as experiments. They are, however, often the best design we can aspire to for questions such as the role of genes vs environment.

Observational design

Observational studies (also called correlational) are essentially the opposite of experiments. They rely on observed data, not manipulated conditions and their aim is to assess relationships between things rather than talk about causes and effects.

By observation, we don’t mean looking at stuff. What makes data observational is the lack of experimental manipulation. These studies can still use all kinds of measurement, questionnaires, even laboratory tasks.

On the flip side, the potential lack of tight environmental controls means that researchers conducting observational studies are able to collect data under many different circumstances. This makes it easier to collect large datasets. Indeed, there are examples of studies of this kind that gathered data from tens, even hundreds of thousands of participants.

The price for this, however, is that observational studies cannot provide strong evidence of a cause-and-effect relationship. The reason for this is that just finding a mathematical or statistical relationship between two things does not mean, there’s a causal relationship between them. Sure, there can be one but there doesn’t have to be one. Take, for instance the relationship found between people who died by drowning in swimming pools every year between 1999 and 2009 and the number of movies starring the wonderful Nicolas Cage in the same time period (source):


Looking at the plot, the relationship is undeniable. But what to make of it? It’s hard to believe that Nicolas Cage, in his nouveau shamanic providence, is moved by people who would eventually end up drowning in a swimming pool to embark on film projects, whose cinematic release would coincide with the untimely demise of these poor souls. The claim that Nicolas Cage’s acting somehow prompts people to drown is somewhat more likely but it’s still a long shot. Perhaps both have the same cause but do not cause one another. After all, making films with Nicolas Cage in them might be indicative of some people having too much money. And maybe with too much money comes too much booze, too many drugs, and too many pool parties… Who knows? Entertaining as it may be to speculate on these far-fetched scenarios, it’s probably far more likely that this is merely a coincidental finding and that the relationship is purely statistical and doesn’t reflect any real relationship between these two occurrences. That’s what people mean when they say that observational data doesn’t indicate a causal relationship!

Under certain circumstances when these kinds of studies can suggest causation (*e.g., a very strong relationship, a lot of additional information about potential confounding factors) but, at least in psychology, this is almost never the case.


The need to resort to observational studies sometimes arises from ethical concerns. You might have heard the claim that cannabis is a gateway drug, meaning that people who smoke marijuana are more likely to use other illegal drugs later in their lives. This is sometimes interpreted as cannabis use causing drug use Even if the data at face value show a higher prevalence of drug use among people who’ve used cannabis than among those who have never tried it, there can be a number of reasons for this observed relationship, from mental health to socio-economic and socio-political factors. To firmly establish a causal relationship, we would need to assign people of all ages to experimental conditions where they are either required to use marijuana or are prevented from doing so and then those people would need to be closely monitored to see who tries other drugs and who doesn’t. Needless to say, this design would not be likely to receive ethics clearance. There may be much smarter and more sophisticated ways of investigating the gateway hypothesis then merely asking people to report their past and present drug experiences but they are all observational.

Just in case you’re thinking that according to the criteria for experimental/observational design, quasi-experiments are closer to observational studies than experiments, you’re not alone; many people count them as a type observational study.

Within/between subjects

Dividing research based on whether or not there’s a manipulation present in the design is not the only way. Another meaningful way of slicing up our design pie is whether the manipulation or the measurement of interest occurs between groups of participants or within each participant’s data.

Let’s return to our colour-naming experiment. In this example, we manipulated whether or not the colour words presented were shown in congruent or incongruent colours. This was a between-subjects manipulation as one participant could either see the incongruent words or the congruent ones. No-one saw both kinds.

However, there was also another kind of manipulation. Recall that the first block of stimuli didn’t actually contain any colour words. For the sake of explanation, let’s call this condition control in addition to our two experimental conditions (congruent and incongruent). Since each participant saw first the control condition stimuli and then one of the experimental condition stimuli, this was a within-subjects manipulation. The design of the study would be arguably better if the control and experimental trials weren’t grouped into blocks but randomly interspersed with each other. This kind of randomisation would take care of order effects. For example, participants might get tired after the first, control block and be more likely to make mistakes and take longer to respond in the second block just because of this. What we want, however, is to be able to attribute the difference between the performance on control and experimental trials respectively to the experimental manipulation and not to our participants getting tired. An equal problem would be if participants simply got better at the task in the second block because they have had the time to train themselves how to perform the task. Randomising the stimulus presentation order across within-subjects conditions takes care of confounding factors such as order effects due to training or fatigue.

As you can see, one study can easily contain both between- and within-subjects design elements. In that case, the study is referred to as a mixed-design study.

In observational research, a between-subject study could for instance be comparing people who haven’t used cannabis and those who have. A within-subject element in a study on the gateway effect could be if we gave our participants a drug use questionnaire on multiple occasions, for instance every 2 years. This kind of within-subject design, where we take several measurements of the same thing several times, is also known as repeated measures design. In our colour-naming study, we record multiple responses on the same kind of stimulus (multiple colour words) and so the response is also a repeated measure.

Time frame

The final way of categorising study design is by the time frame in which the study takes place.


A dude with a lot of money has made himself heard that the reason why millenials can’t afford to buy their own homes is that they spend too much time on posh coffee and smashed avocado toast. Now, you might think this sounds plausible and you’d be wrong. However, we can find a hypothesis in this statement and think about what kind of study design we could come up with to test it.

Let’s say we want to know whether millenials, a generation of people born between 1981 and 1996 (but splitting people into generations is dumb), really like their avocado toast that much more than other “generations”. So what do we do?

We could just ask people’s year of birth and how much they like avocado toast, using an online questionnaire. We could send this questionnaire out to hundreds, nay, thousands of people and collect their responses. We could then categorise our participants into generations according to their date of birth:

Once we did that, we could look at whether millenials have a higher liking for avocado toast than other groups.

This is an example of a cross-sectional study, a study that takes a “cross-section” of a sample of a population at a single time point to investigate a hypothesis. In such a study, there is no temporal dimension: we are interested in looking at how things are now, in this moment, and not how they develop over time.

The study we just imagined has several issues but an especially pertinent one is that, since we’re talking about differences between generations, a temporal dimension is pretty much inherent in the question we’re trying to explore.

Even if we find that people categorised as millenials really seem to like avocado toast more than those in the other groups, that doesn’t necessarily mean that the liking is somehow characteristic of that generation. It could be the case that people develop a specific hankering for that crispy rye sourdough topped with that green buttery goodness around the time they hit mid-twenties but by the time they turn 40, they lose that avo feeling. Cross-sectional design is not able to distinguish between these two hypotheses.


Enter longitudinal design. This kind of design involves repeated measurement of the same characteristic of the same participants at multiple different time points. In our millenial example, this kind of study would track people throughout their lives and ask them, maybe even 5 years or so, how much they like avocado toast. After a good few decades of this study, we could resolve the issue of whether or not the purported larger consumption of this snack by millenials is due to something inherent to this particular generation or whether it’s simply people in their mid-20s to 40s who like it, irrespective of what generation they belong to.

You probably guessed that longitudinal studies are logistically very complex and take a lot of time, money, and effort to run. Just the task of keeping participants engaged in a study for months, let alone years or decades, is a daunting one but people have nevertheless done it. Obviously, such a study would not be run by a single researcher but would involve a coordinated team of people.

Despite these high costs, longitudinal studies are pretty much the only way how to study any situation where time is a potentially important factor. Because of these high costs, you probably wouldn’t waste your resources on a silly study like the one we painted.

Longitudinal design is often employed in topics of ageing and development but has many applications and has also been used to investigate, for instance, the gateway drug hypothesis of cannabis!


We’ve been talking about measuring things a lot in this lecture: we said that quantitative research relies on measurement, we talked about measuring participants’ performance, and mentioned things like the repeated measures design. We haven’t, however, really said what we mean by measurement in the context of quantitative research. When we talk about measurement, we don’t just mean gauging dimensions such as height, distance, weight, or time. Sure, all that is definitely measurement but there’s more to measurement than that.

Measurement is the recording of any quantifiable characteristic, behaviour, or attitude. The object of measurement can be pretty much anything: participants, things, even aspects of a study.

In our study examples, we’ve measured many things:

However, to be able to analyse the data from our experiment, we should also measure characteristics of each trial:


As you can see, there are a lot of things to measure but not everything can be measured in the same way.

Levels of measurement

The term levels of measurement refers to the kind of information we are working with when measuring attributes of interest. There are four levels you need to be able to distinguish:


At this level, we are dealing only with names or labels. The only information at this level of measurement is group membership: which of the several possible categories the given observation belongs to.

Examples of the nominal level of measurement include things like:

This level of measurement can be considered qualitative and there are no quantifiable comparisons we can make between the individual categories. It makes no sense to say that green is more than blue when it comes to eye colour.


At this level, the individual observations of the measured attribute can be ordered in a meaningful way. For instance, the 2020 edition of Le Tour de France was won by the Slovenian cyclist Tadej Pogačar in an exciting battle against his fellow countryman Primož Roglič. That means that Pogačar finished the three-week-long race first, while Roglič came second. If we wanted, we could take every single athlete who took part in the race and sort them in order in which they finished it. So, the measurement of position in a race is done on the ordinal level.

This level gives us means to compare the individual observations in a quantitative way: Pogačar was faster than Roglič. However, it doesn’t give us any detail about the differences. We don’t know how much faster the winner was compared to the runner-up. Importantly, the difference in performance between the 1st and 2nd rider doesn’t have to be the same as the difference between the 2nd and 3rd rider, or between any adjacent pair of riders.

A classic ordinal-level measure in psychology is the Likert scale often used in questionnaires:

  1. Strongly disagree
  2. Disagree
  3. Neither agree nor disagree
  4. Agree
  5. Strongly agree

We can assign a numeric value to each of these levels (e.g., “Disagree” is 2) but that doesn’t mean that normal arithmetic applies to them. We can’t say that the difference between “Agree” and “Neither agree nor disagree” is the same as the difference between “Disagree” and “Strongly disagree”.


Unlike the ordinal level, at the interval level of measurement, the differences between pairs of adjacent values are the same. For example, the difference in temperature between 20 and 21 degrees Celsius is the same as that between 35C and 36C. In other words, the intervals marked by the degrees are evenly spaced or equidistant. However, because values measured on the interval level do not have an absolute zero point, we cannot say that, for instance, 40C is twice as hot as 20C.

A good example of this level in psychology is IQ: you can’t say that someone with an IQ of 200 is twice as smart as someone with an IQ of 100. It just doesn’t make sense because there is no such thing as an IQ of zero.


Finally, the ratio level allows for expressing exactly his kind of relationships. At this level, there is a meaningful zero point so you can say that a person who is 36 years old is three times as old as a person who is 12.

Some other examples of the ratio level of measurement in psychology include


It’s important to realise that the level is not an inherent characteristic of the measured attribute; it is the characteristic of the measurement.

It is possible to measure a single thing at different levels. Take age for example. It feels most natural to measure age on the interval level, in terms of years, months, weeks, days, and so on. But recall our avocado toast study. In this design, we were measuring age as membership in one of the four generations. We could decide to treat this measurement as purely categorical and not worry about who’s older and who’s younger. But we could also order the groups according to the time line.

Validity and reliability

Every time we measure a real-world characteristic, we are using some kind of tool (or measure). It can be anything, really: a tape measure, a stopwatch, a questionnaire, a test, an experimental task, even fingers to count things on. Whatever tool we use to measure the characteristic in question, it is crucially important that it is valid and reliable.


A measurement tool is valid if it measures that which it is supposed to measure.

Well-calibrated bathroom scales are a valid measure of body weight, if used correctly. The body mass index (BMI), however, is not a valid measure of body weight. It includes body weight in the way it’s calculated, yes, but it confounds weight with height and so we cannot tell from it how much a person weighs. On average, a person with a higher BMI will weigh more than a person with a lower BMI, which is why it’s a reasonable measure with respect to populations but it’s not a good measure for an individual.

People who like to categorise things have come up with quite a few types of validity but for our purposes, it is important to distinguish two meanings of the word:

Construct validity

Psychology often deals with concepts or constructs that cannot be observed directly, such as personality, cognitive ability, mental health, or attitudes. You cannot take a speculum, peer into someone’s ear canal and measure how intelligent they are. Instead, we developed tools, such as questionnaires, cognitive tests, diagnostic methods, and experimental tasks to gather information about these unobservable things. Construct validity is the extent to which a tool can be justifiably trusted to actually measure the construct it is supposed to measure.

To extend on the themes of lecture 2, it is important to realise that measurement tool we develop are informed by the theoretical underpinnings behind the constructs that they are designed to measure. There is no such thing as an atheoretical measurement instrument.

External validity

While construct validity is a characteristic of the measurement instrument, external validity is more a feature of study design. A study has external validity if its findings can be applied to the entire population of people with relevant characteristics and if they hold up in real-world conditions. By population with relevant characteristics we mean the group of people the study claims to be studying. If a study is exploring some mechanism in, say, white women in western cultures, then it has external validity if its findings apply to white women in western cultures in general, and not just the select sample it was studying. If, however, the study claims to apply to all people, while only looking at white women in western cultures, its claim to generalisability is likely to be questioned.

The applicability of study findings in real-life conditions is referred to as ecological validity. Experiments are often accused of having low ecological validity due to the artificial scenario and tightly controlled conditions. And it is true, that just because something is true in the lab doesn’t automatically mean it is going to be true in the real world. Having a vaccine that is effective against a virus (sorry!) in a petri dish doesn’t mean it is an effective and safe drug for use in humans. After all, that is why developing a drug is such a long, laborious, and expensive process.


An instrument is reliable if it produces accurate and stable results. Imagine you and I take a statistics test. If your knowledge and understanding of stats is better than mine, you would expect to score higher on the exam than me. If so, the exam score is an accurate reflection of the stats ability that the test is assessing. Likewise, if you took the test on two separate occasions, let’s say 6 months apart, and you didn’t learn or forget any statistics in the meantime, you would expect to get pretty much the same score on both occasions. This characteristic of stability over time is called test-retest reliability.


As a real life example, let’s take the implicit association test is an experimental task created with the aim of measuring people’s implicit bias in favour or against a certain demographic characteristic (ethnicity, gender, etc.) by measuring the differences in reaction time when associating positive or negative words with members of different demographic groups.

This method has been criticised on the grounds of both reliability and validity. First of all, its test-retest reliability seems poor: Taking the same test twice often results in different “bias” score. The test also has questionable construct validity. There really isn’t a good theoretical reason for why a person’s implicit bias against a particular demographic would be reflected in millisecond-level differences in reaction time on an association task. Finally, even if the test had reasonable construct validity, it is not clear why performance on this laboratory task should in any way translate to a person’s implicit attitudes, let alone their behaviour.


The point of measuring characteristics or attributes is to assess the value (e.g., amount, number) of the given characteristic for each subject of interest (participant, trial…). This is a trivial statement: if we knew someone’s height, we wouldn’t need to measure it. But why do we not know the value of the attribute for every subject? Well, that’s easy: because the value changes from case to case! It… varies!

A variable is a characteristic whose value varies within a set of subjects of interest.

This definition means that pretty much anything can be a variable. Prior experience of drug use? Well, not everyone has the same drug history, right? Variable!

The group to which a participant belongs? Variable!

Time point of measurement in a longitudinal study? Variable!

The number of times you’ll have to re-read this section for it to start making sense? vARiAblE!


It’s important to also appreciate that whether or not something is a variable is independent of whether or not it is being measured! Even things that we don’t measure are variables. In fact, these hidden or unobserved or external variables are the source of much methodological headache because they may or may not have an impact on the results of studies. Think back to the example from lecture 2 where nutrition was not considered relevant for IQ: arguably a problematic omission…

For the purpose of research, it is useful to divide variables into two groups: predictor variables and outcome variables.

Predictor variables are variables used to model or predict outcome variables. In our gateway drug hypothesis example, cannabis use was the predictor variable, while the use of other illegal recreational drugs was the outcome. In the avocado toast example, the generation to which a participant belonged was the predictor, and the liking of avocado toast was the outcome.

In our experimental scenario, condition was the predictor variable and the performance on the Stoop task (accuracy or reaction time) was the outcome. In experimental design, we manipulate the predictor variable to elicit an effect in the outcome. The way the predictor changes is not dependent on the outcome but the way the outcome variable (performance) changes is intended to be dependent on the predictor (condition). In other words, we design experiments such that we can say that the experimental manipulation causes the change in the response. For that reason, predictor variables IN EXPERIMENTS are called independent variables, while outcome variables are referred to as dependent variables.

Observational research doesn’t allow us to make strong causal claims and so we stick to the terms predictor and outcome. In fact, these are basically roles we ascribe to variables and this happens based on theory. A variable that is the predictor in one theoretical model can be the outcome in another. If a theoretical framework is missing, it may not even be clear which variable fulfils which “role”. Take, for instance, the tongue-in-cheek example of swimming pool drownings and Nicolas Case films. We can look at the relationship between these two variables but it’s not clear which one should be considered the predictor and which one should be the outcome.

Variables of everyday

While we often talk about variables in the context of maths, stats, and research, it is best to realise that thinking about varying characteristics is just a way of systematising the world around us.

As an example, take this selection of books of fiction plucked from my bookcase: