class: center, middle, inverse, title-slide # Lecture 8: Distributions, functions, transformations ## A little maths goes a long way ### Dr Milan Valášek ### 16 November 2020 --- ## Overview [**The shape of things**](#3) - Histograms - The normal curve [**Transformations**](#10) - Functions - The _z_-transform [**Comparing things with maths**](#21) - Comparing groups - Comparing scores across groups - Comparing scores across variables --- ## The shape of things For the purpose of this lecture, we will only be talking about _continuous_ variables! - The vast majority of the measured heights are roughly in the 155-175 centimetre range - The distribution is roughly symmetrical around its mean and has the shape of a bell characteristic of a _normal distribution_ - The shape isn't perfectly smooth in a finite sample  --- ## Histograms - Height is a continuous variable so no two people are the **exact same** height - To plot the variable on a histogram, we have to assort the values into _bins_. - Each bar on the histogram represents the number of people whose height falls within a given range  --- ## Ideal curves - If we could collect an infinite number of observations, we could make the bins _infinitely_ narrow - This would give us an idealised shape of the normal distribution: **the normal curve**. - **Because we will mostly be talking about continuous normal variables, we can visualise them as this kind of curve** <img src="./assets/hist.gif" width="70%" /> <!--  --> --- ## The normal distribution - We can describe key properties of a variable using measures of _central tendency_ and _spread_ - In a normally distributed variable, **the majority** (about 68%) of all the values are concentrated **within ±1 standard deviation to either side of the mean** - The larger the standard deviation, the more spread out the variable is --- ## The normal distribution <iframe class="app" src="https://mivalek.github.io/viz/norm_dist.html" height=550px scale=70%></iframe> --- ## The normal distribution - Mean and standard deviation are **independent of one another** - Neither shifting the mean, not changing the standard deviation of a distribution doesn't change its _fundamental shape_ - **Relative position of the individual points on the line with respect to each other does not change**! - It is still true that about 68% of values are within ±1 standard deviation from the mean --- ## Same shape, different scale <br>  --- ## Transformations (From now on we'll be talking about **sample mean**, `\(\bar{x}\)`, and **sample standard deviation**, `\(SD\)`) - How do we change `\(\bar{x}\)` and `\(SD\)` without changing the shape of the variable? - Only changing the values of a selection of observations will alter the shape of the distribution - _not good_! - We can decide to switch our measurement unit of height from centimetres to feet and inches but we have to do it **consistently for all observations** - This _preserves the relationships between individual observations_! --- ## Functions Let's play a game! --- ## Functions - **CONGRATS!** You have just discovered the _identity_ function: `\(f(x) = x\)` - A transformation is just a mathematical function that takes an input and returns an output (just like a function in `R`) - For example the _second power_: 2<sup>2</sup> = 4, 3<sup>2</sup> = 9, 4<sup>2</sup> = 16 and so on - We can think of this operation as a function that takes an input, `\(x\)` and returns the output `\(x^2\)`. `$$f(x)=x^2$$` --- ## Graph of _f(x)_ <iframe id="transform" class="app" src="https://mivalek.github.io/viz/transform.html" height=560px></iframe> --- ## Centring and scaling - Addition **shifts** the values of `\(x\)` up and down along the y-axis, **while keeping the distances between points unchanged** - Multiplication, **spreads or "squishes"** the values of `\(x\)` along the y-axis - When addition and multiplication are applied to variables, they are referred to as **centring** and **scaling**, respectively. --- ## Centring - Centring is the **subtraction** of a fixed value from each observation of a variable - You can technically centre a variable by subtracting _any_ value from it but the most frequently used method is **mean-centring**: `$$f(x) = x - \bar{x}$$` - Mean-centring **does not alter the shape of the variable, nor does it change the scale at which the variable is measured**  --- ## Scaling - Scaling is the **division** of each observation of a variable by a fixed value - This has the effect of stretching or squishing the entire variable _in the direction of the x-axis_ - The most frequent method of scaling variables is by their **standard deviation**: `$$f(x) = \frac{x}{SD(x)}$$`  --- ## The _z_-transform <iframe id="z-transform-app" class="app" src="https://mivalek.github.io/viz/std_norm.html" height=560px></iframe> --- ## The _z_-transform - First mean-centring and then scaling a variable by its _SD_ - AKA, **standardisation**. `$$z(x) = \frac{x - \bar{x}}{SD(x)}$$` - Shape of the variable remains intact and the relative differences between any two values in the variable are preserved - **Standardisation is a linear transformation** (like addition and multiplication)  --- ### _z_-scores - Values of a standardised/_z_-transformed variables - **Distance from the mean in units of standard deviation**. - This interpretation is **independent of the actual value of _SD_** in the original variable! - A person with a _z_-score of 1 will be _one_ SD _taller than average_: 164.98 + (1 × 6) = 170.98 cm. - Someone with a _z_-score of -0.8 will be 0.8 _SD_ **shorter** than the average person in the sample: 164.98 + (−0.8 × 6) = 160.17 cm. --- <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800" height="600" frameBorder="0"></iframe> --- ## Comparing groups We can compare groups by asking how different are the groups _on average_. <img src="./assets/hist_two_groups.png" width="70%" /> `$$\begin{aligned}diff_\text{height}&= \bar{x}_\text{w} - \bar{x}_\text{nb}\\&=164.98 - 170.74\\&=-5.77\end{aligned}$$` --- ## Comparing across groups Nyari is a 172 cm tall woman; Karim is a 179 cm tall non-binary person What if we wanted to know how their heights compare **relative** to their groups/populations? We can use _z_-scores: `\(z(x) = \frac{x - \bar{x}}{SD(x)}\)` <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> \(\bar{x}\) </th> <th style="text-align:center;"> \(SD\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Women </td> <td style="text-align:center;"> 164.98 </td> <td style="text-align:center;"> 6.00 </td> </tr> <tr> <td style="text-align:left;"> Non-binary </td> <td style="text-align:center;"> 170.74 </td> <td style="text-align:center;"> 7.74 </td> </tr> </tbody> </table> ```r (172 - 164.98) / 6 # Nyari ``` ``` ## [1] 1.17 ``` ```r (179 - 170.74) / 7.74 # Karim ``` ``` ## [1] 1.067183 ``` --- ## Comparing across variables - We could use the same principle to compare values on **of variables measured on any scale** - Nyari earns £38,400 per year here in the UK - She just got a job offer in Germany with an agreed salary of EUR 4,270 per month. - Is she going to be relatively better off if she takes the job? - Average _annual_ wage in the UK is £37,428 (_SD_ = 4,266) - Average _monthly_ wage in Germany is EUR 3,880 (_SD_ = 351.6) ```r (38400 - 37428) / 4266 # Nyari's UK salary z-score ``` ``` ## [1] 0.2278481 ``` ```r (4270 - 3880) / 351.6 # Nyari's German salary z-score ``` ``` ## [1] 1.109215 ``` --- <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800" height="600" frameBorder="0"></iframe> --- ## Recap - We often think about the distributions of variables in terms of the normal curve - Mean and _SD_ reflect the position and spread of this curve - **Transformations** are mathematical functions we can use to manipulate variables - Some transformations, such as centring or scaling, don't change the relative distances between individual values of a variable. - These are **linear** transformations - Others, such as _exponentiation_ (_e.g._, x<sup>2</sup>) do change the proportions of the transformed variables - These are **non-linear** transformations --- ## Recap - The _z_-transform, AKA **standardisation**, is a two step transformation consisting of _first_ mean-centring the variable and then scaling it by its _SD_ - It converts the values of any variable into units of _how far the value is from the mean of the whole variable in terms of numbers of standard deviations_ - We can compare group averages by _subtracting the means of the groups_ - We can use _z_-scores to compare values of variables **measured on different scales or in different units** --- class: last-slide <br><br><br><br><br> # And that's it!