The Analyst's Toolkit: Standard Deviation

From time to time, you will hear people using the phrase 'standard deviation' in a variety of contexts. It can appear to give a seal of statistical credibility to whatever it is that they are saying. This article explains what Standard Deviation actually is and how to calculate it.

The simplest way of visualising Standard Deviation is to see it as a measure of the spread of a bunch of data: Whether the data is all tight together, like a bunch of asparagus. Or whether it is spread out flat like an omelette

The most complicated way of visualising Standard Deviation is to present it as an over-decorated mathematical equation. You will find lots of examples of this. It can be quite intimidating at first.

Don't be put off though. It's nothing like as complicated as they want you to believe

To show what Standard Deviation is, we need some data to play with. Here is a block of data from my world. It doesn't really matter what it is. Pick a block of your own if you want.

At first glance, it is hard to make any real sense of this data. We need some summary measures to help describe it more compactly.

The most obvious ones to look at are: How many items of data do we have, what is the highest value, what is the lowest value, and what is the 'average' value. You can work these out manually from the illustration above. I'm using Excel functions to speed this up a bit

Average

We could get into a whole separate discussion about what the best type of average would be. But we are mostly talking here about the spread of the data. So we will put medians etc on one side for another day.

The average used here is the 'traditional' average, also known as the 'mean' or the 'arithmetic mean'.

This is simply the total of all the individual bits of data divided by the total number of bits of data

For my example data this is 23,088.1 divided by 168 which equals 137.4

We could make this look much more complicated and impressive as follows:

You will see variations of this around.

The 'x' with the line over it is called 'x-bar' which is Martian for 'the mean'.

The capital Greek 'sigma' symbol ∑ just means 'sum' or 'total of' (you will see this in parts of Excel as a symbol indicating the SUM() function)

n is simply the total number of items of data, in my example 168

x is any actual item of data and the little subscript i refers to the position (order number) of any individual bit of data in the whole list

So the formula translates into:

Total up all the individual data items from the first to the 168th item (i.e. all of them). Then divide this by the total number of items

Spread

Ok so we've got the average, the maximum and the minimum

We can calculate the 'Range' by subtracting the minimum from the maximum:

Range = maximum - minimum

Range = 160.5 - 115.7 = 44.8

Surely this is enough to describe the extent of spread of our data? Well not really. While we have the width and the positions of the two extreme edges of the spread, we still don't know whether our data is spread evenly across this range (omelette) or whether it is mostly bunched up tightly around the average (asparagus)

What we need to do is to look at how far each individual bit of data item is away from the mean and then find a way to summarise that

Deviation

We can calculate the Deviation from the Mean (or simply the 'Deviation') for each value by subtracting the Mean from it. This gives a set of positive and negative smaller values. (For no obvious reason. statisticians sometimes refer to these calculated deviations values as 'residuals')

The negative Deviations are all cases where the original value is smaller/lower than the mean, The positive Deviation values are all cases where the original values are larger/higher than the mean

So having calculated all the individual Deviations, surely all we have to do now is to calculate an Average Deviation. And this will provide a measure for the spread of the data?

Unfortunately there is a snag......

If you try to calculate the Average Deviation by totalling up all the individual Deviations and then dividing by the Total number (still 168) something annoying happens: All the positive and negative Deviations cancel themselves out. So the Total Deviation is zero and the Average Deviation becomes zero too.

This is no use. So we need to find a way to discount the effect of the negative signs in half the Deviation values.

Mean Absolute Deviation (MAD)

One method we could use is to simply ignore all the negative signs and average the 'absolute' values of each Deviation. This is relatively easy to do (you can the Excel ABS() function if you are wanting to follow this step by step) and gives a type of Average Deviation called the Mean Absolute Deviation - also referred to as 'MAD'

We can calculate the MAD value for our sample data. The total of the absolute values of all the Deviations is 1592.7 divided by 168 gives a MAD value of 9.48

Excel provides a function for calculating MAD values, cleverly disguised under the title AVEDEV()

Square and Un-square

There's another way to get rid of the negative signs. This is based on the fact that the square of any negative number is always a positive number.

So the square of 4 is 4 x 4 = 16

And the square of -4 is -4 x -4 =16 too

So if you Square a negative number and then 'Un-square' it again you end up with a positive number. Mathematicians prefer to call the 'Un-square' of a number the 'Square Root' or even just the 'Root'

So the Square Root of the Square of 4 is 4

And the Square Root of the Square of -4 is also 4

Statisticians find this slightly more complicated way of dealing with negative numbers more attractive (they have plenty of time on their hands). So they use this trick to create another type of Average Deviation from the Mean - the 'Standard Deviation'

Calculating Standard Deviation

The essence of the calculation is

Work out the Mean
Work out all the Deviations from the Mean
Square all the Deviations
Calculate the Average of the Square of the Deviations
Un-square it again

We had already got as far as completing Step 2.

Step 3: let's calculate all the Squares of the Deviations from the Mean:

So that's got rid of all the negatives - as we hoped. Notice that some the numbers are quite big - looking even bigger than the original data - and some are quite small. There are even a few 0.0 values appearing. Don't worry - this is all just the behaviour of numbers when multiplied by themselves. The 0.0 values are not actually zero, they are just very small numbers which do not show up when the number is written to just one decimal place

Step 4 is then to calculate the average of the Squares of the Deviations. So this will be their total divided by the number (still 168)

The Total of the Squares of the Deviations = 23,345.7

The Average of the Squares of the Deviations = 23,345.7 divided by 168

The Average of the Squares of the Deviations = 138.96

The value of the Average of the Square of the Deviations also has an official statistical name - the 'Variance' (not to be confused with this same term as used by either accountants or lawyers)

Note that Excel provides a function DEVSQ() which can calculate the total of the Squares of the Deviation (23,345.7 in our example) in a single step

Step 5 - the final step - is to Un-square the previous number

The Square Root of 138.96 = 11.8

This is the Standard Deviation

The Standard Deviation can also be described as the Square Root of the Variance. And the Variance can also be described as the Square of the Standard Deviation. Try this if you want to baffle people.

If you thought that was a bit long-winded and time consuming, don't worry, you will probably never actually have to do this in full. Excel has a function that calculates this is in a single move. And many pocket calculators have had short cut ways of doing this since the 1970s

Ok, so what?

So we now have two different versions of an Average Deviation from the Mean to go with the summary measures we calculated earlier. And we have a thing called the Variance which sounds as if it should be useful but looks rather big:

Remember that the Variance is the calculation before it is un-squared. So this means that it is a different type of measurement and cannot be directly compared to the max, min, average or range.

But the Variance still gives a measure of the spread - or 'dispersion' - of the data. The smaller the Variance, the more tightly bunched the data is around the Mean (asparagus). And the bigger, the more spread out (omelette). A slight issue with the Variance is that because it uses squares, it is distorted by data at the extremes of the range.

Both the Standard Deviation and the MAD (mean absolute deviation) can be compared directly to the Average, and to the max and min. And their size also gives a direct measure of the spread (dispersion) of the data

There is a 'rule of thumb' called the 'Range Rule' (or 'Range Rule of Thumb') which says that the Range is about four times the Standard Deviation. That seems to be the case with our illustration data. 44.8 is roughly four times 11.8 (but not exactly - hence 'rule of thumb' - some days I roughly get to work on time)

Our data can be described as having an Average (Mean) of 137.4 with a Standard Deviation of 11.8.

These are absolute numbers. If you wanted to do a quick comparison with another different block of data then some kind of relative measure would be helpful

If we divide the Standard Deviation by the Mean we will get a ratio (which can also be expressed as a percentage). In our example

11.8 divided by 137.4 then times 100 = 8.6%

i.e. our Standard Deviation is 8.6% of our Mean

This measurement has an official statistical name too - the 'Coefficient of Variation'

This is often abbreviated to 'CV'. So as with the other measures, it gives an indication of the dispersion (spread) of the data. A high CV indicates dispersed data (omelette) and a low CV indicates compact data (asparagus). Ours seems quite low - so definite asparagus tendencies (but you will need to compare it to other blocks of data before you can come to this conclusion).

So one important use for Standard Deviation is as a stepping stone to calculating the Coefficient of Variation which in turn allows the 'flatness' in two different blocks of data to be compared

Time for a break

Reward yourself with a coffee - if you've got this far you have earned it

We have entered the world of the statistician and managed to follow everything without even a Martian-to-English phrasebook. Once translated into plain English their obscure gibberings suddenly make sense.

They have tried to baffle us with jargon such as

Variance
Standard Deviation
Mean Absolute Deviation
Coefficient of Variation

but we know that these are just Martian phrases for some fairly simple ideas

What next?

There are a few things still to look at:

Understanding why there are different versions of Standard Deviation and Variance and when to use each

Understanding how Standard Deviation relates to probability and a relatively abnormal thing known as the Normal Distribution

Understanding how to put Standard Deviation to practical use, such as in the special types of run charts used in Statistical Process Control (SPC). These are really useful as they help us decide whether changes are real or imaginary

The Analyst's Toolkit

Friday, 23 March 2018

Standard Deviation

No comments:

Post a Comment

Crash Course in SQL Part 2: SELECT