Saturday, 31 March 2018

Medians, quartiles, centiles and all that: An introduction to quantiles

Most of us get to hear about 'medians' fairly soon after we start dabbling with data. The median is just one version of a type of measure called 'quantiles'. We regularly bump into quantiles and the rather quaint sounding names they are usually described with.

A lot of these names sound as if they come from the Arthurian mythology, Chaucer or somewhere else in the Middle Ages - a kind of Medieval Martian language surviving into modern times, and used mainly to baffle us




The idea behind a quantile is a simple one. It is best visualised as a 'cut point'  dividing a block of ranked data into equal parts.

The median is the single cut  point dividing the data into two equal blocks.

The 'terciles' are the two cut points dividing the data into three equal blocks

The 'quarttiles' are the three cut points dividing the data into four equal blocks





And so on.

Note that the number of blocks of data is always one more than the number of cuts  (and be careful not to confuse quantile names with the blocks of data created)

Apart from the median, the quantiles which you will probably meet most often are the quartiles and the centiles (also now regularly called 'percentiles', although the 'per-' is not strictly necessary, unless you are a cat)

The quartiles are  used a lot for describing the spread of a block of data. They have an associated visualisation  - the box plot




The illustration above shows how the quartiles are reflected schematically in the box plot (there are several versions of box plot but the reference points are the same. Box plots are more usually set out as columns (vertically) rather than bars (horizontally). Conceptualy there is no difference).

The 'max' is sometimes referred to as the 'Fourth Quartile' (Q4) and the 'min' as the 'Zeroth Quartile' (Q0). While these terms probably make perfect conceptual sense to a mathematician, I think they are confusing, and probably pretentious, alternatives to 'max' and 'min' -  so best avoided.

The span between Q1 and Q3 is known as the 'inter-quartile range'  i.e. the two quarter blocks either side of the middle.

The median and inter-quartile range are more 'resistant' measures of spread than the mean and standard deviation (see separate article on these). 'Resistant' here means less prone to being distorted by values at the extremes of the spread

Particular centiles are sometimes used as synonyms for quartiles
  • the 25th centile is the same as the first quartile
  • the 75th centile is the same as the third quartile
  • the 50th centile is the same as the median
It is difficult to decide whether the saying '25th centile' and the '75th centile' is clearer or less clear than the first and third quartiles. You will have to decide, based on your knowledge of your audience. But I  am fairly sure that saying  the '50th centile' is invariably much less clear than saying 'the median'

Other centiles are also used. A good example is in describing the expected growth of children, where a series of 'centile' charts are published and an individual child's measurements are plotted on these














No comments:

Post a Comment

Crash Course in SQL Part 2: SELECT

The SQL SELECT instruction is the most important single thing to master It is the instruction that actually gives the answers to any que...