Hi ,
Between 1999 and 2009, the number of people who drowned by falling into a swimming pool (in the US) correlated with the
number of movies Nicolas Cage appeared in, with a correlation coefficient of 67%. But there's no causal link between the two.
Nicolas Cage wasn't actually going out and drowning people to celebrate his latest film roles and he wasn't awarded film roles based on US drowning statistics.
This is simply an example of spurious correlation - two variables that just happen to be correlated, even though one has no bearing on the other.
It's also been observed that the number of drowning deaths
increases along with the consumption of ice cream. And it would be easy to dismiss this as another example of spurious correlation.
After all, ice cream consumption doesn't influence drowning deaths any more than Nicolas Cage does.
Yet, dig deeper and you'll
find that both are influenced by a third factor - heat.
In warmer months, people consume more ice cream and also spend more time around swimming pools.
There may not be a direct causal link between ice cream consumption and drowning deaths, but causality is
still there.
Here's the thing...
Data scientists constantly proclaim that causality and correlation are two different things. But in our eagerness to pounce on instances where correlation is being mistaken for correlation,
sometimes we can miss spotting instances where the opposite is true.
Sometimes, things do just happen and two variables can end up correlated because of random chance. But sometimes what may appear to be chance may actually show something more is at play.
As a data scientist, your job is to dig deeper and work out which is true.
Talk again soon,
Dr Genevieve Hayes.