Skip to main content
SAGE
Search form
PDF
  • 00:00

    [MUSIC PLAYING][Correlation & Causation: Tyler Vigen's Spurious Correlations]

  • 00:20

    I'm Tyler Vigen. [Tyler Vigen, Independent Developer]I'm a third year law student at Harvard Law School,and I wrote a book called Spurious Correlations, whereI explore the connections between variables thatlook like they're related statistically,but really, they're not connected.Correlation's when two things vary together.So when one thing goes up, the other thing goes up,and one thing goes down, the other thing goes down.

  • 00:41

    There's a lot of really obvious examples of correlation,and there are some that aren't as obvious.So one really obvious example of correlationis the average speed of cars thatare on a particular stretch of roadand the speed limit on that particular stretch of road.Very obvious, because cars tend to slow downthe speed limit's lower.Cars tend to speed up when the speed limit's faster.It's not always followed.

  • 01:02

    It's not a perfect correlation, but there's a very strongcorrelation there.Another example of correlation wouldbe something like the number of classesthat are in session on a particular dayand how busy the cafeteria is in the school.Again, there's a really good exampleof times where that's not true.So, for example, next week we're goingto have final exams on campus.There are going to be no classes scheduled,but I can guarantee that the cafeteria is still

  • 01:23

    going to be very busy.Causation is when one variable does cause another to occur.So, for example, if we have cars on a stretch of roadthat are correlating their speed with the speed limits thatare posted on the signs, the causal mechanism,the causation in that situation, aredrivers who are choosing to follow the speed limit.Similarly, in the case of a cafeteria thatgets more students coming and buying food on certain days

  • 01:46

    where there's more classes in session,the causation comes from studentswho are on campus and hungry.There's a lot of bad things that can happen if we mistakecorrelation for causation.Thankfully, in the academic world,it doesn't happen a lot in published papersbecause they go through peer review.However, in the business world there'sa lot of situations where it can come upand we might not see it.Before I came to law school, I worked

  • 02:07

    as an internal investigator for a retail agency.And what we would see is a correlationbetween certain cashiers who workedat certain counters and the lack of money in those registers.So if there is missing money, often they'dbe at those registers, which seemslike it might be indicative that they're stealing that money.Now the thing that happened was, the problemwas, that those are the registers that were used

  • 02:27

    for cash exchange by managers.And so they happen to be using the exact same registersas these particular cashiers, and so from the outset,from the spreadsheets and from the correlative analysis,it looked like they were stealing money.But the truth of the matter was that there was justmanagers that were using it for something completely different.There was no causal relationship between the two.And so it's very important to keep those differences in mind

  • 02:48

    when you're doing those kind of analytics.Because the more data that we usein business and in the rest of our academic life,the more we're going to run into those problemswhere there can be correlations thatare indicative of something dire or somethinginteresting or something we want to look into,but we need to be careful to look for a causal mechanism.Spurious Correlation started as a website and is now a book

  • 03:09

    that I put together of two variables on each pageand each set that are connected.They're correlated, but they're not related.So we should be calling them is spurious relationships.They do, statistically, go together.They vary together, just like we say they should be correlated,but one variable is not causing the other.And so it looks to us like those two thingsare correlated and connected when, in reality, we

  • 03:30

    should know that they're not.So I came up with the idea for Spurious Correlations,the website itself, when I lookedat this chart of this New York state murderrate and this mountain.It's a cartoon mountain, so they clearly made this mountainparticularly for this chart.But I said to myself when I saw it hey,I bet there's a lot of other things out therethat correlate with each other for no particular reason.

  • 03:51

    And I bet if I looked through enough data,I'd be able to find them.This is called data dredging, whichis the process of going, taking one variable,and then trying to find it from a set of 1,000 other variablessomething that correlates with it.And it's not very difficult to do, especiallywith the age of big data and a lot of computer processing.So I started doing that and, of course,I found tons of interesting connections in the data.

  • 04:12

    Spurious Correlations became a websitebecause I started just looking through itand finding a bunch of different variablesand I wanted somewhere to put them and share itwith my friends.And so that's actually why it's hosted under my nameand not under spuriouscorrelations.com.But it exists as a website, and it was fun to put together,and it started getting passed around online, which was reallyfun to see.And I got a lot of feedback, and Igot a lot of professors that contacted me and found ways

  • 04:33

    to make the information that was providedbetter because I was not an expert in statistics whenI did this.I was a law student, and it was a very interesting process.I got to learn a lot about the world of statistics,I got to meet a lot of really interesting people,and then I got to put together a book.There are a lot of data sets available online,and if you wrote a robot to just go around and crawlfor data sets, you'd probably end up with way too much data.

  • 04:56

    You want to go out there and find something,if you're trying to, like me, create an interesting data set,you want to go out there manually and pull togetherwhat you can find.So one of the easiest ways to go out,if you want to use brand name corporations, things that you,can go to a lot of 10K annual reports for corporationsand find things like their profits, their revenue,and see different types of things

  • 05:17

    that you would actually be really interested in.For example, Pandora had a net loss for the entire companyfor a couple years.So it's kind of interesting to track that over timeand correlate it with other things.That can be fun.Another good place to find data like this is the US Census.They used to publish, up until 2011,a statistical abstract containing all sortsof different federal data.They don't publish the same kind of abstract anymore,

  • 05:37

    but they still have a lot of really cool data available.The CDC is also a really good source of data.They keep a very, very detailed set of statistics.It's a little bit more morbid, but it's very detailedand it spans a lot of time.So the process to create a spurious correlation, for me,is the same process as data dredging.And the idea is first you take a variable that you want

  • 05:58

    to correlate, so pick one thing, and then you take that thingand you compare it to 1,000 other variables.And in this case, you can just use a functionwhere you correlate it.So long as you have a computer that can do it for you,you correlate it to each one and then justchoose the highest correlation scores.So we just take the correlation coefficient between these twovariables a thousand times and thenlook for the highest rated one.

  • 06:18

    Now, if you take the highest rated one,it's not interesting, go to the next one.You want to find one in the real world,you should be looking for somethingthat you can explain a causal mechanism with.When an academic or a professor or most of the peoplethat I would work with would put together a correlation chart,they'd probably do something like a scatterplotand then look for all the dots to align.And that's a better way to look at a correlation,

  • 06:39

    but it's not necessarily an easier way.And in my case, I'm trying to look at something.I'm already trying to create a spurious chart.So I'm going to try to lie with my chart on purpose.In my case, I'm looking for somethingthat's funny and interesting.And so I put together two variables.And then we're going to put them on a chart.And this is where it gets kind of reallyinteresting from a viewer's standpoint and a user

  • 07:00

    standpoint.So here I'm going to walk through the process of creatingmaybe a single chart or a single set of spurious correlation.So this is on my website if you weregoing to select a variable that were going to correlate first.We're going to go with coffee that the average Americanconsumed correlating with Americans killed by misusinga non-powered hand tool.

  • 07:22

    So this is the coffee that the average Americanconsumed in cups, which is determined by the US CensusBureau.Looks like it goes from about 425 cups of coffee hereto around 375.So the average American's consumingone cup of coffee per day.And the Americans that are killedby misusing a powered hand tool-- which, I think,is people using hammers and screwdrivers wrong.

  • 07:43

    It's not very many.So it goes from about three to seven.It's not a very big range.Every year somebody dies by using maybe a hacksawor a screwdriver or a hammer.Now some things you should notice on this chartis first, why axes are way off.So this is going from 0 to 16.

  • 08:03

    Over here we're going from 350 to 450.There's not any reason for that, and you can see herethese lines line up from what we see,and we know that there is a correlation.So there is a statistical connection between the twoand they look to us super correlated.They come down over here, you go across the screen.But let's take a look now if we go in

  • 08:24

    and change how these are set up and pull up a version wherethe axes are zero.So now we see the same thing, but wesee that the reason that they were correlated so stronglyis because they're both basically straight lines.So right here we just see that the number of cupsof coffee consumed by Americans pretty much stays the sameover this 10 year period, and the number

  • 08:46

    of people that are killed by a powered hand tools each year,pretty much stays the same, too.It's almost near zero all the time.But if we go back, we can see the differencebetween these two.The only difference between these two charts--it's exact same data plotted over exact same time frame.But one of them, I'm really abusing my y-axes'm using this chart to lie to show you something thatlooks dramatically connected when, in reality, it just

  • 09:07

    looks that way.Very easy way to lie with charts.It's something you really shouldn'tdo if you're trying to show someone something.A lot of times this is used if you'retrying to convince someone of something.So this would be something you would see in a news article,but it's something you should call out when you see it,not use yourself.When you're making a chart like this,you really should have an axes thatmakes sense for your chart.Usually it should start at zero and then go up

  • 09:28

    to a reasonable number on both sides.But if you're trying to lie with the chart,you're trying to make something spurious,you can adjust those y-axes and reallybring things closer together.So one of my favorites is the Nicolas Cage correlation,where the number of people who drownby falling into a swimming pool correlateswith the number of films that Nicolas Cage appearedin each year.And one of the reasons that's my favorite

  • 09:48

    is now if you Google Nicolas Cage swimmingpool, my face pops up in Google Image search,and so that's my new claim to fame,is that Nicolas Cage and swimming pool death is me.So if you want to generate your own data sets,or if you want to come up with some of your own statisticsthat correlate, one way is to find a set that someoneelse has come up with.That's an easy way to get some data.

  • 10:09

    But sometimes the most interesting data setscome from people who generate their own data, for example,by doing a survey.And surveys can be a great way to find a whole bunch of datareally quickly.You can use a survey tool like Mechanical Turk, whereyou can survey your friends, and get a bunch of peopleto take your survey.And some really important things to keep in mindare things like you want to have numerical values that you

  • 10:30

    can assign to your survey questions.So, for example, here we try to findways to correlate what kind of employment outcomesare going to come based on a set of grades.So if you can have a numerical assignment of how grades work--here they don't like to assign numbersto grades for some reason-- if you can find a wayto assign numbers to grades, thenyou can correlate it to employment outcomesa lot better.

  • 10:50

    So the same thing applies elsewhere.If you're trying to correlate things or find connectionsbetween things, you're trying to regress variablesagainst each other, if you can find a wayto make them into numbers and create survey questions thatgive you numerical data and quantitative data,you're going to have a lot easier time findingreal correlations.So one of my favorite charts is alsothe correlation between the per capita consumption of margarine

  • 11:11

    and the divorce rate in Maine, which,there's no particular reason that should be connected,but it's a fun connection.Like, I can't believe we're still married.I do get a lot of questions or ideas for correlations.One of the big struggles I have isthat I have a really particular data set that I'm looking at.So I look at annual data from about 1990 to 2010.

  • 11:32

    And if you go through all the charts that I've published,you'll find there's a pretty strong connection wheremost of the charts fall within that time range.And the reason is, then, I can alwayslook for data for that 20 year periodand try to find things they're goingto correlate with each other.I know if I find a new piece of data,it's going to correlate with something in my data set.That's a really narrow band to be looking in,but when your data dredging, you need to have a common variable.

  • 11:54

    So, for me, it's years, and it's in that frame.Something that happens every year during that time.So a struggle that happens often is, for example,if you're trying to do election results.That happens every four years in the presidential election,happens every two years for other races.And so it can be correlated, you can correlate those thingsacross time, but it's very difficult to do over, say,

  • 12:15

    a 10 year period.You're looking at a 20 year period.And it gets really interesting whenyou start to break it out and go over a 40 year timespan or something much longer, because there'sa lot of things that change about the countryor about the world in that time.And so that can really have a dramatic impact,even on the biggest variables that you'relooking at-- per capita consumption of milk,for example.

  • 12:35

    That's something that's going to change very much over a 100year period, so you're not going to beable to see those kind of correlations anymore.So a lot of people think that spurious correlations is funny,especially data researchers.So one of the fun things I get to experienceis when I go to different data analytics conferences,I'm usually the youngest guy there,and everyone's seen my work and thinksis very hilarious, which is great.

  • 12:55

    I really enjoy that, and I get to meeta lot of really cool people.One of the most exciting things for meis getting to see how a lot of different peopleuse data for really cool things.So, for example, I went to one conference.It was just an emetrics conference,but it's a lot of people that runa lot of different websites, and a lot of different companieswhere they're using a lot of data.And their goal is always to provide insights,

  • 13:16

    which is a very dangerous area for correlation and causationconfusion, because you might think that all of your businessis coming from a particular part of the countryjust because more people live there, for example.It's a very common theme among website owners.And I've also gotten to meet some really cool peoplein other areas.So, for example, I went to a conferencelast year where I met one of the people that runs the US Census

  • 13:38

    Bureau and runs statistical abstract whowas talking with someone from the UK Office of NationalStatistics.And something that's really interesting there is there'sa lot of political battles that happenwhen we're trying to count people, just in the census.Which, you would think, would be a really simple situation.You're just trying to count the number of peoplethat live in a particular county, precinct, district,

  • 13:58

    whatever.The problem is that there's a lot of-- it'snot clear what method you should use to count people.Because if you send out, for example, a letter that sayshow many people live in your house?Most people really don't respond to that.You can probably think of the number of timesyou've responded to something like that, census like that.You wouldn't think it's real, or you justwouldn't care enough to tell the Census Bureau

  • 14:19

    that you live there.And so there's a lot of problems in the United Stateswhere they don't have very accurate counts because it'snot easy to get an accurate count of peoplethat live in a particular city.And so there's a lot better methodsthat could be used right now to estimate how many people livein one city block, for example.But because of the politics that are associatedwith redistricting and with gerrymandering and with finding

  • 14:41

    ways to capture the vote for different political parties,there's a lot of pressure on the US Census to not do thatand to just take, for example, the number of peoplethat signed on a form that they live there.And so there's a lot of back and forthbetween how they do it in the UK and how they do in the USand how we could be counting people better,which is a really simple thing that it seems like,but there's a lot of easy ways for that

  • 15:02

    to be a spurious problem, too.I hope that by looking at spurious correlationsfrom my website or book, it's reallyjust a fun concept to me, so I hopeit encourages people to get more involved in statistics,to not be scared of them.It's not hard to open Stata data and learn a little bitabout how to regress variables or howto correlate things together.In the future, I like to work on kind of different projects

  • 15:23

    one at a time.So right now, I'm working on, for example,how would we gerrymander the country?Or how can we ungerrymander the country?How can we think about redistricting five yearsfrom now and say how should this look?How could it look?How should we look at how we divide upthe population to elect people?So those are kind of my goals.[MUSIC PLAYING]

Video Info

Publisher: SAGE Publications Ltd

Publication Year: 2017

Video Type:In Practice

Methods: Spurious correlation

Keywords: applications and contexts; change detection; estimation; gerrymandering; humor; misconceptions; politics; redistricting; web sites ... Show More

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:

Keywords:

Abstract

Tyler Vigen describes his web project and book, Spurious Correlations, which finds similar patterns in unrelated data sets. Though Vigen exploits these similarities for the purpose of humor, they can also be used to imply a causal relationship when there isn't one.

Video Info

Publication Info

Publisher:
SAGE Publications Ltd
Publication Year:
2017
Product:
SAGE Research Methods Video
Publication Place:
, United Kingdom
SAGE Original Production Type:
SAGE In Practice
ISBN:
9781473997714
DOI
https://dx.doi.org/10.4135/9781473997714
Copyright Statement:
(c) SAGE Publications Ltd., 2017

People

Practitioner:
Tyler Vigen

Segment Info

Title:

Segment Num: 1

Keywords:

Segment Start Time:

Segment End Time:

People

Things Discussed

Organizations Discussed:

Events Discussed:

Places Discussed:

Persons Discussed:

Methods Map

Spurious correlation

A situation where an apparent correlation between two variable arises, not because one causes another, but because of the effect of a third variable.
Spurious correlation
Correlation and Causation: Tyler Vigen's Spurious Correlations

Tyler Vigen describes his web project and book, Spurious Correlations, which finds similar patterns in unrelated data sets. Though Vigen exploits these similarities for the purpose of humor, they can also be used to imply a causal relationship when there isn't one.

Copy and paste the following HTML into your website