Search form
Auto-Scroll
• 00:01

[Wharton, University of Pennsylvania][Business Mathematics][Module 10: Statistics Part 2.1: Introduction to Statistics]

• 00:09

RICHARD WATERMAN: Today's class isgoing to talk about statistics. [Richard Waterman, AdjunctProfessor of Statistics] And we need to start doing statisticswhen we don't have access to a complete probability modelto describe a business process.And that's the reality, that we don't typicallyhave that probability distribution.But what we do have is some data that we have collected.

• 00:33

RICHARD WATERMAN [continued]: And then we can use that data to create the probability model.That's the basic idea, and the difference between statisticsand probability.So today's class is focusing on collecting data and makingstatements about that data.And they are, again, going to be probabilistic

• 00:58

RICHARD WATERMAN [continued]: Now, as you will very quickly see,we tend to make assumptions when we do statistics.And it's important that we're able to articulatethose assumptions, and also to beable to check those assumptions and determine whether or notthey're reasonable.It's going to be the case if that the assumptions are not

• 01:21

RICHARD WATERMAN [continued]: reasonable, then the conclusions we makeare quite likely to be unreliable.So statistics is always based on assumptions.It's up to us to know the assumptionsand also check those assumptions.We're going to start off by discussinghow we might summarize data, and why we summarize data.

• 01:44

RICHARD WATERMAN [continued]: So you imagine yourself in a situationwhere you're sitting down at work and somebody, your boss,presents you with this humongous spreadsheet, potentiallymillions of rows, lots and lots of columns.And your job is to figure out what is going on there.Obviously you can't look at the entire spreadsheet.It's just impossible.There's too much data.

• 02:04

RICHARD WATERMAN [continued]: It's a very natural thing to want to summarize it.Now those summaries that we createcan be very, very useful in terms of understanding businessprocesses.And here are some reasons why we like to summarize data.One of them is to create benchmarks.Businesses are always trying to compare themselvesagainst one another, sometimes to an external benchmark,

• 02:27

RICHARD WATERMAN [continued]: sometimes to an internal benchmark.Where do those benchmarks come from?Well, they can come from data summaries.So we often are interested in creatingsome kind of benchmark.Another thing that we need to do as a business,is to monitor the business process over time--in other words, to track changes.And the objects that we track are typically

• 02:52

RICHARD WATERMAN [continued]: statistical summaries of data.So they might be, for example, the average number of defectsin a batch, how long it takes for us on averageto service a help call in our call center,all these sorts of things that were interested in tracking.Conversion rates if you're running a website.

• 03:14

RICHARD WATERMAN [continued]: Another thing that we're able to do with data summariesis, in fact, to create decision rules from these summaries.And we're going to see that today whenwe talk about an object called a confidence interval,and we can see how we can make a decision based offof the confidence interval, whichitself is driven by a summary of underlying data.

• 03:36

RICHARD WATERMAN [continued]: And so here are some examples that you could wellbe familiar with, that are essentially data summaries.If you turn on the television at certain times of the month,there's a lot of certainly business TV,there's a lot of discussion as to what the housing prices aredoing right now.And you will see people talk about the median house price

• 03:60

RICHARD WATERMAN [continued]: as a way of judging the health of the housing market.Another summary statistic that gets a lot of visibilityat the moment is the monthly unemployment rate.The Federal Reserve in the United Stateshas decided to tie some of its policiesto those monthly unemployment rates.

• 04:22

RICHARD WATERMAN [continued]: And hence, they generate a lot of interest.The monthly unemployment rate is a summary of unemploymentin the United States.It is indeed a statistical summarybecause it's based on what is termed a sample survey.And finally, here's a summary statistic that at Wharton we're

• 04:44

RICHARD WATERMAN [continued]: sometimes interested in tracking,and that the average GMAT of the entering class.So that's just some examples of summary statisticsthat you might well be familiar with.And I would finish this section by just noting that people

• 05:07

RICHARD WATERMAN [continued]: are often complaining about data overload, information overload,however you want to describe it these days.Well, there's no more natural wayof dealing with a huge amount of data than by summarizing it.So we're going to see, right now,the basic statistical summaries of data.

• 05:27

RICHARD WATERMAN [continued]: I want to talk about these summarieswithin the context of an example.And the example that I'm going to useis driven by looking at returns on the stock of Apple-- AppleComputer-- over the period 2007 to the beginning of 2013.So in the table that you're looking at, in the first column

• 05:51

RICHARD WATERMAN [continued]: are the date, and each row corresponds to a day.And in the second column, you're lookingat the return on Apple stock presented as a percentage.And so the very first row in the table,the third of January 2007, there was a return of minus 1.2258%,

• 06:12

RICHARD WATERMAN [continued]: which says approximately that if you would owned$100 worth of Apple stock at the beginning of the day,by the end of the day you would have lost$1 and approximately 23 cents.So that's how to interpret the numbers in the table.What I want to do is focus on that second column,the percentage return, and provide

• 06:33

RICHARD WATERMAN [continued]: some statistical summaries of that column.I'll start by talking about the two key graphicsthat we use to summarize a column or distribution of data.The first of these is termed a box plot.And it's very good to help us identifyoutliers, or atypical observations in the data set.

• 06:56

RICHARD WATERMAN [continued]: And the second one is the histogram,which gives us insight into the shape of the underlyingdistribution of the data.And shape is important because we oftenwant to compare the shape of our distributionto what is called a normal distribution, or a bell curve.With the histogram we are also able to identify asymmetry,

• 07:19

RICHARD WATERMAN [continued]: or skewness in the data.And furthermore, it's possible to pick out outliers as well.So those are the two most common summariesof a single column of data.Here they are illustrated for the Apple data set.The first graphic at the top of the slideis called the box plot.

• 07:39

RICHARD WATERMAN [continued]: The second graphic is the relative frequency histogram.The box plot is showing you the distribution of the dataand some key summaries of the distribution of the data.I'm going to explain this in a little bit more detailin just a minute, but the center of the box plot

• 07:59

RICHARD WATERMAN [continued]: is called the median of the distribution.The two edges of the box in the box plotare known as the lower and the upper quartileof the distribution.And then you see a couple of lines thatare drawn outside of the boxes.Those lines are used to help identify outliers.

• 08:22

RICHARD WATERMAN [continued]: Points outside those lines are unusualand typically would warrant some further investigation.The box plot is a nice summary of the distributionof the data.Underneath the box plot is what iscalled the relative frequency histogram.You can see along the x-axis, the horizontal axis, the values

• 08:45

RICHARD WATERMAN [continued]: that the return takes.And the height of the bars correspond,on a relative basis, to the frequencyof the observations within any of the ranges definedby the bars.Notice that there are no numbers going upthe vertical axis on the relative frequency histogram.

• 09:07

RICHARD WATERMAN [continued]: And that's because it's a relative frequency histogram.We don't really care what the height of those bars are.They have been normalized so that the area under this graphis equal to 1.And so what we're looking at hereis the relative frequency of the occurrenceof different observations.

• 09:28

RICHARD WATERMAN [continued]: And this graph allows us to get a sense of the shapeof the distribution.So let's talk about the creation.How would you make one of these histograms?Now I would note that in the modern age,we don't ever do this sort of thing by hand.

• 09:49

RICHARD WATERMAN [continued]: We are completely reliant on softwarewhich is good because as humans wetend to make errors, especially when facedwith a large, large data sets.And we do work with a lot of data,and it's just not feasible to do any of these sorts of thingsmanually, though it's useful to understand what

• 10:09

RICHARD WATERMAN [continued]: lies behind the construction.So how would you create one of these histograms if you had to?Well, you would create buckets, or sometimes they'recalled bins for the data.For example, a bucket could be 0% return to 2.5%, then2.5% to 5%, et cetera, and then you simplycount the number of observations in each bucket.

• 10:30

RICHARD WATERMAN [continued]: The buckets are presented to you on the horizontal axis.And the height of the bar shows the relative frequencyof the counts, or the number of observationswithin each bucket.So if you've got a higher bar, you have more observations.So the histogram is showing you where most of the datais clustered, and to some extent how extreme it can become.

• 10:51

RICHARD WATERMAN [continued]: So on looking at the histogram for Apple's stock, whatI learned about the return is that these returnstend to be centered about 0.That's where the middle of the distribution is.And that most of the returns lie within the bucketsminus 2.5% percent to 0%, and 0% to 2.5%.So most of the time our returns are within plus or minus 2.5%.

• 11:15

RICHARD WATERMAN [continued]: But there are certainly some extreme data pointsthat are identified, both in the histogram and the box plot.And you can see that if you have a look at someof those extreme plot points-- so let's just quickly jump backthere-- there are days where Apple has lost almostas much as 20% of its value.And also days where it gained 15% in terms of value.

• 11:39

RICHARD WATERMAN [continued]: So you can identify the extreme observations as well.One final thing to note about the distribution of the Applestock is that it's quite symmetric.So symmetric is this idea that if you took a mirrorand you put it in the middle of the distribution,then what you saw in the mirror would

• 11:59

RICHARD WATERMAN [continued]: be what you would actually see if youlooked at the histogram itself, so a line of symmetry.And that's an important idea.Typically means that the big plusesare going to cancel out with the extreme negativeswhen we come to take some form of averaging.So some observations on the construction of the histogram.

• 12:20

RICHARD WATERMAN [continued]: Now, going back to the box plot, just a recapof some of the features of the box plot.The center line in the box plot, that is defined by whatis called the median of the data.And the median is the observationthat you get if you sort the dataand find the one in the middle, the one that essentially has

• 12:41

RICHARD WATERMAN [continued]: 50% above it and 50% below it.So that's the middle of the distribution asdefined by the median and showed to you in the box plot.The edges of the box are called the lowerand the upper quartile.The lower quartile, by definition,has 25% of the data beneath it.The upper quartile has 25% of the data above it.

• 13:02

RICHARD WATERMAN [continued]: Outliers are flagged on either side of the box plotas individual data points.And in practice, an outlier isn't necessarily somethingyou're going to throw away.In business, we often talk about 80/20 rules.For example, 80% of your profit comesfrom 20% of your customers.

• 13:22

RICHARD WATERMAN [continued]: What that says to me as a statistician isthat there are some extreme observationsin the distribution of profit per customer.And those extreme data points are actuallysometimes the most informative or the most useful ones.So all because something is extreme or an outlierdoesn't mean that you want to get rid of it.

• 13:44

RICHARD WATERMAN [continued]: I'm saying that from a business perspective,sometimes these are the most interesting observationsin the data set, and the box plotis a good way of flagging those interesting observationsthat, from a statistical perspective,we might call an outlier.One final point, as I noted before,you can see from either the box plot or the histogram

• 14:05

RICHARD WATERMAN [continued]: that daily returns have been as low as almost minus 20%,losing a fifth of the value in a single day,and as high as about 15%, so two graphical summaries of data.And again, in practice now, there'sno way we would ever create these things by hand.

• 14:26

RICHARD WATERMAN [continued]: We'd always use some software to do that.Now it's not the goal of this courseto introduce any statistical software,or teach you about statistical software,you have plenty of opportunity to do that later on.I simply want to present the summariesthat such software would provide for youand talk about how to interpret those summaries.

• 14:49

RICHARD WATERMAN [continued]: I'm now going to present some numerical summaries of data.But before jumping into those numerical summaries,I just want to talk about the paradigm that sitsbehind a lot of statistics.And it's called the population sample paradigm.The idea is that there's some population out there

• 15:09

RICHARD WATERMAN [continued]: and, for example, that population mightbe all people who are unemployed in the United Statesat a given point in time.Then there's a feature of that populationthat we're interested in.That could be, for example, the average length of timefor which they have been unemployed.So that's a number that economists

• 15:31

RICHARD WATERMAN [continued]: will track to get a gauge of the health of the labor market.Now, certainly the population of the USis large, as is the absolute numberof those that are unemployed.And it's either impossible, and often the case,unnecessary to survey every member of the population

• 15:54

RICHARD WATERMAN [continued]: in order to get a sense of what the feature that we'reinterested in, what value it takes on.And so with several million people unemployed in any givenmonth, it's not realistic to thinkthat you can talk to every one of themand figure out for how long they've been unemployed.So even though we're interested in the population

• 16:14

RICHARD WATERMAN [continued]: for various reasons, it's not feasible to surveyevery member of the population.And so what we do is take a sample.And that's exactly what the Bureau of Labor Statisticsin the US does when it conducts what's calledthe Current Population Survey.So it takes a sample of the potential working population.

• 16:38

RICHARD WATERMAN [continued]: Now from that sample, we calculate a sample statistic.And the sample statistics' role isto estimate the unknown population value of interest.And in this particular example that I'm talking about,we could find the mean or the average length of unemployment

• 16:59

RICHARD WATERMAN [continued]: in the sample, and use that to estimatethe average length of unemployment in the population.That's the basic idea.You have a question about the population,you can't see the whole population, you take a sample,you use some feature of the sampleto estimate what's going on in the population.

• 17:24

RICHARD WATERMAN [continued]: There are a number of places in the business worldwhere you see this sample population paradigm played out.And one of those-- maybe it's surprising to some people--is for TV ratings.And so TV ratings are obviously very important in termsof the marketing world, but those TV ratingscome from sampling the US population of TV viewers.

• 17:49

RICHARD WATERMAN [continued]: You can't look at everyone in the US who's watching TV.There are too many of them.And those ratings statistics actuallycome from a sample of something like 50,000 householdsin the US.And so it's very common to answer questionsabout a population, or US TV viewers,by looking at a sample of them.

• 18:12

RICHARD WATERMAN [continued]: In this graphic, I am illustratingthe idea of the population sample paradigm.And the idea is that there's a big population out there.That's the red box on the outside.And there's a feature of the population that we care about.And you can see I've labeled that featurewith the Greek letter mu.

• 18:33

RICHARD WATERMAN [continued]: That was the concept that I introduced in last time's classknown as the population mean.But the problem is that that red box is just too big for usto completely enumerate.And in reality, it's actually unnecessary to typicallyenumerate it.And so what we do is take a sample.And that's what that little orange subset

• 18:55

RICHARD WATERMAN [continued]: is meant to be identifying within the population.And based on that sample, we calculate a sample statistic.We're going to look today at the sample mean that we oftenwrite as x with a bar going across the top, otherwise knownas x bar.So x bar will be the sample mean.

• 19:16

RICHARD WATERMAN [continued]: Mu will be the population mean.I don't know mu, but I'd like to talk about it.So what I do is use the sample mean as a surrogate,or a proxy, or an estimate, of the population mean.And a lot of what we have to concern ourselves withis to what extent that inference is valid,

• 19:37

RICHARD WATERMAN [continued]: to what extent is the sample that Ihave taken representative of the population?And I'll talk about a way of making sure that it is indeedrepresentative.But this is the basic population's sample paradigm.A population parameter mu that we don't know,but would like to make a statement about.A sample of data that we have obtained.

• 19:58

RICHARD WATERMAN [continued]: Based on that sample, we create some sample statistics.For example, the sample mean, x bar,and we use x bar to make some inference about me.So that's the population sample paradigm.And with that in her hand, I can talkabout some sample statistics.The main feature of the distribution

• 20:22

RICHARD WATERMAN [continued]: that we are typically interested inis the center of the distribution.And there are different ways of measuringwhere the center of the distribution lies.What's a typical value?The most common one is the samplemean-- the feature that we write as x bar.And what the sample mean is doing

• 20:43

RICHARD WATERMAN [continued]: is estimating the population mean.You calculate the sample mean by adding up the observed valuesand dividing by how many there are.I'll present a formula in just a second.A different measure of centralityis, in fact, with sample median, whichI told you is created by sorting the dataand picking out the one that's in the middle.

• 21:05

RICHARD WATERMAN [continued]: So it is the case with statisticsthat there's often more than one approachto measuring some feature of interest,the center of the distribution.And that's just the nature of the beast.We provide "option" is how I would put it.But these are the two most common options-- the meanand the median.

• 21:25

RICHARD WATERMAN [continued]: After identifying where the center of the distribution is,our next feature of interest is the spread of the data.And we talked about this in last timeswhen we were discussing probability,and I talked about the variance and the standard deviation.Now that we're in the realm of statistics,where we're collecting a sample of data,we are going to refer to these summaries

• 21:47

RICHARD WATERMAN [continued]: as the sample variance, which we write as s squared,and that's going to estimate the population variance, typicallyunknown, which is sigma squared, and then we'llhave the sample standard deviation, whichwe write, not surprisingly, as f, and that'sgoing to estimate the population standard deviation sigma.

• 22:08

RICHARD WATERMAN [continued]: So--[MUSIC-- REPEATER, BY MOBY][BUSINESS MATHEMATICS][Richard Waterman] [Wharton: University of Pennsylvaina

## Video Info

Episode: 11

Publisher: Wharton

Publication Year: 2014

Video Type:Tutorial

## Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:

Keywords:

## Abstract

In part 2.1 of his series on business mathematics, Professor Richard Waterman discusses statistics and probabilistic data. Statistics involves many assumptions, and it is important to check whether the assumptions are reasonable. Waterman discusses summarizing data, histograms and box plots, and population and sample paradigms.