- 00:00
[MUSIC PLAYING]

- 00:01
GREG MARTIN: Learning statistics does not need to be difficult.Now, instead of bombarding you with a complicated formulaand statistical theory, I'm going to walk youthrough a way of thinking.And that's going to enable you to address the mostcommon statistical questions.When we look at sample data, for the most part,we see two things.We see differences between groups,so men are taller than women.

- 00:22
GREG MARTIN [continued]: And we see relationships between variables like taller peopleweigh more than shorter people.And the big question is, are those differencesand are those associations or relationships real?And I'm going to talk you through what it isthat we mean by the term real.Over the next few minutes, we're goingto take a look at a very simple data set.And we're going to see how, by looking

- 00:42
GREG MARTIN [continued]: at various combinations of variables and variable types,we can identify very specific differences between groupsand very specific relationships between variables.And I'm going to walk you through whenand how to use statistical tests and howto interpret your results.Now, let's imagine that we have a research questionand it's about the height and the weight

- 01:03
GREG MARTIN [continued]: of people living in Ireland.Of course, we can't measure the height and the weightof the entire population.So instead, we take a random sample of the populationand we measure the weight and the height of that sample.And we collect some additional information,like gender and age group, from eachof the people in that sample.And we arrange these data in a spreadsheetor a data set with the various attributes in columns.

- 01:26
GREG MARTIN [continued]: And these are called variables.And these variables will be the object of our inquiry.Now, most data sets that you workwith will contain two types of variables, categoricaland numeric variables.Categorical variables, like gender,contain categories, as the name suggests.Think of them as groups or bucketsthat the data can be arranged into.

- 01:48
GREG MARTIN [continued]: In this case, males and females.Numeric variables, like height, are numbers,as the name suggests, and can be arranged on a number line.Now, to better understand our data and to make sense of it,we summarize it and we visualize it.In the case of categorical data, wecan count up the number of observationsin any given category.And we can represent them in a table and on a bar chart.

- 02:10
GREG MARTIN [continued]: And to summarize numeric data, we'refirstly interested in the spread or the distributionof the data.So we might describe the range of the data,the interquartile range.We could also include the standard deviation.To get a sense of the middle of the data,we use the median, which divides the data into two equal halves.And we use the mean, which is the average.The mean is probably the most commonly used summary value

- 02:30
GREG MARTIN [continued]: to represent this kind of data.We can visualize our data using a box plot, whichis a visual representation of the range,the interquartile range, and the median.And of course, we can create a histogram.And this gives us the shape of the data.So I hope you can see that this process of summarizingand visualizing the data takes itfrom being just numbers and words on a spreadsheetand turns it into something that is meaningful to us, something

- 02:52
GREG MARTIN [continued]: that we can get our heads around, something that wecan think about.Now, in this very simple data set,we've got two categorical and two numeric variables.And things start to get interestingwhen we start looking at combinations of variables.So for example, we can take a lookat a categorical and a numeric variable,like gender and height.And so we can group the data by gender,which is the categorical variable,and create a summary of the numeric variable,

- 03:14
GREG MARTIN [continued]: in this case, height, that is separated outinto those two groups.And looking at the summary, we cansee that, in our sample data, men are, on average, tallerthan women.What I want you to see here is that we'velooked at a combination of a categoricaland a numeric variable.But as you can imagine, there are other possible combinationsof variables that we could have looked at.We could have looked at height and weight, whichare both numeric.

- 03:35
GREG MARTIN [continued]: We could have looked at gender and agegroup, both categorical.And in each case, we might see either differencesbetween groups or relationships between variables.And in each of these cases, thereare specific statistical tests that wecan apply to see if what we are seeing in the sampledata has implications for what we think about the widerpopulation.

- 03:55
GREG MARTIN [continued]: Can we infer anything?Is what we are seeing statistically significant?So let's take a quick look at the five mostimportant combinations of data that we have.And we'll look at, firstly, what mightwe observe in our sample data, given that sortof combination of data types.And secondly, what statistical testwe might apply to determine whether or notwe can infer anything about the wider population.

- 04:16
GREG MARTIN [continued]: So we might look at a single categorical variable,like gender.And we could do a one-sample proportion test.For two categorical variables, we would do a chi-square test.For a single numeric variable, we do a t-test.If we have a categorical and a numeric variable,we do a t-test, or analysis of variance, or ANOVA,if there are more than two categoriesin our categorical variable.

- 04:36
GREG MARTIN [continued]: And for two numeric variables, we do a correlation test.Now, I'm going to come back to each of these scenariosand each of these tests.So don't panic.At this point, what I want you to seeis how the data can be divided up.And in just a few minutes, we're goingto take each of these scenarios and workthrough exactly what questions you can askand how it is that you can apply statistical testsand, importantly, how to interpret your results.

- 04:58
GREG MARTIN [continued]: At this point, I want to say this.It's not good science to take a data set and just randomly stabaround blindly, hoping to find something that'sstatistically significant.Before you interrogate the data, youstart off by defining your question, your hypothesis.You define your null hypothesis.You identify the alpha value that you're going to use.And then, you analyze the data.

- 05:18
GREG MARTIN [continued]: So let's look at what we can do with just onecategorical variable, like gender.We might ask the question, is therea difference in the number of men and womenin the population?Now, we could state that, as a hypothesis, whichis that there is a difference between the number of menand women in the population.And we could check to see whether or notwe think that that is the case.And when we look at our sample data,

- 05:38
GREG MARTIN [continued]: well, we do in fact see that there'sa difference in the proportion of men and women.So should we get excited?Well, no.Not yet.Remember, this is just sample data.We could have, by chance, selecteda sample that just happened to show a difference.So let's consider the possibilitythat, in actual fact, there is no differencein the number of men and women in the population.

- 05:59
GREG MARTIN [continued]: And we call that our null hypothesis.And if that were true, how likely would it be,what are the chances, what is the probability that we wouldsee the difference that we have observed,or greater difference, for that matter?And if we can show that that probability is low,then we can have a degree of confidencethat the null hypothesis is wrong and we can reject it.

- 06:20
GREG MARTIN [continued]: But before we calculate this probability, whichwe're going to call our p-value, wemust be clear about how small is small enough.Below what value of p would we reject the null?And we must decide on that cutoffbefore we calculate the p-value.And we call that cutoff the alpha value.And for the rest of the examples in this video,we're going to use an alpha value of 0.05, or 5%.

- 06:43
GREG MARTIN [continued]: So we've really got two scenarios.We've got the null hypothesis, whichis that there's no difference, and the alternative hypothesis,which is that there is a difference.And the next step is to apply a statistical test.And in this case, we do a one-sample proportion test.And we generate a p-value.If the p is less than the alpha, then wecan reject the null hypothesis and state

- 07:04
GREG MARTIN [continued]: that the difference that we observeis statistically significant.If we add another categorical variable,in this case, age group, we may have a research question like,does the proportion of males and femalesdiffer across these groups?So our hypothesis is that the number of men and womenthat we observe is dependent on the age category

- 07:25
GREG MARTIN [continued]: that we look at.In other words, the proportions change,or depend on, or are dependent on, the age category.Now, we can collect our sample data.We look at it.And we can see that, yes.In fact, the proportions do change across the age groups.In other words, in our sample data,the proportions are dependent on age category.Now, is that due to chance?

- 07:45
GREG MARTIN [continued]: Well, let's test the idea that the proportions are allthe same, or that they are independent of each category.That's our null hypothesis.Now, here, we can conduct a chi-square test.And that gives us a p-value.And if the p-value is less than the alpha,we can reject the null hypothesis and statethat our observation is statistically significant.

- 08:07
GREG MARTIN [continued]: If we want to look at just one numeric variableon its own, like height, then we don'thave any groups to look for differences between.And we don't have another numeric variableto look for some sort of association or relationshipwith.So what questions can we ask?Well, we might have some theoretical valuethat we want to compare our data to.For example, in the case of average height,

- 08:28
GREG MARTIN [continued]: we might have some historic data.We might wonder if the current populationis significantly different from that historic data.So our question might be, is the average heightdifferent from a previously established height?Let's imagine that the previously established typewas 1.4 meters.We want to know if the average heightin our current population is different to that.Our hypothesis is that there is a difference.

- 08:50
GREG MARTIN [continued]: Again, we collect some sample data.We find that the average height is indeeddifferent from the historic height.Is that statistically significant?Well, if there were no difference,what would the chances be that weobserved the difference that we do, or a greater difference?We conduct a t-test, comparing the averages.And if the p-value is less than the alpha,then we can reject the null hypothesis and state

- 09:10
GREG MARTIN [continued]: that the observed difference is statistically significant.Now, let's consider a categoric and a numeric variable.And we may ask the question, is therea difference between the average height of men and women,in this case?Our hypothesis is that there is a difference.In our sample, we do observe a difference.Let's assume that there is no difference.

- 09:33
GREG MARTIN [continued]: We conduct a t-test, which gives us a p-value.If the p is less than the alpha, we reject the null.And we state that the observationis statistically significant.If we had a categorical variable with more than two categories,like age group, that's got three categories, then insteadof doing a t-test, we would do an analysis of variance,or ANOVA.Now, let's look at the like combination of variable types

- 09:54
GREG MARTIN [continued]: in this data set, two-numeric variables, height and weight.Here, we might start with a question,is there a relationship between height and weight?Our hypothesis is that there is a relationship.We collect sample data.We look at it.And voila.We do see some sort of relationship.Is it real?Well, let's assume that it's not.Let's assume that there's no correlation between the two

- 10:14
GREG MARTIN [continued]: variables.And if it weren't real, then whatare the chances that we'd see the relationship that we do?And here, we conduct a correlation test.Now, a correlation test is going to give us two things.Firstly, it's going to give you a correlation coefficient,which tells us something about the nature of the associationbetween the two variables.And I'm going to talk about that in just a minute.But of course, it also gives us a p-value.

- 10:37
GREG MARTIN [continued]: And again, if the p-value is less than alpha,we can reject the null hypothesis and statethat the correlation that we see is statistically significant.And the correlation that we see can be represented by a numberthat we call the correlation coefficient.So let's talk about that for a second.Correlation coefficient is a numberbetween negative 1 and 1.And it looks at the relationship between two numeric variables.

- 11:03
GREG MARTIN [continued]: If, as the x variable gets larger,the y variable gets smaller, we say that they are negativelycorrelated.If they are perfectly negatively correlated,then the correlation coefficient will be negative 1.If there is no relationship between the two variables,then the correlation coefficient will be 0.And if there's a perfectly positive correlation,as x goes up, y goes up, then the correlation coefficient

- 11:25
GREG MARTIN [continued]: will be 1.And of course, you can have any value in between.And by the way, it doesn't matter which of your variablesis on the x and the y-axis.The correlation coefficient will be the same.Until next time, take care.

### Video Info

**Series Name:** Gregory Martin

**Episode:** 8

**Publisher:** Gregory Martin

**Publication Year:** 2019

**Video Type:**Tutorial

**Methods:** Data analysis skills, Data synthesis, Data linkage, Statistical inference, Statistical significance, Sample size, Control variables, Categorical variables, Categorical data analysis, Range, Mean scores

**Keywords:** analysis of variance; box-and-whisker plot; chi-square test; correlation; null hypothesis; population; Statistical data; Statistical methods and models; Statistical significance; Statistical testing: overview; t-test
...
Show More

### Segment Info

**Segment Num.:** 1

**Persons Discussed:**

**Events Discussed:**

**Keywords:**

## Abstract

Greg Martin, Editor-in-Chief, Globalization and Health, discusses statistical tests in public health, including components of a data set, combinations of variables, and statistical questions that can be answered.