- 00:00
[Statistics-- Making Sense of Data--Alison Gibbs and Jeffrey Rosenthal--Statistical Sciences-- University of Toronto--Today's topic-- Examining relationshipsbetween two categorical variables]

- 00:03
ALISON GIBBS: We've seen how we might examinethe relationship between a categoricaland a quantitative variable, by seeing how the summarystatistics and shape of the quantitative variablediffer with the different values of the categorical variable.In this video, we'll now look at howwe might examine the relationship between twocategorical variables.

- 00:24
ALISON GIBBS [continued]: Recall that when we refer to the distribution of a variable,we're talking about the pattern of the values in the datafor that variable, showing the frequency of occurrenceof the values relative to each other.For categorical variables, the distributionis given by the counts or frequencies,or the relative frequencies, of the observations for each

- 00:45
ALISON GIBBS [continued]: of the categories of the variable.In an earlier video, for the anthropology dataof measurements on 400 skeletons,we saw the distribution of mass, or BMI classification, and sex.Now, what if we're interested in looking at these twocategorical variables together?Our anthropologist is interested in learning

- 01:05
ALISON GIBBS [continued]: about how the error in age estimationis associated with the body mass index.But it's important to also consider the effect of sexhere.If, say, the error in age estimationalso differs with sex, it will beimportant to understand if the body mass indexclassification differs with sex for these observations.

- 01:26
ALISON GIBBS [continued]: As another of the first steps in understanding these data,we should investigate the joint distribution of body mass indexclassification and sex so that wecan learn things, such as do we have equal numbers of femalesand males who are obese?Or are there equal numbers of males and femalesin the underweight category?And so on.

- 01:46
ALISON GIBBS [continued]: We can see the joint distributionof BMI classification and sex in a contingency table,sometimes called a cross tabulation, or a two-way table,since we have two categorical variables here.In the contingency table, we classify our 400 skeletonstwo ways--by BMI classification and by sex.The table contains the counts or percentages

- 02:09
ALISON GIBBS [continued]: of the number of observed values for males for each of the BMIclassifications, and for females for eachof the BMI classifications.We can see from the table that 46, or 12%,approximately, of the 400 skeletonsare underweight males.28, or about 7%, are underweight females, and so on.

- 02:31
ALISON GIBBS [continued]: For a graphical display of the joint distribution,we can plot the frequencies in either side-by-sideor stacked bar plots.I've constructed them so that the height of each baris the number of skeletons in each body massclassification for each sex.From the side-by-side plot, it is clearthat there are many more skeletons in our data thatare males with body mass index in the normal range

- 02:54
ALISON GIBBS [continued]: than skeletons in any other category.And for the female skeletons, we can alsosee that normal weight is also the most common of the BMIclassifications.However, by looking at the total counts of the bars for malesin the stacked bar plot, it is clearthat there are many more males than females in our data.In fact, there are more than twice as many males as females.

- 03:17
ALISON GIBBS [continued]: So although the normal weight bar is taller for malesthan it is for females, it is difficult to judgefrom these plots if a greater fraction or proportionof the males tend to be normal weightthan the fraction of females who tend to be normal weight.And we need to do some more work to make a fair comparison.The marginal distribution of a categorical variable

- 03:37
ALISON GIBBS [continued]: can be thought of as the distributionof only one of the variables in a contingency table.We can see it in the margins of the table,by taking the row or column totals.To understand whether the BMI classification isthe same for both sexes, we need the distributionof BMI classifications separately for each sex.

- 03:59
ALISON GIBBS [continued]: This is known as a conditional distribution.Given that a skeleton is male, the conditional distributionis the distribution of BMI classification just for males.And similarly, we can look at the conditional distributionof BMI classification for females.The relevant quantity that we needfor the conditional distribution--

- 04:20
ALISON GIBBS [continued]: and we can calculate it from the contingency table of counts--is the column percentage.That is, it's the percentage that each countin our contingency table is of the total numberof observations in each column, whichwe can find using the marginal distributions for the columntotals.Note that for both males and females,

- 04:41
ALISON GIBBS [continued]: the conditional distribution proportions sum to be 1.Graphically, we can compare the conditional distributionsof BMI classification given sex by plotting the columnpercentages in stacked bar plots.From these plots, we can see that the proportionsof underweight and obese skeletonsare higher in the females than males,

- 05:03
ALISON GIBBS [continued]: and the proportion of normal weight skeletonsis higher for males than females.We say the two variables in a contingency tableare independent if the conditional distribution of onevariable is the same for all values of the other variable.As we've noted, the distributionsof BMI classifications seem to differbetween males and females.

- 05:25
ALISON GIBBS [continued]: So it seems that BMI classification and sex are notindependent for these skeletons.We can also look at the conditional distributionsof sex given BMI classification.These are the row percentages, or the proportioneach count in the joint distributionis of its row total.We can see that each row in this case sums to be 1.

- 05:48
ALISON GIBBS [continued]: And here, for example, we can seethat 62.2% of the underweight skeletons are male.We'll now go to another example--the result of the study to test the efficacyof a new vaccine for HPV.HPV, or human papilloma virus, isa common sexually transmitted infection

- 06:08
ALISON GIBBS [continued]: that can cause genital warts and some types of cancer,most notably, cervical cancer.People infected with HPV often do not have any symptoms,and thus are unaware that they are at risk of transmittingthe virus to others.This results in an environment in whichthe virus can spread readily.In the US, for example, it is estimated

- 06:28
ALISON GIBBS [continued]: that 20 million people are infected with HPV,and 90% of these people are unaware of their infection.In response to the spread of the disease,vaccines have been developed and are currentlybeing adopted widely.In Toronto, for example, an HPV vaccineis now administered free of chargein school, to all girls in grade eight.

- 06:50
ALISON GIBBS [continued]: There are many types of HPV.We'll only look at protection against HPV 16,the most common type that is associatedwith 55% of all cases of cervical cancer.We'll look at data from the PATRICIA study,a large study in 2004-05 that recruited over 16,000 womenfrom 15 to 25 years old, in 14 countries.

- 07:12
ALISON GIBBS [continued]: The women randomly assigned to receive eitherthe HPV vaccine or a hepatitis A vaccine,in a three-dose regimen, and were thenfollowed for three years to assess their health.In our data, we'll only include the women who receivedall three doses of the vaccine.While there are multiple outcomes we could consider,such as markers for cervical cancer

- 07:33
ALISON GIBBS [continued]: and other consequences of HPV infection,we'll consider simply whether or nota subject contracted a persistent HPVinfection in the three years of the study.Here are four tables that we can construct from the resultingdata.Let's look at these four tables to see what we can learn.In particular, which numbers indicate whether or not

- 07:54
ALISON GIBBS [continued]: the vaccine seems to prevent infections?The first table is the contingency table of counts,classifying subjects by whether or notthey received the HPV vaccine and whether or not theyacquired an HPV 16 infection.In this table, we can see that the PATRICIA study was large,with over 12,000 participating subjects receiving all three

- 08:16
ALISON GIBBS [continued]: doses, and with approximately equal numbers, just over 6,000,receiving the HPV vaccine or the other vaccine.Of those who received the HPV vaccine,23 acquired an HPV 16 infection by the end of the study period,while 345 subjects in the other group acquired an infection.

- 08:37
ALISON GIBBS [continued]: In the joint distribution of proportions,we see that 3% of the subjects acquired an infection,with 2.8% of those in the group who did notreceive the HPV vaccine.In the table of row proportions, wesee that of participants who received the HPV vaccine,only 0.4% acquired an infection, while in the group of patients

- 08:59
ALISON GIBBS [continued]: who did not receive the HPV vaccine,5.7% acquired an infection.And in the table of column proportions,we can learn information such as, of the subjects whoacquired infections, 6.2% were in the group thatreceived the HPV vaccine, and 93.8% were in the other group.

- 09:21
ALISON GIBBS [continued]: Because the question of interest is, does the vaccinework at preventing HPV 16 infection, in this case,the row proportions, or the conditional distributionof infections status given whether or nota subject received the HPV vaccine,gives the most direct interpretation.An infection rate of 5.7% versus 0.4%

- 09:44
ALISON GIBBS [continued]: certainly seems to be compelling evidence that the HPVvaccine is effective.In later videos, we'll examine whether a difference this largecould have happened just by chance,or if it's statistically significant.Let's look at one more example, from a reporton the findings from a 20-year follow-up

- 10:04
ALISON GIBBS [continued]: of a large-scale study of thyroid and heart diseasecarried out in England in the mid-1970s.We're showing a subset of the data,containing measurements on 1,314 women whowere classified at the beginning of the study as current smokersor having never smoked.And we're interested in the 20-year survivalstatus for these women.

- 10:26
ALISON GIBBS [continued]: Looking at the contingency table for these data,the column proportions tell an interesting story.Of the smokers, only 24% had died.But of the nonsmokers, 31% had died.The this study show that smoking might lead to a greaterchance of surviving 20 years?Of course, there's a twist here.

- 10:48
ALISON GIBBS [continued]: Let's look at the column proportionsfor the tables of smoking and survival status,broken down by age grouping.Although age is a quantitative variable,it is sometimes given in groups to illustrate a point.As we can see from both the tables of survivalby smoking status broken down by agegrouping, or from the side-by-side bar chart,

- 11:09
ALISON GIBBS [continued]: for all age groups, except the 25- to 34-year-olds,the opposite conclusion is reached.That is, the death rate is higher in the group of smokersthan in the group of nonsmokers.How did this happen?Age is related to both smoking status and survival.The stacked bar chart shows the age distributions

- 11:31
ALISON GIBBS [continued]: for smokers and nonsmokers.The nonsmoking population includes more older women.When the study was started, few of the women over age 65were smokers.But of course, many of them, since they were at least 65at the start, had died by the end of the 20-year follow-upperiod.Moreover, this study could potentially

- 11:53
ALISON GIBBS [continued]: underestimate the harmful effects of smoking,since the observed, small percentage of older smokerscould have happened because smokers tend not to surviveto age 65, so there were fewer smokers in the older age groupsto enroll in the study.This is an example of Simpson's Paradox,in which conditional distributions within subgroups

- 12:15
ALISON GIBBS [continued]: can give the opposite conclusion to conditional distributionsfor the combined observations.Age, here, is a lurking variable.We need to always watch for lurking variableswhich, if taken into account in our analyses,might affect our conclusions.In some upcoming lectures, we'll talk about data collection

- 12:36
ALISON GIBBS [continued]: and how to design a study to mitigate the effects of lurkingvariables.[Statistics-- Making Sense of Data--Alison Gibbs and Jeffrey Rosenthal--Statistical Sciences-- University of Toronto--Today's topic-- Examining relationshipsbetween two categorical variables]

### Video Info

**Series Name:** Understanding Data

**Publisher:** Alison Gibbs and Jeffrey Rosenthal

**Publication Year:** 2013

**Video Type:**Tutorial

**Methods:** Categorical variables

**Keywords:** crosstabulation; paradox; vaccines

### Segment Info

**Segment Num.:** 1

**Persons Discussed:**

**Events Discussed:**

**Keywords:**

## Abstract

Alison Gibbs explores various methods for analyzing categorical data. Gibbs utilizes real-world studies to illustrate these methods and potential errors in their use.