How-to Guide for R

Introduction

In this guide, you will learn how to produce a Pearson’s Chi-Squared test with a Yates’ Correction in R software using a practical example to illustrate the process. You will find links to the example dataset, and you are encouraged to replicate this example. An additional practice example is suggested at the end of this guide. The example assumes that you have already opened the data file in R.

Please note that due to the nature of R, there are often many different ways to achieve the same analysis as the R-user community develop different commands and codes or refine existing ones.

Contents

- Yates’ Correction
- An Example in R: Level of Importance of Religion and Whether Volunteered in the Past Twelve Months
- 2.1 The R Procedure
- 2.2 Exploring the R Output

- Your Turn

1 Yates’ Correction

The Yates’ Correction is not a test as such but a means to correct a problem faced when using Pearson’s Chi-Squared test on 2 × 2 contingency tables. Pearson’s Chi-Squared assumes continuous distribution, but in 2 × 2 tables, this assumption cannot be upheld because the data are binary. This is less of an issue in large samples. However, in small sizes where cell counts may be low (i.e., less than 10, less than 5) or relatedly in samples where there are cell counts below 10 or below 5, it is problematic as Pearson’s Chi-Squared has a tendency to over-reject. The Yates’ Correction adjusts for this problem. Using the Yates’ Correction, we can run a Pearson’s Chi-Squared test to identify whether there is a statistically significant association between two independent categorical variables (e.g., ethnicity, gender), both of which are dichotomous (e.g., BME/White, male/female).

2 An Example in R: Level of Importance of Religion and Whether Volunteered in the Past Twelve Months

This example is a Pearson’s Chi-Squared test with a Yates’ Correction using two variables from the 2016 American National Election Studies Survey. There are 3,620 respondents in this subset of data. The two variables that we will examine are:

- Is religion important part of R’s life (Religion)
- Has R done any volunteer work in past 12 months (Volunteer)

The first variable, Religion, is coded 1 if a respondent reports “important” and 2 if “not important.” The second variable, Volunteer, is coded 1 if “Yes, have done this in the past 12 months” and 2 if “No, have not done this.”

Data Cleaning

Prior to opening the CSV file used in this example in R and performing the analyses outlined in this guide, you will need to do a small amount of preliminary cleaning to the data file. This is because this file contains two codes, -8 and -9, that represent “missing” data. However, you need to change this to “NA” (“Non Applicable”) so that R can read it as “missing” data and not include it in your analysis, potentially skewing your results. To change the codes, you need to open the CSV file and highlight all variable columns. Then on the “Home” toolbar, click on “Find & Select” and choose “Replace” from the drop-down menu. In the “Find and Replace” dialog box that opens, insert “-8” into “Find what” and “NA” into “Replace with.” Then click “Replace All.” This replaces all the -8 values with “NA.” You then repeat the process for the other missing code (i.e., -9). For the nominal variables (Religion, Veteran, and Volunteer) in this dataset, you will need to use the “Find and Replace” function to convert the data values to text; the Codebook that accompanies this guide will help you do this. For example, Religion is currently coded 1 for “important” and 2 for “not important;” using “Find and Replace” you should change the 1 to “important” and 2 to “not important.” You do this in exactly the same way as you did for the “missing” codes, except that you highlight each variable column at a time as opposed to the whole dataset. Once you have done this, save the file and then proceed as outlined in this guide.

2.1 The R Procedure

R is a free, open-source software and computing platform. Unlike many other statistical software packages, it does not operate with drop-down menus. Instead, users submit lines of code that execute commands and functions built into R. It is a good idea to save your code in a simple text file that R users generally refer to as a script file. We provide a script file with this example that executes all of the operations described here. If you are new to using R, we suggest you start with the introduction manual (http://cran.r-project.org/doc/manuals/r-release/R-intro.html). Another useful introductory guide is Andy Field’s Discovering Statistics Using R (2012, SAGE).

For this example, we must first load the dataset into R and then attach the dataset so R can directly access the variables stored inside the data file. It is possible to import data from a variety of software packages, including SPSS, STATA, Minitab, and Excel. It is best to import data from software packages in file formats that are R-friendly, such as tab-delimited text (.txt in Excel or .dat in SPSS) or comma-separated files (.csv). This guide will use a CSV file (dataset-anes-2016-subset1.csv). If you want to find R code for importing other file types, then you can find these online easily.

If you are using the CSV file provided with this example, the code looks like this (assuming the data file is already saved in your working directory):

ANES16 <- read.csv(“dataset-anes-2016-subset1.csv”)

The code reads in the dataset and assigns it to an object named Lfsp.

Prior to running a Pearson’s Chi-Squared test with a Yates’ Correction or indeed any statistical test, it is good practice to examine each variable on its own; this is univariate analysis. This allows us an opportunity to describe the variable and get an initial “feel” for our data. We can now generate frequency descriptives for the Religion and Volunteer variables; to do this, we must first install (if you have not done this previously) and load the summarytools package. We can then run the R code, which is shown below followed by the results:

summarytools::freq(ANES16$Religion, order = “freq”)

Figure 1: Frequency Distribution of Religion.

summarytools::freq(ANES16$Volunteer, order = “freq”)

Figure 2: Frequency Distribution of Volunteer.

The frequency table for Religion is shown in Figure 1. The majority of the sample (65.5%) state that religion is “important,” where around a third state that it is “unimportant.” It should be noted that there are 27 missing cases. Figure 2 shows the frequency distribution of Volunteer. We can see that just over half of the respondents (55.6%) had not volunteered in the past year. It should be noted that there are 634 missing cases. Figures 1 and 2 show the distribution of each of these variables by themselves, but they cannot tell us whether they are in a relationship.

2.2 Exploring the R Output

We can now run the Pearson’s Chi-Squared test with a Yates’ Correction; the code and the results are shown here:

chisq.test(ANES16$Religion, ANES16$Volunteer, correct=TRUE)

Figure 3: Results of Chi-Squared Test With Yates’ Correction.

Figure 3 shows the results of our test; we can use the round() function to round the p-value from 1.563e-11 to <.001.

The findings are statistically significant (p < .001). Therefore, we can reject the null hypothesis of no association between level of importance of religion in an individual’s life and whether that individual has volunteered in the past 12 months. Our analysis supports the conclusion that there is an association between level of importance of religion in an individual’s life and whether that individual has volunteered in the past 12 months.

3 Your Turn

Download this sample dataset and see whether you can replicate these results. The sample dataset also includes another variable called Veteran, which relates to whether the respondent had served on active duty in the armed forces. See whether you can reproduce the results presented here for the Religion variable, and then try producing your own Yates’ Correction test substituting Veteran for Religion in the analysis.