Secondary data, often collected for billing purposes or administrative decision-making, may also be useful in the development of a program of research. In order to utilize secondary data, it must be managed, understood, and made ready for analysis. The purpose of this case study is to demonstrate secondary data cleaning and preparation for a multi-site and multi-year dataset using a step-by-step process. The use of secondary data is common in health services research because it often provides ample information to explore relationships and examine trends. Secondary data can be particularly useful when researchers have a limited amount of time and resources to conduct large-scale projects. Secondary data are often messy; therefore, using secondary data can be challenging. The analyst may not have access to detailed information about how the data were collected and coded, and there may be missing data, outliers, or extraneous data elements. The method used for this project was a modified version of a step-by-step guide to use and manage secondary data in research. The objectives of this guide are to encourage novice researchers to consider using secondary data and to provide a practical format to follow when doing so. The use of secondary data requires careful and thoughtful data management. Following a guide such as this aids in the organization of the project and ensures the many steps in the process are not missed so that data are well prepared for analysis.
By the end of this case, students should be able to
- Understand the challenges associated with the use of secondary data
- Apply step-by-step guide to secondary data preparation
- Evaluate readiness of dataset(s) for analysis
In my first year as a PhD student, I was given an opportunity to join a research team as part of a Research Immersion Practicum. The research group was conducting an analysis of the Nurse Practice Environment using a survey instrument called the Practice Environment Survey (PES) as part of a larger program evaluation project. The PES measures the nurse perception of the practice environment. A good practice environment is important because it has been linked to positive patient and nurse outcomes (Cummings, Hayduk, & Estabrooks, 2006). The PES is a widely used instrument; it has been used in many countries and translated into several languages (Warshawsky & Havens, 2011). This instrument has 31 individual questions (items) that make up five subscales and a composite score which measures the overall environment in which nurses practice. For each item, nurses answer questions by rating the presence of the item in their current job setting using a 4-point Likert scale. For example, nurses would rate the presence of collegial physician–nurse relationships in their work setting by stating they strongly disagree, disagree, agree, or strongly agree that nurses and physicians have collegial working relationships in their place of work (Lake, 2002). My primary responsibility on the team was to prepare the PES data for analysis. The data were adopted from the source in four separate files that consisted of 16,667 de-identified cases (individual responses). Each file represented one survey year from 10 different military hospitals (the survey was conducted once a year for 4 years). The team wanted to analyze the data in two different ways. The first objective was to analyze the practice environment of medical, surgical, and intensive care units (ICUs) within each of the 10 hospitals. The second objective was to compare the composite PES score (at each hospital) by year to view changes over time. This meant that instead of aggregating data to the hospital level only, it was important to be able to identify responses by particular unit types within a hospital as well as responses by hospital alone.
The variables within the dataset were as follows:
- Respondent ID—this variable was a de-identified code that was assigned to each response
- Survey Year—the calendar year in which the survey was conducted
- Respondent Position—the position type of the respondent (registered nurse, licensed vocational nurse, or certified nursing assistant)
- Facility Name—name of the hospital
- Facility Location—geographic location of the hospital by region
- Unit Billing ID—four-character code used for accounting and pay purposes
- PES items 1-31—Likert scale responses to the individual items of the PES (response choices are strongly disagree, somewhat disagree, somewhat agree, or strongly agree)
As a first step, I sought guidance from the research team members and my faculty advisor and also reviewed the research. They all suggested following a guide that would help me through the data management process to ensure that I stayed organized and did not miss any steps. In my search to find a guide, I found a resource developed by Andersen, Prause, and Silver (2011) called step-by-step guide to the management of secondary data in research and decided to follow the steps. I began by listing the steps:
- identify a secondary dataset
- create a personalized dataset
- organize the project
- extract the meaningful variables
- create a codebook
- structure the data
- create needed variables
- composite variables
- proxy variables
- consider methodological/statistical implications
- missing data
- imputing data
- weighting data
Now that I had a guide, I decided to review each step and determine what it meant for me. I could quickly mark the first step off the list because the dataset had been chosen before I joined the research team. However, Andersen et al. (2011) do have some words of wisdom to share regarding the selection of a secondary dataset.
An important consideration when choosing to use secondary data is ensuring that the data you have can answer the research questions you are attempting to ask. In our case, we did not want to select a dataset which contains patient adverse events (e.g. patient falls) within a facility and attempt to make an estimation of the nurse practice environment based on that information because we would not be able to answer the research question in hand. Another consideration is identifying whether or not the dataset supports the unit of analysis in your target population. For example, if our dataset only included the facility’s composite score, it would be impossible to learn anything about unit type–specific practice environments within that facility.
The PES dataset contained de-identified information from the individual respondent level, and each respondent provided their Unit Billing ID. Along with identifying what dataset to use, Andersen et al. (2011) stress the importance of establishing contacts with experts when acquiring the data. For example, in our case, the PES dataset contained the Unit Billing ID that was just mentioned. The research team identified and contacted a subject matter expert who provided valuable information about the billing codes. The expert assisted in answering questions such as the following: What do the billing codes mean? How might they help address the study’s specific aims? More information about recoding and creating variables to aid in analysis will be discussed later in the case study.
The goal of Step 2 is to create a personalized dataset. This is an extremely important step that cannot be skipped. Many secondary datasets have far more information than what is needed to answer the posed research question. These variables can muddy the analysis and slow down the data cleaning process. Personalizing the dataset focuses the attention on only the specific variables that will be used during the analysis. I began to organize the project by creating an electronic filing system which included back-up datasets for each year. These files were kept, unchanged, in the event that I made mistakes with the working file. Working files were saved under the same name, but with a version identifier and date so that I could clearly identify the most current version. Figure 1 is an example of what the data file names looked like. A practical lesson I learned was to open and save each file at the beginning of a work session to ensure I remembered to update the file name and preserve the previous file. If I waited, I would often hit save while I was working and overwrite the previous version.
A data log was also created to track actions and decisions made during the data management process. At the end of each work session, I would make an entry into the data log which would briefly summarize what I accomplished during that work session. In addition to tracking the reasoning behind the data management actions and decisions, this information provided me with a starting point for the next work session. By reviewing the data log, I knew where I left off and when I needed to start in the next work session. Figure 2 provides an example of what a data log could look like. Note the column headings and corresponding notes for each data management session. Other columns could be added as needed, for example, you could consider adding the contact information of a subject matter expert for a given variable.
Prior to making a master codebook, I created a variable codebook for each year’s dataset. This provided the labels for each variable and was very useful in determining whether and how the variables differed from year to year. The next step was to identify and extract meaningful variables for analysis. Team input was vital during this process, especially because I was new to research and new to the team. I did not want to drop a variable that would have been needed for analysis. Based on the decisions regarding what variables were to be retained and the individual variable codebooks for each year, a master codebook was created that identified each variable, the labels for the variable, and the associated numeric codes that were assigned for analysis. An example of what a codebook may look like can be found in Figure 3. The goal was to have all variables coded in a way that was standardized across the four datasets. The last part of this step was structuring the secondary data into a file that could be read and analyzed by the statistical software program we were using. The comma-separated value (.csv) file format is compatible with all statistical software.
During Step 3, variables were created to add clarity and reality to the data. First, a composite variable was created to provide an overall score of the practice environment from each of the respondents. A composite variable is normally calculated from existing variables. In this case, the “Composite PES” variable was calculated from all 31 individual items from the PES instrument. Creating this composite variable took several steps. First, the responses strongly disagree, somewhat disagree, somewhat agree, and strongly agree had to be recoded into the assigned numeric values of 1, 2, 3, and 4, respectively. Then, the “Composite PES” variable was created by calculating the mean of the numeric ratings for the 31 items from each response. After creating this composite variable, we began considering proxy variables that need to be created. Proxy variables may be built from an existing variable that can assist in providing additional variables that might be of interest to your analysis. For example, the variable Unit Billing ID did not seem to have much meaning for this study. However, after consulting with experts, we learned that this variable provided information regarding unit type. If the Unit Billing ID started with an A, the unit was a medical unit; B indicated a surgical unit; and C indicated an ICU. By taking time to learn the meaning and purpose behind each variable, the team realized that this variable forced units to be identified as either medical or surgical for accounting purposes. However, having worked in several hospitals, we knew that in reality, many of the units are combined medical and surgical units; therefore, the team identified the need to create a new variable called “Unit Type” which allowed for analysis based on the actual type of patients who were cared for, not just on accounting codes. To identify this potential pitfall in the data, time had to be taken to understand what each of the variables meant, how it was generated, and what it was used for. This type of care and attention to detail is critical when using secondary data and helps to ensure the output is meaningful (Andersen et al., 2011). Whenever I learned new information about a variable, I made an entry in the data log to ensure the information was tracked and available for the rest of the research team.
During Step 4, the research team began to address the methodological and statistical considerations that commonly arise when using secondary data. One such consideration was the decision regarding handling of missing data. Before removing any cases (response from a participant in the survey), we had to discuss how this may alter the sample we were working with. For this step, consultation with the whole team, and in particular our statistician, was necessary before I decided to manage the missing data.
The overall strategies in handling missing data are twofold: casewise or listwise deletion and imputing missing values. Choosing the right strategy to deal with missing values depends on the underlying patterns in the data. The three main categories in patterns of missing data are missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) (Allison, 2002; Rubin, 1976). One of the easier ways to deal with missing values is to delete all the missing observations which will result in loss of valuable missing information. Another option for handling missing data would have been to fill or “impute” values. This is a statistical technique that substitutes a value for the missing data point. Imputation, or the process of replacing missing data, can be accomplished in several ways and ranges in complexity from simple (using the mean from existing cases or carrying the last value forward) to complex (using logic rules to match a similar case) (Enders, 2010).
Running descriptive statistics for quantitative variables and frequency tables for categorical variables provided important information for team decision-making. These statistics helped the team identify what percentage of each variable was missing, aid in identifying outliers, and helped to determine whether or not our variables were MCAR. In addition, during the analysis of some secondary survey data, teams may choose to “weight” certain variables or subgroup responses. These methods are often used with complex instruments, during complex designs, or if the instrument was originally intended to be weighted. Four types of weighting commonly used in with surveys are (1) weighting as a first-stage ratio adjustment, (2) weighting for differential selecting probabilities, (3) weighting for nonresponse, and (4) post-stratification weighting for sampling variance reduction. The choice to weight an instrument or responses depends on the instrument design, the sample in relation to the target population, and the variation in the response rates (Groves et al., 2009). These decisions are based on the existing information about the instrument, the theoretical concepts guiding the research, the proportion of responses, and your statistician’s suggestions. As these decisions were made by the team, they and the rationale behind the decision were tracked in the data log for future reference.
Secondary data are useful in many types of research, but in particular it is commonly used in Health Services Research. Managing secondary data and preparing it for analysis are complex and time-consuming. Following a step-by-step guild helps to ensure steps are not missed. Taking time to understanding nuances of the data leads to more purposeful cleaning and preparation of the data for analysis. Organization and careful documentation are necessary components of all research and are particularly important steps in managing secondary data.
- What steps would you take in order to prepare a dataset for analysis?
- What would you do with missing data? What must you first determine about the missingness of the data? What would you do if a case (response from a participant in the survey) was missing all 31 items on the PES? What if 2 of the 31 were missing? What if 20 of the 31 items were missing?
- Specifically, given what you know about the unit identifier codes, how would you treat “like codes” from different facilities?
- How would you separate unit types? What considerations would you have based on the scenario?
- If a new unit was created sometime during the 4 years contained in this dataset, how would you treat this unit? Would you exclude it or include it and why?