Skip to main content
SAGE
Search form
  • 00:01

    MATT DENNY: Hi, everyone.This is your instructor, Matt Denny.And today, we're going to talk about your second homeworkassignment.So in your second homework assignment,we're going to build on what we did in the first homeworkassignment, and what we've been talking aboutin the topic for this past week, with reading in and managing

  • 00:24

    MATT DENNY [continued]: and working with multiple, sometimes large, datasets.And so in this assignment, you'regoing to be working with several hundred CSV files.You're going to be reading them in and combining them together.Then we're going to be doing some collapsingof this dataset.So you're going to start out with a dataset that

  • 00:46

    MATT DENNY [continued]: has many tens of thousands of rows,and you're going to be collapsing that dataset downfurther and further to form datasets that are collapsedover various things.So the data that we're going to be dealing withis information about bills section.

  • 01:06

    MATT DENNY [continued]: So one thing that will come up commonlythroughout this course, I am a political scientistas my day job.And I study legislation, the processof how legislation is written and formed in the UnitedStates at the federal level.And so one of our primary data sources

  • 01:28

    MATT DENNY [continued]: here would be text data, and this is made available online.So we'll learn about this in, actually, the next assignment.But I tend to collect a lot of informationabout pieces of legislation.I tend to do so at a really fine, small level.So in this case, these larger pieces of legislation

  • 01:48

    MATT DENNY [continued]: that are introduced in the United States,they each have = little sections.There'll be interesting stuff about those sections.And so I'm going to give you an exampleof a dataset, a pared-down one again,but one that I've actually used in my own research thatrecords some metadata about individual bill sections.

  • 02:10

    MATT DENNY [continued]: And we're going to be doing some collapsing.So the first thing we're going to do is take that data,so we might have multiple rows, multiple observationsfor each unique piece of legislation,and we're going to collapse that data downinto one observation for each bill.And then we might do things like recordhow many sections went along with this bill,

  • 02:31

    MATT DENNY [continued]: that sort of thing.And then we're going to go, actually, one step furtherand say, OK, for each unique legislator,tell me some interesting things about the legislationthat they introduced.And these are actually really relevant research tasks,and something you'll have to do commonly.Oftentimes, maybe you collect a whole bunch of Twitter data,and you need to subset it down, or collapse it down,

  • 02:54

    MATT DENNY [continued]: so that you have some information at the Twitter userlevel.You might have a whole bunch of observations for countries,or any sort of individual over time or across contexts,and you need to do some sort of collapsing.Maybe a bunch of individuals that are in one classroomtogether, and you want to get their average gradesand do a classroom-level analysis.

  • 03:16

    MATT DENNY [continued]: These are all things where you'd need to collapse a dataset,and where you might be working with multiple datasets.And along the way, we're going to cover--I'm going to suggest and talk to you about this,but we're going to cover a really bread and another thinghere--it's a little thing, but generating observation IDswith the Paste function.This turns out to be super, super helpful.

  • 03:39

    MATT DENNY [continued]: So you can oftentimes bring togethera whole bunch of different pieces of informationabout an individual to form a unique identifier for each row.And that will become useful here in this assignment.So with that, I'm going to head over to my desktop.And let's see what we've got going on.So I'm here on my desktop.We should have three files.We should have the homework_2.R script,

  • 04:01

    MATT DENNY [continued]: the homework_2_solution.R script.And again, this would be the same as we had for homework 1.It'll give you my code that I wrote for tryingto solve the exercises.In this case, there are three of them in the assignment.And then we're also going to have this multi_datasets.zipfile.

  • 04:21

    MATT DENNY [continued]: So in general I believe now with most windows and Mac computers,and even most Linux computers, if youdouble click on a ZIP file, it should automaticallyextract itself to a folder.So a ZIP file's the way that we can handilygive you one thing to download, and not hundreds of thingsto download.And then with the multi_datasets folder here,

  • 04:44

    MATT DENNY [continued]: if I actually open this up, we'llsee that there are a heck of a lot.I believe there's 471 CSV files in here.And so your first test is going to be to read it and combineall of these.And we can take a look at one.So let's look at dataset_1.csv.So Excel will take forever to load.

  • 05:06

    MATT DENNY [continued]: This is another nice thing about working with R.You can probably get it read in, loaded,and ready to look at in one quarterthe time it's going to take Excel to open up the CSV file.So while we're waiting on that--is it going to go for me?Here we go.So the structure of the data we have here are we

  • 05:27

    MATT DENNY [continued]: have a session variable.So this is going to tell us about the session of Congress.This is actually going to come in useful later.So in the United States, the 103rd session of Congresswould have been, I think it must have been--oh, here we go, 1993.Very helpful.So the 103rd would have been 1993 to 1995.

  • 05:49

    MATT DENNY [continued]: And in the United States federal system, a session of Congresslasts two years.The chamber variable, this is goingto be either an HR or an S, although Idon't know in the States if we're onlygoing to encounter HR.So this would be House of Representatives for HR,

  • 06:10

    MATT DENNY [continued]: or Senate.So there are two chambers.It's a bicameral system in the United States.We'll also have a bill number.So we have Bill number 1, And thenlook, we've got Bill number 10.And then 100 and then 1000, and so on.So there's clearly some interesting things going onwith the numbering here.

  • 06:33

    MATT DENNY [continued]: Those are things you'll want to work through.We have the bill type.So IH, in this case, means introduced in the House.There are other things that couldbe introduced in the Senate, reportedto the floor in the House.There's all these different stages a bill can be at.Then we have the section number variable.So here, we see that there are 23 sections associated

  • 06:57

    MATT DENNY [continued]: with this first bill, introduced in the House of Representativesduring the 103rd session of Congress.We have the number of co-sponsors.So this is the number of people who co-sponsoredthat piece of legislation.And as we can see right here, the format I've given to you--in my research, I also had a whole bunch

  • 07:17

    MATT DENNY [continued]: of other stuff, other variables, that youcould get into that deal with stuff that'sspecific to a bill section.Like how many words did it have or whatwas it about, or these sorts of things.When was it when was it inserted into the bill,or something like that?For the sake of this assignment, not

  • 07:39

    MATT DENNY [continued]: to get things too, too complicated,essentially what you'll notice is that all the metadata hereis going to be the same.The only thing that's going to changeis the section number for each bill.And so actually, one of the first thingsyou'll want to do after we read in the dataare going to actually be to collapse this.So we only really need essentially, like,need one row here.

  • 07:60

    MATT DENNY [continued]: And then instead of having the section number,we could record the number of sections.That might be useful.We also have the date it was introduced.We have the title of each bill."To grant family and temporary medicalleave under certain circumstances."That is super-duper informative.We then have the name of the person who sponsored

  • 08:20

    MATT DENNY [continued]: that piece of legislation.So this would be the name of the legislator.We have the major topics.So we can think about things like labor or governmentoperations, or housing or domestic commerce,these sorts of things.And then whether that person was Democrat or Republican.So those are the two that will show up in our data.

  • 08:40

    MATT DENNY [continued]: So we've got-- and each of these things has, let's see.Oops, we've got-- is that 1,000?Scroll up here.Yeah, we've got 1,000 rows.So it's 1,001 here with the first row,and our Excel spreadsheet is actually the column names,which you'll want to remember.So we've got 1,000 rows in each of these.

  • 09:02

    MATT DENNY [continued]: So we're going to end up with about 470,000rows in our dataset, once we've read it in.And then, let's see, about 10 columns?Looks like 10, 11 columns.So anyhow, a reasonably large dataset,once we've read all of these CSV files in.

  • 09:22

    MATT DENNY [continued]: So anyhow, let's actually open up our homework_2.R script.So I've got RStudio pulled up here.And again, this is going to talk youthrough the second assignment.It'll do things like, OK, let's set our working directory.And we're going to want to use the rio package.We've talked about using rio for data importing and exporting.

  • 09:46

    MATT DENNY [continued]: Note that this is going to take a little while.This could take up to an hour on your computer,depending on if you want to go whole hogand try and do the full assignment.You could also try and only work with, say, the first 20 filesor something like that for this assignment.And then that would certainly speed itup if you don't have a whole lot of time to try this,or you just want to work with something less unwieldy

  • 10:09

    MATT DENNY [continued]: on your computer.But I would encourage you to try outsomething that takes a while.And you can use print statements we talked about insideof these for loops that you're going to use here, presumablyto tell yourself about progress.And then try eating a sandwich or watching a movie

  • 10:30

    MATT DENNY [continued]: or doing something else while thisis working on your computer.So here again is a little description of the data.It's going to tell us about the different fields in the dataframe.And so in our first exercise, it'sgoing to be all about reading in these data framesand then combining them.So you can use the rbind function,

  • 10:51

    MATT DENNY [continued]: hint, hint, hint, to actually row bind,to stick all of these 471 datasets together,and form one giant data frame.So that's going to be your first exercise.It'll take a little while.The second exercise is going to be

  • 11:11

    MATT DENNY [continued]: to create a new dataset that collapsesthe original dataset over bill sections,so that there's only one entry for each unique bill.And then you'll also want to remove the section variable.You won't need that.But you can now create a variable

  • 11:32

    MATT DENNY [continued]: that holds the sum of the number of sectionsthat were in that bill.So one of the key things here is you will need to assign--the easiest way to do this is goingto be to assign a unique identifier to each bill.So within each bill, you'll have different sections.But if we were to take the session, chamber, and numberfields, and create a bill ID variable out

  • 11:55

    MATT DENNY [continued]: of those, where you could just paste them together--you could have 103-HR-1, for example.And you could do that, and then you can just look for all rowsin your dataset that satisfy eachof these unique values, each of these unique bill values,essentially.So this is an easy way to check for, to find,

  • 12:19

    MATT DENNY [continued]: rows in your dataset that satisfy these criteria.Again, then you can use the unique function.You could use the which function.Then inside of a for loop to find outwhich of these, once you have the unique bill IDs,you can then use a for loop and then figure out,OK, which sections, which rows in this data frame

  • 12:43

    MATT DENNY [continued]: are associated with each bill?And then you could use that to find one as an exemplar,plop it in your output data frame,and then record something like that the number of sections.So this will definitely take a while.So my code ran in 15--that's maybe a little optimistic.15 to 30 minutes for me.

  • 13:05

    MATT DENNY [continued]: So you'll just want to play and make surethat your computer can be on for this line,or you can try it on a smaller subset of the data.So then the final thing we're going to want to do hereis then collapse it even further.So once we have that dataset, that is at the bill level--so we're going to have one row for each bill.We start out with one row for each section.It's 470,000 sections.Once you go to one row for each bill,

  • 13:26

    MATT DENNY [continued]: you can have about 100,000 bills.And then we can go down even further and say, OK,now from this these bills, I wantto get down to the one row per unique legislators.So we have their names.You can just go for one row for each uniquely-named legislator.And then we're going to do things like,how many bills did that legislator introduce?

  • 13:47

    MATT DENNY [continued]: What are the total number of sections in all the billsthat they introduced?The average number?When was the earliest bill?In other words, when did they first introducea piece of legislation?What was the most common topic?So for most common topic, you mighttake the unique of all the different topics associatedwith that legislator.

  • 14:07

    MATT DENNY [continued]: So the topics of all those different bills.And then you might count each oneand say, OK, how many did I come up?You could also try out the table function thatmight also be helpful for you.But unique, and then counting is probablygoing to be your easiest way to implement this.And you're going to want to-- again,

  • 14:27

    MATT DENNY [continued]: you could create one row for each unique legislator,and then rbind these together.Now, there are going to be faster waysto do this than the approach that you might takeat first, with things like using for loops and growing datastructures, rbinding things together.But I would encourage you to start out like this.Start out with something you understand.Try it out.

  • 14:47

    MATT DENNY [continued]: And then of course, check your code against my example code.Remember, as always, it is OK to,A, look on the internet for help if you want to tryand figure something out.It's a good opportunity for you to try and use the internetto solve a problem.This is a real problem.Getting legislator-level data something

  • 15:09

    MATT DENNY [continued]: that I've actually done in my own work.It's sometimes useful to say, is there something weird.Do some legislators tend to sponsor a lot more legislationthan others, and why is that?Do they tend to sponsor longer legislation,but it tends to be shorter?These sorts of things.And so this is a real research exercise,

  • 15:30

    MATT DENNY [continued]: and it's OK to really use the internet to help you.And then of course, as always, youcan look at my solution code, and usethis to help you work your way through this second assignment.So with that, I'll leave you to it,and I will see you in the next one.

Video Info

Series Name: Practical Data Management with R

Episode: 41

Publisher: SAGE Publications Ltd

Publication Year: 2017

Video Type:Tutorial

Methods: Data management, R statistical package, Programming

Keywords: computer programming; data management; data manipulation; data processing; databases; internet; programming and scripting languages; Spreadsheets ... Show More

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:

Keywords:

Abstract

In this second homework assignment, Matt Denny provides an opportunity to practice reading in, managing, and working with multiple data sets. Solutions to the three exercises require generating observation IDs and forming a collapsed data set.

Looks like you do not have access to this content.

Data Management: Homework 2

In this second homework assignment, Matt Denny provides an opportunity to practice reading in, managing, and working with multiple data sets. Solutions to the three exercises require generating observation IDs and forming a collapsed data set.

Copy and paste the following HTML into your website