Skip to main content
Search form
  • 00:00

    [From Wide to Long Format]

  • 00:01

    MATT DENNY: Hi, everyone.This is your instructor, Matt Denny,and welcome to this lecture on converting data nowfrom wide format back to a long format.So this is just going to sort of follow onon our previous lecture on converting from long to wide.And just a reminder, we need to be careful,

  • 00:22

    MATT DENNY [continued]: particularly with our column names when we'redoing these kind of conversions, because we want to make surethat what we end up with for long data sort of makes sense,and the observations are in a reasonable orderthat we can work with.Again, we're still going to use the tidyr R package to do this,because it provides functions that allow you to go both ways,

  • 00:44

    MATT DENNY [continued]: from long to wide and then back again.And what we're going to see is that we end up with a dataset.We're going to be working with the same datasetthat we just turned into a wide format dataset.We're going to see that it end up lookingthe same as the original.So with that, I'm going to head over to my Desktop,and we're going to open up our studio.

  • 01:04

    MATT DENNY [continued]: OK.So I'm here on my desktop.Again, we have this long and wide datadot R file and the long wide data dot R data file.And so we're going to open up our studio,and we're going to continue where we left off.So if you skipped the previous lectureon converting from long to wide format data,you're going to want to do all of that.And we're going to pick up here down on line 75,

  • 01:27

    MATT DENNY [continued]: talking about converting back to long data.So we're going to create something called long_data_2so we can compare it to our original long datathat we read in and we did some transformation to.And to do that, we're going to use the GATHER function.So SPREAD takes long and makes it wide,and GATHER takes wide and makes it long.And so, again, the data argument here-- this

  • 01:48

    MATT DENNY [continued]: is a four-argument function.So the data argument, that is goingto be the wide dataset that we want to convert to long.The key is going to be now, OK, we wantto create a single variable.In this case, our section variable.

  • 02:08

    MATT DENNY [continued]: We're going to call it Section.That's going to contain all of these unique valuesof the column names in our y data.So whereas we had all these section 1, section 2,you know, up to 10 and whatever, up to 816,we're now going to use these column namesas keys as the sort of things that

  • 02:34

    MATT DENNY [continued]: help us delineate what row we're in when we go back to long dataformat.And the value here, so we have all of these columnsthat we're going to want to take,and we're going to want to squish backinto a single column in long format.And so we need to name something for all the values

  • 02:56

    MATT DENNY [continued]: that get stored in here.And what we're going to do is we're going to call that Terms.So this is just going to sort of mirror this terms variable uphere in our long data format.The final set of arguments, it can be any numberof arguments we want.And these are going to be the columnnames we want to consider.And so one of the interesting things

  • 03:17

    MATT DENNY [continued]: here is that we don't have to put anythingin quotation marks.We don't have to do anything else, and we can actually use--so we can start with the first column namethat we want to be dealing with.And this really works well if youhave sort of a big contiguous block of column names.So we want to start with the furthest to the left column

  • 03:38

    MATT DENNY [continued]: name that you want to be dealing with, this section 001.And note that it's OK that there are some NAs here.So these NAs are recording that there was no section 001or section 1.And that may seem weird.You say, well, hey.Look, it has a section 2 and so on,but it doesn't have a section 1.That's because I actually removed themwhen I was cleaning these data.So these data are actually--this is like a little sliver of metadata

  • 03:59

    MATT DENNY [continued]: from my dissertation research, which dealswith US congressional bills.So anyhow, we're going to start with the furthest left column,which is going to be the section 001.And then we can do something like callnames_wide_data,wide_data.And that will tell us that our last column

  • 04:23

    MATT DENNY [continued]: that we want to deal with is this section 816.So we can scroll through here, and itwill give us a character vector of all these column names.And we want everything from section 001 to section 816.And so we could create a character vectorof these column names.Alternatively, we can use our little colon operator,

  • 04:44

    MATT DENNY [continued]: and that will actually just give usall the columns between these in terms of how the GATHERfunction will actually interpret whatwe've given it as an argument.And so this can be really, really handy.So basically, first column name, colon, last columnname you want to deal with.If it's all one block, sort of left to right of these columns.And that's a really, really handy feature

  • 05:04

    MATT DENNY [continued]: in this GATHER function.So I'm going to run this.And what do we see?Well, this will take a minute.We now have long_data_2, but it'sgot 5,774,016 observations of four variables.Well, that doesn't sound too promising.So we do have a bill date, and section and terms.

  • 05:27

    MATT DENNY [continued]: That's good.This is going to take a little minute to load.There we go.So first off, the ordering here is a little bit weird.And second off, we can see that we have a whole bunchof these NA terms.And if we look back in our wide data,that makes sense, because we had a whole bunch of observationsof NA.

  • 05:48

    MATT DENNY [continued]: Here, we had these missing things,because there was no section 9 for this 103HR1000IH,for example.So that could be a little bit of an issue.But we can deal with this pretty easily.So one easy way to do this, and Ibelieve there's an internal option

  • 06:09

    MATT DENNY [continued]: inside the GATHER function.But just to make it really clear is we can say, OK,which is NA long_data_2$Term.So we haven't seen this ISNA function, but whatit does is it will take as input a vectoror an individual value.And it will tell us, is this value equal to NA?

  • 06:29

    MATT DENNY [continued]: And NA is a special not applicable classificationfor a particular value in a dataframe or a vectoror whatever it says.These data are missing, essentially.And so we can ask which of the rows in this long_data_2

  • 06:51

    MATT DENNY [continued]: have a terms variable that is NA?So let's run this, and we're going to save itto a vector called ToRemove.And if we can see here, ToRemove is a large integer vector.So 4, 5, 6, 7, 8, and then next one is 11.So let's just check this in our long_data_2.So we can see--oh, no.This is going to take a while.

  • 07:13

    MATT DENNY [continued]: Just bear with me for a second.So the viewer is handy, but it takes a while to load.So 1, 2, 3, are there.4, 5, 6, 7, 8 are not.They're all NA.So those are good.We want to remove those.9 and 10 are real numbers.They are numbers.11 is not.Then we're OK until we get to 16 and 17.

  • 07:35

    MATT DENNY [continued]: So this sort of maps on, right?And then we can use our little--we can just use long data.And then, remember, before the comma will give us rowsthat we want to index.And we want to say minus ToRemove,and that says remove all of the rows thatmatch up to these indices in this 5,734,016 element vector.

  • 07:58

    MATT DENNY [continued]: And then if we do just a comma and leave everything blank,that means keep all of the columns.And so let's do this.And oh, look.We're back down to having 40,000 observations of four variables.So that's really promising.We're back down to the right number,and that should make sense.We converted from long to wide and from wide back to long,

  • 08:19

    MATT DENNY [continued]: and we got rid of all these NAs.We should have the same number of rows.So let's click on this again And OK, this looks OK.But we are now sorting by section.So as you can see, it's all these section 001s,and it might be convenient to make it actually look likeour data did before, with bills being the first thing we sort

  • 08:40

    MATT DENNY [continued]: on and then sections within bills.And to do that, we're going to use anotherfunction we haven't dealt with yet.And that's the ORDER function.So what the ORDER function does isit will return a vector of indices.Let's just run this for a second.It will return a vector of indices.So up here, it will tell us that it should go 1, then 3,747,

  • 09:06

    MATT DENNY [continued]: then 8,481, and so on.If we were to order our long_data_2 in termsof these strings by however R actuallygoes about deciding which string comes earliest in the alphabet,essentially, if we were to do that that way,then this would be the ordering we should use.

  • 09:30

    MATT DENNY [continued]: So we're going to reorder the rows in long_data_2using the ORDER function.And again, after this comma, we'regoing to leave everything blank sothat it will make all the columns go along with it.So if we do this, if we run this line of code,we're now going to get out somethingthat looks more reasonable.Now, we're still going from 1 to 10.

  • 09:50

    MATT DENNY [continued]: We'd have to do some more messing around.We'd actually have to put in some leading 0s here.But we're now 103HR1IH, and then it goes section 001 throughsection 023.So we're at least consistent now within bills.So that's just a quick sort of crash courseon going from long to wide and back again.

  • 10:12

    MATT DENNY [continued]: And I hope that you'll find this helpful in actually dealingwith these two data formats, because thisis a pretty common thing that you have to deal with whenyou're working with data and trying to manage it,is converting between these two formats.So thanks for watching, and I will see youin the next lecture.

Video Info

Series Name: Practical Data Management with R

Episode: 39

Publisher: SAGE Publications Ltd

Publication Year: 2017

Video Type:Tutorial

Methods: Data management, R statistical package, Programming

Keywords: data format conversion; data management; programming and scripting languages

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:



Matt Denny continues the explanation of data format conversion using the "gather()" function in the "tidyr" package to convert data in a wide format to a long format.

Looks like you do not have access to this content.

Data Management: From Wide to Long Format

Matt Denny continues the explanation of data format conversion using the "gather()" function in the "tidyr" package to convert data in a wide format to a long format.

Copy and paste the following HTML into your website