Skip to main content
SAGE
Search form
  • 00:02

    ROBERT MASTRODOMENICO: OK, so in this part,we're going to look at dealing with files.So to deal with a file, we need some kind of file, hopefullywith some data--hopefully with some interesting data.And luckily, this Seattle Police Department Police ReportIncident gives us exactly what we need.So what we're going to do is go aheadand download the CSV version of the data,

  • 00:22

    ROBERT MASTRODOMENICO [continued]: which, luckily for you, I've already done.And I've already put the file name into Python.So the file name is essentially a stringrepresenting the full location of where the file lives.Now, what I want to do is I want to read this file into Pythonand get access to all the juicy data that liveswithin this beautiful file.So let's see what we've got.

  • 00:43

    ROBERT MASTRODOMENICO [continued]: So what we're going to do is we'regoing to use this open command to open the file.Now, we pass in the file name and we pass in this rinto it, as well.So what the r means is we're goingto open this file in read mode.So let's open it and see all the data that we've got.Oh, we get no data.

  • 01:04

    ROBERT MASTRODOMENICO [continued]: What we've got is we've got this kind of weird messagewhich says that we've opened up this file in mode rwith some other information.So you're thinking, how do I get this data?Well, you're not far away.What we've opened is a kind of streamto the file, which allows us to go ahead and getthe data within the file.So let's create a variable called data, and then

  • 01:25

    ROBERT MASTRODOMENICO [continued]: let's read from f.So we can use this method called read,which allows us to read what's in it.So we're going to read it in.Cool, let's see was in data.Wow.We can see we've got quite a lot of data here.So what we're doing is essentiallyshowing the data screen.

  • 01:45

    ROBERT MASTRODOMENICO [continued]: And it appears this file is pretty big, solots and lots of stuff.And what we can notice is that what we've got outdoesn't really look like you might thinkthis file is going to look like, it kind of justlooks like a big mess.So what's the type of this?So that's do type of data.Well, it's a string.

  • 02:06

    ROBERT MASTRODOMENICO [continued]: So what we've got here is this massive, great stringwhich contains all this data that we want to get access to.So how are we going to do this?Well, we've got to think a bit about it.So we have to inspect the file and look at waysthat we can use the really clever string stuff in Pythonto pull out all the information that we want.Now, if we look right at the bottom of this string here,

  • 02:29

    ROBERT MASTRODOMENICO [continued]: we have this slash n.Now, the slash n, essentially, is an end of line character,and if you're used to dealing with files like this,you'll have seen them before.If you're not, I'm informing you now.So what we can do is use this slash n, and the factthat we've got multiple ones within this file,to separate or the end of line characters outand to create something where we can

  • 02:51

    ROBERT MASTRODOMENICO [continued]: access the elements of this file line by line.So all we do is we're going to overwrite our dataand we're going to split it on the-- not the comma,because everything separated by the comma in there.That wouldn't be too useful, because thisis a comma-separated file, after all.

  • 03:12

    ROBERT MASTRODOMENICO [continued]: We're going to split it on the slash n,so this should then give us every lineby line of the file in some way.Now, we know when we split, we split it into a list.So what we're expecting back is a list, now.So let's see if we got back what we would expect.Yes, we did.So now, we can access what we hope will be lines of the file

  • 03:33

    ROBERT MASTRODOMENICO [continued]: as if we were accessing elements of a list.So 0, the 0 index is the first element,so let's see what we've got.Excellent, we've got this comma-separated stringwhich appears to be the headers of the file.Now, let's look at the last few lines.We've then got here some day that looks like it's

  • 03:53

    ROBERT MASTRODOMENICO [continued]: to do with the file, again.We can see the strings, and every new string is a new row.But we also have this empty string on the end.And when we looked before, we couldsee we didn't have an empty string,so how has this empty string ended up in our list?Well, this is a kind of consequenceof us splitting on the slash n.

  • 04:14

    ROBERT MASTRODOMENICO [continued]: On the last slash n that we spliton we ended up with an empty string on the other side of it,because there's nothing to split on at the end,because it finishes on a slash n.So what we need to do is remove this last rogue empty stringwithin this list.And the way we can do this is quite easy.We're just going to do data.pop, and we're

  • 04:35

    ROBERT MASTRODOMENICO [continued]: going to pop off the last element of it.So let's run that last command before to showus the end of our list.Nice, we've got rid of what we didn't really want,this kind of problematic empty string.So now, what we have is in a list,element by element, referring to every line of our file, whichis really cool.

  • 04:55

    ROBERT MASTRODOMENICO [continued]: So we now want to get into the file and do something with it.So let's get what is essentially the first line of the file,our headers, back up.So let's say we're looking at this and we're like,well, there's a lot of data in there,I just want to get a smaller subset of it, notall the columns, and I just want to programmatically do this.

  • 05:16

    ROBERT MASTRODOMENICO [continued]: What we could do is we can go in and wecan say, well, what I'm interested in is the offensetype.So all I want to get out this is offense type,but I need the RMS CDW ID because Ineed to identify, maybe based on other things, what we've got.We can do that in Python, and we can do it really easily.Now, if we're going to do this, we

  • 05:38

    ROBERT MASTRODOMENICO [continued]: need to do it to every single row within our data,or every element of our list that we now have.So let's find out how big this data is,or how many rows we've got.Wow, that's a lot of rows in there.So if we're going to do this, we can't do itby typing across every single row.We're going to need to do something a bit smarter.

  • 05:59

    ROBERT MASTRODOMENICO [continued]: We're going to need to loop across thisand apply some kind of logic to every single line.Now, let's think about the logic that we're going to want to do.We could say, well, we know ID lives at element 0,if we split this part of the list.So let's say we take the first headers and we split it.Because we know it's comma-separated,

  • 06:21

    ROBERT MASTRODOMENICO [continued]: we can separate on the comma, and it'sgoing to give us back a list of everything that we want.And we can say, well, we're interested in element 0,and then we're interested in element 1, 2, 3, 4, 5--you could do it like that, but A, it's a bit messy,might get your numbers wrong.And B, it's not very readable.

  • 06:41

    ROBERT MASTRODOMENICO [continued]: And one of the things you need from your code,especially if you're going to use it again or share it,is readability.And that's one of the beauties of Python,it reads like you might expect code to look.It's not overly complicated, whichis one of the wonderful things to do with Python.So in this regard, what we're going to dois look at a single example to get the data out of one row.

  • 07:02

    ROBERT MASTRODOMENICO [continued]: And once we've got that, we're goingto apply it to all other rows within our dataand get out exactly what we want,which is going to be the ID and the offense type.Let's stick in the year, as well, to make it interesting.We can do three things in there, at different points of this.So what we're going to do is we'regoing to call this headers.

  • 07:25

    ROBERT MASTRODOMENICO [continued]: So this is the list of the headings within the file.Very nice.Now, we're going to look at how we can apply thisto a row of data.So in exactly the same way we did the first point,for the second point we're going to split it on the comma,and we're going to create this thing called a row.The names are arbitrary, but it does

  • 07:47

    ROBERT MASTRODOMENICO [continued]: refer to a row in the way we think about datasets,and we can see here we've got some real datafrom our data set.Now, in Python, what you can do is you can take two listsand you can stick them together as follows.We can zip them up.If I can type.

  • 08:09

    ROBERT MASTRODOMENICO [continued]: What we've got there is a zip object.So a zip object takes the headers in the rowand essentially sticks them together.Now, what we can do with that is wecan stick this into a dictionary.So a dictionary accepts this kindof zipped version of two lists put together,and it creates us this really nice dictionary

  • 08:30

    ROBERT MASTRODOMENICO [continued]: of the data in the row with its headers associated with it.So let's assign this to something.And then what we can do is we can access any elementby just knowing--

  • 08:52

    ROBERT MASTRODOMENICO [continued]: I've got really bad IDing convention, there--by just knowing the name of the header, whichmakes it really nice, because when you're reading the code,you know exactly what you're getting back.You don't have to guess.Say, if we did it this way, and we wanted to row 0,you might be thinking, well, what's row 0?What does that actually refer to?Well, you know it, because we're saying

  • 09:14

    ROBERT MASTRODOMENICO [continued]: what the header of the file is, or what we're specificallylooking for.That RMS CDW ID may mean nothing to you,but you actually know what you're getting, which is cool.So what we can do is we can apply this logic within a loop,and then apply this to every single row of our dataor every element of the list that we've got now

  • 09:34

    ROBERT MASTRODOMENICO [continued]: that contains all the data.But what we really need to do is we want to write this back outto a file, so we've got another file containingeverything we want.So to do that, we're going to need to define a new file.So in exactly the same way that I had the file name fromearlier, I'm going to create this new file,

  • 09:55

    ROBERT MASTRODOMENICO [continued]: and I'm going to call it reduced_police_data.csv.Now, the name's arbitrary.I've put it in a different location to the other file.If I'd used the exact same name as the other one,you would get into a problem with overwriting whatyou've got, so that's not cool.

  • 10:16

    ROBERT MASTRODOMENICO [continued]: But now what we need to do is, in the same waythat we open the file before--and we need to be careful to keep the file namedifferent to what we put in.So this time we're opening out_file_name,which is a string containing the outputfile where we want this stuff to go.We're going to essentially open it.But as opposed to opening it in mode r,

  • 10:38

    ROBERT MASTRODOMENICO [continued]: we're going to open it mode w.And w refers to write, because we want to write to the file.And this is going to create the file, because this file doesn'texist at all.So there we go.So now we've got this, and there we go.So we can see you've got a wrapper around this pathto a file opened in mode write.

  • 10:58

    ROBERT MASTRODOMENICO [continued]: Brilliant.We're ready to go.Now we need to start writing stuff to file.We've kind of shown the logic that we can apply to each row.We now need to stick this into a loop.So let's again create the row in the same way.So we're going to split it.

  • 11:20

    ROBERT MASTRODOMENICO [continued]: Let's create the row_dict, because you've alreadygot the headers.row_dict = dict-- we're going to zip.OK, cool.So we've got to the point we were at before,and now we need to start writing this out to a file.

  • 11:41

    ROBERT MASTRODOMENICO [continued]: And we want a file to be comma-separatedso we can write a string with commas in which allow usto write exactly to the file.So if we go out_string, so it's just a generic namefor a string here.And then we're going to do row_dict,and we're going to put in the names of whatwe want to be in our file.

  • 12:02

    ROBERT MASTRODOMENICO [continued]: So RMS, this very awkward naming convention-- and Ialways want to type it the other wayaround, which wouldn't help.So that's going to give us the ID to go into the file.So what we then need to do is we need to concatenate,the + so a + on the string is to concatenate.And we're going to put in a string that just contains

  • 12:24

    ROBERT MASTRODOMENICO [continued]: a comma, because obviously, every variable hasto be comma-separated.And then we're going to add in the next variable.What are we going to use?Offense type.And you've got to get your spelling right,because if you get your spelling wrong,you're going to get a key error, and that's

  • 12:46

    ROBERT MASTRODOMENICO [continued]: not going to be cool.And then we're going to do another one.So we've got two commas, and they're both comma-separated.And we're going to put the last thing in,which is going to be row_dict, and we're going to use--this is going to give us the year of the offense, which

  • 13:07

    ROBERT MASTRODOMENICO [continued]: we've got the type of.Now, one of the things you have to remember to put on the endis a slash n, because, like we saw when we read the file in,every line is separated by this slash n, which is kindof an underlying character.If you don't put it in, you're justgoing to have one long string every time, because every timewe're going to run this, is it's goingto add a new part to our file.OK, so I'm going to do fo.write(out_string),

  • 13:34

    ROBERT MASTRODOMENICO [continued]: and then we're going to let it run.Now, what it's doing there is those numbersrefer to how long the string is that it's writing out to.So you can see that it's got a fair bit work to do,because it's got to go over all the lines within our listand write them all out the file.But it's done it pretty quickly.That's really cool.OK, so what we're going to do nowis we're going to close our initial file,

  • 13:55

    ROBERT MASTRODOMENICO [continued]: because, like opening and closing doors,we like to close things behind us, because we're notborn in a barn.And we're going to do the same to the output file.So we close both.Now, how do we know that what we've done has actually worked?Well, let's open a new file and let's use the out_file_name,

  • 14:19

    ROBERT MASTRODOMENICO [continued]: and let's open it mode read.So we've already seen how to do this,so I don't need to explain it.So it's better if I don't repeat my code, there.And then state data = data.split--

  • 14:43

    ROBERT MASTRODOMENICO [continued]: So we're going to essentially split it again on a slash n.So I'm just guessing that this is going to work,hopefully this all comes together.And then if I do data--ta-da!What we've got is this new subfilewhich contains three elements.So you can see there, for each row there

  • 15:04

    ROBERT MASTRODOMENICO [continued]: are only three elements in it, with the RMS ID, the offensetype, and the year.And we've now got this reduced datafrom this bigger set of data.So that's excellent, that's cool.In a little bit of code we've processed this massive setof 600,000 data points into something much smaller.And that's kind of how you deal with files.

  • 15:26

    ROBERT MASTRODOMENICO [continued]: So let's close this off.So what we've shown here is how to read files in,how to process the stuff within those files,and how to write to files.So hopefully, you can now go in and apply thisto files of your own.

Video Info

Series Name: Introduction to Python for Social Scientists

Episode: 15

Publisher: SAGE Publications Ltd

Publication Year: 2017

Video Type:Tutorial

Methods: Coding

Keywords: coding; data preparation; data processing; programming and scripting languages

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:

Keywords:

Abstract

Robert Mastrodomenico, PhD, owner of Global Sports Statistics, explains how to work with files in Python, including reading data into a file, processing and preparing data within a file, and writing data to an output file.

Looks like you do not have access to this content.

Module 3: Dealing with Files

Robert Mastrodomenico, PhD, owner of Global Sports Statistics, explains how to work with files in Python, including reading data into a file, processing and preparing data within a file, and writing data to an output file.

Copy and paste the following HTML into your website