Skip to main content
SAGE
Search form
  • 00:09

    NIMA ZAHADAT: My name is Nima Zahadat,and I'm a Professor of Digital Forensics and Data Science.In this presentation, I'm going to illustrate some data miningconcepts using Python and Python libraries workingon baby names, which is from the US census.They put out a list of baby names for each year.

  • 00:30

    NIMA ZAHADAT [continued]: This list goes all the way back to 1880.And we have the list from 1880 to 2014.It would be interesting to analyze this listand see what baby names were popular for boys or girlsduring what period and what variations of a particular nameare more popular than other variations of that name.

  • 00:51

    NIMA ZAHADAT [continued]: Clearly, I cannot teach you Python in this shortpresentation.And don't worry if you don't know Python or programming.But I will illustrate the concepts as we go through.And you can see how elegant, with very few lines of code,you can get a lot of good information and knowledgeout of this set of data.We start by looking at the data itself.

  • 01:14

    NIMA ZAHADAT [continued]: And here, I have the zip file.It's called names.zip.And we'd like to first extract all that information.I start by importing a bunch of librariesthat are necessary for Python to function properly.We start by pandas.I'm renaming it as pd.That's what as pd means, a matplotlib.pyplot,which I'm reading as a Pd.

  • 01:36

    NIMA ZAHADAT [continued]: This is a package pandas, which is great for mining.This is a package great for visualization,in other words graphing.We need a couple of other additional packages, zip fileand OS, to actually extract the information and display it.So with this line, I have extractedthat zip file into a folder.

  • 01:57

    NIMA ZAHADAT [continued]: And this is called a names folderright here that basically brings out all the files.And then, I actually call the OS library that I brought in,and I call one of the its functions.Some people say method, and that's perfectly fine.And I list the first 10 names, the first 10 files,that are in that directory.

  • 02:17

    NIMA ZAHADAT [continued]: As you can see, it's this PDF plus a bunch of text files.So let's actually start by using pure Python,using a Python method to read the first 20lines of these yob2012, which is the year of birth 2012.We can see that it actually pulls the information.It appears that girls are listed first.

  • 02:40

    NIMA ZAHADAT [continued]: So I have name--Sophia, female.There were 22,267 people who named their girls Sophiain 2012.However, this is not very friendly to look at.We can use pandas, which is one of the libraries we imported,and call its method or its function-- read_csv--and read the same exact file.Notice when I do this one--

  • 03:01

    NIMA ZAHADAT [continued]: I'm getting the first 10, let me get the first 20--that this one comes out much nicer.It's much cleaner.This is referred to as a pandas data frame.There was only one problem.Of course, it decided that Sofia isgoing to be the header, which we don't want.But that's easy to fix.I simply come here and read it again.And this time, I say, give the column names names, sex,

  • 03:23

    NIMA ZAHADAT [continued]: and number as the headings.And when I run it, now it looks beautiful.This is, of course, showing the first 10 itemsbecause I don't want to have everything show up.It would be too long.But I can always say I want to get the first give itemsif I wanted to, or I want to get the first 50items if I wanted to.And it would give me exactly what I've asked it.

  • 03:46

    NIMA ZAHADAT [continued]: We can also use something called a tail function or tail methodto look at the end of the list.And you can see that, for 2012, the end of the listare showing.Let me run this again to correct it.I can see that there are boy names at the very end.And these are some names like Zylin, Zymari, Zyrin.

  • 04:07

    NIMA ZAHADAT [continued]: And there are five people who named their kids that.This is great.But there is one problem that we have,namely that we have all these various files to work with.I have a lot of files here.I don't want to work with them individually.I prefer to combine them all and use that combinationfor my data mining.

  • 04:29

    NIMA ZAHADAT [continued]: With a few lines of code right here,I accomplish exactly that.These are a few lines that actuallytake all of those files and combine them together.Furthermore, I come down here.And I use another function that'sbuilt into pandas that takes all of theseand combines the names together, which is really, really nice.

  • 04:49

    NIMA ZAHADAT [continued]: Finally, I have a data frame whichI'm calling all years, which includesall the names from all the years in one single structure.Let's go down and see what we can do with this thing.First and foremost, I am going to take it,and I'm going to index it.Up here, you can see that there is an implicit index that's

  • 05:11

    NIMA ZAHADAT [continued]: put in place.And I have 1.8 million rows and four columns.But I'm going to come over here and actuallychange the index myself.And by doing that, I'm going to say I want to first sortby sex, then by name, then by year.And if I take a look at the first 10 items,you can see that it went by sex.

  • 05:34

    NIMA ZAHADAT [continued]: First, we have F. Then, we have M for male.And then, we have the names that are alphabetical,and then the years, which are also sorted.You might be looking at this and say, well, Idon't see the year 1880.And that's because nobody named their kid Aabha in 1880.And you can't see that here for Aabriella, we start at 2008.Then, nobody named their kid that until 2014.

  • 05:56

    NIMA ZAHADAT [continued]: But this is now a complete data frame that we can use.And it's indeed the way we want it.So let's actually go ahead and take a lookat a couple of names.I want to first look and find the name Rachel.I'm going to call my data frame that I created here.And I'm going to use one of its functions, whichis called loc for locate.

  • 06:17

    NIMA ZAHADAT [continued]: I'm looking for girls.So I put F here.And I'm looking for Rachel.And I can see that the list comes up.In 1880, 166 people named their kid Rachel.And it goes on and on all the way up to 2014when 2,051 people name their girls Rachel.

  • 06:37

    NIMA ZAHADAT [continued]: I may try my own name.So I'm going to put M for male.And I'm going to put my first name, Nima.And when I run this, I can see that my namedidn't appear until 1971.Five people named their kid Nima back in 1971.Climbed up a little bit-- nowhere near Rachel--and climbed back down to 9 in 2014.

  • 07:01

    NIMA ZAHADAT [continued]: Now, what's interesting is, you can play with this stuff.I'm going to go back to Rachel.And I wonder if anybody named their boy Rachel.So I'm going to change the F to M.And we can see that, sure enough, therewere some people who named their boys Rachel.1899, there were five people.Some people, I guess they wanted to have a girl,

  • 07:23

    NIMA ZAHADAT [continued]: and they got a boy.I'm not sure.But that's what they did.Let me take a look at my name and see how many peoplenamed their daughters Nima.And I can see that when I switch from M to F in my name, nowit went back to 1921.In 1921, six people named their daughters Nima.

  • 07:43

    NIMA ZAHADAT [continued]: And it's actually been going on all the way to 2014.There were more people naming the girls Nima than boys.So I guess it's kind of a unisex name.OK well that's good to know.What I want to do now is I want to createa function, which is basically a small package of code--

  • 08:03

    NIMA ZAHADAT [continued]: packaged code, as I call it--to plot, actually graph, a particular name.Very simple, you can see it's barely three lines of code.And here, we're going to plot Rachel.So I'm going to call my plot, feed it Rachel,and then show it.And now, I have a nice beautiful graphical representation

  • 08:24

    NIMA ZAHADAT [continued]: of what Rachel was-- actually, I don't know why this came up.This must be from before, so let me run it again.It must have cached it.I have a nice beautiful representation of Rachelat this point.I can see that, in 1880, there were not too many people namingtheir kids Rachel.It looked like the name Rachel peaked in the 1980s and 1990s,

  • 08:46

    NIMA ZAHADAT [continued]: climbing up to over 16,000 people naming their children--their daughters-- in each of those years Rachel.So it actually peaked during that year.It would be interesting to find out why that was the case.Let's do the same thing for my name.And here's my name.Now, one thing to keep in mind is,

  • 09:07

    NIMA ZAHADAT [continued]: you can see that my name is nowhereas popular as Rachel's is.And the scaling, when I look at Rachel,this scale goes from 0 to 16,000.When I look at Nima, the scale goes from 5 to 40.The point is that if I were to mix both of these--I wanted to see what the popularity of Nima

  • 09:29

    NIMA ZAHADAT [continued]: is versus Rachel-- my name would flatten on the graphbecause Rachel has so many more values over the years.So always be careful when you're doing something like this.Having done that, let's actually do some comparison of names.So I am going to compare some popular names--Michael, Barry, Bill, and Joseph.Joseph is my son, and Michael is a very popular name.

  • 09:51

    NIMA ZAHADAT [continued]: So I just figured let's do that.Barry and Bill are two of the gentlemen who are actuallymaking this video possible.So I'm going to run that.And here now, I have a combination graphwhere that shows how these names compared to one anothersince 1880 all the way up to 2014.

  • 10:11

    NIMA ZAHADAT [continued]: It's clear that Michael has been the most popular name followedby Joseph.And then, we have Barry and Bill are pretty much even over here.Let's do the same thing for girlsso we don't leave out the girls.So I am using the name of some of the ladiesresponsible for this video.

  • 10:31

    NIMA ZAHADAT [continued]: I have Julie, Kathleen, Rachel, and Jacqueline.I'm looking at this.Again, I can see some results of how these names havebeen doing over the years.All of them started out kind of closeto 0, at least on this scale.It's probably in the hundreds.I can see that Kathleen peaked in the 1950s.Julie peaked in the 1960s and '70s.

  • 10:55

    NIMA ZAHADAT [continued]: Rachel peaked, as we saw earlier, in the 1980s and '90s.And Jacqueline has been the one that'sbeen fairly consistent across the board.It's been more or less the same, although it peaked a little bitin the early 1960s, obviously because of Jackie Kennedy,President Kennedy's wife.So one more thing that I would like to do right hereis, well, suppose that we want to find out

  • 11:18

    NIMA ZAHADAT [continued]: the variations of the name Kathleen.We have Kathleen.We have Kathy.We have Katie.We have Kat, maybe others.How are those variations playing a part, or how do they compare?Which one are people most likely like to put on their girlas a name?

  • 11:39

    NIMA ZAHADAT [continued]: So I'm going to come over here and actuallyuse another for loop along with the function that I created.I've created a list, which is a Python structure,and I put in Kathleen, Kathy, Katie, and Kat as the names.And I'm going to loop through them and actually display them.So let's take a look at that.

  • 11:60

    NIMA ZAHADAT [continued]: Here we are.This shows that Kathleen has been the most popular variety,followed by Kathy, which is very, very close to it.Katie and Kat, Kat has not even appeared until the 2000s.Apparently, nobody called their kid Kat.And this is an official name.This is not a nickname.So this is the name that shows on the birth certificate.

  • 12:22

    NIMA ZAHADAT [continued]: So Kathleen followed by Kathy followed by Katiehave been very popular.Interestingly enough, Kathy doesn't appear until the 1910s,although Kathleen has been around since the beginning aswell as Katie.Katie has been around since the very beginning.We can do the same thing, of course, for boysif we wanted to.

  • 12:42

    NIMA ZAHADAT [continued]: But I just wanted to have this little presentationto show you what data mining is capable of doing.This is kind of a fun data set to play with and getinformation out of.We can get a lot more info out of this if we wanted to.But this should be sufficient to get you intriguedwhat you can do with data mining tools and techniques.

  • 13:04

    NIMA ZAHADAT [continued]: And this is all done with very, very few lines of code.Just to leave you with a thought--how would you have combined all of those text filesif it wasn't with those three or four lines of code?Would you use Excel and copy and pasteeverything, which would be a nightmare?How would you do it?Think about that, and you understand

  • 13:25

    NIMA ZAHADAT [continued]: why these tools that are designed for data miningare so popular and so useful.

Video Info

Series Name: An Introduction to Data Mining

Episode: 4

Publisher: SAGE Publications Ltd

Publication Year: 2019

Video Type:Tutorial

Methods: Data mining, Python, Data visualization

Keywords: census data; coding; data management; data manipulation; data mining; data preparation; data processing; data visualisation; graphical presentation of data; programming and scripting languages; Statistical packages ... Show More

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:

Keywords:

Abstract

Nima Zahadat, PhD, Professor of Digital Forensics and Data Science at George Washington University, illustrates some data mining concepts using a baby name data set and Python, including the preparation of data frames and data visualization.

Looks like you do not have access to this content.

Data Mining Example: Baby Names

Nima Zahadat, PhD, Professor of Digital Forensics and Data Science at George Washington University, illustrates some data mining concepts using a baby name data set and Python, including the preparation of data frames and data visualization.

Copy and paste the following HTML into your website