Skip to main content
Search form
  • 00:00

    [MUSIC PLAYING]My name is Nima Zahadat and I'm a professor of data scienceand digital forensics.And we are going to discuss on data mining, terminologyand ideas.We're going to start by talking about what classification is.

  • 00:23

    When you do data mining, one of the resultsmight be a classification scheme.The idea here is that you take a collection of recordsthat you have discovered in your data miningand call that a training set.And then we find a model for that.And then apply that model to datathat we don't know about, anything about, or use

  • 00:44

    that model to classify all the data that we haven't studiedyet.That's the idea behind classification.The goal here is that the previously unseen recordsshould be assigned classes as accurately as possibleso we can determine what attributes, when I say class,

  • 01:05

    I mean I'm talking about attributes, what attributesof this data that might be of interest to an end user,they can understand it and work with it in an easy fashion.That's more or less what it is.So let's take a look at a classification example.Let me discuss this.Let's say that we have a set of dataabout people who are either single, married, or divorced.

  • 01:28

    And they file their taxes and theyhave a certain level of income.And we want to know whether or not theycheated on their tax return.We can actually do a data analysis based on a setof these people and determine-- we take a sample--we take this data and we analyze it and determine who cheated,who did not cheat.

  • 01:50

    And based on these other attributes, they were single,they were married, they were divorced,their income was particular range,we can come up with a classification schemeand say, well, it seems like if their income iswithin such and such range and they were--they have gone through a divorce,they were more likely to cheat on their taxes.

  • 02:10

    And what that is used for, then, is used by the computer,if you will, by the application of the program,to flag people who are divorced, who have a certain incomerange, as questionable if their taxes,the way they're filed, maybe they have too many deductions.They might be flagged as an interesting party

  • 02:34

    to be audited by the IRS.Another example of classification,let's talk about, say, direct marketing.We tend to-- in the old days-- to do marketing they would sendout flyers, which came to be known as junk mail,to people without really trying to target any particular groupbecause they couldn't.

  • 02:54

    They just didn't have the technology to deal with it.And the data collection was all manual.Nowadays, we tend to do direct marketing.People are targeted based on their particular politicalaffiliation, their gender, their age group,whether they're married.They're single.Where they live in the country.

  • 03:15

    That's all part of the targeting.And this can all be done very quicklyby using classification schemes as part of data mining.Fraud detection is another classification example.The idea here is in case of credit cards in particular,if we can detect fraud.There is a charge that's done on a credit card

  • 03:37

    and the charge is unusual, not necessarilybecause of his location, although that could be,but it might be just unusual because the person whohas that credit card usually doesn't spend moneyin that particular way.And it's not necessarily the amount.I was contacted by my bank a couple of years ago

  • 03:58

    and I was driving and they said that we justhad a charge of 100 some dollars with an online store.And I think it was some kind of a flower store.Did you make this charge?And I said, no I didn't.They said, well, we had a charge previously,also with the same store.Did you make that charge?And I said, no I did not.

  • 04:19

    And they said, well, we approved the first onebut we blocked the second one.So they had gotten curious as whya second charge would be made at the same storeand they had blocked it.They gave me credit for both of thosesince I was not responsible.But the idea was that I never purchased flowersonline for any reason.And I just don't buy flowers.So the machine had detected this as an anomaly.

  • 04:41

    And this is an example of classification schemeagain, where you can categorize people and put themin these buckets and say, this is the way they behave.That's another classification idea.What if somebody changes their habits,maybe somebody is becoming a fatherand they are going to start shopping for baby stuff

  • 05:02

    and maybe they want to buy somethingfor their wife, et cetera.Is the classification programmed to handle something like that?And the answer is, well, kind of.Maybe, maybe not.And let me explain that.First of all, it may very well be that itwill block those purchases.It will signal those as an error.

  • 05:24

    The father would have to contact the companyand say that these are legitimate purchases at whichpoint the algorithm will now take thatinto effect going forward.On the other hand, data mining is everywhere.And data is being collected everywhere.There was a very interesting case of--speaking of fathers-- a father whowas very upset at Target because Target was sending direct mail

  • 05:48

    and information to his daughter because they were anticipatingthat she was pregnant and she was going to have a baby.This was a teenage girl who wasn't married,was living with mom and dad.So he was very upset at Target.And after Target showed him the information,Target, using data mining by other companies,

  • 06:10

    had actually found out that the girl was, in fact, pregnant,and was, in fact, in need of having this material probablyfairly soon.That's because the girl had been searching online for thingslike this and had been posting messages with her friends,et cetera, about being pregnant.And Target knew it before her own dad knew.

  • 06:32

    So this does happen.This is part of data mining.And hopefully that makes sense that yes, itcan be trained and retrained and get more and more advanced.As I mentioned, Target had accessto all this information about this girl.And a question that comes up is, was that a violationof the girl's privacy?

  • 06:52

    And how did they gain access to all this databy all these companies?People have to understand that as soon as they put somethingon the web or as soon as they send a message, a textmessage, or a chat, or picture, or anythinglike that, that becomes a realm of the public.

  • 07:13

    And there is nothing illegal about someone else using it.Anybody who has a Facebook page or a LinkedIn account,who obviously has not read the terms of privacy,if they were to read the terms of privacy,they would be terrified because the terms of privacysay specifically that anything you post belongs to them.

  • 07:33

    They can do anything they want with it.They can use it for their own advantage in any possible way.That you have no control over it.And if they lose it or something terrible happens to you becauseof it, they are not liable.That is their privacy policy.People make the huge leap, going from the word privacy policyto thinking that somehow that means privacy protection.

  • 07:56

    Basically, there is no protection.So this is all done, of course, all this data ismined and collected and sold.And companies like Target use that to find outabout how they can target and selltheir products to end users.Another scheme that data mining works with

  • 08:16

    is the idea of clustering, that is, putting together--taking all the data and clustering theminto groups based on one or more attributes.These could be-- the attributes could be how close the datapoints are together in some form, maybea mathematical distance.

  • 08:36

    It could be color.It could be region.It could be likes and dislikes, anything like that.That's called clustering analysis.And that's, again, part of data mining.It could be very well that certain areas of the countries,as credit bureaus do, they categorizethe various areas of the countriesas the Northeast, the Midwest, the mid-Atlantic, et cetera.

  • 08:59

    And they group these people based on their credithistory and credit scores.And then you can compare yourself,depending on where you are within the countryand how you measure up with everybody else.And this is available, of course,with a fee, a monthly fee that you have to pay.And you can see how you measure up.

  • 09:20

    So an example of clustering, since we talked about itand I gave you an example, but let'stalk about market segmentation.The goal is to subdivide a marketinto distinct subsets of customerswhere a subset may conceivably be selected as a marketto be reached by a particular organization.So we decide that we are going to target the Washington DC

  • 09:44

    area or the San Francisco area.And these are two distinct markets.They're not the same.We may want to look at the lifestyles.We look at the demographic.We may want to look at the age range.We may want to look at the types of jobsthat people have in these particular areas and targetdifferently to each one.

  • 10:04

    This is clustering.This is clustering.DC is different than Montgomery, Alabama.It's different than San Francisco, California.Each one has its own demographic,its own set of lifestyles.And so we cluster them together for better targeted marketingand delivery of products.

  • 10:24

    Another example would be document clustering.Google, of course, made this very famous.Google's search engine is very popular.Even if people don't like Google, they still use Google.I have never heard anybody say that I'm using MSMsearch or Yahoo search anymore.And the reason for that is that the Google search engineuses the idea of clustering to pull up documents on the web.

  • 10:48

    These documents are web documents,of course, that are very relevant to the search termthat the user was interested in to begin with.And that's an example of clustering again.The very first page of Google where the most relevant pagesshow up, everybody wants to be on that whena particular term is typed in.They realize this, and of course, theyturn that into their advertising model, which

  • 11:11

    is where they make their money and now they're almosta $900 billion dollar company.Another example of data mining approachis association rule discovery.And here is, you know, analyzing, let's say,when people go out shopping, if they are buyingparticular items together.Let's say that an association rule

  • 11:32

    is discovered by analyzing thousandsof customer transactions.And it appears that every time a customer buys a carton of milk,they also buy Coca-Cola to go with it.Or if they buy milk and diaper, they also buy beer.That being the case, the store maydecide that we are going to arrange the items in such

  • 11:56

    a way that when they pick up their diaper and milk,the beer is nearby, at least a stack of it or a gondola of it,so they are reminded not to forgetto also purchase the beer.That's an example of an association rule.And it's very effective.Anyone who's been to a grocery storewould know that when they want to buy the milk,they have to walk the entire length of the store

  • 12:17

    to get to it.That's done on purpose, of course, because they wantto make sure that the customer walksthe entire length of the store and runs acrossas many items as possible before they haveto buy the milk because milk is one of the itemsthat just about everybody needs and wants.Bread is the same way.Of course, that's a more overall look at it.

  • 12:38

    But next time you're at the store,take a look at how items are put together in such a wayto make it convenient to pick up all the items togetherin one spot because of discoveringthese various association rules.Couple of additional terms, just in case.

  • 12:60

    There is a term called support.And there's a term called confidence whenit comes to association rules.We had an example about people buying diapersand milk are more likely to also buy beerwhen they are getting their diapers and milk.The support, in this case, let's say is 20%.What does that mean?That means that we have discoveredthis in 20% of the transactions that had diapers in milk.

  • 13:26

    They also purchased beer.That's the support.The confidence might be 85%, meaningthat these 20% transactions, there was 85% confidence.85% of the people who bought diapers and milk alsopurchased beer.So it was only 20% of the transactions,

  • 13:47

    that's the support.But it's a very high confidence.And that is an information that the storewould like to know because that helps them determinehow they're going to stack--how they're going to stock, I meant, their shelves whenthey have diapers and milk.They may not necessarily put beerright next to diapers and milk, but they may certainly

  • 14:10

    put a sign or something, or maybe just a display,to encourage them to also go pick up the beer.Having said that, I have to admitthat I have been to our local grocery store.And very near the milk, the aisle right across from itis also the beer.So I imagine that there is some truth to this.

  • 14:31

    So the question might be, what if prohibitionwas to come back, let's say, or a particular area didn'tsell beer anymore for whatever reason.What would happen?What would happen?What would take the place of beer, let's say.Well, we don't know.Remember that the whole idea behind the data miningis that we analyze all these transactions

  • 14:54

    and we're looking for particular patterns,particular classifications, clustering of information.And in this particular case, we get some association rulesout of it when we're done.We may decide that, after doing the analysis, that beerand diaper--if that's what's of interest to us-- and beers--

  • 15:17

    not beer and diaper, milk and diaper.When they're purchased together and beer or alcohol is notavailable, that they've replaced it with something else.Maybe they now buy more soda or maybe theybuy more lemonade or something.Who knows, maybe they buy non-alcoholic beer.But maybe there is no pattern.There is nothing else at that point.They buy their diaper and their milk and there is nothing else.

  • 15:39

    And that would also be useful informationbecause now they know that there is reallynothing taking the place of, let's say,an item that is no longer available for purchase.And they may be looking at other patterns at that point.[MUSIC PLAYING]

Video Info

Series Name: An Introduction to Data Mining

Episode: 2

Publisher: SAGE Publications Ltd

Publication Year: 2019

Video Type:Tutorial

Methods: Data mining

Keywords: attributes (data); classification; cluster analysis; cluster detection; cluster grouping; credit card fraud; data mining; direct marketing; market segmentation; modeling (research); privacy issues; Search engines; Social media; target markets ... Show More

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:



Nima Zahadat, PhD, Professor of Data Science and Digital Forensics at George Washington University, discusses the tasks that data mining is designed for, including classification systems (targeted marketing, fraud detection, and privacy concerns), clustering (market segmentation and search engine documents), and association rule discovery (support and confidence).

Looks like you do not have access to this content.

Data Mining Tasks

Nima Zahadat, PhD, Professor of Data Science and Digital Forensics at George Washington University, discusses the tasks that data mining is designed for, including classification systems (targeted marketing, fraud detection, and privacy concerns), clustering (market segmentation and search engine documents), and association rule discovery (support and confidence).

Copy and paste the following HTML into your website