Skip to main content
SAGE
Search form
  • 00:00

    [MUSIC PLAYING]

  • 00:09

    NIMA ZAHADAT: My name is Nima Zahadat,and I'm a professor of Data Science and Digital Forensics.And we are going to discuss data miningas it compares to typical database management systems,as well as statistics.The data mining process is a knowledge discovery processin data.We typically start with some raw data.

  • 00:31

    NIMA ZAHADAT [continued]: It could be stored in database.It could be data that's just out there, that's notnecessarily organized.It could be brought into a data warehouse and cleaned.That clean process can take up to 60% or moreof the whole process of data mining.Once it's been cleaned and organized,then the data mining processes run against it.

  • 00:52

    NIMA ZAHADAT [continued]: And from there, based on visualizationsthat are produced, we discover knowledge.And that is referred to also as business intelligence.And that's what data mining is.So we start with the cleaning process again.We may do some reduction and transformation of the data.To give an example, we might get some data from, let's say,

  • 01:13

    NIMA ZAHADAT [continued]: a simple spreadsheet that has several million points of data.And we find out that some of the data is missing.Some of the data may be in the improper formatfor our analysis.That is a process of transformation, for instance.We may also decide that some rows where the data is missing,we simply don't want to use because there is data missing.So we will eliminate that.

  • 01:34

    NIMA ZAHADAT [continued]: That's the reduction process.Then, we decide how we want to do thisas far as data mining goes.Do we want to do classification?Do we want to do regression, association, clustering?Or we may not know any of them.We may not know which one we want to do, in which casewe start running our algorithms, and running our code,and then decide which one is really coming out.

  • 01:56

    NIMA ZAHADAT [continued]: Is it a classification scheme?Is it an association scheme?Et cetera.Data mining is search for patterns of interest.That's what it's all about-- finding patterns of interestthat we can use for business intelligence.Data mining is a combination of a lotof other areas of studies.

  • 02:16

    NIMA ZAHADAT [continued]: Statistics really can be used as part of data mining.It doesn't replace it.Visualization is used.Obviously, database technologies are used.Machine learning is also used as data miningor is used as part of data mining.So data mining is a combination of a lot of things.It's not just one thing by itself.

  • 02:38

    NIMA ZAHADAT [continued]: Let's actually discuss data miningversus statistical analysis.When I was going to college, of courseI had to take some statistics.And I did very well in it, even though I didn't like it.Statistics takes the idea that I mighthave thousands of points of data,perhaps millions of points of data.We cannot analyze that many.So we take a sample.

  • 02:60

    NIMA ZAHADAT [continued]: That whole data is referred to in statistics as a population.When you take a sample of that population,and you do analysis of that, that analysisis called a statistic.And that's what the term statistics comes from.I might have millions of points of data.

  • 03:21

    NIMA ZAHADAT [continued]: And I'd take out a sample of a couple of hundred.That may not be enough.But maybe that's all I can handlewith my knowledge of statistics or with my program.So I might take multiple samples of 200, or 100, or 30,or whatever.So statistics has become a size by itself over the yearsbecause it has a background that goes back a few hundred years.

  • 03:47

    NIMA ZAHADAT [continued]: It has a history, I should say, thatgoes back a few hundred years.The problem is when you apply statistics,that this is entirely data-driven.And a lot of times, it starts offby dealing with what we call descriptive statisticsor summary statistics.Most people are familiar with taking averages or findinga median, averages being the most popular.

  • 04:09

    NIMA ZAHADAT [continued]: So we take this data.We find average values for certain attributes.And then, we apply our statistical methodsto draw some conclusions.There are several problems here.First of all, how do we know that our samplewas representative enough of the whole population?How do we know that?We really don't know that.

  • 04:31

    NIMA ZAHADAT [continued]: Could there be bias in there?Absolutely.There are a lot of people out therewho will take a sample that will give them the result that theywant that has happened, and it continuesto happen to this day.The interpretation of the resultscan be daunting and difficult because statisticsis daunting and difficult. It reallydoes require expert guidance and expert knowledge.

  • 04:52

    NIMA ZAHADAT [continued]: There is also no input as far as domain knowledge is concerned.And what domain knowledge refers tois someone's expertise and experience or other people'sexpertise and experience about the populationthat was statistically analyzed.Data mining, on the other hand, takes the entire population,takes the entire set of data, and says,

  • 05:14

    NIMA ZAHADAT [continued]: OK, we're going to apply some algorithmsto the entire set of data.It's real world data.We might have lots of missing values.We may be able to replace some.We may be able to eliminate them.But we are working with the whole data, not justa sample, a small sample.And it does allow for domain knowledge.In other words, someone can offer their knowledge,

  • 05:35

    NIMA ZAHADAT [continued]: their expertise, their experienceas part of the data analysis when it's done.Let's talk about an example or the differencesbetween data mining versus traditional just databases.Remember that we can use traditional databasesas our data source when we're doing data mining.But anyone who's worked with databasesbefore knows that you can give instructions to a database

  • 05:57

    NIMA ZAHADAT [continued]: to give you results back.That's not data mining.That's just getting some results back.So an example of a database would be a database report.What was the last month's sales for each service type?Or the sales per service group by customer, sex, or age.List of customers who lapsed their policy, their insurancepolicy, let's say--that's not data mining.

  • 06:17

    NIMA ZAHADAT [continued]: That's just a report that says, here's the information.Or a student going and saying, I wantto get a copy of my transcript.That gets fetched from a database and is given.Now, some questions that data mining could answerwould be, what characteristics do customers that lapstheir policy have in common?Or how do they differ from customerswho renew their policy?

  • 06:38

    NIMA ZAHADAT [continued]: Which motor insurance policy holderswould be potential customers for their house content insurancepolicy, or the renter's policy, or a home insurance policy?Those are questions that data mining can answer--which is, it's not some simple querythat you give to the database to give you data back.It actually has to be done as an analysis

  • 07:00

    NIMA ZAHADAT [continued]: after data mining has been run.Data warehousing is not a very popular term that is used.That is also not data mining.Data warehouses are databases that are extremely large,and they hold typically historical data.People who love to watch sports, such as football,they are inundated with all kinds of information

  • 07:21

    NIMA ZAHADAT [continued]: that this guy had so many passes in so many seasons.And he just surpassed a certain goal, et cetera.Where did they get all this information from?This is all coming from data warehouses.That's where this stuff is stored.That, again, is not data mining.And the idea here is, again, to take that information,mine it, and extract knowledge from it.

  • 07:43

    NIMA ZAHADAT [continued]: That is what data mining is all about.So it's a little bit different than data warehouses.Again, we use data warehouses as part of our data miningprocess.Let's take a look at this chart right here.It's a very simple chart.It has several columns.We have a Day column.There are 14 days here.There is an Outlook, whether it was sunny, rainy, et cetera.

  • 08:05

    NIMA ZAHADAT [continued]: We have a temperature.We have a humidity.We have a wind factor, and whether or notwe played outside on that day based on these factors,let's say.Or we don't know that precise.Whether or not we played, that's basicallywhat the chart shows us.So this is just a very simple chart.Obviously, there are not thousands of points.But even this, with only having so many little points,

  • 08:27

    NIMA ZAHADAT [continued]: can be a little bit difficult to analyzeif I'm interested in some information such as whichparticular factors seem to affectwhether we play outside or not.Let's take a look at an example of what a database, a DBMS--DBMS is Database Management System--or OLAP, Online Analytical Processing, we do.

  • 08:48

    NIMA ZAHADAT [continued]: I can query this chart, this little database,and say, for instance, what was the temperaturein the sunny days?And the results would come back.They were 85, 80, 70, two 69, 75.I can ask which days the humidity was less than 75I can ask which days the temperature wasgreater than 70.Or I can combine some of these.

  • 09:08

    NIMA ZAHADAT [continued]: Which days the temperature was greaterthan 70 and the humidity was less than 75?And get a result back.That's what it gives me.Now, we want to bring this one step further and look at itas an online analytical processing.So this is one step closer to what data mining does.In this particular case, I have taken that information

  • 09:29

    NIMA ZAHADAT [continued]: and created what's called a cube where I have divided my 14days into two separate weeks.And I have both a sunny day, a cloudy day, et cetera.And looking at this, it requires a little bit of analysison my part.For instance, I can look at week 1and see that we had two sunny days, but we didn't play.

  • 09:52

    NIMA ZAHADAT [continued]: So the zero says we didn't play.And the 2 says that we had two sunny days.But that still leaves a good deal of analysison the part of the end user that theyhave to sit there and make sense of this thing.And of course, although this is fairly simplistic in nature,I still somewhat had to sit there and count upthis information and create this OLAP table.

  • 10:14

    NIMA ZAHADAT [continued]: Let's take a look at what data mining does.In data mining, we can produce the following decision treeusing data mining processes.I have an outlook.Outlook is sunny.Humidity is high, we don't play.Humidity is normal, we do play.If the outlook is overcast, we play every time.

  • 10:35

    NIMA ZAHADAT [continued]: It's very simple.If the outlook is rainy and there's wind, we don't play.If it's not windy, we do play.So we took that entire table and brought itdown to something very simple that anyone can understand.And this is the knowledge that was discoveredin that simple table.

  • 10:56

    NIMA ZAHADAT [continued]: This is the tabular form still.But it's still considered to be a visualization, whichsomewhat can look at and very quickly make sense of.And that's what data mining gives us.[MUSIC PLAYING]

Video Info

Series Name: An Introduction to Data Mining

Episode: 3

Publisher: SAGE Publications Ltd

Publication Year: 2019

Video Type:Tutorial

Methods: Data mining, Statistical tests

Keywords: algorithms; data mining; data preparation; data storage; data transformations; data visualisation; database management systems; decision tree: introduction; descriptive statistics; pattern analysis; pattern recognition; Sample size calculations and statistical power; Statistics and data mining ... Show More

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:

Keywords:

Abstract

Nima Zahadat, PhD, Professor of Data Science and Digital Forensics at George Washington University, discusses the differences between data mining and statistics, how data mining differs from working with databases, and the types of information that can be extracted using data mining.

Looks like you do not have access to this content.

Data Mining vs Statistics

Nima Zahadat, PhD, Professor of Data Science and Digital Forensics at George Washington University, discusses the differences between data mining and statistics, how data mining differs from working with databases, and the types of information that can be extracted using data mining.

Copy and paste the following HTML into your website