- 00:00
SPEAKER: Hi.So now we'll look at one more branch of machine learning,which is density estimation, which fallsunder unsupervised learning.So this is also a place where we don'tknow the labels for the data.We don't know how many labels should be there.

- 00:22
SPEAKER [continued]: And we saw that in clustering-- so it's similar to clustering.But here, we don't even know how many clusters we should haveand what's a reasonable number and wherewe should look for them.And so density estimation-- what it does-- it looks for--

- 00:42
SPEAKER [continued]: and actually, we can look at the specific technique calledMeanShift.What it does-- it tries to find the maximum values of a densityfunction and-- given a set of datapoints that can associate with that function.So it's actually looking through all the data pointsthat we have.And it's trying to figure out what

- 01:02
SPEAKER [continued]: is the likelihood of finding something in a given area.And so the intuition is there are things that oftenare concentrated in some areas.And so you can identify the concentration there.And there are times when things are more scattered.

- 01:23
SPEAKER [continued]: And so then it's spread out more.And maybe there are different grouping of themor functions describing them.So this is very useful for doing exploratory analysis, whereyou have a bunch of data and you don't know how to organize them

- 01:45
SPEAKER [continued]: and you have no good intuition to figure out how manyclasses that can belong to.So we're going to see this with an example in Python.So once, again let's get started with importingsome of the libraries.NumPy-- we almost always need it.

- 02:07
SPEAKER [continued]: We're going to use scikit again.And sklearn dot cluster has a lot of techniques or algorithmsimplemented.We're going to take MeanShift algorithm.And for this exercise, we're also

- 02:31
SPEAKER [continued]: going to just have Python create samples for usrather than us having a sample.So we're going to use samples_generator packagewithin sklearn.And make underscore blobs is a specific function

- 02:58
SPEAKER [continued]: that we're going to need to create a bunch of samples.You'll see this in action.It'll make more sense--matplotlib pyplot as plt.And we can also try some 3D--

- 03:21
SPEAKER [continued]: and so let's get mpl_toolkits.Now, I'm assuming, at this point,you have these things loaded.If not, please stop here at this pointand make sure you have all of these packagesthat we are importing here loaded for this Python.

- 03:44
SPEAKER [continued]: Let's store this.So we're going to first just define our style for plotting.So we can just--matplotlib import style.

- 04:08
SPEAKER [continued]: So we're going to import that style function.And we're going to say style dot use ggplot.So this is going to use a familiar gnuplot styling.So we're going to create some artificial data here.And what we're going to do--we're going to define three centers

- 04:30
SPEAKER [continued]: in a three-dimensional space and have this make_blobs createus a bunch of points using those data-- so around those data.Well, I know this is kind of a self-fulfilling prophecy,that we are generating the data around three points.And then we are hoping to do this density estimation, whichwill give us those three points.

- 04:54
SPEAKER [continued]: But that will make our life easier right now,to just learn this.And we can still play with different parameters there.So let's-- so we're creating three-dimensional points--just want to make sure-- it could be anything-- justwant to make sure that they are different enough that they're

- 05:17
SPEAKER [continued]: not too close to each other.And now we're going to have three-dimensional--so we're going to create points in three-dimensional spaceusing make--

- 05:43
SPEAKER [continued]: now, make_blobs function-- it asks for how many samplesyou want to create, around which centers,and how wide you should be looking around those centersto be creating points.And so what we need is, really, returning this X. That's

- 06:06
SPEAKER [continued]: where all the points would be.Y we're going to ignore.But we need that because that's howthe make_blobs function works.So we don't need it for now.But so we need--let's say we will ask for 100 samples around centers,well, that are defined here and centers,

- 06:27
SPEAKER [continued]: the three different points that we have.And cluster standard deviation--let's say 2.So what this means is we're going to start at these points.And for each of the points, look within two standard deviation

- 06:48
SPEAKER [continued]: and generate a bunch of sample data, total 100 sample data.So that's what it means.And just as we did before, it's very easy.You create MeanShift.This is a model.And then you fit it to data.

- 07:09
SPEAKER [continued]: So X, capital X, has our samples thatare generated using these as a centersin three-dimensional space using this function.And so this MeanShift function justsimply tries to fit that data.And just as we have, again, done before,

- 07:31
SPEAKER [continued]: let's try to extract the centroids.Now, in ideal case, they should have--it should get the same thing.The things that it use to generate those samples--and now we're doing the reverse engineering.We are trying to get to the centroids using the samples.But we'll see-- so ms dot cluster

- 07:55
SPEAKER [continued]: underscore centers underscore.That gives us those centroids.And let's see what are the labels for our 100 points.So let's go ahead and print.

- 08:18
SPEAKER [continued]: Let's go ahead and run it.So we don't need this right now.So let's just get this out and just run it.So we can see that it found two centroids, actually.

- 08:47
SPEAKER [continued]: So we use three centers to generate the data points.But when it did that density estimation,perhaps two of these points are close enough that the samplesgenerated around them happened to fallunder the same distribution, same density, similar density.So they were grouped together.So it found two centroids.

- 09:09
SPEAKER [continued]: And remember, we didn't ask for specific number of centroids.So unlike clustering with k-means,where we ask for certain number of clusters, herewe didn't ask for that.We actually just said, well, just see what's out there.And it found two.And then these are the labels.

- 09:29
SPEAKER [continued]: And it's not printing all of them.But if you're curious, you can find it here,where you have this 100 points.And each of them have a label, 0 and 1.So that shows which cluster the points belong to.

- 09:50
SPEAKER [continued]: And so if you want to get a little idea howmany number of clusters--let's say find the unique labels from those clusters.

- 10:12
SPEAKER [continued]: And let's say number of estimated clustersand n clusters underscore.And it says number of estimate clusters, which, of course,

- 10:35
SPEAKER [continued]: we counted manually.Let's play with this a little bit more to see if you bring--if you're able to bring some other changes here.So now we have slightly different points.And now we find three different clusters here.

- 10:55
SPEAKER [continued]: And again, we didn't ask for three.It just found three based on how the data is distributed.So I'm going to leave this here and let you play aroundwith this different numbers.You can also play with what happens if you havedifferent number of points.Sometimes, it makes a difference.Other times, it doesn't.

- 11:15
SPEAKER [continued]: But this is-- and then there's some things in visualization,which I'm going to leave out.And I'll provide you material to--if you want to play with some visualization as well.The important thing here is understanding that you'reable to take data points.And remember, if you have real data,then this is where you'll make a difference.

- 11:37
SPEAKER [continued]: So you will not have this.And you will not have this.So you don't need to create artificial data.You will simply have this data frame.And it could have multiple dimensions.So you can have multiple columns.Think about a .csv file--multiple features, multiple columns that you load up.

- 11:58
SPEAKER [continued]: This is your data frame.And then this remains the same--everything.So you give that to MeanShift and ask it to fit this model,ask it to fit this data.And then you can see how many clusters it foundand where each data point, each row, goes--

- 12:19
SPEAKER [continued]: so the membership of each data point into one of the clusters.And so this gives you an idea of how things are organized.And this is very useful, as I said before,for exploratory work because you'regoing in this with very little to no prior knowledge.

- 12:41
SPEAKER [continued]: And you don't have a very strong intuitionabout where things could be.And so you're just throwing all this data and asking thisto figure it out if there is some pattern--there's some kind of grouping, some clusters--that exist.

- 13:02
SPEAKER [continued]: And depending on the parameters--and we didn't even see all the parameters.But you can see--you can practice with them.Depending on the parameters, you couldfind different number of clusters--so whatever makes sense.And that's where the exploration comes in.There is no defined, clear way.But you are using--so for instance, if this finds two clusters

- 13:24
SPEAKER [continued]: and you may realize, well, that'snot enough, you need to have at least five--and so you can play with some parametersto see if you can get at least five grouping of the data.And so that's it for density estimationas part of machine learning with Python.

### Video Info

**Series Name:** Introduction to Data Science with Python

**Episode:** 6

**Publisher:** Chirag Shah

**Publication Year:** 2018

**Video Type:**Tutorial

**Methods:** Python, Machine learning, Unsupervised learning, Clustering

**Keywords:** artificial intelligence; cluster detection; cluster grouping; computer programming; data analysis; density; estimation; Software
...
Show More

### Segment Info

**Segment Num.:** 1

**Persons Discussed:**

**Events Discussed:**

**Keywords:**

## Abstract

Dr. Chirag Shah, PhD, explains how density estimation, a branch of unsupervised learning, can be determined using MeanShift in Python. Using an example based on artificially generated, 3-dimensional data points, the model is able to find centroids and labels. Considerations for real-world data are also provided.