Skip to main content
Search form
  • 00:00

    SPEAKER: OK.Now we'll talk about clustering.Clustering falls under unsupervised learningwithin machine learning.So this is where we don't know how many classes are thereand what their labels are.Of course, to make it easier, we often justguess the number of classes or we start somewhere.

  • 00:24

    SPEAKER [continued]: But we may still not know their labels.And that's why this is unsupervised technique.One of the most popular techniques for doing this iscalled k-means ,where K is the number of desired clusters.So again, as I said, we may just askfor a certain number of clusters evenif we don't know if that's a good number

  • 00:44

    SPEAKER [continued]: or not and what are the good values or classlabels for those.OK.So let's practice this.It's easy enough to understand.Let's go ahead and practice.And we'll actually look at--we'll see an artificial example so it'llmake it a little easier.

  • 01:06

    SPEAKER [continued]: OK.So we're going to import our regular stuff,NumPy for doing math-related stuff,and matplotlib.pyplot for plotting.

  • 01:28

    SPEAKER [continued]: We're also going to get a styling for plotting.So from Matplotlib, you have something called style,and this determines how your graphs look.

  • 01:50

    SPEAKER [continued]: So if you ever want different look, feel for your plots,this is a way to do it.And here, we're going to use ggplot,which is gmu plot, a very popular package.And we'll see ggplot in some other contexts.

  • 02:12

    SPEAKER [continued]: But for now, we're going to use that as a styling here.Let's just go ahead and save it.And then we need things from our scikit learn--so from sklearn-- because that's where the technique is that we

  • 02:38

    SPEAKER [continued]: want to use, the K-Means.So this comes from scikit learn.So at this point, you want to make surethat you have this package installed with your Python.Of course, you also need to have NumPy, Matplotlib,and those things installed.So if you have scikit learn, which

  • 03:00

    SPEAKER [continued]: is a very popular package for doing machine learning,from sklearn, the cluster we are goingto use-- there are a lot of clustering techniques--we are going to use k-means.Now, let's create our data.So instead of importing in a data set,we're just going to make up something,because it will make it easier to understand.

  • 03:22

    SPEAKER [continued]: So we're going to find a number of pointson a two-dimensional plane.So this is going to be a two-dimensional array.And the way we create it is, we'll justmake up some numbers.And I'm just going to the next line,just because it makes it easier to read.

  • 03:43

    SPEAKER [continued]: But you can keep typing on the same line.So what I'm typing here are the xand y-coordinates for different points on a Cartesian plane.It could be anything, you could even put negative valuesif you like.I'm just going with positive valuesto simplify some of the plotting.

  • 04:04

    SPEAKER [continued]: So I just created six points on a two-dimensional plane.So this is one comma two means, x equals 1 and y equals 2.So it will be plotted somewhere on the plane.And so now I have the six points,and they could have come from some data.So these are just data points.

  • 04:26

    SPEAKER [continued]: Now, I don't know what their labels are,I don't know what the organization is.But I'm going to use clustering to organize these six points.Now, I know it doesn't seem like a lot,but we'll see what it does.But we want to take the six pointsand just organize them in, let's say, two groups or clusters.

  • 04:49

    SPEAKER [continued]: So we're going to use k-means function,and it has only this one argument that we care about--is the number of clusters.And let's say we want two clusters.So this prepares our model, and now weneed to fit that model to the data.

  • 05:13

    SPEAKER [continued]: And data is our capital X. And that's it.So it's very simple.It's just this.Now, what we want to do, we want to figure outhow these clusters are organized-- where they are.So the clustering is actually done.

  • 05:35

    SPEAKER [continued]: But this is not enough, we actuallywant to perhaps even visualize it.We want to see where they are.So what happens when you create clusters,there is the center of the cluster--so think about cluster like a circle.So to define a circle, what you need is, the minimum you needis the center of the circle.

  • 05:58

    SPEAKER [continued]: Or here it's called centroid.So the centroid is essentially the cluster center.So let's extract the centroid information.So it's already there-- that information is already therein this model that we created.But we just want to extract that.

  • 06:18

    SPEAKER [continued]: So k-means is where our model is stored, so K-Means dot--and it's a special variable name,cluster underscore centers underscore.And there maybe some labels--

  • 06:39

    SPEAKER [continued]: we can get those, as well, using the special variable,labels underscore.So let's just go ahead and print these things.And print-- see what we get, just as much.Let's go ahead and run it.

  • 07:01

    SPEAKER [continued]: So as you can see, we created a six by two, so six points.Each point is essentially a two-dimensional thing, x and y.So that's why those two.And we made a little typo here.

  • 07:23

    SPEAKER [continued]: This needs to be k-means--not the actual model, but actual function is K-Means--capital K, capital M. But our model is stored in K-Means.So you can call it something else, if it makes it easier.So this is what we get--the centroids and the labels.So these are the centroids.

  • 07:46

    SPEAKER [continued]: And what you see, these are actuallytwo points on Cartesian plane.This is one point, its x and y-coordinates.And this is another point with its x and y-coordinates.So these two points represent the two clusters that we found.

  • 08:06

    SPEAKER [continued]: And those two clusters are currently--because we don't know the class labels, remember?This is unsupervised clustering.So we just name them very generic, like zero and one.OK, so this is zero and this is one--or it could be the other way, but one of them

  • 08:27

    SPEAKER [continued]: is zero and one of them is one.And what we see here are the associations of the six pointsthat we have-- these are the six points--to one of those clusters.So the first point, one come two, belongs to clusters zero.The second point, five comma eight, belongs to cluster one.

  • 08:49

    SPEAKER [continued]: The third point.1.5 comma 1.8 belongs to zero, and so on.So these are the labels.And these are artificial labels--we don't know the meaning of them,but this is what the clustering has prepared.Now let's go ahead and do some visualization of this,and maybe that will make more sense once we have that.

  • 09:12

    SPEAKER [continued]: So we're going to pick some colors.So I'm just going to create a color palette,to be able to use some colors--green, dot-- and what this means is we're going to plot a dot.It could be in a different shape,but we're just going to go with that.

  • 09:33

    SPEAKER [continued]: R for red, cyan, yellow, and that should be enough colors.So we're just preparing a color palette.And what we're going to do is we'regoing to go through all of our points.We have six points, but we can put this in a for loop x.

  • 10:02

    SPEAKER [continued]: So what this does, it's a loop that goes from i--the i actually goes through the entire length of x.And this is x, and its entire length is six.So we're going through each point.And we'll coordinate-- so we're going

  • 10:30

    SPEAKER [continued]: to print the coordinates x i.So i-th data point, and we'll print its label--which is in labels, corresponding labels array.So i goes from zero to the length,

  • 10:54

    SPEAKER [continued]: and so it goes through each of these points--i-th point, an i-th label.And we'll also plot that.So two-dimensional point, how do we plot it?Well, we take its x dimension--

  • 11:19

    SPEAKER [continued]: so this is i-th point.And zero indicates the x dimension of it.And i-th point and one indicates the y dimension of it.And we're going to use our color palette,and we'll pick a color that corresponds to the label.

  • 11:46

    SPEAKER [continued]: And I'll just define how big thisis going to be, we'll say 10.So we're going to close this here.All right, so this will plot the points.

  • 12:09

    SPEAKER [continued]: And now we want to plot the centroid.So let's just go ahead and run this much and see what happens.So what you see here is all this different points that we have,the six points, and the one, two, three,four, five, and six.

  • 12:29

    SPEAKER [continued]: You see six of them.And if you look carefully, they have different colors.So labels, i.So i-th point, so the possible labels here are zero and one.So zero is green, and one is red.So points with label equals zero are in green,

  • 12:54

    SPEAKER [continued]: and points with label equal one are in red.So you can see that.But let's actually finish this, and also plot the centroid.So centroids are two specific points--there are two centroids, because we asked for two clusters.

  • 13:17

    SPEAKER [continued]: And so we are saying--so there are two dimensions here.One is a centroid number--so there are two centroids, centroid zero and centroid one.But whatever it is, the second one is its x-coordinate.

  • 13:38

    SPEAKER [continued]: And then centroids-- whatever the point is, zero or one.And then one means it's y-coordinate.And we are going to mark it using x.Now let's define some other parameters to indicatehow big it should be.

  • 14:01

    SPEAKER [continued]: You can play with this, or you can ignore this.You'll see what difference it makes.So let's see that line again.My plt.scatter creates a scatterplot.

  • 14:21

    SPEAKER [continued]: And we are asking to print these things.So when you put this column, that means you don't care,essentially, everything.So centroids-- let's look at centroids.So here is the centroid.You see this?So there is a first centroid, and there's a second centroid.

  • 14:44

    SPEAKER [continued]: And you're saying that you don't care if it's first or second,do both of them.And the second part here indicates the x dimension.And the second part here, this indicates a y dimension.So essentially, this is just askingto print two points in a scatterplot.

  • 15:07

    SPEAKER [continued]: Mark it with x, and these are just some size parameters.So let's go out and run it.Let's see what we find.OK, so now we have the centroids plotted.And you can see that they represent clusters--so this is one cluster, and this is another cluster.And you can see there are three points that

  • 15:29

    SPEAKER [continued]: belongs to this cluster, and then these three pointsbelong to this cluster.So hopefully now it makes more sense.Let's go out and do one thing-- let'ssee what happens if you ask for three clusters.Just run this.And so now you can see, these three points

  • 15:50

    SPEAKER [continued]: belong to one cluster.These two points belong to this cluster.And this one point belongs to this third cluster.And you can see that in this output, too, here.So now we have three centroids, and your six pointshave three possible classes-- zero, one, and two.And you can see which point has which class label--

  • 16:13

    SPEAKER [continued]: zero, one, or two.And this is a visualization of that.So that's clustering, which is an unsupervised learning.And what we did here, we used the K-Means clustering,where k is the number of clusters that we desire.And it's very easy to use.

  • 16:36

    SPEAKER [continued]: You can see, now, the only thing that you will findis, this is something that right now we created thisartificially, just because we can control things and seethings easily, how they associate.But you may get these values differently.So this x, essentially, is your data frame.And so that could have something very different.

  • 16:58

    SPEAKER [continued]: Also, if you are working with more than two variables,that means you're going to be in a higher dimension,not two dimensions.So here, nice thing about this datais we're able to control it in two dimensions.And so that makes it easier to visualize.But if you have multiples--several features, or several columns or variables

  • 17:21

    SPEAKER [continued]: for your data set, you won't have an easy wayto do these kind of visualizations.Which is fine, because ultimately,what really matters is this.Because this is what you're tryingto figure out-- where do each of these data points belong?

  • 17:43

    SPEAKER [continued]: This is what you're really after.You're trying to see where each of the data pointgoes, how they're distributed.And so that's your objective.And that's very easy to do, and this is it.This is what you want to do-- this is essentially your modelbuilding, and you're just printing it out.So you can forget about this visualization for higher

  • 18:05

    SPEAKER [continued]: dimension data, but this code remains pretty much the same.And this is where you get your data frame.So what will change for you is howthis data frame is obtained, and then you build your classifier.And you'll find out where points belong,and that's what comes out here.

  • 18:30

    SPEAKER [continued]: So that's clustering with k-means.

Video Info

Series Name: Introduction to Data Science with Python

Episode: 5

Publisher: Chirag Shah

Publication Year: 2018

Video Type:Tutorial

Methods: Python, Machine learning, Clustering, Unsupervised learning

Keywords: artificial intelligence; cluster analysis; cluster detection; cluster grouping; computer programming; data analysis; Scatterplot; Software ... Show More

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:



Dr. Chirag Shah, PhD, explains how clustering, an unstructured/unsupervised aspect of machine learning, can be analyzed in Python, using k-means to work with clusters and matplotlib.pyplot to create scatter plots.

Looks like you do not have access to this content.

Machine Learning with Python: Clustering

Dr. Chirag Shah, PhD, explains how clustering, an unstructured/unsupervised aspect of machine learning, can be analyzed in Python, using k-means to work with clusters and matplotlib.pyplot to create scatter plots.

Copy and paste the following HTML into your website