Skip to main content
SAGE
Search form
  • 00:01

    SPEAKER: OK, so here we'll see clustering, whichis the unsupervised branch--unsupervised learning branch of machine learning.This is where we don't really know the labels on the datapoints.We don't know how many classes.We don't know what those classes should be.

  • 00:23

    SPEAKER [continued]: But often, we want to start somewhere,so we at least identify some initial number of classes.And that's one of the popular techniquesis k-means where k is a number of desired clusters.So the way this technique works is we look at the data.And we come up with some reasonable idea

  • 00:45

    SPEAKER [continued]: about how many clusters should beor how many groups should be.And we tried it with that.And it may not give us as good kind of representation,so we can try a different number.But here, we are going in with very little to no intuitionabout or knowledge about how the data should be groupedor where these should go.

  • 01:07

    SPEAKER [continued]: So let's do this using R. What we'regoing to do for this exercise, wewill use one of the inbuilt data sets that'savailable through R. And for that, we'llneed a library we'll need to load up called data sets.

  • 01:28

    SPEAKER [continued]: And one of the data that it has is called iris.Let's see what it looks like.And so this is just part of iris data.head is for simply printing some of the first few linesof the data set.And this is what it looks like.So it has length, width, and different parameters.

  • 01:54

    SPEAKER [continued]: And the label species, right?So in this case, we know the label.But let's pretend that we don't know that labeland all we have are these attributes for flowers.And we're going to put this flowers in, using theseattributes-- represent some way--

  • 02:15

    SPEAKER [continued]: and see if there is some kind of grouping that we can identify.And one reason we are using this is because this labels,even though we're not doing classification,could help us verify our techniqueto just see if what we're coming up with actuallyhas some meaning or not.

  • 02:37

    SPEAKER [continued]: So we are going to load ggplot2 library because we'regoing to do some plotting.In fact, let's just go ahead and plot this data.We're using ggplot.So iris.Now, we didn't load this data in a data framelike we normally do because this is a inbuilt data set.

  • 02:59

    SPEAKER [continued]: So we can just refer to it directly like this.And we're going to look at Petal.Length, Petal.Width,

  • 03:20

    SPEAKER [continued]: and we'll color it using species as sort of the color.And we'll do a scatterplot.This is just to kind of see what we're dealing with here.And let's look at what it has.

  • 03:44

    SPEAKER [continued]: So this it what we're dealing with.And these are all data points.So we are just plotting in two-dimensional lengthversus width.So each data point has corresponding length and width.And normally in clustering, you would not have these colors--these labels.

  • 04:06

    SPEAKER [continued]: All you'll have is just data points.And then you would be looking to seeif there is some kind of organizationthat we could find.But here we have the advantage of knowing allthrough the labels, so we're just doing this--I mean, in a way, we are cheating here.But we're just doing this to see if we find any patterns.Hopefully, you can see that this is a whole different kind

  • 04:29

    SPEAKER [continued]: of a group here.And then maybe this whole thing is one group.Or maybe there are two groups here.But let's do some analysis on this.So this was just to kind of see what we're dealing with.So let's set up some random seed.This is just to initialize a randomization process

  • 04:54

    SPEAKER [continued]: because the way k-means works that it needs to startwith some number of clusters.And then it tries to identify where each point belongs.So kmeans.And we are looking at only the petal lengthand width within iris data.

  • 05:14

    SPEAKER [continued]: So iris will take all the rows-- all the data points--but only third and fourth column.We will take-- let's see, how many-- we'llgo with three clusters.

  • 05:35

    SPEAKER [continued]: And this is the seed.I'll start with 20.So this is the kmeans function or algorithm.And let's capture it in something.And we just build the kmeans model.Using kmeans, we build a clustering model.

  • 05:58

    SPEAKER [continued]: And let's just print this out.So this is what it looks like.So we did kmeans clustering with three clusters of size.And this is something that it automatically detected.And these are the centroids or cluster centers

  • 06:18

    SPEAKER [continued]: or means where this cluster has 1.46 as this average lengthand 0.24 as average width.So this represents a two-dimensional point.And you can see there are three kind of points.And here we have all the data points that we have

  • 06:41

    SPEAKER [continued]: and their corresponding class labels.So several of the first data pointsare all in class 1, which is this, or cluster 1.Many of these are in or most of theseare in class 2, which is this.But you can see somewhere there is also something in class 3.

  • 07:04

    SPEAKER [continued]: And this is class 3.Most of these are-- almost all of theseare in class 3, which is this.But then somewhere, you also see something in class 2.So it's done all this analysis for us.And this is our clustering.You can also see how this clustering

  • 07:29

    SPEAKER [continued]: labels relate to actual labels, since we already know that.So let's see cluster.So this is what we kind of predicted, right?This is what we came up.Iris cluster is our making.

  • 07:49

    SPEAKER [continued]: And iris species--I mean, that's given.That's actual truth.So see how well we do.So 1, 2, and 3 are our predictions.Setosa, versicolor, and virginica

  • 08:10

    SPEAKER [continued]: are the labels that are actual truth.So when we predict 1, or whateverwe predicted to be in class 1, allhappen to be in setosa group--none from the other two groups.

  • 08:34

    SPEAKER [continued]: The things that we predicted to be in class 2, 48 of themhappen to be in versicolor, and four happen to be in virginica.Whereas things that we predicted to be in class 3, two of themare in versicolor, and 46 were in virginica.So this is how we did.So you can see that we're kind of matching quite well

  • 08:57

    SPEAKER [continued]: with what the real truth is.Now, let's try plotting this.Let me just keep this plotted ggplot--iris.

  • 09:18

    SPEAKER [continued]: Again, I'll take Petal.Length, Petal.Width, and color,we'll use the cluster color this time.

  • 09:43

    SPEAKER [continued]: And this is the scatterplot.So this is what we come up with.This is based on our own clustering.So we put all the data points like we did before.

  • 10:04

    SPEAKER [continued]: But we label them using our cluster color--our cluster number.So these things that are colored with the number 1, and mostof these are colored with number 2.And most of these colored with number 3.

  • 10:26

    SPEAKER [continued]: And so now we can see the similarities and differences.So you can see in reality, everything that was--this kind of a flower, they're all colored with this color--the darkest color.So this is where we are doing perfect.

  • 10:49

    SPEAKER [continued]: Now look at this.Here, most of these are green.But there are a couple of green dots that happen to be here.You see that?And when we go here--you see those two green dots-- insteadof putting them in this class, we ended up putting them here.So we're kind of wrong for those two.But if you don't know the truth, which is this,

  • 11:12

    SPEAKER [continued]: and if you only knew this, you wouldn'tbe mistaken for putting these two dots in this classrather than here.Or at least they are on the borderline.And that's not easy to just simply place themin this class instead of this.So it's not too bad.Yes, we're not perfect.But for the things that we messed up are really hard

  • 11:36

    SPEAKER [continued]: data points.Similarly, you can see that this pointthat belongs to this class here, the same point we determinedthat it belongs to this one.So we were wrong.But again, you can think about this.This is really on the borderline where

  • 11:57

    SPEAKER [continued]: if you had just looked at this, itwould be hard to see which one it belongs to.So let's go back here and try something else.So we asked for three clusters.See what happens if you ask for two.And now we plot it.

  • 12:22

    SPEAKER [continued]: So you see here now pretty much all of these belongto one cluster.Pretty much all of these, you know,they still remain in the same cluster.There's this one point hanging out here,which we determined it goes here.But it could also go here, right?And so this is when we ask for two clusters.

  • 12:43

    SPEAKER [continued]: And you can play with this.You can ask for, say, four clusters.See what makes sense.And so if we ask for 4 grouping then,this remains still the clearest cluster.And then there is some division happens at here--this middle and then this top.

  • 13:04

    SPEAKER [continued]: Now, remember, in this case, we know the truth.We know the actual labels.In reality, we don't do clusteringwhen that's the case.In reality, when we're doing clustering, we don't know.All we know is there are these points.And we're trying to figure out what makes sense to group them.Maybe what makes sense is these points are in one group.

  • 13:26

    SPEAKER [continued]: And all of these are in another group.Or maybe these things have some kind of a division.Maybe this is one group.And this is another group.And this is another group, right?So there is a lot of exploration thatcould happen because we don't knowhow many groups should be there and what they should be called.So clustering is a very useful technique

  • 13:48

    SPEAKER [continued]: for exploring the data-- trying to seesome kind of organization within that data.And then you have to bring in a lot of your own context--your own knowledge about the problem where you're working--to see what makes sense.Is this two groups, four groups, five?And so technique is one thing but then

  • 14:10

    SPEAKER [continued]: all this other knowledge that youneed to bring from the domain or from your own expertise.And that's clustering using R.

Video Info

Series Name: Introduction to Data Science with R

Episode: 4

Publisher: Chirag Shah

Publication Year: 2018

Video Type:Tutorial

Methods: R statistical package, Clustering, Machine learning, Unsupervised learning

Keywords: cluster analysis; cluster detection; computer programming; data analysis; prediction; Scatterplot; Statistical models ... Show More

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:

Keywords:

Abstract

Dr. Chirag Shah, PhD, explains how to create a simple clustering model for unsupervised machine learning in R. With background knowledge of the usually unknown labels, it is possible to test the model and its modifications.

Looks like you do not have access to this content.

Machine Learning with R: Clustering

Dr. Chirag Shah, PhD, explains how to create a simple clustering model for unsupervised machine learning in R. With background knowledge of the usually unknown labels, it is possible to test the model and its modifications.

Copy and paste the following HTML into your website