Find Clusters of Data with K-means Clustering in Python and Scikit-learn

Hannah Davis
InstructorHannah Davis

Share this video with your friends

Send Tweet
Published 4 years ago
Updated 3 years ago

We’ll return to the iris dataset to see how to use k-means clustering, an unsupervised learning algorithm, to create categories for data that doesn't have labels. We'll also visualize these clusters using matplotlib.

More on K-means can be found at Scikit-learn.

Instructor: [00:00] From scikit-learn, we'll import datasets. From sklearn.cluster, we'll import k-means. We'll import matplotlib.pyplot as plt. We'll be working with the iris dataset, which is datasets.load_iris. Let's print out our feature names.

[00:28] We can see we have four features here. K-means is an unsupervised algorithm, which means that it's used on data that doesn't have labels. Even though this dataset has target labels, we're going to be ignoring it for the purpose of teaching k-means.

[00:43] We'll assign our X to be For simplicity's sake, we're just going to take two features to work with. We'll take the middle two features. From here, we can say model equals k-means.

[01:01] K-means is a clustering algorithm, which means that we give it a number of clusters, and it figures out how to divide the data into that many clusters. It does this by creating centroids which are set to the mean of the cluster that it's defining. Let's see how that works.

[01:18] If we say n_clusters, our number of clusters, equals 5, and we'll also pass in a random state, which will be 0and then we say and pass it our X data, then if we print model.labels -- this is supposed to be n_clusters -- then we can see the k-means model has taken our X data and assigned a label from 0to 4 for each data point.

[01:50] Even though k-means is not a classification tool, this is the same as saying model.predict(X). We can also print model.cluster_centers_. These are the centroids of each delineated cluster. Let's visualize these.

[02:18] We'll say plt.scatter. First, let's plot our X data, so the first variable and the second variable. We'll say the color is blue. We'll do an X label and a y label. Then we'll say This has an underscore at the end of it. This is how two features of the data look plotted. If we want to add our centroids, we can say plt.scatter(centroids), first variable, and second variable.

[03:11] We can say marker equals and then whatever shape we want it to look like, size equals 170, zorder equals 10 so that it will come to the top, and then color equals magenta. Let's also change our data color to be the model.labels. This will be the color of the cluster that k-means has put the data point into.

[03:41] Here we can see the five groups that k-means has clustered this dataset into. If you thought this looked good, you could investigate further, and look for patterns and similar variables, and try to find connections that you might not have otherwise, but looking at this, it looks a little arbitrary.

[03:58] It's also worth playing around with the number of clusters. We can see that this looks a little more intuitively separated. K-means is a good tool for exploring your data and for creating classes and labels if your dataset doesn't have them.