Clustering

Why use clustering?

If you want to find what classes are present in the data, you can use clustering and then manually review the clusters
If you don’t want to manually annotate your data, you can make a feature using clustering, then feed it to your model
Can help find anomalies

K-Means Clustering

K-Means Clustering
#unsupervised-learning

K-means clustering is an unsupervised learning algorithm to assign each data point to a cluster

$k$ is how many clusters you expect

Tries to minimize the weighted distance between all points and the center of their assigned clusters

The algorithm starts with randomly assigned centroids, then converges

At each step, each point is assigned to the closest centroid

The centers of each of these clusters become the new centroids

Repeat until it converges

Visualization of k-means clustering

Pros and Cons

Pros:

Good when you want model to work with new data

With K-Means, you just find the closest centroid and add your new data point to that

With hierarchical clustering, if you get new data points, you need to perform clustering again

Cons:

Assumes clusters are spherical and equally sized

Hierarchical clustering doesn’t have this problem

However, can use modified version of k-means that clusters of different sizes and shapes

Choosing a K

Try a bunch of different Ks

For each K, calculate its Within-Cluster Sum of Square (WCSS)

WCSS = sum of squares of the distances between each data point in the cluster to the centroid

As K increases, WCSS will decrease

But the “elbow” of this graph is where the optimal K will be

Link to original

Hierarchical Clustering

Hierarchical Clustering
#unsupervised-learning

Two types:

Agglomerative hierarchical clustering: Starts with individual nodes, then lumps them together into clusters

Divisive hierarchical clustering: Starts with whole graph as one cluster, then divides it into smaller clusters

Pros:

More likely to converge correctly than K-Means Clustering

Gives you a dendrogram, which gives you a more complete picture of the dataset

Cons:

High time and space complexity

Cannot be used for large datasets

No objective function for hierarchical clustering

Sensitive to noise and outliers since we use distance metrics

Difficulty handling large clusters

Link to original

CMSC320 Notes

Explorer

Clustering

K-Means Clustering

K-Means Clustering

Pros and Cons

Choosing a K

Hierarchical Clustering

Hierarchical Clustering

Graph View

Table of Contents

Backlinks