K-means and EM for Gaussian mixtures are two clustering algorithms commonly covered in machine learning courses. In this post, I’ll go through my implementations on some sample data.
I won’t be going through much theory, as that can be easily found elsewhere. Instead I’ve focused on highlighting the following:
Pretty visualizations in ggplot
, with the helper packages deldir
, ellipse
, and knitr
for animations.
Structural similarities in the algorithms, by splitting up K-means into an E and M step.
Animations showing that EM reduces to the K-means algorithm in a particular limit.
This last point is covered in Section 9.3.2 of Bishop’s Pattern Recognition and Machine Learning, which I recommend taking a look at for additional theoretical intuition.
So let’s get started! Our data comes from Barbara Englehardt’s Spring 2013 Duke course, STA613/CBB540: Statistical methods in computational biology, homework 4.
First, we load our library functions and data points:
library(deldir)
library(ellipse)
library(pryr)
library(ggplot2)
center_title = theme(plot.title = element_text(hjust = 0.5))
no_legend = theme(legend.position="none")
points = read.table('../data/points_hw4.txt', col.names=c("x", "y"))
ggplot(points, aes(x = x, y = y)) + geom_point() +
labs(title = "Scatter plot of data") +
center_title
One of the aims of this post is to show how the common EM clustering algorithm reduces to K-means in a particular limit. To do this, we should first put both algorithms into a common form.
If you’ve worked through the algorithms before, you’ll see that both K-means and EM consist of a step where points are assigned to clusters, followed by a step where parameter updates are computed from those assignments. They look like the following:
Our approach will be to implement these four helper functions, and then string them together using a common interface. This also cuts down on duplication. The high-level program takes an E function, an M function, and starting inputs to an E step, and alternates the two steps while keeping track of all intermediate results:
iterate_em = function(nsteps, K, points, e_step, m_step, m_params.init) {
m_list = list(m_params.init)
e_list = list()
i = 1
while (i <= nsteps) {
e_list[[i]] = e_step(K, points, m_list[[i]])
m_list[[i+1]] = m_step(K, points, e_list[[i]])
i = i + 1
}
return(list(m_list=m_list, e_list=e_list))
}
The rest of this section is pretty dry, and just consists of my R code for the two algorithms. In the E step for EM, I make use of the log-sum-exp trick, which turns out to be helpful for numerical precision; you can read more about that here.
# K-means as functions
# points: N x D matrix
# e_params: list with
# clusters: vector of assignments
# m_params: list with
# centers: matrix of cluster centers
kmeans.e = function(K, points, m_params) {
N = dim(points)[1]
D = dim(points)[2]
distances2 = matrix(0, N, K)
for (k in 1:K) {
for (j in 1:D) {
distances2[,k] = distances2[,k] + (points[,j] - m_params$centers[k,j])^2
}
}
clusters = apply(distances2, 1, which.min)
e_params = list(clusters=clusters)
return(e_params)
}
kmeans.m = function(K, points, e_params) {
N = dim(points)[1]
D = dim(points)[2]
centers = matrix(0, K, D)
for (k in 1:K) {
centers[k,] = colMeans(points[e_params$clusters == k,])
}
m_params = list(centers=centers)
return(m_params)
}
# EM as functions
# points: N x D matrix
# m_params: list with
# mu: K x D, MoG centers
# sigma: list of length K of D x D matrices, MoG covariances
# weights: K, MoG weights
# e_params: list with
# resp: responsibilities, N x K
# ll: log-likelihood, for debugging
em.e = function(K, points, m_params) {
N = dim(points)[1]
D = dim(points)[2]
mu = m_params$mu
sigma = m_params$sigma
weights = m_params$weights
# update responsibilities
resp = matrix(rep(0, N*K), N, K)
for (k in 1:K) {
constant_k = log(weights[k]) - 0.5*log(det(sigma[[k]])) -
log(2*pi)*(D/2)
displacement = points - as.numeric(matrix(mu[k,], N, D, byrow = TRUE))
log_probs = -1/2 * colSums(t(displacement) * (
solve(sigma[[k]]) %*% t(displacement)))
resp[,k] = log_probs + constant_k
}
# log-sum-exp trick
max_log_probs = apply(resp, 1, max)
resp = resp - matrix(max_log_probs, N, K)
resp = exp(resp)
ll = mean(log(rowSums(resp))) + mean(max_log_probs) # log likelihood
resp = resp / matrix(rowSums(resp), N, K)
e_params = list(resp=resp, ll=ll)
return(e_params)
}
em.m = function(K, points, e_params, fix_sigma=NULL, fix_weights=NULL) {
N = dim(points)[1]
D = dim(points)[2]
resp = e_params$resp
# update means
mu = matrix(0, K, D)
for (k in 1:K) {
mu[k,] = colSums(resp[,k]*points) / sum(resp[,k])
}
# update covarainces
if (is.null(fix_sigma)) {
sigma = NULL
for (k in 1:K) {
sigma[[k]] = matrix(0, D, D)
displacement = points - as.numeric(matrix(mu[k,], N, D, byrow = TRUE))
for (j in 1:D) {
sigma[[k]][j,] = colSums(displacement[,j]*displacement*resp[,k]) / sum(resp[,k])
}
}
} else {
sigma = fix_sigma
}
# update component weights
if (is.null(fix_weights)) {
weights = colSums(resp) / sum(resp)
} else {
weights = fix_weights
}
m_params = list(mu=mu, sigma=sigma, weights=weights)
return(m_params)
}
Now, we can choose K = 3
and nsteps = 20
for an initial run. We randomly choose three points as our starting centers for both K-means and EM. For EM, we additionaly initialize using identity covariance and equal weights over the mixture components.
# Run K means and EM
K = 3
nsteps = 20
N = dim(points)[1]
D = dim(points)[2]
set.seed(3)
centers = points[sample(1:N, K),]
row.names(centers) = NULL
m_params.init = list(centers=centers)
kmeans_results = iterate_em(nsteps, K, points, kmeans.e, kmeans.m, m_params.init)
mu = centers
sigma = NULL
for(k in 1:K) {
sigma[[k]] = diag(D) # covariances initialized to identity matrix
}
weights = rep(1, K) / K # weights initialized to uniform
m_params.init = list(mu=mu, sigma=sigma, weights=weights)
em_results = iterate_em(nsteps, K, points, em.e, em.m, m_params.init)
The results of each E-step produces data for each of the N points, which is verbose. Instead, let’s print the results of the M-step, which is more compact.
kmeans_results$m_list[1:3]
## [[1]]
## [[1]]$centers
## x y
## 1 4.236116 4.322595
## 2 3.115771 3.307241
## 3 4.474370 2.387970
##
##
## [[2]]
## [[2]]$centers
## [,1] [,2]
## [1,] 3.903263 4.585723
## [2,] 2.434461 3.479530
## [3,] 4.907858 2.282699
##
##
## [[3]]
## [[3]]$centers
## [,1] [,2]
## [1,] 3.837077 4.520047
## [2,] 2.366460 3.438372
## [3,] 4.907858 2.282699
em_results$m_list[1:3]
## [[1]]
## [[1]]$mu
## x y
## 1 4.236116 4.322595
## 2 3.115771 3.307241
## 3 4.474370 2.387970
##
## [[1]]$sigma
## [[1]]$sigma[[1]]
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
##
## [[1]]$sigma[[2]]
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
##
## [[1]]$sigma[[3]]
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
##
##
## [[1]]$weights
## [1] 0.3333333 0.3333333 0.3333333
##
##
## [[2]]
## [[2]]$mu
## [,1] [,2]
## [1,] 3.596083 4.280319
## [2,] 2.652853 3.548945
## [3,] 4.318030 2.668293
##
## [[2]]$sigma
## [[2]]$sigma[[1]]
## [,1] [,2]
## [1,] 0.66681006 0.05709793
## [2,] 0.05709793 0.56318406
##
## [[2]]$sigma[[2]]
## [,1] [,2]
## [1,] 0.6406161 0.1080031
## [2,] 0.1080031 0.5753433
##
## [[2]]$sigma[[3]]
## [,1] [,2]
## [1,] 1.3820866 -0.4615455
## [2,] -0.4615455 0.7275576
##
##
## [[2]]$weights
## [1] 0.3032386 0.5042118 0.1925496
##
##
## [[3]]
## [[3]]$mu
## [,1] [,2]
## [1,] 3.602189 4.367311
## [2,] 2.578687 3.553098
## [3,] 4.475524 2.552581
##
## [[3]]$sigma
## [[3]]$sigma[[1]]
## [,1] [,2]
## [1,] 0.5694985 0.1191194
## [2,] 0.1191194 0.4292038
##
## [[3]]$sigma[[2]]
## [,1] [,2]
## [1,] 0.5033149 0.1763752
## [2,] 0.1763752 0.5461334
##
## [[3]]$sigma[[3]]
## [,1] [,2]
## [1,] 1.2389029 -0.4628892
## [2,] -0.4628892 0.5710045
##
##
## [[3]]$weights
## [1] 0.3006975 0.5026309 0.1966716
Looks sensible. It’s also a good idea to check that the EM log-likelihood always increases.
lls = rep(0, nsteps)
for (i in 1:nsteps) {
lls[i] = em_results$e_list[[i]]$ll
}
ggplot(data=data.frame(x=1:nsteps, y=lls)) +
geom_line(aes(x=x, y=y)) + geom_point(aes(x=x, y=y)) +
labs(title="Log likelihood for EM", x="step", y="log likelihood") +
center_title
We’d like to visualize our results, and since my aim is to compare K-means with EM, I’ve chosen to visualize them side-by-side using ggplot
’s facet_grid
option. Points are colored to show the assigned cluster (K-means) or the most likely cluster (EM); an alternate visualization would use blended colors for EM. I used the deldir
package to compute K-means decision boundaries, which come from Voronoi diagrams, and the ellipse
package to plot shapes of each Gaussian mixture.
make_visualization = function(points, kmeans_data, em_data, nsteps, K) {
for (i in 1:nsteps) {
# colored points
df_points = rbind(
data.frame(x = points[,1], y = points[,2], type = "K-means",
cluster = kmeans_data$e_list[[i]]$clusters),
data.frame(x = points[,1], y = points[,2], type = "EM",
cluster = apply(em_data$e_list[[i]]$resp, 1, which.max)))
# K-means decision boundaries
centers = kmeans_data$m_list[[i]]$centers
df_voronoi = deldir(centers[,1], centers[,2])$dirsgs
df_voronoi$type = factor("K-means", levels=c("K-means", "EM"))
# ellipses
mu = em_data$m_list[[i]]$mu
sigma = em_data$m_list[[i]]$sigma
all_ellipses = NULL
for (k in 1:K) {
ellipse_data = ellipse(sigma[[k]], level=pchisq(1, df=D))
all_ellipses[[k]] = data.frame(
x=ellipse_data[,1] + mu[k,1], y=ellipse_data[,2] + mu[k,2],
cluster=k, type="EM")
}
df_ellipses = do.call(rbind, all_ellipses)
print(
ggplot() +
geom_point(data=df_points, aes(x=x, y=y, color=factor(cluster))) +
geom_point(data=data.frame(x=centers[,1], y=centers[,2], type="K-means"),
aes(x=x, y=y), shape=17, size=3) +
geom_segment(data=df_voronoi, linetype = 1, color= "#FFB958",
aes(x = x1, y = y1, xend = x2, yend = y2)) +
geom_path(data=df_ellipses, aes(x=x, y=y, color=factor(cluster))) +
facet_grid(. ~ type) +
ggtitle(paste0("Most likely cluster, K = ", K, ", step = ", i)) +
center_title + no_legend)
}
}
Since knitr
/ R Markdown supports animations, we can simply plot each frame in a for loop. In my case, I’m using ffmpeg
with an .mp4
format, and hacked knitr
to add some flags for Apple support, which was necessary for me to get things (hopefully) viewable in Safari.
With this visualization code, we can finally take a look at our results!
make_visualization(points, kmeans_results, em_results, nsteps, K)
To make the EM algorithm more like K-means, we start by limiting the M step to only change the \(\mathbf{\mu}\) parameters. The correspondence is pretty clear – the Gaussian means correspond to the K-means cluster centers.
If you look closely above, I added some extra arguments to the EM M step that allows for this change. Since the iterate_em
function accepts a function for the M step, we can use the partial
function from the pryr
package to set these arguments appropriately.
fixed_sigma = partial(em.m, fix_sigma=sigma, fix_weights=weights)
em_results = iterate_em(nsteps, K, points, em.e, fixed_sigma, m_params.init)
make_visualization(points, kmeans_results, em_results, nsteps, K)
In the above animation, you’ll see that the shapes of the Gaussians do not change; only their centers do. One can show that this leads to linear decision boundaries for the most likely cluster, just like K-means. However, the algorithm evolution is still not the same as K-means.
To allow the two algorithms to finally match up, we need to take a limit where the fixed covariance for each mixture component is \(\epsilon I\), and we take \(\epsilon\) to 0. In this case, we take \(\epsilon\) to be 0.01, corresponding to a standard deviation of 0.1. The log-sum-exp trick I mentioned earlier was necessary for my results to not under / overflow in this case.
sigma001 = NULL
for(k in 1:K) {
sigma001[[k]] = diag(D)*0.01
}
m_params.init = list(mu=mu, sigma=sigma001, weights=weights)
fixed_sigma001 = partial(em.m, fix_sigma=sigma001, fix_weights=weights)
em_results = iterate_em(nsteps, K, points, em.e, fixed_sigma001, m_params.init)
make_visualization(points, kmeans_results, em_results, nsteps, K)
The two sides match very well!
We can repeat this entire process for a different value of \(K\). With regular EM:
# Run K means and EM
K = 8
set.seed(3)
centers = points[sample(1:N, K),]
row.names(centers) = NULL
m_params.init = list(centers=centers)
kmeans_results = iterate_em(nsteps, K, points, kmeans.e, kmeans.m, m_params.init)
mu = centers
sigma = NULL
for(k in 1:K) {
sigma[[k]] = diag(D) # covariances initialized to identity matrix
}
weights = rep(1, K) / K # weights initialized to uniform
m_params.init = list(mu=mu, sigma=sigma, weights=weights)
em_results = iterate_em(nsteps, K, points, em.e, em.m, m_params.init)
# Visualize
make_visualization(points, kmeans_results, em_results, nsteps, K)
With only \(\mathbf{\mu}\) updates, and \(I\) covariance:
fixed_sigma = partial(em.m, fix_sigma=sigma, fix_weights=weights)
em_results = iterate_em(nsteps, K, points, em.e, fixed_sigma, m_params.init)
make_visualization(points, kmeans_results, em_results, nsteps, K)
With only \(\mathbf{\mu}\) updates, and \(0.01I\) covariance:
sigma001 = NULL
for(k in 1:K) {
sigma001[[k]] = diag(D)*0.01
}
m_params.init = list(mu=mu, sigma=sigma001, weights=weights)
fixed_sigma001 = partial(em.m, fix_sigma=sigma001, fix_weights=weights)
em_results = iterate_em(nsteps, K, points, em.e, fixed_sigma001, m_params.init)
make_visualization(points, kmeans_results, em_results, nsteps, K)
Besides the links included above, I found this link useful for plotting Voronoi diagrams using ggplot
.
This blog post was generated from an R Markdown file using the knitr
and blogdown
packages. The original source can be downloaded from GitHub.