Ex Data, Scientia

Home Contact

Cluster Analysis with Auto-Encoders

While cluster analysis has traditionally been implemented with relatively simple algorithms like K-Means and Expectation-Maximization, the relatively recent emergence of Deep Neural Networks in applied data science has brought a new, more complex method to the field: the auto-encoder.

An auto-encoder is essentially an extension of a Deep Neural Network (DNN): DNNS traditionally project their input into spaces of ever decreasing dimensionality, thereby transforming it into more and more abstract representations, with the output typically being a scalar or a (relatively low-dimensional) vector. The former case occurs when the DNN is applied for a regression task, or a task of binary classification (which itself is only an extension of logistic regression), while the latter occurs in multi-class classification. In all cases, the DNN is fitted according to a very determined goal, i.e. the data are not treated in an exploratory way.

Auto-encoders, on the other hand, are purely explorative tools. The clue to their application is to attach an essentially mirrored DNN to the output layer of a classic DNN. The output of the final layer of this mirrored DNN then has the same dimensionality and structure as the input to the "classic" part of the auto-encoder. An auto-encoder is thus comprised of two parts: the first part is a sequence of layers that projects the data into a very abstract representation, and thus resembles a classic DNN for classification. The second part projects this abstract represntation into increasingly complex representations, and ultimately into the same format as that of the original input. The first part is referred to as the encoder, the second part is called a decoder. The goal in fitting in auto-encoder is then to produce outputs that are as similar as possible to the input.

By projecting the input into lower-dimensional spaces, the auto-encoder is forced to learn to extract features that are relevant for the reconstruction of the input. Ultimately, this means that the auto-encoder must learn to recognize which data are similar to each-other. For in the most abstracted state, i.e. the layer corresponding to the output layer of a classic DNN, there are not really any features left to describe the input data. If that layer is e.g. a simple five-dimensional vector, then the meaning of this vector is essentially that of a list with probabilities of cluster associations. Here, each dimension represents one cluster, and the input data-point is therefore assigned to the cluster the index of which matches the dimension with the highest probability value. The vector is thus functionally the same as the list of Euclidean distances to cluster means in the K-means algorithm, and as the list of probability-density values of the EM algorithm.

The most abstract data representation in the auto-encoder is thus nothing more than a quasi-one-hot encoded vector informing about the likelihood of particular cluster associations. This vector, or rather the list of vectors over all data points, is indeed the product we want to obtain from applying the auto-encoder. Applying the argmax function to this list of vectors yields a list of cluster indices, such that we can assign each data point to one cluster. The reconstruction of the input is in most cases only a by-product, and in a purely explorative analysis it serves as a quality control: If the reconstructions are good, then we can likley trust the cluster pattern found by the auto-encoder. If it is not, then the cluster pattern will be rather random, and should not be trusted (in this case, a frquent occurrence is that all data will come to lie in one cluster).

Unlike the situation in the K-means and EM algorithms, where a set of randomly initialized cluster-mean estimates is provided as an input, the same is not necessary for the auto-encoder. Instead of moving a set of cluster means through the space occupied by the data, the autoe-encoder instead moves a set of separating hyperplanes (a feature more common to the world of classification). This is done by the weight updates that compress the data into representations meaningful for successful reconstruction. Clusters are thus not defined by their means, but by their boundaries.

Aside from addressing the ever-present question of how many clusters can be expected to be in the data (which is here expressed as the dimensionality of the most abstract data representation), the architect of an auto-encoder has to decide on how many hidden layers of what dimensionality to incorporate. This is a challenge inherent to the design of any kind of DNN, and is not easily answered. The basic rule is that the incorporation of a higher number of hidden layers lead to better results if the data are of a more complex nature. After all, a neural network is said to be able to approximate any function if enough neurons are present and arranged in a meaningful way (a "neuron" is the term for one dimension of a data representation). Ultimately, a good architecture can only be approximated via experimentation, and given the high flexibility inherent to the design of DNNs, even the result of this procedure will likely only be a suffient solution, but not necessarily the optimal one.

The design of an auto-encoder is also influenced by the structure of the input data at hand. In the case of image data, which bear a three-dimensional structure, it is useful to use convolution and pooling layers in the encoding part, and deconvolution and unpooling layers in the decoding part. This way, spatial relationships between pixels of the images are maintained in the projection process, which is known to improve macine-vision tasks greatly.

As in the K-means and EM clustering algorithms, the number of expected clusters must be specified by the user and is subject to some uncertainty. As in those algorithms, the auto-encoder can be run several times with different random initializations, to find out if the clustering outcomes remain approximately the same. When implementing an auto-encoder with the Keras API to Tensorflow, the DNN parameters will be set randomly at each new initialization automatically. Another approach is to set a somewhat higher-than expected number of clusters. During the fitting process, these extra dimensions will be ignored in case the DNN "deems" them to be superfluous, which means that some cluster indices will never be assigned to any data-points.

In my experience, the application of auto-encoders to relatively simple-structured data yields about the same results as the application of the K-means or EM algorithms. Convolutional auto-encoders - i.e. auto-encoders incorporating Convolutional layers - work well on simple image data like the MNIST set of hand-written digits, since for these, only a relatively simple architecture (few layers) is required. My attempts at auto-encoding more difficult datasets, like plankton images, via pre-trained convolutional bases and a custom reverse-construction of these bases for the decoder part, have so far not yielded much success.

Clustering of generally difficult, complex datasets appears to be an ongoing challenge. However, the benefits of a successful implemetation could be immense. For example, successful clustering of images that assigns images to groups that correspond to human-defined classes would remove the necessity for creating hand-labelled training datasets, as required in classification. Thus, even more time could be saved in tasks involving visual recognition.

An implementation of an auto-encoder on the iris dataset using Keras in Python is givne below:

"""
import the required packages
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

from keras import layers, optimizers, models

"""
load and standardize the iris data
"""
iris = datasets.load_iris()['data']

for i in range(np.shape(iris)[1]):
    iris[:,i] = (iris[:,i] - np.mean(iris[:,i])) / np.std(iris[:,i])
 
"""    
set the expected number of clusters
"""
n_clusts = 4   

"""
build a simple Deep Neural Net
"""
inp = layers.Input(shape=[np.shape(iris)[1],]) # input layer; start of the encoder component
l1 = layers.Dense(64, activation = 'linear')(inp)
l2 = layers.Dense(64, activation = 'linear')(l1)

clst = layers.Dense(n_clusts, activation = 'linear')(l2) # most abstract representation of the input

l3 = layers.Dense(64, activation = 'linear')(clst) # start of the decoder component
l4 = layers.Dense(64, activation = 'linear')(l3)
outp = layers.Dense(np.shape(iris)[1], activation = 'linear')(l4) # reconstruction layer

"""
setting up the full auto-encoder, and the encoder part (the latter is part of the full
auto-encoder, but is the only part we need for the clustering in the end)
"""
autoenc = models.Model(inp, outp)
enc = models.Model(inp, clst)

"""
compile the auto-encoder for fitting: assign an optimizer and a loss function (mean-
squared error)
"""
autoenc.compile(optimizer = optimizers.RMSprop, loss = ['mse'])

"""
fit the auto-encoder (in the context of DNNs, this is also sometimes called "training")
"""
autoenc.fit(iris, iris, batch_size = 10, epochs = 1, verbose = 1)

"""
generate predictions (i.e., cluster assignments) using the encoder component of the auto-encoder
"""
pred = enc.predict(iris)
pred = [np.argmax(pred[i,:]) for i in range(len(pred))] # generate a list of cluster indices

"""
calculate the means of each cluster
"""
clst_mean_0 = np.nanmean(iris[[pred[i] == 0 for i in range(len(pred))],:], axis = 0)
clst_mean_1 = np.nanmean(iris[[pred[i] == 1 for i in range(len(pred))],:], axis = 0)
clst_mean_2 = np.nanmean(iris[[pred[i] == 2 for i in range(len(pred))],:], axis = 0)
clst_mean_3 = np.nanmean(iris[[pred[i] == 3 for i in range(len(pred))],:], axis = 0)

"""
plot the clustered data and the cluster means
"""
plt.scatter(iris[:,0], iris[:,1], c = pred)
plt.plot(clst_mean_0[0], clst_mean_0[1], 'ro')
plt.plot(clst_mean_1[0], clst_mean_1[1], 'ro')
plt.plot(clst_mean_2[0], clst_mean_2[1], 'ro')
plt.plot(clst_mean_3[0], clst_mean_3[1], 'ro')
plt.show()

"""
generate the reconstructions using the full auto-encoder
"""
recon = autoenc.predict(iris)

"""
plot the correlation between the original input data and the reconstructions
"""
plt.scatter(iris[:,0], recon[:,0])
plt.show()

"""
plot the first two dimensions of the original input data and the reconstructions
"""
plt.scatter(iris[:,0], iris[:,1])
plt.scatter(recon[:,0], recon[:,1])
plt.show()