Tech Blog

Principal component analysis

Motivation

Principal component analysis (PCA) is a technique to reduce the number of features of a machine learning problem, also known as the problem dimension, while trying to maintain most of the information of the original dataset. The two main applications of dimensionality reduction by PCA are:

Visualization of high-dimensional data.
Pre-processing of data to accelerate model training and reduce memory requirements

This blog introduces PCA, explains how it works, and applies it to an image recognition problem.

Intuitive explanation

PCA generates new features, called principal components, that are linear combinations of the original features. The principal components are designed with two objectives:

Differentiate between instances. The value of the principal component should vary as much as possible between instances. Mathematically, this objective is equivalent to maximizing the variance.
Summarize the data by attempting to (only) eliminate redundant information. It should be possible to predict, or rebuild, the original features from the main principal components. When we transform or project the features into principal components, the mathematical objective is to minimize the average squared projection error.

Surprisingly, these two objectives are equivalent. The reasons for this are best understood by considering an example. Figure 1 shows a data set with two features: x1 and x2. As there are two features, we can get up to two principal components. The first principal component is depicted as a green arrow and maximizes the variance as follows. If the instances are projected onto a straight line, then they are on average as far as possible projected into the first principal component. The projection error is the average squared distance between the instances and the green arrow, which is also minimized by the first principal component.

Principal components of two features dataset. — Figure 1: Principal components of a data set with two features.

If we consider the first principal component a sufficiently accurate approximation of the two features, we could replace the two features by only the first principal component, effectively reducing the problem dimension.

Image recognition example

We will apply the principal component analysis to the MNIST data set. If you are not familiar with this data set, it is formed by black and white pictures of hand-written digits. Each picture is represented by a matrix of dimension 28 * 28. Each element in the matrix describes the grayscale intensity of a pixel.

For this blog, we will use the training data formed by 60,000 images. Similar results would be obtained if the testing data set was used instead. The following code loads the MNIST data set and plots an example instance as shown in Figure 2.

import numpy as np
import pandas as pd
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
from keras.datasets import mnist
from sklearn.decomposition import PCA
import seaborn as sns

# We will only use the training data.
# The outcome will be similar with the testing data.
(X, Y), _ = mnist.load_data()

# Plot example instance
plt.imshow(255-X[0], cmap='gray')

# Attribute matrix dtype must be float for PCA
X = X.astype(float)

To facilitate the visualization for this blog example, we will only use the first 4 digits in the data set (0 to 3).

n_digits = 4
X = X[Y < n_digits, :, :]
Y = Y[Y < n_digits]

The feature array has a dimension of 24,754 * 28 * 28. This is 24,754 instances (relating to digits 0 to 3) of 28 * 28 pixel images. In order to apply PCA, we need to flatten the array to obtain the 2-dimensional feature matrix with one row per instance and one column per feature.

print(X.shape)
# (24754, 28, 28)
X = X.reshape(X.shape[0], -1)
print(X.shape)
# (24754, 784)

PCA requires all features to have a mean value of zero. This is achieved by subtracting the column mean from each element. A second pre-processing step is often required. If the magnitude of the features are very different, it is good practice to standardize the features by dividing them by their standard deviation. This is not needed for the MNIST data set, as all features are of similar magnitude.

Data visualisation

The code below plots the value of the two main principal components for 500 instances as shown in Figure 3.

def plot(X_transformed, Y, N, p1, p2):
    """"
    Plots two principal components of the instances.

    X_transformed: the attribute matrix transformed by PCA.
    Y: the output vector.
    N: number of instances to plot.
    p1 and p2: principal components to plot.
    """
    label_x = 'Principal component {}'.format(p1+1)
    label_y = 'Principal component {}'.format(p2+1)
    df = pd.DataFrame({label_x: X_transformed[:N, p1],
                       label_y: X_transformed[:N, p2],
                       'label': Y[:N]})
    sns.lmplot(data=df, x=label_x, y=label_y, hue='label', fit_reg=False)


plot(X_transformed, Y, 500, 0, 1)

Figure 3: Values of two principal components for selected instances.

As seen in Figure 3, the two main principal components are sufficient to correctly classify the majority of the instances. However, some important information is clearly lost as some digits are mixed up. As expected, the digits are better classified horizontally, following the first principal component, than vertically, following the second principal component.

Model preprocessing

The key difference between applying principal component analysis for visualization and model pre-procesing is in the selection of the number of principal components to retain. When applying PCA for visualization, only the 2 to 3 most important components are typically retained. When applying PCA in pre-processing the data (before input to a model), the number of selected principal components is generally determined by the target variance of the original data set to be retained. As an example, the total variance retained by the two main principal components is 17.1% (first component only) and 7.9% (second component only). It would hence be counterproductive to select only these two principal components as part of the pre-processing step as 75% of the variance (i.e. ability to differentiate between instances) would be lost before reaching the machine learning model.

print(pca.explained_variance_ratio_[:2])
# [0.17106882 0.07874969]

Figure 4 plots the cumulative retained variance depending on the number of selected principal components. For model pre-processing, we typically select the number of components to retain 95% to 99% of the variance of the original data set. For the MNIST data set, we will need to retain 135 (95% variance) to 302 (99% variance) principal components, out of 784 (28 * 28) original features. The selected principal components can then be used as the inputs to our machine learning model.

plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.grid()
plt.xlabel('Number of selected components')
plt.ylabel('Cumulative retained variance')
plt.show()

Figure 4: Cumulative retained variance as a function of number of principal components selected.

Limitations

As any other machine learning technique, PCA has some known limitations:

PCA only looks for linear correlation between the features. It will not work effectively if the correlation between the features is not linear.
An underlying assumption of PCA is that the principal component with the highest variance will be the most useful for solving our machine learning problem (for example, predicting the class of an instance). This assumption, although logical, is not always correct.

Conclusion

This blog post has explained principal component analysis and how to apply it. Please contact us if you have any questions about the blog or any other machine learning topic.