This blog aims to explain the paper Non Linear Principal Component Analysis using Artificial Neural Networks (MA Kramer, 1991) by (hopefully) breaking it down into much simpler terms.
What exactly is Principal Component Analysis?¶
To understand what this paper aims to achieve, we must first understand the crux of the paper: Principal Component Analysis, which I will hence refer to as PCA.
Imagine you have a system of multiple sensors of various kinds, and you're trying to understand how this particular system works by observing the data from the sensors. The primary issue would be that you can't seem to find which parts of this data are actually useful, or if the data has relations. If it were 5-10 measurements, you could sit through and find relations, and probably make it easier for yourself. But when you have too many measurements, doing that is a huge hassle.
Having these raw measurements, or variables, is known as the Superficial Dimensionality. More speciically, it is the number of variables (read: measurements) in the vectors (read: data points, or, dataset). The number of variables in a Superficial Dimensionality is almost always larger than necessary. So what would a Ideal dataset look like? An ideal dataset would have only the independent (or, necessary) variables, and thus would have Intrinsic Dimensionality.
Let's illustrate this with an example. Given you are a person who's managing a singular gas pipeline, and you have data coming from 2 sensors: one measuring inlet flow, and the other measuring the outlet flow. Your task is to identify if there is a leak in the pipeline. You're smart, so you subtract the in flow from the out flow, and then check if the difference, let's call that net flow, is 0. If it is 0, then there is no leak. But if it were a non zero value, then there is a leak. Simple, right? But stop right there. You've basically simplified the problem from 2 variables to 1 variable, or moving from a Superficial Dimensionality of 2 to an Intrinsic Dimensionality of 1.
But what if there were 100 sensors? 1000 sensors? 10000 sensors? It would be a Herculean task to find relations in such a dataset, and it would be even more Herculean to find relations in a dataset with 100000 sensors. And that is where PCA comes in. PCA is a method of reducing the Superficial Dimensionality to Intrinsic Dimensionality, or, in other words, it is a method of finding relations in a dataset with a large number of variables, and end up with the most efficient way to represent the data. It is important to note, however, that PCA assumes all the relations in the dataset are linear only. With PCA, we assume that data can be represented best by a straight line or a plane. If you were to visualise this in a graph, it would exactly be like finding the line of best fit.
Let's dig deeper into the math. Let's assume a matrix $Y$ of size $n \times m$, where $n$ is the number of data points (or equations), and $m$ is the number of variables. This would be our data in the form of a table. PCA decomposes it into 2 smaller matrices and an error term.
$$ Y = T \times P^{T} + E $$
Where $T$ is the scores matrix of dimensions $n \times f$, $P$ is the loadings matrix of dimensions $m \times f$, and $E$ is the error matrix of dimensions $n \times m$. Here $f$ is the number of principal components, or, the Intrinsic Dimensionality.
But what is the Scores Matrix $T$? The scores matrix essentially shows the new coordinates of the data points in the reduced space. $T$ represents the new equations, but what is the linear mapping of all the old variables into the new ones? That, is what the Loadings Matrix $P$ does. $P$ represents the linear mapping of all the old variables into the new ones. The columns of $P$ are mathematically defined as the eigenvectors corresponding to the $f$ largest eigenvalues of the covariance matrix of $Y$.
Quick Refresher (or guide)
What is a covariance matrix?A covariance matrix is a square matrix that shows the covariance between variables. It is given by:- $$ \Sigma = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \mu)(y_i - \mu) $$ $$ or, $$ $$ \Sigma = \begin{bmatrix} Var(x_1) & \cdots & Cov(x_1, x_n) \\ \vdots & \ddots & \vdots \\ Cov(x_n, x_1) & \cdots & Var(x_n) \end{bmatrix} $$ where $\mu$ is the mean of the variables.
What are eigenvalues and eigenvectors?
An eigenvector is a vector whose direction remains unchanged when a linear transformation is applied to it. An eigenvalue is the scalar that represents how much the eigenvector is stretched or shrunk during the transformation. $$ A v = \lambda v $$ where $A$ is the transformation matrix, $v$ is the eigenvector, and $\lambda$ is the eigenvalue.
The Residual Matrix $E$ is the difference between the original matrix $Y$ and the reconstructed matrix $T \times P^{T}$. It represents the information that was lost during the dimensionality reduction process, or sometimes just classifed as noise.
Let's take this step by step from here. We first obtain $T$ to map higher dimensional data $R^m$ to a lower dimensional space $R^f$. We do this by simply multiplying the data matrix $Y$ with the transpose of the loadings matrix $P$.
$$ T = Y \times P $$
Then, we decode it back by reconstructing $Y$ using:-
$$ Y' = T \times P^{T} $$
We then take the difference between $Y$ and $Y'$ to get the residual matrix $E$. We optimise this by using the least squares approach, where we try to reduce the value of $||E||$.
All this means, that in the class of Linear methods for optimal transformation, PCA is one of the best methods for the task.
Now, what's the problem?¶
If only PCA could effectively come up with Non linear relations. Let us take an example with a data set containing points on a circle.
If I were to apply PCA, it would only give me a linear relation between x and y, with a sufficiently large value of $||E||$ for it to be very off. A linear relation is essentially a straight line, but we wanted a circle!
Thus, we introduce Non Linear Principal Component Analysis, using Autoassociative Neural Networks.
NLPCA¶
Since using just a single linear layer (like in a standard PCA) can't capture non-linear relationships, Mark Kramer proposed using a 5-layer Autoassociative Neural Network (AANN).
To put it simply, we're building a "bottleneck" that forces the data to squeeze through a tiny representation before trying to reconstruct itself.
The 5-Layer Architecture¶
The magic happens in how these five layers are structured. Let's break them down:
- Input Layer ($N$ nodes): Takes our raw data (the Superficial Dimensionality).
- Mapping Layer ($M$ nodes): This is a non-linear layer. It uses sigmoidal activation functions to project the data into a space where features can be extracted non-linearly.
- Bottleneck Layer ($f$ nodes): This is the "squeeze." $f$ is our Intrinsic Dimensionality (where $f < N$). The values here are our Non-Linear Principal Components.
- Demapping Layer ($D$ nodes): Another non-linear layer that takes the compressed representation and starts "un-squeezing" it back to the original size.
- Output Layer ($N$ nodes): The reconstruction of our original data.
If we were to visualize it, it would look something like this:
$$ \text{Input} \xrightarrow{\text{Non-linear}} \text{Mapping} \xrightarrow{\text{Linear/Non-linear}} \text{Bottleneck} \xrightarrow{\text{Non-linear}} \text{Demapping} \xrightarrow{\text{Linear}} \text{Output} $$
Why 5 layers? Why not 3?¶
This is the part that often trips people up. Why do we need five whole layers when a simple three-layer network (Input $\to$ Hidden $\to$ Output) seems like it should work?
To understand this, we have to look at what we're trying to achieve: we want to find any non-linear function that can map high-dimensional data into a low-dimensional bottleneck, and then find another non-linear function to map it back.
The Universal Approximation Theorem¶
In neural network theory, there's a famous concept called the Universal Approximation theorem (Cybenko, 1989). It states that a feed-forward network with a single hidden layer containing a finite number of neurons can approximate any continuous function, provided it has a non-linear activation function (like a sigmoid).
Now, let's look at our NLPCA objective again. We have two distinct functions to learn: 1. The Encoding Function ($G$): Maps $R^m \to R^f$ (Input to Bottleneck). 2. The Decoding Function ($H$): Maps $R^f \to R^m$ (Bottleneck to Output).
If we want the network to be able to learn any non-linear $G$, then based on the theorem, $G$ must have at least one hidden layer. That "hidden layer" for $G$ is our Mapping Layer.
Similarly, if we want the network to be able to learn any non-linear $H$, it also needs at least one hidden layer. That "hidden layer" for $H$ is our Demapping Layer.
Breaking it down:¶
If we try to use only 3 layers (Input $\to$ Bottleneck $\to$ Output): - If the Bottleneck is linear, the network is mathematically equivalent to standard, linear PCA. It just finds the best-fitting line or plane. - If the Bottleneck is non-linear, it still lacks the "functional capacity" to model complex, multi-modal, or "wrapped" distributions (like the circle or an S-curve). It can only represent a very limited set of non-linearities because it's trying to do both the feature extraction and the compression in a single step.
By using 5 layers, we decouple the "Mapping" (which extracts the non-linear features) from the "Bottleneck" (which holds the compressed coordinates). This gives the network the freedom to "unroll" the data before squeezing it, and then "re-roll" it back during reconstruction.
Pro-Tip
In the literature, the 5 layers are often referred to as:- Input Layer
- Mapping Layer (Hidden Layer 1)
- Bottleneck Layer
- Demapping Layer (Hidden Layer 2)
- Output Layer
Identity Mapping¶
The goal of this network is to achieve Identity Mapping. This means we want the output $Y'$ to be as close as possible to the input $Y$.
We train the network to minimize the reconstruction error:
$$ \mathcal{L} = \sum_{i=1}^{n} ||Y_i - Y'_i||^2 $$
When the network is successfully trained, the bottleneck layer effectively holds a compressed, non-linear representation of the data. If your data was points on a circle, the bottleneck layer would only need one node (representing the angle $\theta$) to perfectly reconstruct the $(x, y)$ coordinates. That's a reduction from 2 variables to 1!
Significance and Challenges¶
Kramer's NLPCA is a beast because it doesn't care if your data is linear or not. It finds the underlying "manifold" (the shape) of your data.
However, it's not all sunshine and rainbows: - Local Minima: Unlike linear PCA (which has a single best solution), training these networks can get stuck in local minima. - Complexity: Choosing the number of nodes in the Mapping and Demapping layers ($M$ and $D$) is more of an art than a science. - Computation: It's way more expensive to train than just doing a singular value decomposition (SVD) for standard PCA.
Conclusion¶
NLPCA was a foundational step in what we now call Autoencoders. By forcing data through a bottleneck, we're not just compressing it; we're forcing the machine to understand the essence of the data. Whether it's a gas pipeline or a complex chemical reaction, NLPCA gives us a look at the "Intrinsic" world that linear methods simply can't see.
Recommended Resources¶
If you're looking to dive even deeper into NLPCA and Dimensionality Reduction, here are some top-tier resources:
- The Original Paper (Kramer, 1991) - The source of it all.
- Introduction to Autoencoders (DeepLearning.ai) - Great for modern context.
- Non-linear Dimensionality Reduction (Wikipedia) - A broad overview of other methods like t-SNE and Isomap.