A Beginner’s Guide to Eigenvectors, PCA, Covariance and Entropy

Content:

This post introduces eigenvectors and their relationship to matrices in plain language and without a great deal of math. It builds on those ideas to explain covariance, principal component analysis, and information entropy.

The eigen in eigenvector comes from German, and it means something like “very own.” For example, in German, “mein eigenes Auto” means “my very own car.” So eigen denotes a special relationship between two things. Something particular, characteristic and definitive. This car, or this vector, is mine and not someone else’s.

Matrices, in linear algebra, are simply rectangular arrays of numbers, a collection of scalar values between brackets, like a spreadsheet. All square matrices (e.g. 2 x 2 or 3 x 3) have eigenvectors, and they have a very special relationship with them, a bit like Germans have with their cars.

Linear Transformations

We’ll define that relationship after a brief detour into what matrices do, and how they relate to other numbers.

Matrices are useful because you can do things with them like add and multiply. If you multiply a vector v by a matrix A, you get another vector b, and you could say that the matrix performed a linear transformation on the input vector.

Av = b

It maps one vector v to another, b.

We’ll illustrate with a concrete example. (You can see how this type of matrix multiply, called a dot product, is performed here.)

So A turned v into b. In the graph below, we see how the matrix mapped the short, low line v, to the long, high one, b.

You could feed one positive vector after another into matrix A, and each would be projected onto a new space that stretches higher and farther to the right.

Imagine that all the input vectors v live in a normal grid, like this:

And the matrix projects them all into a new space like the one below, which holds the output vectors b:

Here you can see the two spaces juxtaposed:

(Credit: William Gould, Stata Blog)

And here’s an animation that shows the matrix’s work transforming one space to another:

The blue lines are eigenvectors.

You can imagine a matrix like a gust of wind, an invisible force that produces a visible result. And a gust of wind must blow in a certain direction. The eigenvector tells you the direction the matrix is blowing in.

So out of all the vectors affected by a matrix blowing through one space, which one is the eigenvector? It’s the one that doesn’t change direction; that is, the eigenvector is already pointing in the same direction that the matrix is pushing all vectors toward. An eigenvector is a like a weathervane. An eigenvane, as it were.

The definition of an eigenvector, therefore, is a vector that responds to a matrix as though that matrix were a scalar coefficient. In this equation, A is the matrix, x the vector, and lambda the scalar coefficient, a number like 5 or 37 or pi.

You might also say that eigenvectors are axes along which linear transformation acts, stretching or compressing input vectors. They are the lines of change that represent the action of the larger matrix.

Notice we’re using the plural – axes and lines. Just as a German may have a Volkswagen for grocery shopping, a Mercedes for business travel, and a Porsche for joy rides (each serving a distinct purpose), square matrices have as many eigenvectors as they have linearly independent variables; i.e. a 2 x 2 matrix could have two eigenvectors, a 3 x 3 matrix three, and an n x n matrix could have n eigenvectors, each one representing its line of action in one dimension.

Because eigenvectors distill the axes of principal force that a matrix moves input along, they are useful in matrix decomposition; i.e. the diagonalization of a matrix along its eigenvectors. Because those eigenvectors are representative of the matrix, they perform the same task as the autoencoders employed by deep neural networks.

To quote Yoshua Bengio:

Many mathematical objects can be understood better by breaking them into constituent parts, or finding some properties of them that are universal, not caused by the way we choose to represent them.

For example, integers can be decomposed into prime factors. The way we represent the number 12 will change depending on whether we write it in base ten or in binary, but it will always be true that 12 = 2 × 2 × 3. 

From this representation we can conclude useful properties, such as that 12 is not divisible by 5, or that any integer multiple of 12 will be divisible by 3.

Much as we can discover something about the true nature of an integer by decomposing it into prime factors, we can also decompose matrices in ways that show us information about their functional properties that is not obvious from the representation of the matrix as an array of elements.

One of the most widely used kinds of matrix decomposition is called eigen-decomposition, in which we decompose a matrix into a set of eigenvectors and eigenvalues.

Principal Component Analysis (PCA)

PCA is a tool for finding patterns in high-dimensional data such as images. Machine-learning practitioners sometimes use PCA to preprocess data for their neural networks. By centering, rotating and scaling data, PCA prioritizes dimensionality (allowing you to drop some low-variance dimensions) and can improve the neural network’s convergence speed and the overall quality of results.

To get to PCA, we’re going to quickly define some basic statistical ideas – mean, standard deviation, variance and covariance – so we can weave them together later. Their equations are closely related.

Mean is simply the average value of all x’s in the set X, which is found by dividing the sum of all data points by the number of data points, n.

Standard deviation is the square root of the average square distance of data points to the mean. In the equation below, the numerator contains the sum of the differences between each datapoint and the mean, and the denominator is simply the number of data points (minus one), producing the average distance.

Variance is the measure of the data’s spread. If I take a team of Dutch basketball players and measure their height, those measurements won’t have a lot of variance. They’ll all be grouped above six feet.

But if I throw the Dutch basketball team into a classroom of psychotic kindergartners, then the combined group’s height measurements will have a lot of variance. Variance is the spread, or the amount of difference that data expresses.

Variance is simply standard deviation squared, and is often expressed as s^2.

For both variance and standard deviation, squaring the differences between data points and the mean makes them positive, so that values above and below the mean don’t cancel each other out.

Let’s assume you plotted the age (x axis) and height (y axis) of those individuals (setting the mean to zero) and came up with an oblong scatterplot:

PCA attempts to draw straight, explanatory lines through data, like linear regression.

Each straight line represents a “principal component,” or a relationship between an independent and dependent variable. While there are as many principal components as there are dimensions in the data, PCA’s role is to prioritize them.

The first principal component bisects a scatterplot with a straight line in a way that explains the most variance; that is, it follows the longest dimension of the data. (This happens to coincide with the least error, as expressed by the red lines…) In the graph below, it slices down the length of the “baguette.”

The second principal component cuts through the data perpendicular to the first, fitting the errors produced by the first. There are only two principal components in the graph above, but if it were three-dimensional, the third component would fit the errors from the first and second principal components, and so forth.

Covariance Matrix

While we introduced matrices as something that transformed one set of vectors into another, another way to think about them is as a description of data that captures the the forces at work upon it, the forces by which two variables might relate to each other as expressed by their variance and covariance.

Imagine that we compose a square matrix of numbers that describe the variance of the data, and the covariance among variables. This is the covariance matrix. It is an empirical description of data we observe.

Finding the eigenvectors and eigenvalues of the covariance matrix is the equivalent of fitting those straight, principal-component lines to the variance of the data.

Eigenvalues are simply the coefficients attached to eigenvectors, which give the axes magnitude. In this case, they are the measure of the data’s covariance. By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the principal components in order of significance.

For a 2 x 2 matrix, a covariance matrix might look like this:

The numbers on the upper left and lower right represent the variance of the x and y variables, respectively, while the identical numbers on the lower left and upper right represent the covariance between x and y. Because of that identity, such matrices are known as symmetrical. As you can see, the covariance is positive, since the graph near the top of the PCA section points up and to the right.

If two variables increase and decrease together (a line going up and to the right), they have a positive covariance, and if one decreases while the other increases, they have a negative covariance (a line going down and to the right).

(Credit: Vincent Spruyt)

Notice that when one variable or the other doesn’t move at all, and the graph shows no diagonal motion, there is no covariance whatsoever. Covariance answers the question: do these two variables dance together? If one remains null while the other moves, the answer is no.

Also, in the equation below, you’ll notice that there is only a small difference between covariance and variance.

vs.

The great thing about calculating covariance is that, in a high-dimensional space where you can’t eyeball intervariable relationships, you can know how two variables move together by the positive, negative or non-existent character of their covariance. (Correlation is a kind of normalized covariance, with a value between -1 and 1.)

To sum up, the covariance matrix defines the shape of the data. Diagonal spread along eigenvectors is expressed by the covariance, while x-and-y-axis-aligned spread is expressed by the variance.

Causality has a bad name in statistics, so take this with a grain of salt:

While not entirely accurate, it may help to think of each component as a causal force in the Dutch basketball player example above, with the first principal component being age; the second possibly gender; the third nationality (implying nations’ differing healthcare systems), and each of those occupying its own dimension in relation to height. Each acts on height to different degrees. You can read covariance as traces of possible cause.

Change of Basis

Because the eigenvectors of the covariance matrix are orthogonal to each other, they can be used to to reorient the data from the x and y axes to the axes represented by the principal components. You re-base the coordinate system for the dataset in a new space defined by its lines of greatest variance.

The x and y axes we’ve shown above are what’s called the basis of a matrix; that is, they provide the points of the matrix with x, y coordinates. But it is possible to recast a matrix along other axes; for example, the eigenvectors of a matrix can serve as the foundation of a new set of coordinates for the same matrix. Matrices and vectors are animals in themselves, independent of the numbers linked to a specific coordinate system like x and y.

In the graph above, we show how the same vector v can be situated differently in two coordinate systems, the x-y axes in black, and the two other axes shown by the red dashes. In the first coordinate system, v = (1,1), and in the second, v = (1,0), but v itself has not changed. Vectors and matrices can therefore be abstracted from the numbers that appear inside the brackets.

This has profound and almost spiritual implications, one of which is that there exists no natural coordinate system, and mathematical objects in n-dimensional space are subject to multiple descriptions. (Changing matrices’ bases also makes them easier to manipulate.)

Entropy & Information Gain

In information theory, the term entropy refers to information we don’t have (normally people define “information” as what they know, and jargon has triumphed once again in turning plain language on its head to the detriment of the uninitiated). The information we don’t have about a system, its entropy, is related to its unpredictability: how much it can surprise us.

If you know that a certain coin has heads embossed on both sides, then flipping the coin gives you absolutely no information, because it will be heads every time. You don’t have to flip it to know.

A balanced, two-sided coin does contain an element of surprise with each coin toss. And a six-sided die, by the same argument, contains even more surprise with each roll, which could produce any one of six results with equal frequency. Both those objects contain information in the technical sense.

Now let’s imagine the die is loaded, it comes up “three” on five out of six rolls, and we figure out the game is rigged. Suddenly the amount of surprise produced with each roll by this die is greatly reduced. We understand a trend in the die’s behavior that gives us greater predictive capacity.

Understanding the die is loaded is analogous to finding a principal component in a dataset. You simply identify an underlying pattern.

That transfer of information, from what we don’t know about the system to what we know, represents a change in entropy. Insight decreases the entropy of the system. Get information, reduce entropy. This is information gain. And yes, this type of entropy is subjective, in that it depends on what we know about the system at hand. (Fwiw, information gain is synonymous with Kullback-Leibler divergence, which we explored briefly in this tutorial on restricted Boltzmann machines.)

So each principal component cutting through the scatterplot represents a decrease in the system’s entropy, in its unpredictability.

It so happens that explaining the shape of the data one principal component at a time, beginning with the component that accounts for the most variance, is similar to walking data through a decision tree. The first component of PCA, like the first split in a properly formed decision tree, will be along the dimension that reduces unpredictability the most.

Other Resources

A Beginner’s Guide to Eigenvectors, PCA, Covariance and Entropy的更多相关文章

  1. A Beginner's Guide to Paxos

    Google Drive: A Beginner's Guide to Paxos The code ideas of Paxos protocol: 1) Optimistic concurrenc ...

  2. Beginner's Guide to Python-新手指导

    Refer English Version: http://wiki.python.org/moin/BeginnersGuide New to programming? Python is free ...

  3. A Beginner's Guide To Understanding Convolutional Neural Networks(转)

    A Beginner's Guide To Understanding Convolutional Neural Networks Introduction Convolutional neural ...

  4. (转)A Beginner's Guide To Understanding Convolutional Neural Networks Part 2

    Adit Deshpande CS Undergrad at UCLA ('19) Blog About A Beginner's Guide To Understanding Convolution ...

  5. (转)A Beginner's Guide To Understanding Convolutional Neural Networks

    Adit Deshpande CS Undergrad at UCLA ('19) Blog About A Beginner's Guide To Understanding Convolution ...

  6. 新手教程之:循环网络和LSTM指南 (A Beginner’s Guide to Recurrent Networks and LSTMs)

    新手教程之:循环网络和LSTM指南 (A Beginner’s Guide to Recurrent Networks and LSTMs) 本文翻译自:http://deeplearning4j.o ...

  7. Photography theory: a beginner's guide(telegraph.co.uk)

    By Diane Smyth, Tim Clark, Rachel Segal Hamilton and Lewis Bush 11:00AM BST 09 Jun 2014   Have you r ...

  8. A Beginner’s Guide to the OUTPUT Clause in SQL Server

    原文 A Beginner’s Guide to the OUTPUT Clause in SQL Server T-SQL supports the OUTPUT clause after the ...

  9. The Beginner’s Guide to iptables, the Linux Firewall

    Iptables is an extremely flexible firewall utility built for Linux operating systems. Whether you’re ...

随机推荐

  1. pandas中DataFrame的ix,loc,iloc索引方式的异同

    pandas中DataFrame的ix,loc,iloc索引方式的异同 1.loc: 按照标签索引,范围包括start和end 2.iloc: 在位置上进行索引,不包括end 3.ix: 先在inde ...

  2. C语言自评

    问卷调查:你对自己的未来有什么规划?做了哪些准备?答:做设计方面的工作:正在努力自学有关这方面的知识 你认为什么是学习?学习有什么用?现在学习动力如何?为什么?答:学习就是增长见识:学习的作用就是为了 ...

  3. 【技术向】rainmeter的设计与发现

    我们在大学期间所学的那点代码知识还远远不够,于是我就自己寻找到了一款简单易懂的软件,来丰富我的代码知识. 这款软件叫rainmeter,中文叫做雨滴,是一款可以修改桌面的软件.它可以将桌面上更改出硬盘 ...

  4. 关于虚拟机安装mac os 教程详解

    环境搭建 VMware下载 百度云盘下载:链接:http://pan.baidu.com/s/1pK8RcLl 密码:5jc5 Unlocker208 百度云盘下载:链接:http://pan.bai ...

  5. nodejs 中on 和 emit

    首先测试用例: var EventEmitter = require('events').EventEmitter var life = new EventEmitter(); // life.on( ...

  6. Scrum Meeting Beta - 2

    Scrum Meeting Beta - 2 NewTeam // 地点:新主楼F座二楼 任务反馈 团队成员 完成任务 计划任务 安万贺 了解缓存的相关内容Issue #109 设计本地存储的方案Is ...

  7. 把握曝光三要素(上):快门、光圈、ISO

    概要: 如果你还没有掌握快门.光圈和ISO,那这篇文章或许对你有所帮助! 把照片比作水池.把进光量比作水.把快门比作关闭水龙头的速度.把光圈比作水龙头的大小.把感光度ISO比作水龙头的滤网,这就变得好 ...

  8. Hadoop RPC protocol description--转

    原文地址:https://spotify.github.io/snakebite/hadoop_rpc.html Snakebite currently implements the followin ...

  9. 【Java并发编程】之二:线程中断

    [Java并发编程]之二:线程中断 使用interrupt()中断线程 ​ 当一个线程运行时,另一个线程可以调用对应的Thread对象的interrupt()方法来中断它,该方法只是在目标线程中设置一 ...

  10. 【转】Unable to load embedded resource from assembly 无法加载的程序集嵌入的资源

    转自:http://blog.sina.com.cn/s/blog_994678b90101f035.html 项目运用IbatisNet 今天更新项目,编译完点击运行,报错如下: [“/”应用程序中 ...