Least Squares Comparisons of 4 Ways: Speed and Accuracy

This is my final project proposal for MIT OCW 18.065[1][2], most ideas come from there.

1 What is Least Squares

Least Squares is a method to solve linear equations Ax=b. Wait a minute, you may ask, why not just use Gaussian Elimination? Good question, the situation here is that in theory the solution of Ax=b may not exist, however in practice we must provide a solution as good as possible. The least squares method chooses an x to make the error as small as possible.

$$ \text{Minimize error: } ||b-A\hat{x}|| $$

Pay attention to the different symbol $\hat{x}$ which implies the solution provided by least squares is not always the same as the original x. In other words, when Ax=b is solvable, least squares solution $\hat{x}$ is the same as x, the minimal error is 0, $\hat{x}$ solves the linear equations Ax=b; otherwise when Ax=b is unsolvable, least squares solution $\hat{x}$ is not the same as x, the minimum error is greater than 0, $\hat{x}$ doesn't satisfy the linear equations Ax=b.

So let's think about the error function, calculus tells us that a function has the minimum when its derivatives are zero. Ultimately this can lead to the so called Normal Equation:

$$ A^TAx=A^Tb $$

You can find more details at my appendix post: Matrix Calculus and Normal Equations

There are 4 ways to solve the least squares.
1. When ATA is invertible, a solution obtained through normal equation.

$$ \hat{x}=(A^TA)^{-1}A^Tb $$

2. The Gram-Schmidt way replaces A by A=QR to increase numerical stability, the solution becomes:

$$ \hat{x}=R^{-1}Q^Tb $$

The other two methods always work whether ATA is invertible or not.
3. Adding a penalty term. After adding a penalty term to the normal equations: (ATA + δI)x=b, ATA is replaced by ATA + δI which is invertible.
4. Using the pseudoinverse.

In this essay, we'll look at what the four ways are, why they are good solutions and also analysis their computation cost, numerical stability etc.

2 Pseudoinverse

The pseudoinverse (aka Moore-Penrose inverse) A+ of a matrix A is something like the inverse of a square matrix, it's the most known generalization of inverse. We start by a neat pseudoinverse formula and then illustrate why it's good from different point of views.

2.1 Pseudoinverse formula from SVD

The Singular Value Decomposition for any Matrix A is A=UΣVT, its pseudoinverse is:

$$ A^{+}=V\Sigma ^+U^T $$

When A is invertible, the diagonal matrix Σ is invertible, all its diagonal elements are non-zero, Σ+ is the inverse of Σ, A+ is also the exact the inverse of A. Everything is perfect.

When A is not invertible or A is not even square matrix at all, some of the diagonal elements of Σ are zero, then the corresponding diagonal elements of Σ+ are also zero. For example:

$$ \begin{align*} \text{If} \quad \Sigma &= \begin{bmatrix} 2 & 0 \\ 0 & 0 \end{bmatrix}, \\ \text{then}\quad \Sigma ^{+} &= \begin{bmatrix} 1/2 & 0 \\ 0 & 0 \end{bmatrix} \end{align*} $$

You can see, we try our best to inverse the matrix and make pseudoinverse close to a real inverse as much as possible. But we don't end up with this natural understanding, we'll try to understand it from other perspectives.

2.2 Pseudoinverse and the Four Fundamental Subspaces

Let's revisit a matrix multiply by a vector: Ax=y.

If A is invertible, x is in the row space, y is in the column space. A can be thought as a linear transformation which takes x from the row space to the column space. While the inverse of A does exactly the opposite: multiply inverse of A by y, take y from the column space back to the original row space vector x.

$$ A^{-1}y=A^{-1}Ax=Ix=x $$

So you may ask, does pseudoinverse have similar properties?

In general, A is a m by n matrix, A+ is n by m matrix, x is n by 1 vector, y is m by 1 vector. Since A is not invertible, there're two cases here:

1. If x is in the row space
we can prove A+y=x, proof is at: Pseudoinverse Properties and Proof.

It shows that column space vector y was brought back to the original row space vector x through the pseudoinverse times y. This is really desired properties for pseudoinverse.

2. If x is not in the null space
A times x will take vector x in null space to zero, namely Ax=0 that's simple. What we really need to check is A+z, where z is in the null space of AT:

$$ A^+z=V\Sigma ^+Uz=0 $$

proof is here: Pseudoinverse Properties and Proof.

This shows that for vectors in the null space of AT, wich are the part that can't be inverted, pseudoinverse tries its best to transform these vectors to zero like what the original matrix have done to vectors in null space. It's reasonable.

So pseudoinverse A+=VΣ+VT makes sense from the point view of four fundamental subspaces. In addition, pseudoinverse also produces a solution to least squares.

2.3 Pseudoinverse and Least Squares

If A is invertible, the solution for Ax=b is x=A-1b. If A is not invertible, will the pseudoinverse solution x+=A+b satisfies the least squares condition? Yes, you'll see it, actually it does more than that which is also the reason we use a different symbol x+ here.

UT is an orthogonal matrix, we can multiply by it without changeing any length.

$$ ||b-Ax||^2 = ||b-U\Sigma V^Tx||^2 = ||U^Tb-\Sigma V^Tx||^2 $$

Set ω = VTx to get ||UTb-Σω||2. The best ω is Σ+UTb. You can read more details here: Pseudoinverse Properties and Proof. Finally, we get

$$ x^+ = V\Sigma ^+U^Tb=A^+b $$

So pseudoinverse solution A+b solves least squares problem. Moreover, it's also the minimum norm least squares solution.

What does that mean? If there are nonzero vectors z in the nullspace of A, they can be added to x+, the error ||b-A(x+ + z)|| kept unchanged because of Az=0. However the length ||x+ + z||2 will grow to ||x+||2 + ||z||2. In fact x+ doesn't contain any components in the nullspace of A. You can read more details here: Pseudoinverse Properties and Proof.

In summary, x+ has two properties:

Pseudoinverse is great in this regard, although computing SVD is not easy task.

3 The Normal Equations

In the very first beginning, we have got the normal equations. It can produce a solution when $A^TA$ is invertible. So we want to ask when is $A^TA$ is invertible? Is this solution equivalent to the pseudoinverse solution? We also try to understand the normal equations from a projection perspective.

3.1 When is $A^TA$ invertible

The answer is simple: A has independent columns.

Always A and ATA have the same nullspace! This is because ATAx = 0 always lead to xTATAx=0. This is ||Ax||2=0. Then Ax=0 and x is in the nullspace of A.

So the rank of A equals the rank of ATA.

When A has independent columns, the nullspace of A is empty, x+=A+b is the only solution for least squares. It has to be equivalent to the solution $\hat{x}=(A^TA)^{-1}A^Tb$ that's derived from the normal equations.

$$ \begin{align*} (A^TA)^{-1}A^T &= (V\Sigma ^TU^TU\Sigma V^T)^{-1}A^T \\ &= (V\Sigma ^T\Sigma V^T)^{-1}A^T \end{align*} $$

A is m by n matrix, Σ is m by n matrix. When A has independent columns, rank of A is n, rank of Σ and rank of ΣTΣ are also n. Since ΣTΣ is n by n matrix, it's invertible. And V is orthogonal matrix, V-1 = VT, then

$$ \begin{align*} (V\Sigma ^T\Sigma V^T)^{-1}A^T &= V(\Sigma ^T\Sigma)^{-1}V^TV\Sigma ^TU^T \\ &= V(\Sigma ^T\Sigma)^{-1}\Sigma ^TU^T \\ A^+ &= V\Sigma ^+U^T \\ \end{align*} $$

Is Σ+ = (ΣTΣ)-1ΣT? That's true.

$$ \begin{align*} (\Sigma ^T\Sigma)^{-1} \Sigma ^T &=\begin{bmatrix} \frac{1}{\sigma _1^2} \\ & \ddots \\ & & \frac{1}{\sigma _n^2} \end{bmatrix} \begin{bmatrix} \sigma _1 & & 0 & \cdots & 0 \\ & \ddots \\ & & \sigma _n & \cdots & 0 \end{bmatrix} \\ &=\begin{bmatrix} \frac{1}{\sigma _1} & & 0 & \cdots & 0 \\ & \ddots \\ & & \frac{1}{\sigma _n} & \cdots & 0 \end{bmatrix}\\ &=\Sigma ^+ \end{align*} $$

3.2 The Normal Equations and Projection

We can also get the normal equations from geometry.

Suppose A is 3 by 2 matrix and has independent columns, namely the rank of the column space is 2, the column space is a plane. Ax=b has no solution means the vector b is not on that plane.

Least squares want to find $\hat{x}$ to make the error as small as possible. The original vector b, the best vector in the column space $A\hat{x}$, the error vector $e = b - A\hat{x}$, these three vectors form into a triangle. We attain the smallest(or minimum length in geometry) error vector when the error vector is perpendicular to the column space , in the other words, we project vector b to the plane of column space.

That's 3 dimension case, in higher dimensions, I can't provide a proof now, in fact I know nothing about high dimensional geometry. For now let's we believe that's true for whatever the dimension is.

For any vector z, Az is in column space of A and the error vector is perpendicular to Az, their inner product is zero:

$$ (Az)^T(b-A\hat{x})=z^TA^T(b-A\hat{x})=0 $$

for all z forces $A^T(b-A\hat{x})=0$, that's the the normal equations: $A^TA\hat{x}=A^Tb$.

In summary,
1. Normal equation for $\hat{x}$:

$$ A^TA\hat{x} = A^Tb $$

2. Least squares solution to Ax=b:

$$ \hat{x} = (A^TA)^{-1}A^Tb $$

3. Projection of b onto the column space of A:

$$ p = A\hat{x} = A(A^TA)^{-1}A^Tb $$

4. Projection matrix

$$ P = A(A^TA)^{-1}A^T $$

We can check that $P^2=P$, project a second time, the projection stays exactly the same:

$$ \begin{align*} P^2 &=A(A^TA)^{-1}A^TA(A^TA)^{-1}A^T\\ &=A(A^TA)^{-1}A^T\\ &=P \end{align*} $$

The projection explanation for normal equation is really cool!

4 Gram-Schmidt

The famous Gram-Schmidt algorithm is used to produce orthogonal columns which is very helpful to improve numerical stability and save us from round off errors especially when ATA is nearly singular.

Suppose A has independent columns: a1, a2, a3, ..., an. We want to replace these columns with orthogonal columns: q1, q2, q3, ..., qn

4.1 Standard Gram-Schmidt

Gram-Schmidt works by iterations. Every iteration i produce the ith orthogonal column qi from original ith column ai by subtract the projection component in all previously produced orthogonal columns q1 ..., qi-1. For example:

1. Iteration 1

$$ \begin{align*} q_1 &= a_1 \\ q_1 &= \frac{q_1}{||q_1||} \end{align*} $$

2. Iteration 2

$$ \begin{align*} q_2 &= a_2 - a_2^Tq_1 \\ q_2 &= \frac{q_2}{||q_2||} \end{align*} $$

3. Iteration 3

$$ \begin{align*} q_3 &= a_3 - a_3^Tq_1 - a_3^Tq_2 \\ q_3 &= \frac{q_3 }{||q_3||} \end{align*} $$

The relationship between the original columns and the orthogonal columns can be expressed in matrix form:

$$ \begin{align*} A&=\begin{bmatrix} \\ a_1 & a_2 & a_3\\ \\ \end{bmatrix} \\ &= \begin{bmatrix} \\ q_1 & q_2 & q_3\\ \\ \end{bmatrix} \begin{bmatrix} r_{11} & r_{12} & r_{13} \\ 0 & r_{22} & r_{23} \\ 0 & 0 & r_{33} \end{bmatrix} \\ &= QR \end{align*} $$

Apply A=QR to the normal equations, we’ll get a new formula for $\hat{x̂}$:

$$ \begin{align*} R^TQ^TQR\hat{x} &= R^TQ^Tb \\ R^TR\hat{x} &= R^TQ^Tb \\ \hat{x} &= R^{-1}(R^T)^{-1}R^TQ^Tb \\ \mathbf{\hat{x}} &= \mathbf{R^{-1}Q^Tb} \end{align*} $$

4.2 Gram-Schmidt with Column Pivoting

The straightforward description of Gram-Schmidt worked with the columns of A in their original order a1, a2, a3,...This could be dangerous! We could never live with a code for elimination that don't allow row exchanges. Then roundoff error could wipe us out.

Similarly, each step of Gram-Schmidt should begin with a new column that is as independent as possible of the columns already processed. We need column exchanges to pick the largest remaining column.

I won't weary you with details of column pivoting, it just add a little bit computation cost(column exchanges) to avoid roundoff troubles which doesn't prevent us from understanding the comparisons of 4 ways in the later section, though that's crucial for numerical computation.

5 Least Squares with a Penalty Term

The fourth approach to least squares is called ridge regression.

$$ Minimize\qquad ||Ax-b||^2 + \delta^2||x||^2\\ Solve\qquad (A^TA+\delta^2I)\hat{x}=A^Tb $$

To be honest, I konw nearly nothing about this approach, it's proposed by people who try to solve machine learning or statistical problems, the add penalty term produced better results than the normal least squares on these specific problems.

I wouldn't rewrite this section until I learn more about it.

6 Comparisons

In theory, solutions of first three methods are equivalent(when matrices have independent columns), differences only exist in speed. In practice, it's a total different story due to roundoff errors. Take a look at this example:

$$ \begin{align*} \Sigma &= \begin{bmatrix} 3 & 0 \\ 0 & 0.0000001 \end{bmatrix} \\ \Sigma ^+ &= \begin{bmatrix} 1/3 & 0 \\ 0 & 1000000 \end{bmatrix} \\ \\ \Sigma &= \begin{bmatrix} 3 & 0 \\ 0 & 0 \end{bmatrix} \\ \Sigma ^+ &= \begin{bmatrix} 1/3 & 0 \\ 0 & 0 \end{bmatrix} \end{align*} $$

In math, there's subtle yet obvious difference between 0 and 0.0000001 for sure, meanwhile their pseudoinverses are very different(zero v.s one million).

However, in computer, this obvious difference(0 v.s 0.0000001) are hard to distinguish so that may result in a very differen and unacceptable inverses or pseudoinverses that kills us.

Orthogonal vectors provide numerical stability, so Gram-Schmidt method is more accurate than Normal Equations method. But operation counts of Gram-Schmidt are actually doubled, so the speed is slower.

The pseudoinverse method is a little bit complex. There are two ways to find the pseudoinverse: 1. compute the SVD directly; 2. elimination that's similar to find the inverse.

If we find pseudoinverse by the SVD, we play with the orthogonal matrix UV which is good for accuracy, but the computation cost is large. Singular values and singular vectors cost more then elimination.

If we find pseudoinverse by elimination, the speed is ok, but roundoff error is unacceptable in some scenarios where we can't distinguish exact zeros from small nonzeros.

Forgive me, the accurate computation cost is to be added.

7 What's more

Least Squares comes from linear equation Ax=b which brings about many topics to explore. How about replacing the error in L2 norm with L1 norm? What if A is really huge and the unknown parameters is much more than the equations (typical in deep learning)?

And we can also extend the standard least squares to weighted least squares and recursive least squares.

To be continued...

8 Reference

1. MIT Open CourseWare 18.065 Prof. Gilbert Strang 2018
2. Strang, Gilbert. Linear Algebra and Learning from Data. Wellesley-Cambridge Press, 2018. ISBN: 9780692196380

Written by Songziyu @China Sep. 2023