This is my final project proposal for MIT OCW 18.065[1][2], most ideas come from there.
Least Squares is a method to solve linear equations Ax=b
. Wait a minute, you
may ask, why not just use Gaussian Elimination? Good question, the situation
here is that in theory the solution of Ax=b
may not exist, however in
practice we must provide a solution as good as possible. The least squares
method chooses an x
to make the error as small as possible.
Pay attention to the different symbol $\hat{x}$ which implies the solution
provided by least squares is not always the same as the original x
. In other
words, when Ax=b
is solvable, least squares solution $\hat{x}$ is the same as
x
, the minimal error is 0, $\hat{x}$ solves the linear equations Ax=b
;
otherwise when Ax=b
is unsolvable, least squares solution $\hat{x}$ is not
the same as x
, the minimum error is greater than 0, $\hat{x}$ doesn't
satisfy the linear equations Ax=b
.
So let's think about the error function, calculus tells us that a function has the minimum when its derivatives are zero. Ultimately this can lead to the so called Normal Equation:
$$ A^TAx=A^Tb $$You can find more details at my appendix post: Matrix Calculus and Normal Equations
There are 4 ways to solve the least squares.
1. When ATA
is invertible, a solution obtained through normal equation.
2. The Gram-Schmidt way replaces A
by A=QR
to increase numerical stability,
the solution becomes:
The other two methods always work whether ATA is invertible or not.
3. Adding a penalty term. After adding a penalty term to the normal equations:
(ATA + δI)x=b
, ATA
is replaced by ATA + δI
which is
invertible.
4. Using the pseudoinverse.
In this essay, we'll look at what the four ways are, why they are good solutions and also analysis their computation cost, numerical stability etc.
The pseudoinverse (aka Moore-Penrose inverse) A+
of a matrix A
is
something like the inverse of a square matrix, it's the most known
generalization of inverse. We start by a neat pseudoinverse formula and then
illustrate why it's good from different point of views.
The Singular Value Decomposition for any Matrix A
is A=UΣVT
, its
pseudoinverse is:
When A
is invertible, the diagonal matrix Σ
is invertible, all its diagonal
elements are non-zero, Σ+
is the inverse of Σ
, A+
is also the
exact the inverse of A
. Everything is perfect.
When A
is not invertible or A
is not even square matrix at all, some of the
diagonal elements of Σ
are zero, then the corresponding diagonal elements of
Σ+
are also zero. For example:
You can see, we try our best to inverse the matrix and make pseudoinverse close to a real inverse as much as possible. But we don't end up with this natural understanding, we'll try to understand it from other perspectives.
Let's revisit a matrix multiply by a vector: Ax=y
.
If A
is invertible, x
is in the row space, y
is in the column space.
A
can be thought as a linear transformation which takes x
from the row
space to the column space. While the inverse of A
does exactly the opposite:
multiply inverse of A
by y
, take y
from the column space back to the
original row space vector x
.
So you may ask, does pseudoinverse have similar properties?
In general, A
is a m by n matrix, A+
is n by m matrix, x
is n by 1
vector, y
is m by 1 vector. Since A
is not invertible, there're two cases
here:
1. If x
is in the row space
we can prove A+y=x
, proof is at: Pseudoinverse Properties and Proof.
It shows that column space vector y was brought back to the original row space vector x through the pseudoinverse times y. This is really desired properties for pseudoinverse.
2. If x
is not in the null space
A
times x
will take vector x in null space to zero, namely Ax=0
that's
simple. What we really need to check is A+z
, where z
is in the null
space of AT
:
proof is here: Pseudoinverse Properties and Proof.
This shows that for vectors in the null space of AT
, wich are the part
that can't be inverted, pseudoinverse tries its best to transform these vectors
to zero like what the original matrix have done to vectors in null space. It's
reasonable.
So pseudoinverse A+=VΣ+VT
makes sense from the point view of four
fundamental subspaces. In addition, pseudoinverse also produces a solution to
least squares.
If A
is invertible, the solution for Ax=b
is x=A-1b
. If A
is not
invertible, will the pseudoinverse solution x+=A+b
satisfies the
least squares condition? Yes, you'll see it, actually it does more than that
which is also the reason we use a different symbol x+
here.
UT
is an orthogonal matrix, we can multiply by it without changeing any
length.
Set ω = VTx
to get ||UTb-Σω||2
. The best ω
is Σ+UTb
.
You can read more details here:
Pseudoinverse Properties and Proof.
Finally, we get
So pseudoinverse solution A+b
solves least squares problem. Moreover,
it's also the minimum norm least squares solution.
What does that mean? If there are nonzero vectors z
in the nullspace of A
,
they can be added to x+
, the error ||b-A(x+ + z)||
kept unchanged
because of Az=0
. However the length ||x+ + z||2
will grow to
||x+||2 + ||z||2
. In fact x+
doesn't contain any components in
the nullspace of A
. You can read more details here:
Pseudoinverse Properties and Proof.
In summary, x+
has two properties:
x+=A+b
makes ||b - Ax||
as small as possible
Pseudoinverse is great in this regard, although computing SVD is not easy task.
In the very first beginning, we have got the normal equations. It can produce a solution when $A^TA$ is invertible. So we want to ask when is $A^TA$ is invertible? Is this solution equivalent to the pseudoinverse solution? We also try to understand the normal equations from a projection perspective.
The answer is simple: A
has independent columns.
Always A
and ATA
have the same nullspace! This is because ATAx
=
0 always lead to xTATAx=0
. This is ||Ax||2=0
. Then Ax=0
and
x
is in the nullspace of A
.
So the rank of A
equals the rank of ATA
.
When A
has independent columns, the nullspace of A is empty, x+=A+b
is
the only solution for least squares. It has to be equivalent to the solution
$\hat{x}=(A^TA)^{-1}A^Tb$ that's derived from the normal equations.
A
is m by n matrix, Σ
is m by n matrix. When A
has independent
columns, rank of A
is n, rank of Σ
and rank of ΣTΣ
are
also n. Since ΣTΣ
is n by n matrix, it's invertible. And V
is
orthogonal matrix, V-1 = VT
, then
Is Σ+ = (ΣTΣ)-1ΣT
? That's true.
We can also get the normal equations from geometry.
Suppose A
is 3 by 2 matrix and has independent columns, namely the rank of
the column space is 2, the column space is a plane. Ax=b
has no solution
means the vector b
is not on that plane.
Least squares want to find $\hat{x}$ to make the error as small as possible.
The original vector b
, the best vector in the column space $A\hat{x}$, the
error vector $e = b - A\hat{x}$, these three vectors form into a triangle. We
attain the smallest(or minimum length in geometry) error vector when
the error vector is perpendicular to the column space , in the other words,
we project vector b
to the plane of column space.
That's 3 dimension case, in higher dimensions, I can't provide a proof now, in fact I know nothing about high dimensional geometry. For now let's we believe that's true for whatever the dimension is.
For any vector z
, Az
is in column space of A
and the error vector is
perpendicular to Az
, their inner product is zero:
for all z
forces $A^T(b-A\hat{x})=0$, that's the the normal equations:
$A^TA\hat{x}=A^Tb$.
In summary, 1. Normal equation for $\hat{x}$:
$$ A^TA\hat{x} = A^Tb $$
2. Least squares solution to Ax=b
:
3. Projection of b
onto the column space of A
:
4. Projection matrix
$$ P = A(A^TA)^{-1}A^T $$We can check that $P^2=P$, project a second time, the projection stays exactly the same:
$$ \begin{align*} P^2 &=A(A^TA)^{-1}A^TA(A^TA)^{-1}A^T\\ &=A(A^TA)^{-1}A^T\\ &=P \end{align*} $$The projection explanation for normal equation is really cool!
The famous Gram-Schmidt algorithm is used to produce orthogonal columns
which is very helpful to improve numerical stability and save us from round off
errors especially when ATA
is nearly singular.
Suppose A
has independent columns: a1, a2, a3, ..., an
. We
want to replace these columns with orthogonal columns:
q1, q2, q3, ..., qn
Gram-Schmidt works by iterations. Every iteration i produce the ith orthogonal
column qi
from original ith column ai
by subtract the projection
component in all previously produced orthogonal columns q1 ..., qi-1
.
For example:
1. Iteration 1
$$ \begin{align*} q_1 &= a_1 \\ q_1 &= \frac{q_1}{||q_1||} \end{align*} $$2. Iteration 2
$$ \begin{align*} q_2 &= a_2 - a_2^Tq_1 \\ q_2 &= \frac{q_2}{||q_2||} \end{align*} $$3. Iteration 3
$$ \begin{align*} q_3 &= a_3 - a_3^Tq_1 - a_3^Tq_2 \\ q_3 &= \frac{q_3 }{||q_3||} \end{align*} $$The relationship between the original columns and the orthogonal columns can be expressed in matrix form:
$$ \begin{align*} A&=\begin{bmatrix} \\ a_1 & a_2 & a_3\\ \\ \end{bmatrix} \\ &= \begin{bmatrix} \\ q_1 & q_2 & q_3\\ \\ \end{bmatrix} \begin{bmatrix} r_{11} & r_{12} & r_{13} \\ 0 & r_{22} & r_{23} \\ 0 & 0 & r_{33} \end{bmatrix} \\ &= QR \end{align*} $$
Apply A=QR
to the normal equations, we’ll get a new formula for $\hat{x̂}$:
The straightforward description of Gram-Schmidt worked with the columns of A
in their original order a1, a2, a3,...
This could be dangerous!
We could never live with a code for elimination that don't allow row exchanges.
Then roundoff error could wipe us out.
Similarly, each step of Gram-Schmidt should begin with a new column that is as independent as possible of the columns already processed. We need column exchanges to pick the largest remaining column.
I won't weary you with details of column pivoting, it just add a little bit computation cost(column exchanges) to avoid roundoff troubles which doesn't prevent us from understanding the comparisons of 4 ways in the later section, though that's crucial for numerical computation.
The fourth approach to least squares is called ridge regression.
$$ Minimize\qquad ||Ax-b||^2 + \delta^2||x||^2\\ Solve\qquad (A^TA+\delta^2I)\hat{x}=A^Tb $$To be honest, I konw nearly nothing about this approach, it's proposed by people who try to solve machine learning or statistical problems, the add penalty term produced better results than the normal least squares on these specific problems.
I wouldn't rewrite this section until I learn more about it.
In theory, solutions of first three methods are equivalent(when matrices have independent columns), differences only exist in speed. In practice, it's a total different story due to roundoff errors. Take a look at this example:
$$ \begin{align*} \Sigma &= \begin{bmatrix} 3 & 0 \\ 0 & 0.0000001 \end{bmatrix} \\ \Sigma ^+ &= \begin{bmatrix} 1/3 & 0 \\ 0 & 1000000 \end{bmatrix} \\ \\ \Sigma &= \begin{bmatrix} 3 & 0 \\ 0 & 0 \end{bmatrix} \\ \Sigma ^+ &= \begin{bmatrix} 1/3 & 0 \\ 0 & 0 \end{bmatrix} \end{align*} $$
In math, there's subtle yet obvious difference between 0
and 0.0000001
for
sure, meanwhile their pseudoinverses are very different(zero
v.s
one million
).
However, in computer, this obvious difference(0
v.s 0.0000001
) are hard to
distinguish so that may result in a very differen and unacceptable inverses or
pseudoinverses that kills us.
Orthogonal vectors provide numerical stability, so Gram-Schmidt method is more accurate than Normal Equations method. But operation counts of Gram-Schmidt are actually doubled, so the speed is slower.
The pseudoinverse method is a little bit complex. There are two ways to find the pseudoinverse: 1. compute the SVD directly; 2. elimination that's similar to find the inverse.
If we find pseudoinverse by the SVD, we play with the orthogonal matrix U
、V
which is good for accuracy, but the computation cost is large. Singular values
and singular vectors cost more then elimination.
If we find pseudoinverse by elimination, the speed is ok, but roundoff error is unacceptable in some scenarios where we can't distinguish exact zeros from small nonzeros.
Forgive me, the accurate computation cost is to be added.
Least Squares comes from linear equation Ax=b
which brings about many topics
to explore. How about replacing the error in L2 norm with L1 norm? What if A
is really huge and the unknown parameters is much more than the equations
(typical in deep learning)?
And we can also extend the standard least squares to weighted least squares and recursive least squares.
To be continued...
1. MIT Open CourseWare 18.065 Prof. Gilbert Strang 2018 2. Strang, Gilbert. Linear Algebra and Learning from Data. Wellesley-Cambridge Press, 2018. ISBN: 9780692196380
Written by Songziyu @China Sep. 2023