These notes are mainly a record of what we discussed and are not a substitute for attending the lectures and reading books! If anything is unclear/wrong, let me know and I will update the notes.
19.1 Singular Value Decomposition
Recall that, if A is normal (i.e. A^TA = AA^T) then there exist diagonal \Lambda and orthogonal P (i.e. P^{-1} = P^T) such that A = P \Lambda P^T [this is an eigenvalue decompositon of A (the entries of \Lambda are the eigenvalues and the columns of P are associated eigenvalues)]
A = [1-1; -11]# A is symmetric and therefore normaldisplay( A'A - A*A' )# computes the eigenvalues and eigenvectors of Aλ, P =eigen(A)
For some matrices, this decomposition is impossible:
A = [421 ; 0-101]λ, P =eigen(A)# eigenvectorsdisplay( A*P -P*diagm(λ) )# But P is not orthogonal display( P )# in this case A is not normal!display( A'A - A*A' )
2×2 Matrix{Float64}:
0.0 0.0
0.0 0.0
2×2 Matrix{Float64}:
-0.00699284 1.0
0.999976 0.0
2×2 Matrix{Int64}:
-1 143
143 1
Singular value decomposition of A \in \mathbb R^{m\times n} is of the form A = P \Sigma Q^T where P \in \mathbb R^{m\times m}, \Sigma \in \mathbb R^{m\times n}, Q \in \mathbb R^{n\times n} and P,Q orthogonal and \Sigma diagonal. We normally write the diagonal entries of \Sigma as \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r \geq 0 where r := \min\{m,n\}.
P, σ, Q =svd(A)Σ =diagm( σ )display( Σ )# A can be written as a singular value decompositiondisplay( P * Σ * Q' )# P orthogonaldisplay( P'P )# Q orthogonaldisplay( Q'Q )
2×2 Matrix{Float64}:
101.006 0.0
0.0 41.9975
2×2 Matrix{Float64}:
42.0 1.0
1.11022e-16 -101.0
2×2 Matrix{Float64}:
1.0 0.0
0.0 1.0
2×2 Matrix{Float64}:
1.0 0.0
0.0 1.0
Example.
Find the SVD of A = \begin{pmatrix} 1 \\ 0 \end{pmatrix}.
Theorem.
The eigenvalues of A^TA are real, \geq 0, and the largest r = \min\{m,n\} of them are squares of the sigular values of A.
Proof.
Notice that if A = P \Sigma Q^T, then
\begin{align}
A^T A &= ( P \Sigma Q^T )^T ( P \Sigma Q^T ) \nonumber\\
%
&= Q \Sigma P^T P \Sigma Q^T \nonumber\\
%
&= Q \Sigma^T \Sigma Q^T
\end{align}
and so \Sigma^T\Sigma contains the squares of the eigenvalues of A^TA.
\begin{align}
A = P \Sigma Q^T
%
= ... = \sum_{j=1}^r \sigma_j u_j v_j^T
\end{align}
where u_j is the j^\text{th} column of P and v_j is the j^\text{th} column of Q. Since (\sigma_j) is decreasing, later terms in the sum have less importance. This fact is used for dimension reduction (principal component analysis). That is, we make the approximation
\begin{align}
A \approx A_k := \sum_{j=1}^k \sigma_j u_j v_j^T
\end{align}
where k \ll r. Since the rank of A_k is \leq k, this is a low rank approximation to A.
We will now see an example in image compression. First, we load a bunch of images and choose Monty:
You can manipulate images just like matrices: e.g.
I'I[:,reverse(1:end)]I[400:600, 300:525]
Colours are made up of red, green, blue (RGB) channels and you can convert the image to grayscale too. (Grayscale turns out to the weighted average 0.299 * R + 0.587 * G + 0.114 * B of the RGB channels).
z =zeros(m,n);R, G, B, A =Float64.(red.(I)), Float64.(green.(I)), Float64.(blue.(I)), Float64.(Gray.(I));[ RGB.(R, G, B) RGB.(R, z, z) RGB.(z, G, z) RGB.(z, z, B) Gray.(I) ]
A is now the grayscale image which is represented as a m \times n matrix of values between 0 and 1:
where k \ll r. We will see that, if the singular values decay fast enough, then the error committed in this approximation is small, leading to image file of a much smaller size.
The image of Monty looks pretty good for k = 100. This corresponds to \approx 20\% of the orginial data storage requirements:
r =100;println( "pixels = $m x $n = ", m*n ) println( "PCA amount of data (with k = $r) = ", r * (m+n+1) )println( "ratio = " , (r*(m+n+1))/(m*n))
pixels = 839 x 1152 = 966528
PCA amount of data (with k = 100) = 199200
ratio = 0.206098529996027
One may define the error between two images A and B to be \max_{i,j} |A_{ij} - B_{ij}| (so that no pixel value differs by more than the error). There are other choices of norm which may make more sense in different context (in fact, it turns out that A_k is the best rank k approximation to A with respect to the operator norm). Here, we plot the error:
We can then create the image A_k from the low rank approximations R_{k_1}, G_{k_2}, B_{k_3} coresponding to the RGB channels where k = k_1 + k_2 + k_3:
I = img[4]m, n =size( I )z =zeros(m,n);R, G, B, A =Float64.(red.(I)), Float64.(green.(I)), Float64.(blue.(I)), Float64.(Gray.(I));[ RGB.(R, G, B) RGB.(R, z, z) RGB.(z, G, z) RGB.(z, z, B) Gray.(I) ]