using Plots, LaTeXStrings, LinearAlgebra, IterativeSolvers10 Conjugate Gradient
Lecture 10
- Recap
- Steepest Descent
- Conjugate Gradient
I encourage you to play around with the juptyer notebook for this lecture - you can copy the code with the this notebook button on the side of this page.
These notes are mainly a record of what we discussed and are not a substitute for attending the lectures and reading books! If anything is unclear/wrong, let me know and I will update the notes.
This lecture is mostly based on Burden, Faires, and Burden (2015) sections 7.6.
10.1 Recap
Last semester, we looked at direct methods for solving Ax = b which involved computing the (P)LU decomposition of A (i.e. PA = LU where P is a permutation matrix and L/U are lower/upper triangular) and doing a forwards-backwards substitution:
\begin{align} Ly &= Pb \nonumber\\ Ux &= y. \end{align}
In general, this requires O(N^3) operations. Moreover, when A is sparse, this is inefficient because L, U need not be sparse. In this chapter we are looking at methods that ideally converge in less than O(N^3) operations and can work with matrices stored in a sparse format (i.e. that involve only matrix-vector mulitplications).
Last time, we saw simple examples of iterative techniques for solving Ax = b:
\begin{align} x^{(k+1)} = T x^{(k)} + c \tag{*} \end{align}
If \rho(T) < 1, then, for all x^{(0)} \in \mathbb R^n, (*) converges to the unique solution x = (I-T)^{-1}c as k\to\infty.
Examples: Jacobi, Gauss-Seidel, SOR. These are examples of so-called splitting methods: we write A = P - N where P (the preconditioner) is “easy” to invert. Then, we write A x = b as Px = Nx + b and define the iteration as x_{k+1} = P^{-1}Nx^{(k)} + P^{-1} b.
Exercise 10.1 What is P and N for Jacobi, Gauss-Siedel, SOR?
Notice that this method can also be written:
\begin{align} Px^{(k+1)} &= (P-A) x^{(k)} + b \nonumber\\ % &= P x^{(k)} + (b - Ax^{(k)}). \end{align}
Therefore, we can write
\begin{align} x^{(k+1)} = x^{(k)} + P^{-1} r^{(k)} \end{align}
where r^{(k)} := b - Ax^{(k)} is the residual at step k.
function splitting(A,b,x; P=0., Niter=100, tol=1e-10)
X = [x]
if (P==0.)
P = D
end
if ( isapprox(det(P), 0; atol=1e-12) )
@warn "P is singular"
return X
end
iP = inv(P)
r = b - A*x
for i ∈ 1:Niter
push!( X, X[end] + iP*r )
r = b - A*X[end]
if (norm(r) < tol)
return X
end
end
@warn "max iterations exceeded"
return X
end;Exercise 10.2 Why do we stop the iteration when the residual is small rather than the error?
Example 10.1 Applying the Jacobi and Gauss-Seidel iterations to Ax = b with
b = [6, 25, -11, 15];
A = [10 -1 2 0;
-1 11 -1 3;
2 -1 10 -1;
0 3 -1 8]
L = -tril(A, -1)
D = Diagonal(diag(A))
x0 = [0.,0.,0.,0.];gives:
X = splitting(A,b,x0) # Jacobi
Y = splitting(A,b,x0; P=D-L) # Gauss
ω=1.15
Z = splitting(A,b,x0; P=D-ω*L); # SORIn fact, we have the following error curves:
(ρJ, ρGS, ρSOR)(0.42643661084234186, 0.08982305838804323, 0.17351995310029997)
Exercise 10.3 Does the iteration x^{(k+1)} = (I - A) x^{(k)} + b converge in this case? What are P and N?
10.2 Steepest Descent
We now look at gradient-based methods: Notice that, if A is symmetric and positive definite (SPD) (that is, symmetric and x^TAx >0 for all x \not=0), then Ax = b is the unique minimiser of
\begin{align} g(x) = \frac{1}{2} x^T A x - b^T x. \end{align}
People use lots of different notation for the inner product (also called the dot product):
\begin{align} x^T y = x \cdot y = (x,y) = \langle x, y \rangle % = \sum_i x_i y_i. \end{align}
Exercise 10.4 Show that
\begin{align} A = \begin{pmatrix} 5 & 1 \\ 1 & 3 \end{pmatrix} \end{align}
is symmetric and positive definite
The level curves of g when A is SPD look like this:
A = [5 1; 1 3]; b = [1; 3]; ξ = A \ b
g1(x,y) = dot([x;y], A*[x;y])/2 - dot([x;y], b)
x_range = range(ξ[1]-1,ξ[1]+1, length=100)
y_range = range(ξ[2]-1,ξ[2]+1, length=100)
contour( x_range, y_range, g1,
fill=true,
levels=50,
title=L"Level sets $\{x : g(x) = c\}$",
xlabel="x", ylabel="y", primary=false,
colorbar=true)
scatter!( [ξ[1]], [ξ[2]], color=:white, label=L"solution $Ax=b$")Notice that \nabla g(x) is a vector that points in the direction of increasing g. Therefore, in order to move closer to the minimiser, the first thing you might consider is
\begin{align} x^{(k+1)} = x^{(k)} - \alpha^{(k)} \nabla g(x^{(k)}). \tag{GD} \end{align}
Here, \alpha^{(k)} is the step size and \nabla g(x^{(k)}) is the search direction.
Exercise 10.5 Show that \nabla g(x) = Ax - b and so we can rewrite \text{(GD)} as x^{(k+1)} = x^{(k)} + \alpha^{(k)} r^{(k)} where r^{(k)} is the residual.
We may choose \alpha^{(k)} to be a minimiser of g along the search direction:
\begin{align} g( x^{(k)} + \alpha^{(k)} r^{(k)} ) = \min_{t} g( x^{(k)} + t r^{(k)} ). \end{align}
Since g is quadratic, one can compute \alpha^{(k)} explicitly:
\begin{align} \alpha^{(k)} = \frac {r^{(k)T} r^{(k)}} {r^{(k)T} Ar^{(k)}} % =: \frac{ (r^{(k)}, r^{(k)}) }{(r^{(k)}, r^{(k)})_A }. \end{align}
Here, we have adopted the notation (x,y) = x^T y and defined (x,y)_A := x^T A y.
Here, we have implemented steepest descent:
function gradientDescent(A,b,x0; Niter=100, xlims=[-25,25], ylims=[-25,25])
g(x,y) = dot([x;y], A*[x;y])/2 - dot([x;y], b)
Dg(x,y) = A*[x;y] - b
x_range = range(xlims[1], xlims[2], length=100)
y_range = range(ylims[1], ylims[2], length=100)
P = contour(x_range, y_range, g,
fill=true,
levels=50,
title="Steepest Descent",
xlabel="x", ylabel="y", legend=false,
colorbar=true)
anim = @animate for i ∈ 0:Niter
if (i==0)
scatter!(P, [x0[1]], [x0[2]], markersize=5, color=:white)
else
r = b - A*x0
if (norm(r)<10-8)
break;
end
# step size
α = dot(r,r)/dot(r,A*r)
x1 = x0 - α*Dg(x0[1],x0[2])
plot!(P, [x0[1],x1[1]], [x0[2],x1[2]], color=:white, lw=2)
scatter!(P, [x1[1]], [x1[2]], markersize=5, color=:white)
x0 = x1
end
end
return anim
end
A = [5 1; 1 3]; b = [1; 3]; x0 = [25;-15]; ξ = A \ b
gif( gradientDescent(A,b,x0), fps=1 )[ Info: Saved animation to c:\Users\math5\Math 5485\Math5486\tmp.gif
What’s the problem with this? The steepest descent direction is only best search direction locally. It can happen that the iteration zig-zags:
A = [5 0 ; 0 2]; b = [10; 3]; x0 = [15;20]
gif( gradientDescent(A,b,x0), fps=1 )[ Info: Saved animation to c:\Users\math5\Math 5485\Math5486\tmp.gif
In higher dimensions this problem gets worse! In fact, if the condition number is large (roughly speaking, this means the widths of the principal axes of the hyperellipsoids (the level sets of g) are varied with some directions very thin and some very wide), then the performance of steepest descent is very poor.
Idea: choose better search directions.
10.3 Conjugate gradient
Consider
\begin{align} x^{(k+1)} = x^{(k)} + \alpha^{(k)} v^{(k)} \end{align}
for some step size \alpha^{(k)} and search direction v^{(k)}.
This gives Ax^{(k+1)} = Ax^{(k)} + \alpha^{(k)} Av^{(k)} and thus
\begin{align} r^{(k+1)} = r^{(k)} - \alpha^{(k)} A v^{(k)} \end{align}
where r^{(k)} = b - Ax^{(k)} is the residual.
10.3.1 What \alpha^{(k)}?
Suppose we have a search direction v^{(k)}. Again, we can choose \alpha^{(k)} that minimises g(x^{(k)} + \alpha^{(k)} v^{(k)} ) over the space \{ x^{(k)} + t v^{(k)} : t \in \mathbb R \}. Since
\begin{align} g( x^{(k)} + t v^{(k)} ) &= g(x^{(k)}) + t v^{(k)T} ( Ax^{(k)} - b ) + \frac{t^2}{2} v^{(k)T}Av^{(k)} \nonumber\\ % &= g(x^{(k)}) - t (v^{(k)}, r^{(k)}) + \frac{t^2}{2} (v^{(k)},v^{(k)})_A, \end{align}
we have
\begin{align} \alpha^{(k)} := \frac {(v^{(k)}, r^{(k)})} {(v^{(k)},v^{(k)})_A}. \tag{$\alpha$} \end{align}
10.3.2 What v^{(k)}?
Exercise 10.6 Show that, if \alpha^{(k)} = \frac {(v^{(k)}, r^{(k)})} {(v^{(k)},v^{(k)})_A}, then r^{(k+1)} is orthogonal to v^{(k)} (that is, (v^{(k)},r^{(k+1)}) = 0).
Definition 10.1 We say \{ v^{(j)} \} is A-orthogonal (or A-conjugate) if (v^{(j)}, v^{(k)})_A = 0 for all j\not= k.
It turns out to be a good idea to try and choose v^{(j)} for j \leq k such that r^{(k+1)} is orthogonal to v^{(0)}, \cdots, v^{(k)}. That is, we try and choose the v^{(j)} so that the steepest descent direction points into a “previously unsearched” subspace. By Exercise 10.6 (the choice of \alpha^{(k)}), we have (v^{(k)},r^{(k+1)}) = 0. Moveover, since r^{(k+1)} = r^{(k)} - \alpha^{(k)} Av^{(k)}, we have
\begin{align} (v^{(k-1)}, r^{(k+1)}) &= (v^{(k-1)}, r^{(k)}) - \alpha^{(k)} (v^{(k-1)}, v^{(k)})_A \nonumber\\ % &= - \alpha^{(k)} (v^{(k-1)}, v^{(k)})_A \end{align}
and so (v^{(k-1)}, r^{(k+1)})= 0 if and only if (v^{(k-1)}, v^{(k)})_A = 0. More generally, we have
Lemma 10.1 Suppose that \alpha^{(k)} = \frac {(v^{(k)}, r^{(k)})} {(v^{(k)},v^{(k)})_A}. Then, (v^{(j)}, r^{(k+1)}) = 0 for all j \leq k if and only if \{ v^{j} \} is A-orthogonal.
Any A-orthogonal choice of search directions miniminises g on increasing subspaces:
Theorem 10.1 Suppose that \alpha^{(k)} = \frac {(v^{(k)}, r^{(k)})} {(v^{(k)},v^{(k)})_A} and \{v^{(j)}\} is A-orthogonal. Then, x^{(k+1)} = x^{(k)} + \alpha^{(k)} v^{(k)} minimises g(x) = \frac{1}{2}x^T Ax - b^Tx over the space
\begin{align} &x^{(0)} + \mathrm{span}\big\{ v^{(0)}, v^{(1)}, \cdots, v^{(k)} \big\} \nonumber\\ % &= \left\{ x^{(0)} + c_0 v^{(0)} + \cdots + c_k v^{(k)} : c_0,\cdots,c_k \in \mathbb R \right\} \end{align}
10.3.3 How to choose \{ v^{(j)} \}?
(If you know about Gram-Schmidt (with respect to the inner product (\cdot,\cdot)_A), then that is one option).
It turns out that you can obtain A-orthogonal search directions by writing the new search direction as a linear combination of the residual and previous search direction: suppose that
\begin{align} v^{(k+1)} = r^{(k+1)} + \beta^{(k)} v^{(k)}. \end{align}
Exercise 10.7 Show that, if \{v^{(j)}\} is A-orthogonal and v^{(k+1)} = r^{(k+1)} + \beta^{(k)} v^{(k)}, then
\begin{align} \beta^{(k)} = - \frac {(v^{(k)},r^{(k+1)})_A} {(v^{(k)}, v^{(k)})_A} \end{align}
Hint: consider (v^{(k)},v^{(k+1)})_A = 0
function CG(A,b,x0; Niter=5, xlims=[-25,25], ylims=[-25,25])
g(x,y) = dot([x;y], A*[x;y])/2 - dot([x;y], b)
x_range = range(xlims[1], xlims[2], length=100)
y_range = range(ylims[1], ylims[2], length=100)
P = contour(x_range, y_range, g,
fill=true,
levels=50,
title="Conjugate Gradient",
xlabel="x", ylabel="y", legend=false,
colorbar=true)
r = b - A*x0
v = b - A*x0
anim = @animate for i ∈ 0:Niter
if (i==0)
scatter!(P, [x0[1]], [x0[2]], markersize=5, color=:white)
else
if (norm(r)<10-8)
break;
end
# step size
α = dot(v,r)/dot(v,A*v)
# new point
x1 = x0 + α*v
plot!(P, [x0[1],x1[1]], [x0[2],x1[2]], color=:white, lw=2)
scatter!(P, [x1[1]], [x1[2]], markersize=5, color=:white)
# residual
r = b - A*x1
# new search direction
β = - dot(v,A*r)/dot(v,A*v)
v = r + β*v
x0 = x1
end
end
return anim
end
A = [5 1; 1 3]; b = [1; 3]; x0 = [25;25]; ξ = A \ b
gif( CG(A,b,x0), fps=1 )[ Info: Saved animation to c:\Users\math5\Math 5485\Math5486\tmp.gif
A = [5 0 ; 0 2]; b = [10; 3]; x0 = [15;20]
display( gif( gradientDescent(A,b,x0), fps=1 ) )
gif( CG(A,b,x0), fps=1 )[ Info: Saved animation to c:\Users\math5\Math 5485\Math5486\tmp.gif
[ Info: Saved animation to c:\Users\math5\Math 5485\Math5486\tmp.gif
- Assignment 3: Due Wednesday 11:59 PM,
- (Sign-up to office hours if you would like)
Chapter 1 reading:
- Burden, Faires, and Burden (2015) sections 5.1 - 5.6,
- Driscoll and Braun (2018) zero stability
Chapter 2 reading:
- Recap matrices: sections 7.1 - 7.2 of Burden, Faires, and Burden (2015)
- Jacobi & Gauss-Seidel: section 7.3 Burden, Faires, and Burden (2015)
- Today: section 7.6 Burden, Faires, and Burden (2015)
Next: other Kyrlov subspace methods