10  Conjugate Gradient

Lecture 10

Published

February 23, 2026

Overview
  1. Recap
  2. Steepest Descent
  3. Conjugate Gradient
Note

I encourage you to play around with the juptyer notebook for this lecture - you can copy the code with the this notebook button on the side of this page.

Warning

These notes are mainly a record of what we discussed and are not a substitute for attending the lectures and reading books! If anything is unclear/wrong, let me know and I will update the notes.

Tip

This lecture is mostly based on Burden, Faires, and Burden (2015) sections 7.6.

using Plots, LaTeXStrings, LinearAlgebra, IterativeSolvers

10.1 Recap

Last semester, we looked at direct methods for solving Ax = b which involved computing the (P)LU decomposition of A (i.e. PA = LU where P is a permutation matrix and L/U are lower/upper triangular) and doing a forwards-backwards substitution:

\begin{align} Ly &= Pb \nonumber\\ Ux &= y. \end{align}

In general, this requires O(N^3) operations. Moreover, when A is sparse, this is inefficient because L, U need not be sparse. In this chapter we are looking at methods that ideally converge in less than O(N^3) operations and can work with matrices stored in a sparse format (i.e. that involve only matrix-vector mulitplications).

Last time, we saw simple examples of iterative techniques for solving Ax = b:

\begin{align} x^{(k+1)} = T x^{(k)} + c \tag{*} \end{align}

If \rho(T) < 1, then, for all x^{(0)} \in \mathbb R^n, (*) converges to the unique solution x = (I-T)^{-1}c as k\to\infty.

Examples: Jacobi, Gauss-Seidel, SOR. These are examples of so-called splitting methods: we write A = P - N where P (the preconditioner) is “easy” to invert. Then, we write A x = b as Px = Nx + b and define the iteration as x_{k+1} = P^{-1}Nx^{(k)} + P^{-1} b.

Exercise 10.1 What is P and N for Jacobi, Gauss-Siedel, SOR?

Notice that this method can also be written:

\begin{align} Px^{(k+1)} &= (P-A) x^{(k)} + b \nonumber\\ % &= P x^{(k)} + (b - Ax^{(k)}). \end{align}

Therefore, we can write

\begin{align} x^{(k+1)} = x^{(k)} + P^{-1} r^{(k)} \end{align}

where r^{(k)} := b - Ax^{(k)} is the residual at step k.

function splitting(A,b,x; P=0., Niter=100, tol=1e-10) 
    
    X = [x]

    if (P==0.)
        P = D
    end

    if ( isapprox(det(P), 0; atol=1e-12) )
        @warn "P is singular"
        return X
    end

    iP = inv(P)
    r = b - A*x
    for i  1:Niter
        push!( X, X[end] + iP*r ) 
        r = b - A*X[end]
        if (norm(r) < tol)
            return X
        end 
    end

    @warn "max iterations exceeded"
    return X
end;

Exercise 10.2 Why do we stop the iteration when the residual is small rather than the error?

Example 10.1 Applying the Jacobi and Gauss-Seidel iterations to Ax = b with

b = [6, 25, -11, 15];
A = [10 -1 2 0; 
    -1 11 -1 3; 
    2 -1 10 -1; 
    0 3 -1 8]

L = -tril(A, -1)
D = Diagonal(diag(A))

x0 = [0.,0.,0.,0.];

gives:

X = splitting(A,b,x0) # Jacobi
Y = splitting(A,b,x0; P=D-L) # Gauss

ω=1.15
Z = splitting(A,b,x0; P=D-ω*L); # SOR

In fact, we have the following error curves:

(ρJ, ρGS, ρSOR)
(0.42643661084234186, 0.08982305838804323, 0.17351995310029997)

Exercise 10.3 Does the iteration x^{(k+1)} = (I - A) x^{(k)} + b converge in this case? What are P and N?

10.2 Steepest Descent

We now look at gradient-based methods: Notice that, if A is symmetric and positive definite (SPD) (that is, symmetric and x^TAx >0 for all x \not=0), then Ax = b is the unique minimiser of

\begin{align} g(x) = \frac{1}{2} x^T A x - b^T x. \end{align}

People use lots of different notation for the inner product (also called the dot product):

\begin{align} x^T y = x \cdot y = (x,y) = \langle x, y \rangle % = \sum_i x_i y_i. \end{align}

Exercise 10.4 Show that

\begin{align} A = \begin{pmatrix} 5 & 1 \\ 1 & 3 \end{pmatrix} \end{align}

is symmetric and positive definite

The level curves of g when A is SPD look like this:

A = [5 1; 1 3]; b = [1; 3]; ξ = A \ b 
g1(x,y) = dot([x;y], A*[x;y])/2 - dot([x;y], b)

x_range = range(ξ[1]-1,ξ[1]+1, length=100)
y_range = range(ξ[2]-1,ξ[2]+1, length=100)

contour( x_range, y_range, g1, 
    fill=true, 
    levels=50, 
    title=L"Level sets $\{x : g(x) = c\}$",
    xlabel="x", ylabel="y", primary=false,
    colorbar=true)
scatter!( [ξ[1]], [ξ[2]], color=:white, label=L"solution $Ax=b$")

Notice that \nabla g(x) is a vector that points in the direction of increasing g. Therefore, in order to move closer to the minimiser, the first thing you might consider is

\begin{align} x^{(k+1)} = x^{(k)} - \alpha^{(k)} \nabla g(x^{(k)}). \tag{GD} \end{align}

Here, \alpha^{(k)} is the step size and \nabla g(x^{(k)}) is the search direction.

Exercise 10.5 Show that \nabla g(x) = Ax - b and so we can rewrite \text{(GD)} as x^{(k+1)} = x^{(k)} + \alpha^{(k)} r^{(k)} where r^{(k)} is the residual.

We may choose \alpha^{(k)} to be a minimiser of g along the search direction:

\begin{align} g( x^{(k)} + \alpha^{(k)} r^{(k)} ) = \min_{t} g( x^{(k)} + t r^{(k)} ). \end{align}

Since g is quadratic, one can compute \alpha^{(k)} explicitly:

\begin{align} \alpha^{(k)} = \frac {r^{(k)T} r^{(k)}} {r^{(k)T} Ar^{(k)}} % =: \frac{ (r^{(k)}, r^{(k)}) }{(r^{(k)}, r^{(k)})_A }. \end{align}

Here, we have adopted the notation (x,y) = x^T y and defined (x,y)_A := x^T A y.

Here, we have implemented steepest descent:

function gradientDescent(A,b,x0; Niter=100, xlims=[-25,25], ylims=[-25,25])

    g(x,y) = dot([x;y], A*[x;y])/2 - dot([x;y], b)
    Dg(x,y) = A*[x;y] - b

    x_range = range(xlims[1], xlims[2], length=100)
    y_range = range(ylims[1], ylims[2], length=100)

    P = contour(x_range, y_range, g, 
        fill=true, 
        levels=50, 
        title="Steepest Descent",
        xlabel="x", ylabel="y", legend=false,
        colorbar=true)

    anim = @animate for i  0:Niter

        if (i==0)
            scatter!(P, [x0[1]], [x0[2]], markersize=5, color=:white)
        else 
            r = b - A*x0
            if (norm(r)<10-8)
                break;
            end
            # step size 
            α = dot(r,r)/dot(r,A*r)

            x1 = x0 - α*Dg(x0[1],x0[2])  
            plot!(P, [x0[1],x1[1]], [x0[2],x1[2]], color=:white, lw=2)
            scatter!(P, [x1[1]], [x1[2]], markersize=5, color=:white)
            x0 = x1 
        end
    end

    return anim
end

A = [5 1; 1 3]; b = [1; 3]; x0 = [25;-15]; ξ = A \ b 
gif( gradientDescent(A,b,x0), fps=1 )
[ Info: Saved animation to c:\Users\math5\Math 5485\Math5486\tmp.gif

What’s the problem with this? The steepest descent direction is only best search direction locally. It can happen that the iteration zig-zags:

A = [5 0 ; 0 2]; b = [10; 3]; x0 = [15;20]
gif( gradientDescent(A,b,x0), fps=1 )
[ Info: Saved animation to c:\Users\math5\Math 5485\Math5486\tmp.gif

In higher dimensions this problem gets worse! In fact, if the condition number is large (roughly speaking, this means the widths of the principal axes of the hyperellipsoids (the level sets of g) are varied with some directions very thin and some very wide), then the performance of steepest descent is very poor.

Idea: choose better search directions.

10.3 Conjugate gradient

Consider

\begin{align} x^{(k+1)} = x^{(k)} + \alpha^{(k)} v^{(k)} \end{align}

for some step size \alpha^{(k)} and search direction v^{(k)}.

This gives Ax^{(k+1)} = Ax^{(k)} + \alpha^{(k)} Av^{(k)} and thus

\begin{align} r^{(k+1)} = r^{(k)} - \alpha^{(k)} A v^{(k)} \end{align}

where r^{(k)} = b - Ax^{(k)} is the residual.

10.3.1 What \alpha^{(k)}?

Suppose we have a search direction v^{(k)}. Again, we can choose \alpha^{(k)} that minimises g(x^{(k)} + \alpha^{(k)} v^{(k)} ) over the space \{ x^{(k)} + t v^{(k)} : t \in \mathbb R \}. Since

\begin{align} g( x^{(k)} + t v^{(k)} ) &= g(x^{(k)}) + t v^{(k)T} ( Ax^{(k)} - b ) + \frac{t^2}{2} v^{(k)T}Av^{(k)} \nonumber\\ % &= g(x^{(k)}) - t (v^{(k)}, r^{(k)}) + \frac{t^2}{2} (v^{(k)},v^{(k)})_A, \end{align}

we have

\begin{align} \alpha^{(k)} := \frac {(v^{(k)}, r^{(k)})} {(v^{(k)},v^{(k)})_A}. \tag{$\alpha$} \end{align}

10.3.2 What v^{(k)}?

Exercise 10.6 Show that, if \alpha^{(k)} = \frac {(v^{(k)}, r^{(k)})} {(v^{(k)},v^{(k)})_A}, then r^{(k+1)} is orthogonal to v^{(k)} (that is, (v^{(k)},r^{(k+1)}) = 0).

Definition 10.1 We say \{ v^{(j)} \} is A-orthogonal (or A-conjugate) if (v^{(j)}, v^{(k)})_A = 0 for all j\not= k.

It turns out to be a good idea to try and choose v^{(j)} for j \leq k such that r^{(k+1)} is orthogonal to v^{(0)}, \cdots, v^{(k)}. That is, we try and choose the v^{(j)} so that the steepest descent direction points into a “previously unsearched” subspace. By Exercise 10.6 (the choice of \alpha^{(k)}), we have (v^{(k)},r^{(k+1)}) = 0. Moveover, since r^{(k+1)} = r^{(k)} - \alpha^{(k)} Av^{(k)}, we have

\begin{align} (v^{(k-1)}, r^{(k+1)}) &= (v^{(k-1)}, r^{(k)}) - \alpha^{(k)} (v^{(k-1)}, v^{(k)})_A \nonumber\\ % &= - \alpha^{(k)} (v^{(k-1)}, v^{(k)})_A \end{align}

and so (v^{(k-1)}, r^{(k+1)})= 0 if and only if (v^{(k-1)}, v^{(k)})_A = 0. More generally, we have

Lemma 10.1 Suppose that \alpha^{(k)} = \frac {(v^{(k)}, r^{(k)})} {(v^{(k)},v^{(k)})_A}. Then, (v^{(j)}, r^{(k+1)}) = 0 for all j \leq k if and only if \{ v^{j} \} is A-orthogonal.

Any A-orthogonal choice of search directions miniminises g on increasing subspaces:

Theorem 10.1 Suppose that \alpha^{(k)} = \frac {(v^{(k)}, r^{(k)})} {(v^{(k)},v^{(k)})_A} and \{v^{(j)}\} is A-orthogonal. Then, x^{(k+1)} = x^{(k)} + \alpha^{(k)} v^{(k)} minimises g(x) = \frac{1}{2}x^T Ax - b^Tx over the space

\begin{align} &x^{(0)} + \mathrm{span}\big\{ v^{(0)}, v^{(1)}, \cdots, v^{(k)} \big\} \nonumber\\ % &= \left\{ x^{(0)} + c_0 v^{(0)} + \cdots + c_k v^{(k)} : c_0,\cdots,c_k \in \mathbb R \right\} \end{align}

Proof. Next time!

10.3.3 How to choose \{ v^{(j)} \}?

(If you know about Gram-Schmidt (with respect to the inner product (\cdot,\cdot)_A), then that is one option).

It turns out that you can obtain A-orthogonal search directions by writing the new search direction as a linear combination of the residual and previous search direction: suppose that

\begin{align} v^{(k+1)} = r^{(k+1)} + \beta^{(k)} v^{(k)}. \end{align}

Exercise 10.7 Show that, if \{v^{(j)}\} is A-orthogonal and v^{(k+1)} = r^{(k+1)} + \beta^{(k)} v^{(k)}, then

\begin{align} \beta^{(k)} = - \frac {(v^{(k)},r^{(k+1)})_A} {(v^{(k)}, v^{(k)})_A} \end{align}

Hint: consider (v^{(k)},v^{(k+1)})_A = 0

function CG(A,b,x0; Niter=5, xlims=[-25,25], ylims=[-25,25])

    g(x,y) = dot([x;y], A*[x;y])/2 - dot([x;y], b)

    x_range = range(xlims[1], xlims[2], length=100)
    y_range = range(ylims[1], ylims[2], length=100)

    P = contour(x_range, y_range, g, 
        fill=true, 
        levels=50, 
        title="Conjugate Gradient",
        xlabel="x", ylabel="y", legend=false,
        colorbar=true)

    r = b - A*x0
    v = b - A*x0
    anim = @animate for i  0:Niter

        if (i==0)
            scatter!(P, [x0[1]], [x0[2]], markersize=5, color=:white)
        else 
        
            if (norm(r)<10-8)
                break;
            end

            # step size 
            α = dot(v,r)/dot(v,A*v)

            # new point
            x1 = x0 + α*v
            plot!(P, [x0[1],x1[1]], [x0[2],x1[2]], color=:white, lw=2)
            scatter!(P, [x1[1]], [x1[2]], markersize=5, color=:white)

            # residual
            r = b - A*x1
            
            # new search direction
            β = - dot(v,A*r)/dot(v,A*v)
            v = r + β*v
            
            x0 = x1
        end
    end

    return anim
end

A = [5 1; 1 3]; b = [1; 3]; x0 = [25;25];  ξ = A \ b 
gif( CG(A,b,x0), fps=1 )
[ Info: Saved animation to c:\Users\math5\Math 5485\Math5486\tmp.gif
A = [5 0 ; 0 2]; b = [10; 3]; x0 = [15;20]
display( gif( gradientDescent(A,b,x0), fps=1 ) )
gif( CG(A,b,x0), fps=1 )
[ Info: Saved animation to c:\Users\math5\Math 5485\Math5486\tmp.gif
[ Info: Saved animation to c:\Users\math5\Math 5485\Math5486\tmp.gif
TipTO-DO
  • Assignment 3: Due Wednesday 11:59 PM,
  • (Sign-up to office hours if you would like)

Chapter 1 reading:

Chapter 2 reading:

  • Recap matrices: sections 7.1 - 7.2 of Burden, Faires, and Burden (2015)
  • Jacobi & Gauss-Seidel: section 7.3 Burden, Faires, and Burden (2015)
  • Today: section 7.6 Burden, Faires, and Burden (2015)

Next: other Kyrlov subspace methods

Burden, Richard L., Douglas J. Faires, and Annette M. Burden. 2015. Numerical Analysis. 10th ed. CENGAGE Learning.
Driscoll, Tobin A, and Richard J Braun. 2018. Fundamentals of Numerical Computation. https://fncbook.com/.