A = [ 1 0 ; 0 1 ; 1 1 ]
b = [ 1 ; 1 ; 1 ]
A \ b2-element Vector{Float64}:
0.6666666666666665
0.6666666666666667
These notes are mainly a record of what we discussed and are not a substitute for attending the lectures and reading books! If anything is unclear/wrong, let me know and I will update the notes.
Let us consider a rectangular linear system of equations: A \bm x = b where A \in \mathbb R^{m\times n} with m > n.
Consider
\begin{align} \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix}. \end{align}
This reads: x = y = 1 and x + y = 1. Therefore there is no solution! We say the matrix equation is overdetermined.
What does Julia think about this?:
A = [ 1 0 ; 0 1 ; 1 1 ]
b = [ 1 ; 1 ; 1 ]
A \ b2-element Vector{Float64}:
0.6666666666666665
0.6666666666666667
There are many situations where we might want to approximately solve A \bm x = b e.g. Statistics:
Here we consider the population of the UK in the years since 1970:
Data from “Population Estimates Time Series Dataset - Office for National Statistics — Ons.gov.uk”
years = 1971:2024
pop = [
55928000 56096700 56222900 56235600 56225700 56216100 56189900 56178000 56240100 56329700 56357500 56290700 56315700 56409300 56554000 56683800 56804000 56916400 57076500 57237500 57438700 57584500 57713900 57862100 58024800 58164400 58314200 58474900 58684400 58886100 59113000 59365700 59636700 59950400 60413300 60827000 61319100 61823800 62260500 62759500 63285100 63710500 64138200 64618700 65087000 65605800 65964300 66286700 66627500 66739900 66978000 67636100 68526200 69281400
]
r = 1:10:54
x = years[r]
y = pop[r]
scatter( x, y/1e6 , title="population of UK", xlabel="year", ylabel="/millions", legend=false)One way of fitting the data exactly would be to consider the polynomial interpolation of the data (see Chapter 3).
In the following, we measure time since 1970 to make the numbers smaller and improve the conditioning of the interpolation problem.
p_interp = fit( r, y/1e6 )
display(p_interp)
scatter( r, y/1e6 , title="population of UK", ylabel="/millions",)
plot!( p_interp , -5 , 60, legend=false, xlabel="years since 1970" )You probably have already noticed that this is a bad approximation in this context (e.g. the population before 1970 was not larger than in 1970 and (hopefully) the population will not drop sharply as indicated above). In statistics, this phenomena is known as overfitting.
It turns out that I did not plot all of the data points and so we can actually measure the testing error (that is the error on the data points that we left out when we came up with our model):
scatter( pop'/1e6 , title="population of UK", label="testing data", xlabel="years since 1970", ylabel="/millions",color="red")
println( "training error: ", norm( p_interp.(r) - y/1e6 ) )
println( "testing error: ", norm( p_interp.(1:54) - pop'/1e6 ) )
scatter!( r, y/1e6 , title="population of UK", label="training data", xlabel="years since 1970", ylabel="/millions",color="blue")
plot!( p_interp, -5, 60, label="model" )training error: 6.355287432313019e-14
testing error: 3.947571417338935
Here, we can see that indeed the “model” does not capture the correct behavour outside the training data (i.e. the data points we used in order to derive the model in the first place).
Instead it may be useful to “fit” the data using a low degree polynomial, for example. In order to do this, we set up a (overdetermined) linear system of equations A \bm x \approx b and try to minimise the error in this approximation. First suppose that the data can be decribed by the linear function y \approx p(t) = c_1 + c_2 t where p is the population at time t. Suppose we have the data points (t_1,y_1),\dots,(t_m,y_m) and notice that we want
\begin{align} \bm y = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{pmatrix} % &\approx \begin{pmatrix} p(t_1) \\ p(t_2) \\ \vdots \\ p(t_m) \end{pmatrix} % = \begin{pmatrix} 1 & t_1 \\ 1 & t_2 \\ \vdots & \vdots \\ 1 & t_m \end{pmatrix} \begin{pmatrix} c_1 \\ c_2 \end{pmatrix} \end{align}
In order to solve this equation approximately, we may again use the \ b operator as we did in Note 18.1:
n = 1
V = r.^(0:n)'
p = Polynomial( V \ (y/1e6) )
scatter( pop'/1e6 , title="population of UK", label="testing data", xlabel="years since 1970", ylabel="/millions",color="red")
println( "training error: ", norm( p.(r) - y/1e6 ) )
println( "testing error: ", norm( p.(1:54) - pop'/1e6 ) )
scatter!( r, y/1e6 , title="population of UK", label="training data",color="blue")
plot!( p, -5, 60, label="model" )training error: 3.206666526918118
testing error: 10.088011997522381
We can see that the testing error for this model is actually worse than the polynomial interpolation! The model is too simple to capture what is actually happening. We can add complexity to the model by choosing a higher degree polynomial: p(t) = c_1 + c_2 t + c_3 t^2 and so we want
\begin{align} \bm y = % &\approx \begin{pmatrix} p(t_1) \\ p(t_2) \\ \vdots \\ p(t_m) \end{pmatrix} % = \begin{pmatrix} 1 & t_1 & t_1^2\\ 1 & t_2 & t_2^2\\ \vdots & \vdots \\ 1 & t_m & t_m^2 \end{pmatrix} \begin{pmatrix} c_1 \\ c_2 \\ c_3 \end{pmatrix}. \end{align}
Doing this indeed reduces the testing error:
n = 2
V = r.^(0:n)'
p = Polynomial( V \ (y/1e6) )
scatter( 1:54, pop'/1e6 , title="population of UK", label="testing data", xlabel="years since 1970", ylabel="/millions",color="red")
println( "training error: ", norm( p.(r) - y/1e6 ) )
println( "testing error: ", norm( p.(1:54) - pop'/1e6 ) )
scatter!( r, y/1e6 , title="population of UK", label="training data", xlabel="years since 1970", ylabel="/millions",color="blue")
plot!( p, -5, 60, label="model" )training error: 0.7000921851544214
testing error: 2.4978049313214417
This is known as linear least squares (because we are minimising |Ax-b|^2). The choice of polynomials is somewhat arbitrary: we still get an overdetermined linear system of equations if we instead took p(t) = c_1 + c_2 t + c_3 t^3 + c_4 \cos t
# Different basis functions:
f(t) = 1
g(t) = t
h(t) = t^3
j(t) = cos(t)
A = [f.(r) g.(r) h.(r) j.(r)]
c = A \ (y/1e6)
p′(x) = c[1] * f(x) + c[2] * g(x) + c[3] * h(x) + c[4] * j(x)
scatter( 1:54, pop'/1e6 , title="population of UK", label="testing data", xlabel="years since 1970", ylabel="/millions",color="red")
println( "training error: ", norm( p′.(r) - y/1e6 ) )
println( "testing error: ", norm( p′.(1:54) - pop'/1e6 ) )
scatter!( r, y/1e6 , label="training data", xlabel="years since 1970", ylabel="/millions",color="blue")
plot!( p′, -5, 60, label="model" )training error: 0.5283726651045577
testing error: 3.3411811246054257
Suppose x solves the normal equations A^TAx = bx. Then, x minimises the error |Ax - b|.
Therefore, in order to solve the linear least squares problem, we could solve A^TA x = A^T b which is now a square matrix equation that we have seen how to deal with (i.e. (P)LU decomposition + forwards and backwards substitution). The problem is that “\kappa(A^TA) = \kappa(A)^2” (one can define the condition number of a rectangular matrix but we haven’t seen the definition) and so for large matrices, this problem is ill-conditioned.
Instead: the built-in function in Julia uses the so-called QR-factorisation which plays the role of the LU decomposition but for solving overdetermined systems of equations. We may return to this in the last week of semester but we just saw the very basics:
Suppose that A \in \mathbb R^{m \times n} such that A = QR with Q \in \mathbb R^{m\times n} with orthonormal columns (i.e. Q^{T} Q = I) and R \in \mathbb R^{m\times m} upper triangular. If one has such a decomposition, then
\begin{align} A^T A x = (QR)^T (QR) x % = R^T Q^TQR x % = R^T R x \end{align}
and
\begin{align} A^T b = (QR)^T b = R^T Q b \end{align}
It turns out that if A has linearly independent columns, then R^T is invertible and A^TAx = A^Tb is equivalent to R x = Q b. Solving this system is possible by backwards substitution.