Understanding the Conjugate Gradient Method Hailiang Zhao * College of Computer Science and Technology, Zhejiang University hliangzhao@zju.edu.cn July 10, 2020 Contents 1 Introduction 2 2 The Quadratic Form 2 3 The Steepest Descent Method 3 3.1 The Direction and Optimal Step Size........................... 3 3.2 Eigenthings......................................... 6 3.2.1 Eigenvalues and Spectral Radius......................... 6 3.2.2 Jacobi Iterations.................................. 7 3.3 Convergence Analysis................................... 8 3.3.1 Instant Results................................... 8 3.3.2 General Convergence................................ 9 4 Conjugate Directions Methods 12 4.1 What is Conjugacy..................................... 12 4.2 Gram-Schmidt Conjugation................................ 15 4.3 Optimality Analysis.................................... 16 4.4 Conjugate Gradient Method................................ 17 4.5 Convergence Analysis................................... 18 4.5.1 Picking Perfect Polynomials............................ 18 4.5.2 Chebyshev Polynomials.............................. 21 4.6 Complexity Analysis.................................... 22 5 Conjugate Gradient Method in Action 23 5.1 Starting and Stopping Criterion.............................. 23 5.2 Preconditioning....................................... 23 5.3 Extending to General Cases................................ 24 6 Postscript 25 * Hailiang is a first-year Ph.D. student of ZJU-CS. His homepage is http://hliangzhao.me. 1
1 Introduction line search Hessian Hessian Conjugate Gridient method, CG Hessian CG CG Ax = b, (1.1) A R n n 1 b R n x R n A LU Jacobi Gauss-Seidel CG A Jacobi A CG Ax = b CG n 2 The Quadratic Form the Quadratic Form A R n n x, b R n c R x Ax x x Ax x = 2Ax x f(x) = Ax b f(x) 1 2 x Ax b x + c, (2.1) = (A + A )x b x x = b A f(x) 1 2 (A + A )x = b A A 2 x = A 1 b x f(x) = 0 x = x + e e R n f(x + e) = 1 2 (x + e) A(x + e) b (x + e) + c = f(x ) + 1 2 e Ae > f(x ). 1 A R n n x R n x Ax > 0 http://hliangzhao.me/math/math.pdf 2 λ x Ax = λx x Ax > 0 λ x 2 > 0 λ > 0 A A 2
这说明当 A 是一个对称正定矩阵时 通过求解 Ax = b 得到的就是 f (x) 的全局最小值 这相当 于把一个二次型的最优化问题转变成了线性方程组的求解问题 当 A 是一个对称正定矩阵时 f (x) 的图像是一个图形开口向上的抛物面 如果 A 非正定 那么其全局最小值点就可能不止一个 Fig. 2.1展示了不同的 A R2 2 对 f (x) 的图像的影响 Figure 2.1: 不同的 A 对 f (x) 的图像的影响 a 正定矩阵的二次型 b 负定矩阵的二次型 c 奇异矩 阵和非正定矩阵的二次型 d 不定矩阵的二次型 此时解是一个鞍点 梯度法和 CG 均无法处理该问题 3 The Steepest Descent Method 3.1 The Direction and Optimal Step Size 最速下降法即梯度法 作为一种一维搜索算法 梯度法采取函数梯度的反方向作为迭代方向 并 通过求导的方式确定迭代的最优步长 因为梯度是函数值增长最快的方向 所以梯度下降的方向是 函数值减少最快的方向 这就是为什么梯度法又被称为最速下降法的原因 在梯度法中 我们将从 任意点 x(0) 开始 经过迭代得到 x(1), x(2),... 直到达到最大迭代次数或者误差满足要求为止 在第 i 轮迭代中 梯度法选择的方向为 b Ax(i) 3 为了简化描述 我们定义如下变量 误差向量 e(i) x(i) x 表示第 i 轮迭代时 x(i) 和全局最小值点 x 的距离 残差向量 r(i) b Ax(i) 表示第 i 轮迭代时 Ax(i) 和 b 之间的距离 残差向量其实就是最速下降的方向 且 r(i) = b Ax(i) = Ax Ax(i) = Ae(i) 这个等价变换 会反复用到 务必记住 因此 x 将按照如下公式进行更新 x(i+1) x(i) + α r(i), (3.1) 3 别忘了只有当 A 是对称矩阵的时候才有 f = Ax b 不管 A 是否为对称阵 梯度法都会选取 b Ax x (i) 作为迭代 方向 因此梯度法并不能够保证收敛 我们会在后文只会给出当 A 为对称阵和对称正定阵时的收敛速度 3
α i α f(x (i+1) ) α φ(α) α α φ(α) = 0 α φ(α) = α f(x (i+1) ) = ( x(i+1) f(x (i+1) ) ) α x (i+1) = r(i+1) r (i) = 0. (3.2) (3.2) α α Fig. 3.1 Figure 3.1: a b x (0) = [ 2, 2] x 1x 2 c f(x) α α d α (3.2) r(i+1) r (i) = (b Ax (i) ) r (i) = [ b A(x (i) + α r (i) ) ] r(i) = (b Ax (i) ) r (i) α(ar (i) ) r (i) = r(i) r (i) αr(i) Ar (i) = 0, (3.3) α r (i) r (i) α = r(i) Ar. (3.4) (i) x (0) (3.4) (3.1) (3.1) x (1), x (2),... A r k = 0 Algorithm 1 4
Algorithm 1: Steepest Descent. Input: A R n n b R n Output: x Ax = b argmin x f(x) 1 r (0) = b Ax (0) 2 k = 0 3 while r k = 0 do 4 k = k + 1 5 α (k) = ( r(k 1) r ) ( ) (k 1) / r (k 1) Ar (k 1) 6 x (k) = x (k 1) + α (k) r (k 1) 7 r (k) = b Ax (k) 8 end while 9 return x (k) Algorithm 1 - step 5 step 7 - Algorithm 1 6 x (k) = x (k 1) + α (k) r (k 1) Ax (k) + b = (Ax (k 1) b) α (k) Ar (k 1) r (k) = r (k 1) α (k) Ar (k 1). (3.5) (3.5) r (k 1) α (k) Ar (k 1) b Ax (k) r (k) Ar (k 1) 7 - r (k) r (k) x (k) b Ax (k) r (k) Algorithm 2 Algorithm 2:. Input: A R n n b R n ε MaxIter Output: Ax = b 1 r (0) = b Ax (0), δ (0) = r (0) r (0) 2 k = 0 3 while k < MaxIter and δ (k) > ε 2 δ (0) do 4 q = Ar (k), α (k) = δ (k) /r (k) q 5 x (k+1) = x (k) + α (k) r (k) 6 if k is divisible by n then 7 /* Periodically update by the original formula */ 8 r (k+1) = b Ax (k+1) 9 else 10 /* Reduce complexity according to (3.5) */ 11 r (k+1) = r (k) α (k) q 12 end if 13 k = k + 1, δ (k) = r (k) r (k) 14 end while 15 return x (k) 5
3.2 Eigenthings 3.2.1 Eigenvalues and Spectral Radius v B λ Bv = λv B v v B λ < 1 lim k B k v 0; λ > 1 lim k B k v Fig. 3.2 Fig. 3.3 Figure 3.2: λ = 0.5 B k v Figure 3.3: λ = 2 B k v x B x B B R n n n 4 n B x x B n {v 1,..., v n } x = n c i v i, Bx = n c iλ i v i λ i v i i, c i = 1 x = n v i i λ i > 1 B k x 1 B k x 0 B k x 0 λ i B spectral radius ρ(b) max λ i. (3.6) i 4 http://hliangzhao.me/math/math.pdf 6
B n B (defective matrix) 3.2.2 Jacobi Iterations Ax = b Jacobi Gauss-Seidel Jacobi Jacobi Ax = b Jacobi x (k+1) i i, a ii 0 x (k+1) i x (k+1) j x (k+1) i x i = b i j i a ijx j a ii, (3.7) = b i i 1 j=1 a ijx (k) j n j=i+1 a ijx (k) j (3.8) a ii x j = b i i 1 j=1 a ijx (k+1) j n j=i+1 a ijx (k) j, (3.9) a ii Gauss-Seidel Gauss-Seidel Jacobi D = diag(a) E A F A A = D E F Ax = b (D E F)x = b D 1 (D E F)x = D 1 b x = D 1 (E + F)x + D 1 b. B D 1 (E + F) z D 1 b Jacobi (3.8) x (k+1) = Bx (k) + z. (3.10) Ax = b x (stationary point) i x (i) x e (i) (3.10) x (k+1) = Bx (k) + z = B(x + e (i) ) + z = Bx + z + Be (i) x = Bx + z = x + Be (i) e (i+1) = x (i+1) x = Be (i) Jacobi ρ(b) < 1 0 Jacobi x x (0) Jacobi B Jacobi A A 7
3.3 Convergence Analysis 3.3.1 Instant Results e (i) A λ e r (i) = Ae (i) = λ e e (i) r (i) A λ e (3.4) α Ar (i) = λ e Ae (i) = λ e r (i), (3.11) α = r (i) r (i) r(i) Ar = r (i) r (i) (i) r(i) λ (3.11) er (i) = r (i) r (i) λ e r (i) r (i) = 1 λ e. (3.12) (3.1) e (i+1) + x = e (i) + x + α r (i), e (i+1) = e (i) + α r (i). (3.13) e (i+1) = e (i) + 1 r (i) (3.12) λ e = e (i) 1 λ e λ e e (i) = 0. x Fig. 3.4 Figure 3.4: A = [ [3, 2], [2, 6] ] A v 1 = [1, 2] v 2 = [ 2, 1] λ 1 = 7 λ 2 = 2 e (i) A λ e A A A = PDP 1 = PDP, (3.14) P = {v 1,..., v n } {v 1,..., v n } D = 8
diag(λ 1,..., λ n ) n e (i) e (i) = n ξ i v i, (3.15) e (i) 2 = n ξ2 r (i) = Ae (i) = n ξ iλ i v i e (i) Ae (i) = ( n ξ i vi )( n ) n ξ i λ i v i = ξi 2 λ i, (3.16) r(i) Ar (i) = ( n ξ i λ i vi ) ( n A ξ i λ i vi ) ( n )( n = ξ i λ i v i ξ i λ 2 ) i v i = e (i+1) = e (i) + α r (i) n ξi 2 λ 3 i. (3.17) e (i+1) = e (i) + r (i) r (i) r(i) Ar r (i) (i) n = e (i) + ξ2 i λ2 i n r (i) (3.18) ξ2 i λ3 i A λ (3.18) e (i+1) = e (i) + λ2 n ξ2 i λ 3 n ( Ae (i) ) ξ2 i = e (i) + λ2 n ξ2 i λ 3 n ( λe (i) ) ξ2 i = 0. (3.19) Ae (i) = A n ξ iv i = n ξ iav i = n ξ iλv i = λe (i) A f(x) n A e (i) A λ e A 3.3.2 General Convergence energy norm e A (e Ae) 1 2 (3.20) Fig. 3.5 x f(x) 9
Figure 3.5: 能量范数相同的两个向量 根据能量范数的定义可得 e(i+1) 2A = e (i+1) Ae(i+1) = (e (i) +α (3.13) r(i) )A(e(i) + α r(i) ) 2 = e (3.4) (i) Ae(i) + 2αr(i) Ae(i) + α r(i) Ar(i) r(i) r(i) ( ) ( r(i) r(i) )2 r(i) r(i) + r(i) Ar(i) = e(i) 2A + 2 Ar r(i) Ar(i) r(i) (i) = e(i) 2A (r(i) r(i) )2 Ar r(i) (i) ( = e(i) 2A 1 (r(i) r(i) )2 ) Ar )(e Ae ) (r(i) (i) (i) (i) n ( ) 2 2 2 ( ξi λi ) 2 n = e(i) A 1 n (3.16) & (3.17) ( ξi2 λi )( ξi2 λ3i ) n ( ξi2 λ2i )2 n = e(i) 2A ω 2 ω 2 1 n ( ξi2 λi )( ξi2 λ3i ) (3.21) 我们只要分析 ω 2 的上界 就可以知道梯度法在最坏情况下的表现了 ω 越小 收敛速度越快 至于原 因 此处暂时按下不表 后文会给出解释 我们先针对一个简单的情况 A R2 2 是个正定对称阵 进行分析 前文已经说明说明过 当 A 是正定矩阵时 其特征值 λ1, λ2 均为正数 且特征向量 v1, v2 线性无关 因此不妨设 λ1 > λ2 此时 A 的谱条件数 spectral condition number κ = λ1 /λ2 1 请回顾(3.15)5 我们将误差向量 e(i) 相对于特征向量张成的空间/坐标系的斜率定义为 µ ξ2 /ξ1 5 在(3.15)中 我们将 vi 视为基 相应地 ξi 是对应基上的坐标 将 v1 视为 x 轴 v2 视为 y 轴 即可类比得到斜率 µ 10
由此可对 ω 2 作如下化简 ( ω2 = 1 ( = 1 ξ12 λ21 +ξ22 λ22 ξ12 λ22 2 2 ξi λi ξ12 λ2 )2 )( 2 2 3 ξi λi ξ12 λ32 ) (κ2 + µ2 )2. (κ + µ2 )(κ3 + µ2 ) (3.22) 上式表明我们可以将 ω 看成 κ 和 µ 的函数 只考虑正数部分 观察 Fig. 3.6左图可发现 κ 和 µ 越 小 ω 越小 此外 µ = ±κ 时 ω 最大 这个结论从图像上看很直观 也可以通过对(3.22)求导得到 分析这个图像也可以理解梯度法是如何做到一步迭代即收敛的 当 e(i) 是一个特征向量时 此时误 差向量在特征向量组成的坐标系的坐标轴上 因此斜率 µ 为 0 或 此时均有 ω = 0 当 A 的所 有特征值均相等时 κ = 1 同样有 ω = 0 此外 我们可以观察一下 ω 随 κ 和 µ 的变化情况 当 κ 很小的时候 不论 µ 如何 不论初始点如何选取 收敛速度都是比较快的 这是因为 A 各个方向的 特征值相似 所以 f (x) 接近球形 在各个点处均有合适的迭代方向 当 κ 很大的时候 不同的特征 向量的能量范数差距很大 此时 f (x) 的一极相对另一极长很多 想象一个 a 和 b 差异很大的椭圆 x2 /a2 + y 2 /b2 = 1 如果初始点在山脊处 那么迭代方向还是很好找的 收敛速度不算慢 如果初 始点在山谷处 那么梯度法只能朝着波谷方向曲折而缓慢的前进 每一步迭代作出的有效改进很小 收敛速度很慢 Fig. 3.6右图给四种情形下梯度法的迭代情况做了一个形象化的展示 Figure 3.6: 左图 ω 关于 κ 和 µ 的函数图像 右图 四种情形下梯度法的迭代情况 a κ 大 µ 小 b κ 大 µ 大 c κ 小 µ 小 d κ 小 µ 大 我们再来算一算最差的情况下 即 µ = ±κ 时 ω 的上界是多少 对于对称正定阵 A 我们将条 件数 condition number 定义为特征值之比的最大值 λmax κ. λmin 当 µ = ±κ 时有 µ2 = κ2 因此(3.22)满足 ω2 1 (3.23) ( κ 1 )2 4κ4, κ5 + 2κ4 + κ3 κ+1 进而得到 ω κ 1. κ+1 该结论在 n 2 时依然成立 将(3.24)绘制出来可得 11 (3.24)
Figure 3.7: A ω κ ω f(x (i) ) f(x 1 ) f(x (0) ) f(x ) = 2 e (i) Ae (i) 1 2 e (0) Ae (0) = e (i) 2 A e (0) 2 = ω 2, (3.25) A ω e (i) 2 A i (3.24) ( ) i κ 1 e (i) A e (0) A, (3.26) κ + 1 f(x (i) ) f(x ) f(x (0) ) f(x ) A 4 Conjugate Directions Methods 4.1 What is Conjugacy ( ) 2i κ 1. (3.27) κ + 1 Fig. 3.6 b n d (0), d (1),..., d (n 1) n n (3.15) x (i) x (i+1) x (i) + α (i) d (i). (4.1) α (i) d (i) e (0) = x x (0) = α (0) d (0) +...α (n 1) d (n 1), (4.2) e (0) 12
i + 1 i e (i+1) = e (0) + α (k) d (k) (4.3) d (i) e (i+1) = n k=i+1 = k=0 n 1 k=i+1 α (k) d (i) d (k) α (k) d (k), d (i) d (j) = 0 if i j = 0. (4.4) d (i) (e (i) + α (i) d (i) ) = 0 = α (i) = d (i) e (i) d (i) d. (4.5) (i) (4.5) e (i) i e (i) d (i) α (i) n d (0), d (1),..., d (n 1) n A- d (0), d (1),..., d (n 1) A- A-orthogonal p q p Aq = 0, (4.6) p q A- α (i) A- Fig. 4.1 a A- A p q b a b Figure 4.1: A- A- A- conjugate direc- 13
tions methods (4.4) (4.5) d (i) Ae (i+1) = 0. (4.7) α (i) = d (i) Ae (i) d (i) Ad (i) r (i) = Ae (i) = d (i) r (i) d (i) Ad (i) (4.8) (4.5) d (i) (4.8) (4.7) (3.2) (4.1) x α f(x (i+1) ) = ( x(i+1) f(x (i+1) ) ) α x (i+1) = r(i+1) d (i) = (d (i) r (i+1)) = (d (i) Ae (i+1)) = 0 (4.7) A- d (i) r (i) (3.15) e (0) n 1 e (0) = δ i d (i). (4.9) n 1 d j Ae (0) = δ i d j Ad (i) d (i) d (j) = 0 if i = j = δ i d j Ad (j), δ j = d j Ae (0) d j Ad (j) = d j Ae (0) + j 1 i=0 α (i)d (j) Ad (i) d j Ad (j) = d j A( e (0) + j 1 i=0 α ) (i)d (i) d j Ad (j) d (i) d (j) = 0 if i j j 1 e (j) = e (0) + α (i) d (i) i=0 = d j Ae (j) d j Ad = d j r (j) (j) d j Ad (j) = α (j). (4.8) (4.10) α (i) = δ i e (i) 14
i 1 e (i) = e (0) + α (j) d (j) (4.3) = = j=0 n 1 i 1 δ j d (j) δ j d (j) (4.9) & (4.10) j=1 j=0 n 1 δ j d (j). (4.11) j=i 4.2 Gram-Schmidt Conjugation d (i) (4.1) (4.8) - conjugate Gram-Schmidt process A- - - Gram-Schmidt process - {u (0),..., u (n 1) } {q (0),..., q (n 1) } q (0) = u (0), q (1) = u (1) u (1) q (0) q (0) q (0) q (0), q (n 1) = u (n 1) n 2 ( u(n 1) q (i) ) i=0 q (i) q (i) q (i). (4.12) q (1) u (1) q (0) u (1) q (0) q (0) q (0) u (1) q (0) q (1) q (2) q (1) p (i) i q (0),..., q (i 1) q (i) i, q (i) i 1 q (i) = p (i) (p (i) q (k) )q (k), q (i) q (i) q (i). (4.13) k=0 A- - q (i) d (i) i 1 ( p (i) Ad (k) i : d (i) = p (i) d (k) Ad d (k) ). (4.14) (k) k=1 p (i) Ad (k) d (k) Ad β ik - (k) i = 0 d (i) = p (i) 0 < i < n i 1 d (i) = p (i) + β ik d (k), (4.15) k=0 d (i) Ad (j) = p (i) Ad i 1 ( ) (j) + β ik d (k) Ad (j) k=0 = p (i) Ad (j) + β ij d (j) Ad (j), (4.16) 15
(4.15) (4.17) (4.14) β ij = p (i) Ad (j) d (j) Ad. (4.17) (j) p (i) CG p (i) r (i) 4.3 Optimality Analysis e (i) e (i) A 6 D i i D i span{d (0),..., d (i 1) } (4.3) j 1 e (i) = e (0) + α (i) d (i) e (0) + D i. (4.18) Fig. 4.2 e (i) A i = 2 i=0 Figure 4.2: e (0) + D 2 e (i) A e (0) + D i e (i) e (i) = argmin e e(0) +D i e A e (i) A e (i) 2 A e (i) 2 A = e (i) A e (i) (4.11) ( n 1 ) A ( n 1 ) = δ j d (j) δ k d (k) d (i) d (j) = 0 if i j j=i k=i n 1 = (δ j d (j) ) A(δ j d (j) ). (4.19) j=i e (i) 2 A e (i) A CG 6 3.3.2 i 16
(4.11) i j j > i d (i) A n 1 d (i) r (j) = d (i) Ae (j) = d (i) A δ k d (k) k=j n 1 = δ k d (i) Ad (k) k=j d (i) d (k) A- = 0. (4.20) j {i + 1,..., n 1} r (j) d (0),..., d (i) 1 A- r (j) = Ae (j) (4.15) r (j) j > i d (i) r (j) = p (i) r i 1 (j) + β ik d (k) r (j) k=0 0 = p (i) r (j), (4.21) j {i + 1,..., n 1} r (j) p (0),..., p (i) 2 (4.15) r (j) j = i 3 d (i) r (i) = p (i) r (i). (4.22) Algorithm 2 (3.5) - x (k) = x (k 1) + α (k) d (k 1) Ax (k) + b = (Ax (k 1) b) α (k) Ad (k 1) r (k) = r (k 1) α (k) Ad (k 1). (4.23) r (k 1) α (k) Ad (k 1) b Ax (k) r (k) Ad (k 1) - b Ax (k) 4.4 Conjugate Gradient Method 4.2 CG d (0),..., d (n 1) r (0),..., r (n 1) - (4.21) (4.8) (4.1) x i j : r (i) r (j) = 0. (4.24) (4.15) d (i) d (0),..., d (i 1) n p (i) r (i) 17
r (i) r (j+1) = r (i) r (j) α (j) r (i) Ad (j) (4.23) α (j) r(i) Ad (j) = r(i) r (j) r(i) r (j+1) (4.24) 1 r(i) α Ad (i) r(i) r (i), i = j (j) = 1 α (i 1) r(i) r (i), i = j + 1 0. otherwise β ij = { 1 α (i 1) r (i) r (i) d (i 1) Ad (i 1), j = i 1 0. j < i 1 (4.25) (4.17) (4.26) CG O(n 2 ) O(m) m A j = i 1 α (i 1) β ij β ij = d (i 1) Ad (i 1) d (i 1) r (i 1) r(i) r (i) d (i 1) Ad (i 1) (4.8) = r (i) r (i) d (i 1) r (i 1) (4.22) & p (i 1) r (i 1) r(i) = r (i) r(i 1) r (4.27) (i 1) β ij j = i 1 β (i 1) (4.15) d (i+1) = r (i+1) + β (i) d (i). (4.28) CG Algorithm 3 4.5 Convergence Analysis CG n (4.23) - cancellation error A- 4.5.1 Picking Perfect Polynomials - span{r (0),..., r (i) } = span{d (0),..., d (i) } = D i+1, (4.29) r (i) D (i+1) (4.23) i + 1 r (i+1) r (i) Ad (i) r (i+1) r (i) AD i D i+2 = D i+1 AD i+1. (4.30) 18
Algorithm 3: CG. Input: A R n n b R n ε MaxIter Output: Ax = b 1 k = 0, r (0) = b Ax (0) 2 d (0) = r (0) // Set d (0) as r (0) 3 δ (0) = r (0) r (0) 4 while k < MaxIter and δ (k) > ε 2 δ (0) do 5 q = Ad (k), α (k) = δ (k) /d (k) q 6 x (k+1) = x (k) + α (k) d (k) 7 if k is divisible by n then 8 /* Periodically update by the original formula */ 9 r (k+1) = b Ax (k+1) 10 else 11 /* Reduce complexity according to (4.23) */ 12 r (k+1) = r (k) α (k) q 13 end if 14 δ (k+1) = r (k+1) r (k+1), β = δ (k+1) /δ (k) 15 d (k+1) = r (k+1) + βd (k) // Set d (k+1) by (4.28) 16 k = k + 1 17 end while 18 return x (k) D i = D i 1 AD i 1 = D 1 AD 1 A 2 D 1... A i 1 D 1 = span{d (0), Ad (0),..., A i 1 d (0) } = span{r (0), Ar (0),..., A i 1 r (0) } r (i) = Ae (i) = span{ae (0), A 2 e (0),..., A i e (0) }. (4.31) Krylov i ( i e (i) = I + ψ k A k) e (0), (4.32) k=1 ψ k α (k) β (k) A P i (A) λ P i (λ) = 1 + i k=1 ψ kλ k P i (0) = 0 (4.32) e (i) = P i (A)e (0) v 1,..., v n A λ 1,..., λ n e (0) 19
e (0) = n ξ iv i n e (i) = ξ j P i (λ j )v j Ae (i) = e (i) 2 A = j=1 n ξ j P i (λ j )λ j v j j=1 P i (A)v j = P i (λ j )v j n ξj 2 ( Pi (λ j ) ) 2 λj, v j = 1 (4.33) j=1 e (0) 2 A = n j=1 ξ2 j λ j P 0 ( ) 1 4.3 CG e (i) A (4.33) CG e (i) A P i e (i) 2 A min P i = min P i max λ max λ ( Pi (λ) ) 2 n ξj 2 λ j j=1 ( Pi (λ) ) 2 e(0) 2 A. (4.34) n = 2 Fig. 3.3 λ 1 = 7, λ 2 = 2 Fig. 4.3 CG max λ ( Pi (λ) ) 2 a i = 0 e(0) 2 A 1 b i = 1 P i (0) = 1 P 1 (λ) = a λ + 1 a a = argmin max { (2a + 1) 2, (7a + 1) 2}, a a = 2/9 P 1 (λ) = 2/9λ + 1 5/9 c i = 2 (0, 1), (2, 0), (7, 0) 0 Figure 4.3: i CG P i(0) = 1 P i 0 λ 1 = λ 2 (0, 1), (λ 1, 0), (λ 2, 0) CG Fig. 4.3 d 20
λ min λ max CG [λ min, λ max ] CG 4.5.2 Chebyshev Polynomials Chebyshev polynomials i T i (ω) = 1 ( (ω + ω 2 2 1) i + (ω ω 2 1) i). (4.35) Fig. 4.4 ω [ 1, 1] T i (ω) 1-1 1 ω / [ 1, 1] P i (λ) = ( ) T λmax+λ min 2λ i λ max λ min T i ( λ max+λ min λ max λ min ) (4.36) 7 max λ ( Pi (λ) ) 2 Fig. 4.4 λmin = 2, λ max = 7 4.3 c 0.183 Figure 4.4: 7 CG 21
T i ( λmax+λmin 2λ λ max λ min ) 1 ( ) λ T max+λ min 2λ i λ max λ min e (i) A ( ) e (0) A λ T max+λ min i λ max λ min ( ) 1 λmax + λ min T i e (0) A λ max λ min ( ( κ + 1 ) i ( κ 1 ) ) i 1 = 2 + e (0) A. (4.37) κ 1 κ + 1 ( κ+1 ) i i κ 1 0 CG ( κ 1 ) i e(0) e (i) A 2 A. (4.38) κ + 1 (4.37) (3.27) i = 1 CG i > 1 Fig. (3.7) CG Fig. (4.5) CG κ 2 CG (4.38) Figure 4.5: CG κ 4.6 Complexity Analysis CG CG CG - - O(m) m ϵ e (i) = ϵ e (0) (3.27) 1 ( 1 ) i 2 κ ln. (4.39) ϵ 22
(4.38) CG 1 ( 2 ) i κ ln 2. (4.40) ϵ O(mκ) CG O(m κ) O(m) d κ O(n 2/d ) O(n 2 ) CG O(n 3/2 ) O(n 5/3 ) CG O(n 4/3 ) 5 Conjugate Gradient Method in Action 4 CG CG 5.1 Starting and Stopping Criterion x (0) CG x (0) = 0 f(x) A (4.27) Algorithm 3 Step 14 β ij 0 Algorithm 3 Step 4 5.2 Preconditioning Fig. 4.5 (4.38) κ CG CG M Ax = b M 1 Ax = M 1 b, (5.1) κ(m 1 A) κ(a) M 1 A A M 1 A CG Ax = b M M A M 1 A M E EE = M Cholesky λ M 1 A v (E 1 AE )(E v) = E 1 A(E E )v E E = E E = I = (E E )E 1 Av = E (E E 1 )Av E E 1 = M 1 = E M 1 Av M 1 Av = λv = λ(e v), (5.2) λ E 1 AE E v Ax = b { E 1 AE ˆx = M 1 b, ˆx = E x. (5.3) 23
ˆx x A E 1 AE CG ˆx Transformed Preconditioned Conjugate Gradient Method TPCG CG (5.4) ˆd (0) = ˆr (0) = E 1 b E 1 AE ˆx (0) α (i) = ˆr (i) ˆr (i) ˆd (i) E 1 AE ˆd (i) ˆx (i+1) = ˆx (i) + α (i) ˆd(i) ˆr (i+1) = ˆr (i) α (i) E 1 AE ˆd (i) β (i+1) = ˆr (i+1) ˆr (i+1) ˆr (i) ˆr (i) ˆd (i+1) = ˆr (i+1) + β (i+1) ˆd(i). TPCG E E x ˆx ˆr (i) = E 1 r (i) ˆd (i) = E d (i) ˆx = E x E E 1 = M 1 (5.4) r (0) = b Ax (0), d (0) = M 1 r (0) α (i) = r (i) M r (i) d (i) Ad (i) x (i+1) = x (i) + α (i) d (i) r (i+1) = r (i) α (i) Ad (i) β (i+1) = r (i+1) M 1 r (i+1) r (i) M 1 r (i) d (i+1) = M 1 r (i+1) + β (i+1) d (i). Untransformed Preconditioned CG Method UPCG TPCG UPCG ˆx UPCG E f(x) M = A κ(m 1 A) = 1 f(x) f(x) Mx = b M A f(x) Cholesky incomplete Cholesky preconditioning CG / Transformed/Untransformed Preconditioned Steepest Descent Method 5.3 Extending to General Cases A CG x Ax b 2 = 0 (5.4) (5.5) min x Ax b 2, (5.6) A Ax = A b. (5.7) A Ax = b A R m n m > n overconstrained (5.7) (5.6) A A Ax = b underconstrained A A CG 24
(5.7) A CG CG r (i) b Ax (i) r (i) f (x (i) ) α (i) β (i) Fletcher-Reeves FR Polak-Ribière PR β(i+1) F R = r (i+1) r (i+1) r (i) r (i) β(i+1) P R = r (i+1) (r (i+1) r (i) ) r (i) r (i) (5.8) FR f(x) PR β(i+1) P R < 0 CG { r } (i+1) (r (i+1) r (i) ) β (i+1) max r(i) r, 0, (5.9) (i) CG CG f f CG f CG 6 Postscript Jonathan Richard Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain http://hliangzhao.me/math/ CG.pdf 25