<<

. 4
( 9)



>>


64 ITERATIVE METHODS FOR OPTIMIZATION

This is a dogleg method in that the approximate solution of the trust region problem lies on a
piecewise linear path with the CG iterations as nodes. As long as CG is performing properly
(i.e., pT w > 0) nodes are added to the path until the path intersects the trust region boundary. If
a direction of inde¬niteness is found (pT w ¤ 0), then that direction is followed to the boundary.
In this way a negative curvature direction, if found in the course of the CG iteration, can be
exploited.
The inputs to Algorithm trcg are the current point x, the objective f , the forcing term ·,
and the current trust region radius ∆. The output is the approximate solution of the trust region
problem d. This algorithm is not the whole story, as once the trust region problem is solved
approximately, one must use f (xc + d) to compute ared and then make a decision on how the
trust region radius should be changed. Our formulation differs from that in [247] in that the
termination criterion measures relative residuals in the l2 -norm rather than in the C-norm. This
change in the norm has no effect on the analysis in [247], and, therefore, we can apply the results
in §2.5 directly to draw conclusions about local convergence.
Algorithm 3.3.7. trcg(d, x, f, M, ·, ∆, kmax)
1. r = ’∇f (x), ρ0 = r 2 , k = 1, d = 0
2

2. Do While ρk’1 > · ∇f (x) 2 and k < kmax
(a) z = M r
(b) „k’1 = z T r
(c) if k = 1 then β = 0 and p = z
else
β = „k’1 /„k’2 , p = z + βp
(d) w = ∇2 f (x)p
If pT w ¤ 0 then
Find „ such that d + „ p C = ∆
d = d + „ p; return
(e) ± = „k’1 /pT w
(f) r = r ’ ±w
(g) ρk = rT r
ˆ
(h) d = d + ±p
ˆ
(i) If d C > ∆ then
Find „ such that d + „ p C = ∆
d = d + „ p; return
ˆ
(j) d = d; k = k + 1

Algorithm trcg does what we would expect a dogleg algorithm to do in that the piecewise
linear path determined by the iteration moves monotonically away from x (in the · C -norm!)
and the quadratic model decreases on that path [247]. Algorithm trcg will, therefore, compute
the same Newton step as Algorithm fdpcg. One might think that it may be dif¬cult to compute
the C-norm if one has, for example, a way to compute the action of M on a vector that does
not require computation of the matrix C. However, at the cost of storing two additional vectors
we can update Cp and Cd as the iteration progresses. So, when p is updated to z + βp then
Cp = r + βCp can be updated at the same time without computing the product of C with p.
Then p C = pT Cp. Similarly d = d + „ p implies that Cd = Cd + „ Cp.
Algorithm cgtrust combines the solution of the trust region problem from trcg, the
trust region radius adjustment scheme from trtest, and (indirectly) the locally convergent
algorithm newtcg. The result ¬ts nicely into our paradigm algorithm trgen.



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 65

Algorithm 3.3.8. cgtrust(x, f, „ )
1. Initialize ∆, M , ·, kmax.
2. Do forever
(a) Let xc = x. Compute ∇f (xc ).
(b) Call trcg(d, x, f, M, ·, ∆, kmax) to solve the trust region subproblem.
Set xt = x + d.
(c) Call trtest(xc , xt , x, f, ∆),
solving the trust region subproblem with Algorithm trcg.
(d) Update ·.

Theorem 3.3.10 combines several results from [247].
Theorem 3.3.10. Let f be twice Lipschitz continuously differentiable. Let M be a given
positive de¬nite matrix and let {·n } satisfy 0 < ·n < 1 for all n. Let {xn } be the sequence
generated by Algorithm cgtrust and assume that { ∇2 f (xn ) } is bounded. Then

lim ∇f (xn ) = 0.
(3.49)
n’∞

Moreover, if x— is a local minimizer for which the standard assumptions hold and xn ’ x— ,
then
• if ·n ’ 0 the convergence is q-superlinear, and
p
• if ·n ¤ K· ∇f (xn ) for some K· > 0 the convergence is q-superlinear with q-order
1 + p.
Finally, there are δ and ∆ such that if x0 ’ x— ¤ δ and ∆0 ¤ ∆ then xn ’ x— .
One can, as we do in the MATLAB code cgtrust, replace the Hessian“vector product
with a difference Hessian. The accuracy of the difference Hessian and the loss of symmetry
present the potential problem that was mentioned in §2.5. Another, very different, approach is
to approximate the exact solution of the trust region subproblem with an iterative method [243].


3.4 Examples
The results we report here used the MATLAB implementations of steepest descent, steep.m,
damped Gauss“Newton, gaussn.m, the dogleg trust region algorithm for Newton™s method,
ntrust.m, and the PCG“dogleg algorithms, cgtrust.m, from the software collection.
Our MATLAB implementation of Algorithm steep guards against extremely poor scaling
and very long steps by setting » to

»0 = min(1, 100/(1 + ∇f (x) ))
(3.50)

at the beginning of the line search. We invite the reader in Exercise 3.5.3 to attempt the control
example with »0 = 1.
We not only present plots, which are an ef¬cient way to understand convergence rates, but we
also report counts of function, gradient, and Hessian evaluations and the results of the MATLAB
flops command.




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




66 ITERATIVE METHODS FOR OPTIMIZATION

Dogleg Dogleg
5 5
10 10


0
10




Function Value
Gradient Norm

0
10
5
10
5
10
10
10


10 15
10 10
0 10 20 30 0 10 20 30
Iterations Iterations

Steepest Descent Steepest Descent
4 5
10 10


2
10


Function Value
Gradient Norm




0
10
0
10
5
10
2
10


4 10
10 10
0 20 40 60 0 20 40 60
Iterations Iterations


Figure 3.1: Steepest Descent and Newton“Dogleg for Parameter ID Problem



Damped Gauss“Newton Damped Gauss“Newton
2 5
10 10


0 0
10 10
Function Value
Gradient Norm




2 5
10 10


4 10
10 10


6 15
10 10
0 2 4 6 0 2 4 6
Iterations Iterations

Levenberg“Marquardt Levenberg“Marquardt
2 5
10 10


0 0
10 10
Function Value
Gradient Norm




2 5
10 10


4 10
10 10


6 15
10 10
0 5 10 15 0 5 10 15
Iterations Iterations


Figure 3.2: Gauss“Newton and Levenberg“Marquardt for Parameter ID Problem




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 67

Dogleg“CG Dogleg“CG
5 7
10 10


6
10




Function Value
Gradient Norm




0
10
5
10
5
10
4
10


10 3
10 10
0 2 4 6 0 2 4 6
Iterations Iterations

Steepest Descent Steepest Descent
5 7
10 10


6
10
Function Value
Gradient Norm




0
10
5
10
5
10
4
10


10 3
10 10
0 20 40 60 0 20 40 60
Iterations Iterations


Figure 3.3: Steepest Descent and Dogleg“CG for Discrete Control Problem


3.4.1 Parameter Identi¬cation
We consider the problem from §2.6.1 except we use the initial data x0 = (5, 5)T . Both the Gauss“
Newton and Newton methods will fail to converge with this initial data without globalization (see
Exercise 3.5.14). Newton™s method has particular trouble with this problem because the Newton
direction is not a descent direction in the early phases of the iteration. The termination criterion
and difference increment for the ¬nite difference Hessian was the same as for the computation
in §2.6.1.
In Figure 3.1 we compare the performance of the Newton dogleg algorithm with the steepest
descent algorithm. Our implementation of the classical dogleg in ntrust uses the standard
values
ωdown = .5, ωup = 2, µ0 = µlow = .25, and µhigh = .75.
(3.51)
The plots clearly show the locally superlinear convergence of Newton™s method and the linear
convergence of steepest descent. However, the graphs do not completely show the difference
in computational costs. In terms of gradient evaluations, steepest descent was marginally better
than the Newton dogleg algorithm, requiring 50 gradients as opposed to 55 (which includes those
needed for the 18 difference Hessian evaluations) for the Newton dogleg algorithm. However, the
steepest descent algorithm required 224 function evaluations, while the Newton dogleg needed
only 79. As a result, the Newton dogleg code was much more ef¬cient, needing roughly 5 million
¬‚oating point operations instead of the 10 million needed by the steepest descent code.
In Figure 3.2 we plot the performance of the damped Gauss“Newton and Levenberg“
Marquardt algorithms. These exploit the least squares structure of the problem and are locally
superlinearly convergent because this is a zero residual problem. They also show that algo-
rithms that effectively exploit the structure of the least squares problem are much more ef¬cient.
Gauss“Newton required 6 gradient evaluations, 14 function evaluations, and 750 thousand ¬‚oat-
ing point operations, and Levenberg“Marquardt required 12 gradients, 23 functions, and 1.3


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




68 ITERATIVE METHODS FOR OPTIMIZATION

million ¬‚oating point operations.

3.4.2 Discrete Control Problem
We consider the discrete control problem from §1.6.1 with N = 400, T = 1, y0 = 0,

L(y, u, t) = (y ’ 3)2 + .5 — u2 , and φ(y, u, t) = uy + t2 .

We chose the poor initial iterate

u0 (t) = 5 + 300 sin(20πt).

This problem can be solved very ef¬ciently with Algorithm cgtrust. In our implementa-
tion we use the same parameters from (3.51). In Figure 3.3 we compare the dogleg“CG iteration
with steepest descent. We terminated both iterations when ∇f < 10’8 . For the dogleg“CG
code we used · = .01 throughout the entire iteration and an initial trust region radius of u0 .
The steepest descent computation required 48 gradient evaluations, 95 function evaluations, and
roughly 1 million ¬‚oating point operations, and dogleg“CG needed 17 gradient evaluations, 21
function evaluations, and roughly 530 thousand ¬‚oating point operations. Note that the steepest
descent algorithm performed very well in the terminal phase of the iteration. The reason for this
is that, in this example, the Hessian is near the identity.


3.5 Exercises on Global Convergence
3.5.1. Let F be a nonlinear function from RN ’ RN . Let

f (x) = F (x) 2 /2.

What is ∇f ? When is the Newton step for the nonlinear equation F (x) = 0,

d = ’F (x)’1 F (x),

a descent direction for f at x?

3.5.2. Prove Lemma 3.2.1.

3.5.3. Implement Algorithm steep without the scaling ¬xup in (3.50). Apply this crippled
algorithm to the control problem example from §3.4.2. What happens and why?

3.5.4. Show that if f is a convex quadratic then f is bounded from below.

3.5.5. Verify (3.40).

3.5.6. Show that the Levenberg“Marquardt steps computed by (3.20) and (3.21) are the same.

3.5.7. Prove Theorem 3.2.7.

3.5.8. Complete the proof of Theorem 3.3.2.

3.5.9. Prove Theorem 3.3.4.

3.5.10. Look at the trust region algorithm for nonlinear equations from [218] or [84]. What are the
costs of that algorithm that are not present in a line search? When might this trust region
approach have advantages for solving nonlinear equations? Could it be implemented
inexactly?


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 69

3.5.11. The double dogleg method [80], [84] puts a new node on the dogleg path in the Newton
direction, thereby trying more aggressively for superlinear convergence. Implement this
method, perhaps by modifying the MATLAB code ntrust.m, and compare the results
with the examples in §3.4. Prove convergence results like Theorems 3.3.7 and 3.3.9 for
this method.
3.5.12. In [51] a trust region algorithm was proposed that permitted inaccurate gradient computa-
tions, with the relative accuracy being tightened as the iteration progresses. Look at [51]
and try to design a similar algorithm based on the line search paradigm. What problems
do you encounter? How do you solve them?
3.5.13. Suppose one modi¬es Algorithm trtest by not resolving the trust region problem if
the trial point is rejected, but instead performing a line search from xt , and setting ∆ =
x+ ’ xc , where x+ is the accepted point from the line search. Discuss the merits of
this modi¬cation and any potential problems. See [209] for the development of this idea.
3.5.14. Write programs for optimization that take full Gauss“Newton or Newton steps (you can
cripple the MATLAB codes gaussn.m and ntrust.m for this). Apply these codes to
the parameter identi¬cation problem from §3.4.1. What happens?
3.5.15. Write a nonlinear CG code and apply it to the problems in §3.4. Try at least two ways to
manage the line search. How important are the (strong) Wolfe conditions?
3.5.16. Discuss the impact of using a difference Hessian in Algorithm trcg. How will the global
convergence of Algorithm cgtrust be affected? How about the local convergence?
Consider the accuracy in the evaluation of ∇f in your results.
3.5.17. Without looking at [247] describe in general terms how the proof of Theorem 3.3.1 should
be modi¬ed to prove Theorem 3.3.10. Then examine the proof in [247] to see if you left
anything out.




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




Chapter 4

The BFGS Method

Quasi-Newton methods update an approximation of ∇2 f (x— ) as the iteration progresses. In
general the transition from current approximations xc and Hc of x— and ∇2 f (x— ) to new ap-
proximations x+ and H+ is given (using a line search paradigm) by the following steps:
’1
1. Compute a search direction d = ’Hc ∇f (xc ).
2. Find x+ = xc + »d using a line search to insure suf¬cient decrease.
3. Use xc , x+ , and Hc to update Hc and obtain H+ .
The way in which H+ is computed determines the method.
The BFGS (Broyden, Fletcher, Goldfarb, Shanno) [36], [103], [124], [237] method, which
is the focus of this chapter, and the other methods we will mention in §4.3 are also called secant
methods because they satisfy the secant equation
H+ s = y.
(4.1)
In (4.1)
s = x+ ’ xc and y = ∇f (x+ ) ’ ∇f (xc ).
If N = 1, all secant methods reduce to the classical secant method for the single nonlinear
equation f (x) = 0, i.e.,
f (xc )(xc ’ x’ )
x+ = xc ’ ,
(4.2)
f (xc ) ’ f (x’ )
where x’ is the iterate previous to xc .
The standard quasi-Newton update for nonlinear equations is Broyden™s [34] method, a
rank-one update,
(y ’ Hc s)sT
H+ = Hc + .
(4.3)
sT s
Broyden™s method does not preserve the structural properties needed for line search methods in
optimization, namely, symmetry and positive de¬niteness, and could, in fact, encourage con-
vergence to a local maximum. For that reason quasi-Newton methods in optimization are more
complex than those used for nonlinear equations. The methods of analysis and implementation
are more complex as well.
In this chapter we will concentrate on the BFGS method [36], [103], [124], [237], which is
the rank-two update
yy T (Hc s)(Hc s)T
H+ = Hc + T ’ .
(4.4)
sT Hc s
ys
We will brie¬‚y discuss other updates and variations that exploit problem structure in §4.3.

71
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




72 ITERATIVE METHODS FOR OPTIMIZATION

4.1 Analysis
This section begins with some simple observations on nonsingularity and positivity of the update.
It is very useful for both theory and practice to express (4.4) in terms of the inverse matrices.
The formula we use in this book is Lemma 4.1.1.
’1
Lemma 4.1.1. Let Hc be spd, y T s = 0, and H+ given by (4.4). Then H+ is nonsingular
and
sy T ysT ssT
’1 ’1
H+ = I ’ T Hc I’ T +T.
(4.5)
ys ys ys


Proof. See exercise 4.5.2.
Lemma 4.1.2. Let Hc be spd, y T s > 0, and H+ given by (4.4). Then H+ is spd.
Proof. Positivity of Hc and y T s = 0 imply that for all z = 0,

(z T y)2 (z T Hc s)2
T T
z H+ z = + z Hc z ’ T .
yT s s Hc s

Using the symmetry and positivity of Hc , we have

(z T Hc s)2 ¤ (sT Hc s)(z T Hc z),

with equality only if z = 0 or s = 0, and, therefore, since z, s = 0 and y T s > 0,

(z T y)2
z T H+ z > ≥ 0,
yT s

as asserted.
If y T s ¤ 0 the update is considered a failure.


4.1.1 Local Theory
The local theory [37] requires accurate initial approximations to both x— and ∇2 f (x— ). The
statement of the convergence result is easy to understand.
Theorem 4.1.3. Let the standard assumptions hold. Then there is δ such that if

x0 ’ x— ¤ δ and H0 ’ ∇2 f (x— ) ¤ δ,

then the BFGS iterates are de¬ned and converge q-superlinearly to x— .


Technical Details
The proof of Theorem 4.1.3 is technical and we subdivide it into several lemmas. Our proof is
a hybrid of ideas from [37], [135], and [154]. Similar to other treatments of this topic [45] we
begin with the observation (see §2.5.2) that one may assume ∇2 f (x— ) = I for the convergence
analysis.
Lemma 4.1.4. Let the standard assumptions hold and let

ˆ
f (y) = f (Ay),


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BFGS METHOD 73

ˆ
where A = (∇2 f (x— ))’1/2 . Let xc and Hc be given and let xc = A’1 xc and Hc = AHc A.
ˆ
ˆ

Then the BFGS updates (x+ , H+ ) for f and (ˆ+ , H+ ) for f are related by
ˆ
x+ = A’1 x+ and H+ = AH+ A.
ˆ

In particular, the BFGS sequence for f exists (i.e., Hn is spd for all n) if and only if the BFGS
ˆ
sequence for f does and the convergence of {xn } is q-superlinear if and only if the convergence
of {ˆn } is.
x
Proof. The proof is a simple calculation and is left for exercise 4.5.3.
Hence we can, with no loss of generality, assume that ∇2 f (x— ) = I, for if this is not so, we
ˆ
can replace f by f and obtain an equivalent problem for which it is.
Keeping in mind our assumption that ∇2 f (x— ) = I, we denote errors in the inverse Hessian
by
E = H ’1 ’ ∇2 f (x— )’1 = H ’1 ’ I.
These errors satisfy a simple recursion [37].
Lemma 4.1.5. Let the standard assumptions hold. Let Hc be spd and
’1
x+ = xc ’ Hc ∇f (xc ).

Then there is δ0 such that if

0 < xc ’ x— ¤ δ0 and Ec ¤ δ0 ,

then y T s > 0. Moreover, if H+ is the BFGS update of Hc then

E+ = (I ’ wwT )Ec (I ’ wwT ) + ∆,
(4.6)

where w = s/ s and for some K∆ > 0

∆ ¤ K∆ s .
(4.7)

Proof. Let δ0 be small enough so that ∇f (xc ) = 0 if xc = x— . Theorem 1.2.1 implies that
1
∇2 f (x— + tec )ec dt = ec + ∆1 ec ,
∇f (xc ) =
0

where ∆1 is the matrix given by
1
(∇2 f (x— + tec ) ’ I) dt.
∆1 =
0

Clearly
∆1 ¤ γ ec /2,
and
’1
s = ’Hc ∇f (xc ) = ’(I + Ec )(I + ∆1 )ec .
Therefore,
ec (1 ’ δ0 )(1 ’ γδ0 /2) ¤ s ¤ ec (1 + δ0 )(1 + γδ0 /2)
and hence
0 < ec /2 ¤ s ¤ 2 ec
(4.8)
if, say,
δ0 ¤ min(1/4, 1/(2γ)).
(4.9)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




74 ITERATIVE METHODS FOR OPTIMIZATION

We will assume that (4.9) holds for the rest of this section.
The standard assumptions, our assumption that ∇2 f (x— ) = I, and the fundamental theorem
of calculus imply that
1
∇2 f (xc + ts)s dt
y = ∇f (x+ ) ’ ∇f (xc ) =
0
(4.10)
1
(∇2 f (xc + ts) ’ I)s dt = s + ∆2 s,
=s+
0

where ∆2 is the matrix given by
1
(∇2 f (xc + ts) ’ I) dt.
∆2 =
0

The standard assumptions imply that ∆2 ¤ γ( e+ + ec )/2. Hence, (4.8) implies that

y T s = sT s + (∆2 s)T s ≥ s 2 (1 ’ 3γ ec /2) s 2 (1 ’ 3γδ0 /2) > 0
(4.11)

provided δ0 < 2γ/3. We have that

sy T ssT + s(∆2 s)T ssT
= T ’ ∆3 = wwT ’ ∆3 ,
=T
(4.12)
yT s s s + (∆2 s)T s ss
where (see exercise 4.5.4), for some C > 0,

∆3 ¤ C s .
(4.13)

Subtracting (∇2 f (x— ))’1 = I from (4.5) and using (4.12) gives us

= (I ’ wwT + ∆3 )Hc (I ’ wwT + ∆T ) + wwT ’ I
’1
E+ 3


= (I ’ wwT )(Ec + I)(I ’ wwT ) + wwT ’ I + ∆

= (I ’ wwT )Ec (I ’ wwT ) + ∆,

where
∆ = ∆3 Hc (I ’ wwT + ∆T ) + (I ’ wwT )Hc ∆T .
’1 ’1
3 3

Therefore, if (4.9) holds then 1 + δ0 ¤ 3/2 and

∆ ¤ (1 + δ0 ) ∆3 (2 + ∆3 ) ¤ s 3C(2 + C s )/2

¤ 3C s (2 + 2Cδ0 )/2.

Reduce δ0 if necessary so that 2Cδ0 ¤ 1 and the proof is complete with K∆ = 9C/2.
Lemma 4.1.5 implies that the approximate Hessians do not drift too far from the exact Hessian
if the initial data are good. This property, called bounded deterioration in [37], will directly
imply local q-linear convergence.
Corollary 4.1.6. Let the assumptions of Lemma 4.1.5 hold and let δ0 be as in the statement
of Lemma 4.1.5. Then

E+ ¤ Ec + K∆ s ¤ Ec + K∆ ( ec + e+ ).
(4.14)




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BFGS METHOD 75

Proof. The leftmost inequality follows from Lemma 4.1.5 and the fact that I ’ wwT is an
orthogonal projection. The ¬nal inequality follows from the triangle inequality.
We are now ready to prove local q-linear convergence. This is of interest in its own right and
is a critical step in the superlinear convergence proof. Note that, unlike the statement and proof
of Theorem 2.3.4, we do not express the estimates in terms of H ’ ∇2 f (x— ) = H ’ I but
in terms of E = H ’1 ’ I. The two approaches are equivalent, since if E ¤ δ < 1/2, then
H ’1 < 3/2 and the Banach lemma implies that H ¤ 2. Hence
’1
Hn ’ I /2 ¤ Hn Hn ’ I

’1 ’1
¤ Hn ’ I = Hn (Hn ’ I)

’1
¤ Hn Hn ’ I ¤ 3 Hn ’ I /2.

Theorem 4.1.7. Let the standard assumptions hold and let σ ∈ (0, 1). Then there is δ such
that if
’1
x0 ’ x— ¤ δ and H0 ’ ∇2 f (x— )’1 ¤ δ ,
(4.15)
then the BFGS iterates are de¬ned and converge q-linearly to x— with q-factor at most σ.
ˆ
Proof. For δ suf¬ciently small and
ˆ ˆ
xc ’ x— ¤ δ and Ec = Hc ’ I ¤ δ,
’1
(4.16)
¯
the standard assumptions imply that there is K such that
¯ˆ
¯ ec + ec 2 )/2 ¤ K δ ec .
e+ ¤ K( Ec
(4.17)
ˆ ¯ˆ
Reduce δ if necessary so that K δ ¤ σ to obtain e+ ¤ σ ec . The method of proof is to select
δ so that (4.16) is maintained for the entire iteration if the initial iterates satisfy (4.15).
With this in mind we set
’1
K∆ (1 + σ)

< δ — /2
δ = δ /2 1 +
(4.18)
1’σ
where K∆ is from Lemma 4.1.5. Now if H0 ’ I < δ then
E0 ¤ δ /(1 ’ δ ) ¤ 2δ < δ —
which is the estimate we need.
Now by Corollary 4.1.6
E1 ¤ E0 + K∆ (1 + σ) e0 .
The proof will be complete if we can show that (4.15) and (4.18) imply that En < δ — for all
n. We do this inductively. If En < δ — and ej+1 ¤ σ ej for all j ¤ n, then (4.14) implies
that
En+1 ¤ En + K∆ ( en + en+1 ) ¤ En + K∆ (1 + σ) en

¤ En + K∆ (1 + σ)σ n e0 ¤ En + K∆ (1 + σ)σ n δ
n
σn
¤ E0 + δ K∆ (1 + σ)
j=0


K∆ (1 + σ)
=δ 1+ .
1’σ
We complete the induction and the proof by invoking (4.18) to conclude that En+1 ¤ δ — .


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




76 ITERATIVE METHODS FOR OPTIMIZATION

Proof of Theorem 4.1.3
The Dennis“Mor´ condition [82], [81] is a necessary and suf¬cient condition for superlinear
e
convergence of quasi-Newton methods. In terms of the assumptions we make in this section,
the condition is
En sn
lim = 0,
(4.19)
sn
n’∞

where {sn } is the sequence of steps and {En } is the sequence of errors in the inverse Hessian.
We will only state and prove the special case of the necessary condition that we need and refer
the reader to [82], [81], [84], or [154] for more general proofs.
Theorem 4.1.8. Let the standard assumptions hold; let {Hn } be a sequence of nonsingular
N — N matrices satisfying
Hn ¤ M
(4.20)
for some M > 0. Let x0 ∈ RN be given and let {xn }∞ be given by
n=1
’1
xn+1 = xn ’ Hn ∇f (xn )

for some sequence of nonsingular matrices Hn . Then if xn ’ x— q-linearly, xn = x— for any
n, and (4.19) holds then xn ’ x— q-superlinearly.
Proof. We begin by invoking (4.10) to obtain

En sn = (Hn ’ I)sn = (Hn ’ I)(yn ’ ∆2 s) = En yn + O( sn 2 ).
’1 ’1


Convergence of xn to x— implies that sn ’ 0 and hence (4.19) can be written as
En y n
lim = 0,
(4.21)
sn
n’∞

where yn = ∇f (xn+1 ) ’ ∇f (xn ).
Now let σ be the q-factor for the sequence {xn }. Clearly

(1 ’ σ) en ¤ sn ¤ (1 + σ) en .

Hence (4.21) is equivalent to
En y n
lim = 0.
(4.22)
en
n’∞

Since Hn ∇f (xn ) = ’sn and sn = yn + O( sn 2 ) we have
’1

’1
En y n = (Hn ’ I)(∇f (xn+1 ) ’ ∇f (xn ))

= Hn ∇f (xn+1 ) + sn ’ yn = Hn ∇f (xn+1 ) + O( sn 2 )
’1 ’1


’1 2
+ sn 2 ) = Hn en+1 + O( en 2 ).
’1
= Hn en+1 + O( en
Therefore, (4.22) implies that
’1
En y n Hn en+1 en+1
+ O( en ) ≥ M ’1
= + O( en ) ’ 0
en en en
as n ’ ∞, proving q-superlinear convergence.
For the remainder of this section we assume that (4.15) holds and that δ is small enough so
that the conclusions of Theorem 4.1.7 hold for some σ ∈ (0, 1). An immediate consequence of
this is that ∞
sn < ∞.
(4.23)
n=0



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BFGS METHOD 77

The Frobenius norm of a matrix A is given by
N
2
(A)2 .
A =
(4.24) F ij
i,j=1


It is easy to show that (see exercise 4.5.5) for any unit vector v ∈ RN ,

A(I ’ vv T ) 2 2 2
and (I ’ vv T )A 2 2
¤A ’ Av ¤A F.
(4.25) F F F

We have, using (4.6), (4.7), and (4.25), that
2 2 2 2 2
En+1 ¤ En ’ En wn + O( sn ) = (1 ’ θn ) En + O( sn ),
(4.26) F F F

where wn = sn / sn and
±
En wn

 if En = 0,

En F
θn =



1 if En = 0.

Using (4.23) we see that for any k ≥ 0,
k k
2 2 2 2
θ n En ¤ En ’ En+1 + O(1)
F F F
n=0 n=0

2 2
= E0 ’ Ek+1 + O(1) < ∞.
F F

Hence θn En ’ 0.
F
However, ±
 En wn if En = 0
θ n En =
F

0 if En = 0
En sn
= En wn = .
sn
Hence (4.19) holds. This completes the proof of Theorem 4.1.3.

4.1.2 Global Theory
If one uses the BFGS model Hessian in the context of Algorithm optarm, then Theorem 3.2.4
can be applied if the matrices {Hk } remain bounded and well conditioned. However, even if a
limit point of the iteration is a minimizer x— that satis¬es the standard assumptions, Theorem 3.2.4
does not guarantee that the iteration will converge to that point. The situation in which x is near
x— but H is not near ∇2 f (x— ) is, from the point of view of the local theory, no better than that
when x is far from x— . In practice, however, convergence (often superlinear) is observed. The
result in this section is a partial explanation of this.
Our description of the global theory, using the Armijo line search paradigm from Chapter 3, is
based on [43]. We also refer the reader to [221], [45], and [269] for older results with a different
line search approach. Results of this type require strong assumptions on f and the initial iterate
x0 , but the reward is global and locally superlinear convergence for a BFGS“Armijo iteration.
Assumption 4.1.1. The set

D = {x | f (x) ¤ f (x0 )}


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




78 ITERATIVE METHODS FOR OPTIMIZATION

is convex and f is Lipschitz twice continuously differentiable in D. Moreover, there are »+ ≥
»’ > 0 such that
σ(∇2 f (x)) ‚ [»’ , »+ ]
for all x ∈ D.
Assumption 4.1.1 implies that f has a unique minimizer x— in D and that the standard
assumptions hold near x— .
Theorem 4.1.9. Let Assumption 4.1.1 hold and let H0 be spd. Then the BFGS“Armijo
iteration converges q-superlinearly to x— .
The results for local and global convergence do not completely mesh. An implementation
must allow for the fact that Assumption 4.1.1 may fail to hold, even near the root, and that
y T s ¤ 0 is a possibility when far from the root.


4.2 Implementation
The two implementation issues that we must confront are storage of the data needed to maintain
the updates and a strategy for dealing with the possibility that y T s ¤ 0. We address the
storage question in §4.2.1. For the second issue, when y T s is not suf¬ciently positive, we restart
the BFGS update with the identity. We present the details of this in §4.2.2. Our globalization
approach, also given in §4.2.2, is the simplest possible, the Armijo rule as described in Chapter 3.
We choose to discuss the Armijo rule in the interest of simplicity of exposition. However,
while the Armijo rule is robust and suf¬cient for most problems, more complex line search
schemes have been reported to be more ef¬cient [42], and one who seeks to write a general
purpose optimization code should give careful thought to the best way to globalize a quasi-
Newton method. In the case of BFGS, for example, one is always seeking to use a positive de¬nite
quadratic model, even in regions of negative curvature, and in such regions the approximate
Hessian could be reinitialized to the identity more often than necessary.

4.2.1 Storage
For the present we assume that y T s > 0. We will develop a storage-ef¬cient way to compute
the BFGS step using the history of the iteration rather than full matrix storage.
The implementation recommended here is one of many that stores the history of the iteration
’1
and uses that information recursively to compute the action of Hk on a vector. This idea was
suggested in [16], [186], [206], and other implementations may be found in [44] and [201].
All of these implementations store the iteration history in the pairs {sk , yk } and we present a
concrete example in Algorithm bfgsrec. A better, but somewhat less direct, way is based on
the ideas in [91] and [275] and requires that only a single vector be stored for each iteration. We
’1
assume that we can compute the action of H0 on a vector ef¬ciently, say, by factoring H0 at the
outset of the iteration or by setting H0 = I. We will use the BFGS formula from Lemma 4.1.1.
One way to maintain the update is to store the history of the iteration in the sequences of
vectors {yk } and {sk } where

sk = xk+1 ’ xk and yk = ∇f (xk+1 ) ’ ∇f (xk ).

If one has done this for k = 0, . . . , n ’ 1, one can compute the new search direction
’1
dn = ’Hn ∇f (xn )

by a recursive algorithm which applies (4.5).


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BFGS METHOD 79

’1
Algorithm bfgsrec overwrites a given vector d with Hn d. The storage needed is one
vector for d and 2n vectors for the sequences {sk , yk }n’1 . A method for computing the product
k=0
’1
of H0 and a vector must also be provided.
’1
Algorithm 4.2.1. bfgsrec(n, {sk }, {yk }, H0 , d)
’1
1. If n = 0, d = H0 d; return
2. ± = sT d/yn’1 s; d = d ’ ±yn’1
T
n’1
’1
3. call bfgsrec(n ’ 1, {sk }, {yk }, H0 , d)
T T
4. d = d + (± ’ (yn’1 d/yn’1 sn’1 ))sn’1

Algorithm bfgsrec has the great advantage, at least in a language that ef¬ciently supports
recursion, of being very simple. More complex, but nonrecursive versions, have been described
in [16], [201], and [44].
The storage cost of two vectors per iteration can be signi¬cant, and when available storage is
exhausted one can simply discard the iteration history and restart with H0 . This approach, which
we implement in the remaining algorithms in this section, takes advantage of the fact that if H0
’1
is spd then ’H0 ∇f (x) will be a descent direction, and hence useful for a line search. Another
approach, called the limited memory BFGS [44], [207], [176], [201], keeps all but the oldest
(s, y) pair and continues with the update. Neither of these approaches for control of storage,
while essential in practice for large problems, has the superlinear convergence properties that
the full-storage algorithm does.
At a cost of a modest amount of complexity in the formulation, we can reduce the storage
cost to one vector for each iteration. The method for doing this in [275] begins with an expansion
of (4.5) as
’1
H+ = Hc + ±0 sc sT + β0 ((Hc yc )sT + sc (Hc yc )T ),
’1 ’1 ’1
c c

where
T T ’1
yc sc + yc Hc yc ’1
±0 = and β0 = T .
(yc sc )2
T yc sc
Now note that
’1 ’1 ’1 ’1
Hc yc = Hc ∇f (x+ ) ’ Hc ∇f (xc ) = Hc ∇f (x+ ) + sc /»c

and obtain
’1
H+ = Hc + ±1 sc sT + β0 (sc (Hc ∇f (x+ ))T + (Hc ∇f (x+ ))sT ),
’1 ’1 ’1
(4.27) c c

where
±1 = ±0 + 2β0 /»c .
Also
’1
d+ = ’H+ ∇f (x+ )

T
yc sT sc sT ∇f (x+ )
sc yc c c
’1
=’ I’ Hc I’ ∇f (x+ ) ’
(4.28) T Ts T
yc sc yc c yc sc
’1
= Ac sc + Bc Hc ∇f (x+ ),

where
T
yc sT sT ∇f (x+ )
yc
I’ Tc c
’1
Ac = Hc ∇f (x+ ) +
(4.29) T T
yc sc yc sc »c yc sc


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




80 ITERATIVE METHODS FOR OPTIMIZATION

and
sT ∇f (x+ )
Bc = ’1 + c T .
(4.30)
yc sc
’1
At this point we can compute d+ , and therefore »+ and s+ using only Hc ∇f (xc ). We do not
need H+ at all. We can now form H+ with the new data for the next iterate and will show that
we do not need to store the vectors {yk }.
Since (verify this!) Bc = 0, we have

s+ Ac sc
’1
Hc ∇f (x+ ) = ’ + .
Bc »+ Bc

Combining this with (4.27) gives
’1
H+ = Hc + ±c sc sT + βc (sc sT + s+ sT ),
’1
(4.31) c + c

where
β0
±c = ±1 + 2β0 Ac /Bc and βc = ’ .
(4.32)
Bc »+
This leads to the expansion
n
’1 ’1
±k sk sT + βk (sk sT + sk+1 sT ).
Hn+1 = H0 +
(4.33) k k+1 k
k=0

Upon re¬‚ection the reader will see that this is a complete algorithm. We can use (4.28) and Hn
to compute dn+1 . Then we can compute »n+1 and sn+1 and use them and (4.32) to compute
’1
±n and βn . This new data can be used to form Hn+1 with (4.33), which we can use to compute
dn+2 and continue the iteration.
In this way only the steps {sk } and the expansion coef¬cients {±k } and {βk } need be stored.
Algorithm bfgsopt is an implementation of these ideas.
Algorithm 4.2.2. bfgsopt(x, f, )

1. g = ’∇f (x), n = 0.

2. While g >
’1
(a) If n = 0, dn = ’H0 g
otherwise compute A, B, and dn using (4.28), (4.29), and (4.30).
(b) Compute »n , sn , and x = xn+1 with the Armijo rule.
(c) If n > 0 compute ±n’1 and βn’1 using (4.32).
(d) g = ’∇f (x), n = n + 1.



4.2.2 A BFGS“Armijo Algorithm
In this section we present a simple implementation that shows how the theoretical results can
be applied in algorithm design. Let H+ GS be the BFGS update from Hc and de¬ne the two
BF

modi¬ed BFGS (MBFGS) updates by

H+ GS
BF
if y T s > 0,
H+ =
(4.34)
if y T s ¤ 0,
I


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BFGS METHOD 81

and
H+ GS
BF
if y T s > 0,
H+ =
(4.35)
if y T s ¤ 0.
Hc
In the MBFGS1 method, (4.34), the model Hessian is reinitialized to I if y T s ¤ 0. In the
early phase of this iteration, where ∇2 f may have negative eigenvalues, y T s ¤ 0 is certainly
possible and the search direction could be the steepest descent direction for several iterations.
An MBFGS2 step (4.35) keeps the history of the iteration even if y T s ¤ 0. One view
is that this approach keeps as much information as possible. Another is that once y T s ¤ 0,
the iteration history is suspect and should be thrown away. Both forms are used in practice.
Our MATLAB code bfgswopt uses MFBGS1 and maintains an approximation to H ’1 using
Algorithm bfgsopt. We also guard against poor scaling by using (3.50).


4.3 Other Quasi-Newton Methods
The DFP (Davidon, Fletcher, Powell) update [71], [72], [105]

(y ’ Hc s)y T + y(y ’ Hc s)T [(y ’ Hc s)T y]yy T
H + = Hc + ’
(4.36)
yT s (y T s)2
has similar local convergence properties to BFGS but does not perform as well in practice [224],
[225].
Two updates that preserve symmetry, but not de¬niteness, are the PSB (Powell symmetric
Broyden) update [219],

(y ’ Hc s)sT + s(y ’ Hc s)T [sT (y ’ Hc s)]ssT
H + = Hc + ’ ,
(4.37)
sT s (sT s)2

and the symmetric rank-one (SR1) [35] update,

(y ’ Hc s)(y ’ Hc s)T
H+ = Hc + .
(4.38)
(y ’ Hc s)T s
By preserving the symmetry of the approximate Hessians, but not the positive de¬niteness,
these updates present a problem for a line search globalization but an opportunity for a trust
region approach. The SR1 update has been reported to outperform BFGS algorithms in certain
cases [165], [41], [64], [65], [163], [258], [118], [250], [119], [268], [164], in which either the
approximate Hessians can be expected to be positive de¬nite or a trust region framework is used
[41], [64], [65].
One may update the inverse of the SR1 approximate Hessian using the Sherman“Morrison
formula, (4.39), a simple relation between the inverse of a nonsingular matrix and that of a
rank-one update of that matrix [93], [239], [240], [14].
Proposition 4.3.1. Let H be a nonsingular N —N matrix and let u, v ∈ RN . Then H +uv T
is invertible if and only if 1 + v T H ’1 u = 0. In this case

(H ’1 u)v T
(H + uv T )’1 = H ’1 .
I’
(4.39)
1 + v T H ’1 u

The proof is simply a direct veri¬cation of (4.39).
The SR1 algorithm terminates in ¬nitely many iterations for convex quadratic optimization
problems [101]. Since the denominator (y ’ Hc s)T s could vanish, the update could completely
fail and implementations must examine the denominator and take appropriate action if it is too


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




82 ITERATIVE METHODS FOR OPTIMIZATION

small. This update does not enforce or require positivity of the approximate Hessian and has
been used effectively to exploit negative curvature in a trust region context [165], [41].
For overdetermined nonlinear least squares problems one can try to approximate the second-
order term in ∇2 f while computing R T R exactly. Suppose

∇2 f (x) ≈ H = C(x) + A,

where the idea is that C, the computed part, is signi¬cantly easier to compute than A, the
approximated part. This is certainly the case for nonlinear least squares, where C = R T R . A
quasi-Newton method that intends to exploit this structure will update A only; hence

H+ = C(x+ ) + A+ .

Superlinear convergence proofs require, in one way or another, that H+ s = y. Therefore, in
terms of A, one might require the update to satisfy

A+ s = y # = y ’ C(x+ )s.
(4.40)

The de¬nition of y # given in (4.40) is called the default choice in [87]. This is not the only
choice for y # , and one can prove superlinear convergence for this and many other choices [87],
[84]. This idea, using several different updates, has been used in other contexts, such as optimal
control [159], [164].
An algorithm of this type, using SR1 to update A and a different choice for y # , was suggested
in [20] and [21]. The nonlinear least squares update from [77], [78], and [84] uses a DFP update
and yet another y # to compute A+ ,
T T
(y # ’ Ac s)y # + y # (y # ’ Ac s)T [(y # ’ Ac s)T y # ]y # y #
(4.41) A+ = Ac + ’ .
T T
y# s (y # s)2

The application of this idea to large-residual least squares problems is not trivial, and scaling
issues must be considered in a successful implementation.
Our proof of superlinear convergence can be applied to updates like (4.41). We state a special
case of a result from [87] for the BFGS formulation
T
y# y# (Ac s)(Ac s)T
A+ = Ac + ’ .
(4.42) T sT Ac s
y# s

Theorem 4.3.2. Let the standard assumptions hold and assume that

A— = ∇2 f (x— ) ’ C(x— )

is spd. Then there is δ such that if

x0 ’ x— ¤ δ and A0 ’ A— ¤ δ,

then the quasi-Newton iterates de¬ned by (4.42) exist and converge q-superlinearly to x— .
This result can be readily proved using the methods in this chapter (see [159]).
Quasi-Newton methods can also be designed to take into account special structure, such as
the sparsity pattern of the Hessian. One can update only those elements that are nonzero in the
initial approximation to the Hessian, requiring that the secant equation Hs = y holds. Such
updates have been proposed and analyzed in varying levels of generality in [83], [87], [185],
[238], [256], and [255].


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BFGS METHOD 83

Another approach is to use the dependency of f on subsets of the variables, a structure that is
often present in discretizations of in¬nite-dimensional problems where coef¬cients of operators
can be updated rather than entire matrix representations of those operators. We refer the reader
to [133], [131], and [132] for an algebraic viewpoint based on ¬nite-dimensional analysis and
to [159], [157], [164], [160], and [163] for an operator theoretic description of these methods.
When applied to discretizations of in¬nite-dimensional optimization problems, quasi-Newton
methods perform best when they also work well on the in¬nite-dimensional problem itself. Work
on BFGS in Hilbert space can be found, for example, in [135], [158], and [159].
Quasi-Newton methods have been designed for underdetermined problems [184], and Broy-
den™s method itself has been applied to linear least squares problems [111], [148].


4.4 Examples
The computations in this section were done with the MATLAB code bfgswopt. For the small
parameter ID problem, where evaluation of f is far more expensive than the cost of maintaining
or factoring the (very small!) approximate Hessian, one could also use a brute force approach
in which H is updated and factored anew with each iteration.

4.4.1 Parameter ID Problem
We solve the parameter ID problem with the same data as in §3.4.1 using H0 = I as the initial
Hessian. We compare the BFGS solution with the Gauss“Newton iteration from §3.4.1. From
Figure 4.1 one can see the local superlinear convergence and the good performance of the line
search. However, as one should expect, the Gauss“Newton iteration, being designed for small
residual least squares problems, was more ef¬cient. The Gauss“Newton iteration required 14
function evaluations, 6 gradients, and roughly 1.3 million ¬‚oating point operations, while the
BFGS“Armijo iteration needed 29 function evaluations, 15 gradients, and 3.8 million ¬‚oating
point operations.

4.4.2 Discrete Control Problem
We return to the example from §3.4.2. For our ¬rst example we use the initial iterate

u0 (t) = 10.

BFGS also requires an initial approximation to the Hessian and we consider two such approxi-
mations:
Hp = .25I and Hg = I.
(4.43)
The Hessian for the continuous problem is a compact perturbation of the identity and the
theory from [158] and [135] indicates that Hg is a much better approximate Hessian than Hp .
The results in Figure 4.2 support that idea. For the better Hessian, one can see the concavity
of superlinear convergence in the plot of the gradient norm. The computation for the better
Hessian required 12 iterations and roughly 572 thousand ¬‚oating point operations, while the one
with the poor Hessian took 16 iterations and roughly 880 thousand ¬‚oating point operations.
Stepsize reductions were not required for the good Hessian and were needed four times during
the iteration for the poor Hessian. However, the guard against poor scaling (3.50) was needed
in both cases.
When we used the same poor initial iterate that we used in §3.4.2

u0 (t) = 5 + 300 sin(20πt)




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




84 ITERATIVE METHODS FOR OPTIMIZATION

Gauss Newton Gauss Newton
2 5
10 10


0 0
10 10




Function Value
Gradient Norm



2 5
10 10


4 10
10 10


6 15
10 10
0 2 4 6 0 2 4 6
Iterations Iterations

BFGS BFGS
5 5
10 10


0
10


Function Value
Gradient Norm




0 5
10 10


10
10


5 15
10 10
0 5 10 15 0 5 10 15
Iterations Iterations


Figure 4.1: BFGS“Armijo and Gauss“Newton for the Parameter ID Problem


Better Hessian Better Hessian
5 5
10 10
Function Value
Gradient Norm




0
10
4
10
5
10



10 3
10 10
0 5 10 15 0 5 10 15
Iterations Iterations

Poor Hessian Poor Hessian
5 5
10 10
Function Value

<<

. 4
( 9)



>>