64 ITERATIVE METHODS FOR OPTIMIZATION

This is a dogleg method in that the approximate solution of the trust region problem lies on a

piecewise linear path with the CG iterations as nodes. As long as CG is performing properly

(i.e., pT w > 0) nodes are added to the path until the path intersects the trust region boundary. If

a direction of inde¬niteness is found (pT w ¤ 0), then that direction is followed to the boundary.

In this way a negative curvature direction, if found in the course of the CG iteration, can be

exploited.

The inputs to Algorithm trcg are the current point x, the objective f , the forcing term ·,

and the current trust region radius ∆. The output is the approximate solution of the trust region

problem d. This algorithm is not the whole story, as once the trust region problem is solved

approximately, one must use f (xc + d) to compute ared and then make a decision on how the

trust region radius should be changed. Our formulation differs from that in [247] in that the

termination criterion measures relative residuals in the l2 -norm rather than in the C-norm. This

change in the norm has no effect on the analysis in [247], and, therefore, we can apply the results

in §2.5 directly to draw conclusions about local convergence.

Algorithm 3.3.7. trcg(d, x, f, M, ·, ∆, kmax)

1. r = ’∇f (x), ρ0 = r 2 , k = 1, d = 0

2

√

2. Do While ρk’1 > · ∇f (x) 2 and k < kmax

(a) z = M r

(b) „k’1 = z T r

(c) if k = 1 then β = 0 and p = z

else

β = „k’1 /„k’2 , p = z + βp

(d) w = ∇2 f (x)p

If pT w ¤ 0 then

Find „ such that d + „ p C = ∆

d = d + „ p; return

(e) ± = „k’1 /pT w

(f) r = r ’ ±w

(g) ρk = rT r

ˆ

(h) d = d + ±p

ˆ

(i) If d C > ∆ then

Find „ such that d + „ p C = ∆

d = d + „ p; return

ˆ

(j) d = d; k = k + 1

Algorithm trcg does what we would expect a dogleg algorithm to do in that the piecewise

linear path determined by the iteration moves monotonically away from x (in the · C -norm!)

and the quadratic model decreases on that path [247]. Algorithm trcg will, therefore, compute

the same Newton step as Algorithm fdpcg. One might think that it may be dif¬cult to compute

the C-norm if one has, for example, a way to compute the action of M on a vector that does

not require computation of the matrix C. However, at the cost of storing two additional vectors

we can update Cp and Cd as the iteration progresses. So, when p is updated to z + βp then

Cp = r + βCp can be updated at the same time without computing the product of C with p.

Then p C = pT Cp. Similarly d = d + „ p implies that Cd = Cd + „ Cp.

Algorithm cgtrust combines the solution of the trust region problem from trcg, the

trust region radius adjustment scheme from trtest, and (indirectly) the locally convergent

algorithm newtcg. The result ¬ts nicely into our paradigm algorithm trgen.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 65

Algorithm 3.3.8. cgtrust(x, f, „ )

1. Initialize ∆, M , ·, kmax.

2. Do forever

(a) Let xc = x. Compute ∇f (xc ).

(b) Call trcg(d, x, f, M, ·, ∆, kmax) to solve the trust region subproblem.

Set xt = x + d.

(c) Call trtest(xc , xt , x, f, ∆),

solving the trust region subproblem with Algorithm trcg.

(d) Update ·.

Theorem 3.3.10 combines several results from [247].

Theorem 3.3.10. Let f be twice Lipschitz continuously differentiable. Let M be a given

positive de¬nite matrix and let {·n } satisfy 0 < ·n < 1 for all n. Let {xn } be the sequence

generated by Algorithm cgtrust and assume that { ∇2 f (xn ) } is bounded. Then

lim ∇f (xn ) = 0.

(3.49)

n’∞

Moreover, if x— is a local minimizer for which the standard assumptions hold and xn ’ x— ,

then

• if ·n ’ 0 the convergence is q-superlinear, and

p

• if ·n ¤ K· ∇f (xn ) for some K· > 0 the convergence is q-superlinear with q-order

1 + p.

Finally, there are δ and ∆ such that if x0 ’ x— ¤ δ and ∆0 ¤ ∆ then xn ’ x— .

One can, as we do in the MATLAB code cgtrust, replace the Hessian“vector product

with a difference Hessian. The accuracy of the difference Hessian and the loss of symmetry

present the potential problem that was mentioned in §2.5. Another, very different, approach is

to approximate the exact solution of the trust region subproblem with an iterative method [243].

3.4 Examples

The results we report here used the MATLAB implementations of steepest descent, steep.m,

damped Gauss“Newton, gaussn.m, the dogleg trust region algorithm for Newton™s method,

ntrust.m, and the PCG“dogleg algorithms, cgtrust.m, from the software collection.

Our MATLAB implementation of Algorithm steep guards against extremely poor scaling

and very long steps by setting » to

»0 = min(1, 100/(1 + ∇f (x) ))

(3.50)

at the beginning of the line search. We invite the reader in Exercise 3.5.3 to attempt the control

example with »0 = 1.

We not only present plots, which are an ef¬cient way to understand convergence rates, but we

also report counts of function, gradient, and Hessian evaluations and the results of the MATLAB

flops command.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

66 ITERATIVE METHODS FOR OPTIMIZATION

Dogleg Dogleg

5 5

10 10

0

10

Function Value

Gradient Norm

0

10

5

10

5

10

10

10

10 15

10 10

0 10 20 30 0 10 20 30

Iterations Iterations

Steepest Descent Steepest Descent

4 5

10 10

2

10

Function Value

Gradient Norm

0

10

0

10

5

10

2

10

4 10

10 10

0 20 40 60 0 20 40 60

Iterations Iterations

Figure 3.1: Steepest Descent and Newton“Dogleg for Parameter ID Problem

Damped Gauss“Newton Damped Gauss“Newton

2 5

10 10

0 0

10 10

Function Value

Gradient Norm

2 5

10 10

4 10

10 10

6 15

10 10

0 2 4 6 0 2 4 6

Iterations Iterations

Levenberg“Marquardt Levenberg“Marquardt

2 5

10 10

0 0

10 10

Function Value

Gradient Norm

2 5

10 10

4 10

10 10

6 15

10 10

0 5 10 15 0 5 10 15

Iterations Iterations

Figure 3.2: Gauss“Newton and Levenberg“Marquardt for Parameter ID Problem

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 67

Dogleg“CG Dogleg“CG

5 7

10 10

6

10

Function Value

Gradient Norm

0

10

5

10

5

10

4

10

10 3

10 10

0 2 4 6 0 2 4 6

Iterations Iterations

Steepest Descent Steepest Descent

5 7

10 10

6

10

Function Value

Gradient Norm

0

10

5

10

5

10

4

10

10 3

10 10

0 20 40 60 0 20 40 60

Iterations Iterations

Figure 3.3: Steepest Descent and Dogleg“CG for Discrete Control Problem

3.4.1 Parameter Identi¬cation

We consider the problem from §2.6.1 except we use the initial data x0 = (5, 5)T . Both the Gauss“

Newton and Newton methods will fail to converge with this initial data without globalization (see

Exercise 3.5.14). Newton™s method has particular trouble with this problem because the Newton

direction is not a descent direction in the early phases of the iteration. The termination criterion

and difference increment for the ¬nite difference Hessian was the same as for the computation

in §2.6.1.

In Figure 3.1 we compare the performance of the Newton dogleg algorithm with the steepest

descent algorithm. Our implementation of the classical dogleg in ntrust uses the standard

values

ωdown = .5, ωup = 2, µ0 = µlow = .25, and µhigh = .75.

(3.51)

The plots clearly show the locally superlinear convergence of Newton™s method and the linear

convergence of steepest descent. However, the graphs do not completely show the difference

in computational costs. In terms of gradient evaluations, steepest descent was marginally better

than the Newton dogleg algorithm, requiring 50 gradients as opposed to 55 (which includes those

needed for the 18 difference Hessian evaluations) for the Newton dogleg algorithm. However, the

steepest descent algorithm required 224 function evaluations, while the Newton dogleg needed

only 79. As a result, the Newton dogleg code was much more ef¬cient, needing roughly 5 million

¬‚oating point operations instead of the 10 million needed by the steepest descent code.

In Figure 3.2 we plot the performance of the damped Gauss“Newton and Levenberg“

Marquardt algorithms. These exploit the least squares structure of the problem and are locally

superlinearly convergent because this is a zero residual problem. They also show that algo-

rithms that effectively exploit the structure of the least squares problem are much more ef¬cient.

Gauss“Newton required 6 gradient evaluations, 14 function evaluations, and 750 thousand ¬‚oat-

ing point operations, and Levenberg“Marquardt required 12 gradients, 23 functions, and 1.3

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

68 ITERATIVE METHODS FOR OPTIMIZATION

million ¬‚oating point operations.

3.4.2 Discrete Control Problem

We consider the discrete control problem from §1.6.1 with N = 400, T = 1, y0 = 0,

L(y, u, t) = (y ’ 3)2 + .5 — u2 , and φ(y, u, t) = uy + t2 .

We chose the poor initial iterate

u0 (t) = 5 + 300 sin(20πt).

This problem can be solved very ef¬ciently with Algorithm cgtrust. In our implementa-

tion we use the same parameters from (3.51). In Figure 3.3 we compare the dogleg“CG iteration

with steepest descent. We terminated both iterations when ∇f < 10’8 . For the dogleg“CG

code we used · = .01 throughout the entire iteration and an initial trust region radius of u0 .

The steepest descent computation required 48 gradient evaluations, 95 function evaluations, and

roughly 1 million ¬‚oating point operations, and dogleg“CG needed 17 gradient evaluations, 21

function evaluations, and roughly 530 thousand ¬‚oating point operations. Note that the steepest

descent algorithm performed very well in the terminal phase of the iteration. The reason for this

is that, in this example, the Hessian is near the identity.

3.5 Exercises on Global Convergence

3.5.1. Let F be a nonlinear function from RN ’ RN . Let

f (x) = F (x) 2 /2.

What is ∇f ? When is the Newton step for the nonlinear equation F (x) = 0,

d = ’F (x)’1 F (x),

a descent direction for f at x?

3.5.2. Prove Lemma 3.2.1.

3.5.3. Implement Algorithm steep without the scaling ¬xup in (3.50). Apply this crippled

algorithm to the control problem example from §3.4.2. What happens and why?

3.5.4. Show that if f is a convex quadratic then f is bounded from below.

3.5.5. Verify (3.40).

3.5.6. Show that the Levenberg“Marquardt steps computed by (3.20) and (3.21) are the same.

3.5.7. Prove Theorem 3.2.7.

3.5.8. Complete the proof of Theorem 3.3.2.

3.5.9. Prove Theorem 3.3.4.

3.5.10. Look at the trust region algorithm for nonlinear equations from [218] or [84]. What are the

costs of that algorithm that are not present in a line search? When might this trust region

approach have advantages for solving nonlinear equations? Could it be implemented

inexactly?

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 69

3.5.11. The double dogleg method [80], [84] puts a new node on the dogleg path in the Newton

direction, thereby trying more aggressively for superlinear convergence. Implement this

method, perhaps by modifying the MATLAB code ntrust.m, and compare the results

with the examples in §3.4. Prove convergence results like Theorems 3.3.7 and 3.3.9 for

this method.

3.5.12. In [51] a trust region algorithm was proposed that permitted inaccurate gradient computa-

tions, with the relative accuracy being tightened as the iteration progresses. Look at [51]

and try to design a similar algorithm based on the line search paradigm. What problems

do you encounter? How do you solve them?

3.5.13. Suppose one modi¬es Algorithm trtest by not resolving the trust region problem if

the trial point is rejected, but instead performing a line search from xt , and setting ∆ =

x+ ’ xc , where x+ is the accepted point from the line search. Discuss the merits of

this modi¬cation and any potential problems. See [209] for the development of this idea.

3.5.14. Write programs for optimization that take full Gauss“Newton or Newton steps (you can

cripple the MATLAB codes gaussn.m and ntrust.m for this). Apply these codes to

the parameter identi¬cation problem from §3.4.1. What happens?

3.5.15. Write a nonlinear CG code and apply it to the problems in §3.4. Try at least two ways to

manage the line search. How important are the (strong) Wolfe conditions?

3.5.16. Discuss the impact of using a difference Hessian in Algorithm trcg. How will the global

convergence of Algorithm cgtrust be affected? How about the local convergence?

Consider the accuracy in the evaluation of ∇f in your results.

3.5.17. Without looking at [247] describe in general terms how the proof of Theorem 3.3.1 should

be modi¬ed to prove Theorem 3.3.10. Then examine the proof in [247] to see if you left

anything out.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Chapter 4

The BFGS Method

Quasi-Newton methods update an approximation of ∇2 f (x— ) as the iteration progresses. In

general the transition from current approximations xc and Hc of x— and ∇2 f (x— ) to new ap-

proximations x+ and H+ is given (using a line search paradigm) by the following steps:

’1

1. Compute a search direction d = ’Hc ∇f (xc ).

2. Find x+ = xc + »d using a line search to insure suf¬cient decrease.

3. Use xc , x+ , and Hc to update Hc and obtain H+ .

The way in which H+ is computed determines the method.

The BFGS (Broyden, Fletcher, Goldfarb, Shanno) [36], [103], [124], [237] method, which

is the focus of this chapter, and the other methods we will mention in §4.3 are also called secant

methods because they satisfy the secant equation

H+ s = y.

(4.1)

In (4.1)

s = x+ ’ xc and y = ∇f (x+ ) ’ ∇f (xc ).

If N = 1, all secant methods reduce to the classical secant method for the single nonlinear

equation f (x) = 0, i.e.,

f (xc )(xc ’ x’ )

x+ = xc ’ ,

(4.2)

f (xc ) ’ f (x’ )

where x’ is the iterate previous to xc .

The standard quasi-Newton update for nonlinear equations is Broyden™s [34] method, a

rank-one update,

(y ’ Hc s)sT

H+ = Hc + .

(4.3)

sT s

Broyden™s method does not preserve the structural properties needed for line search methods in

optimization, namely, symmetry and positive de¬niteness, and could, in fact, encourage con-

vergence to a local maximum. For that reason quasi-Newton methods in optimization are more

complex than those used for nonlinear equations. The methods of analysis and implementation

are more complex as well.

In this chapter we will concentrate on the BFGS method [36], [103], [124], [237], which is

the rank-two update

yy T (Hc s)(Hc s)T

H+ = Hc + T ’ .

(4.4)

sT Hc s

ys

We will brie¬‚y discuss other updates and variations that exploit problem structure in §4.3.

71

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

72 ITERATIVE METHODS FOR OPTIMIZATION

4.1 Analysis

This section begins with some simple observations on nonsingularity and positivity of the update.

It is very useful for both theory and practice to express (4.4) in terms of the inverse matrices.

The formula we use in this book is Lemma 4.1.1.

’1

Lemma 4.1.1. Let Hc be spd, y T s = 0, and H+ given by (4.4). Then H+ is nonsingular

and

sy T ysT ssT

’1 ’1

H+ = I ’ T Hc I’ T +T.

(4.5)

ys ys ys

Proof. See exercise 4.5.2.

Lemma 4.1.2. Let Hc be spd, y T s > 0, and H+ given by (4.4). Then H+ is spd.

Proof. Positivity of Hc and y T s = 0 imply that for all z = 0,

(z T y)2 (z T Hc s)2

T T

z H+ z = + z Hc z ’ T .

yT s s Hc s

Using the symmetry and positivity of Hc , we have

(z T Hc s)2 ¤ (sT Hc s)(z T Hc z),

with equality only if z = 0 or s = 0, and, therefore, since z, s = 0 and y T s > 0,

(z T y)2

z T H+ z > ≥ 0,

yT s

as asserted.

If y T s ¤ 0 the update is considered a failure.

4.1.1 Local Theory

The local theory [37] requires accurate initial approximations to both x— and ∇2 f (x— ). The

statement of the convergence result is easy to understand.

Theorem 4.1.3. Let the standard assumptions hold. Then there is δ such that if

x0 ’ x— ¤ δ and H0 ’ ∇2 f (x— ) ¤ δ,

then the BFGS iterates are de¬ned and converge q-superlinearly to x— .

Technical Details

The proof of Theorem 4.1.3 is technical and we subdivide it into several lemmas. Our proof is

a hybrid of ideas from [37], [135], and [154]. Similar to other treatments of this topic [45] we

begin with the observation (see §2.5.2) that one may assume ∇2 f (x— ) = I for the convergence

analysis.

Lemma 4.1.4. Let the standard assumptions hold and let

ˆ

f (y) = f (Ay),

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BFGS METHOD 73

ˆ

where A = (∇2 f (x— ))’1/2 . Let xc and Hc be given and let xc = A’1 xc and Hc = AHc A.

ˆ

ˆ

xˆ

Then the BFGS updates (x+ , H+ ) for f and (ˆ+ , H+ ) for f are related by

ˆ

x+ = A’1 x+ and H+ = AH+ A.

ˆ

In particular, the BFGS sequence for f exists (i.e., Hn is spd for all n) if and only if the BFGS

ˆ

sequence for f does and the convergence of {xn } is q-superlinear if and only if the convergence

of {ˆn } is.

x

Proof. The proof is a simple calculation and is left for exercise 4.5.3.

Hence we can, with no loss of generality, assume that ∇2 f (x— ) = I, for if this is not so, we

ˆ

can replace f by f and obtain an equivalent problem for which it is.

Keeping in mind our assumption that ∇2 f (x— ) = I, we denote errors in the inverse Hessian

by

E = H ’1 ’ ∇2 f (x— )’1 = H ’1 ’ I.

These errors satisfy a simple recursion [37].

Lemma 4.1.5. Let the standard assumptions hold. Let Hc be spd and

’1

x+ = xc ’ Hc ∇f (xc ).

Then there is δ0 such that if

0 < xc ’ x— ¤ δ0 and Ec ¤ δ0 ,

then y T s > 0. Moreover, if H+ is the BFGS update of Hc then

E+ = (I ’ wwT )Ec (I ’ wwT ) + ∆,

(4.6)

where w = s/ s and for some K∆ > 0

∆ ¤ K∆ s .

(4.7)

Proof. Let δ0 be small enough so that ∇f (xc ) = 0 if xc = x— . Theorem 1.2.1 implies that

1

∇2 f (x— + tec )ec dt = ec + ∆1 ec ,

∇f (xc ) =

0

where ∆1 is the matrix given by

1

(∇2 f (x— + tec ) ’ I) dt.

∆1 =

0

Clearly

∆1 ¤ γ ec /2,

and

’1

s = ’Hc ∇f (xc ) = ’(I + Ec )(I + ∆1 )ec .

Therefore,

ec (1 ’ δ0 )(1 ’ γδ0 /2) ¤ s ¤ ec (1 + δ0 )(1 + γδ0 /2)

and hence

0 < ec /2 ¤ s ¤ 2 ec

(4.8)

if, say,

δ0 ¤ min(1/4, 1/(2γ)).

(4.9)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

74 ITERATIVE METHODS FOR OPTIMIZATION

We will assume that (4.9) holds for the rest of this section.

The standard assumptions, our assumption that ∇2 f (x— ) = I, and the fundamental theorem

of calculus imply that

1

∇2 f (xc + ts)s dt

y = ∇f (x+ ) ’ ∇f (xc ) =

0

(4.10)

1

(∇2 f (xc + ts) ’ I)s dt = s + ∆2 s,

=s+

0

where ∆2 is the matrix given by

1

(∇2 f (xc + ts) ’ I) dt.

∆2 =

0

The standard assumptions imply that ∆2 ¤ γ( e+ + ec )/2. Hence, (4.8) implies that

y T s = sT s + (∆2 s)T s ≥ s 2 (1 ’ 3γ ec /2) s 2 (1 ’ 3γδ0 /2) > 0

(4.11)

provided δ0 < 2γ/3. We have that

sy T ssT + s(∆2 s)T ssT

= T ’ ∆3 = wwT ’ ∆3 ,

=T

(4.12)

yT s s s + (∆2 s)T s ss

where (see exercise 4.5.4), for some C > 0,

∆3 ¤ C s .

(4.13)

Subtracting (∇2 f (x— ))’1 = I from (4.5) and using (4.12) gives us

= (I ’ wwT + ∆3 )Hc (I ’ wwT + ∆T ) + wwT ’ I

’1

E+ 3

= (I ’ wwT )(Ec + I)(I ’ wwT ) + wwT ’ I + ∆

= (I ’ wwT )Ec (I ’ wwT ) + ∆,

where

∆ = ∆3 Hc (I ’ wwT + ∆T ) + (I ’ wwT )Hc ∆T .

’1 ’1

3 3

Therefore, if (4.9) holds then 1 + δ0 ¤ 3/2 and

∆ ¤ (1 + δ0 ) ∆3 (2 + ∆3 ) ¤ s 3C(2 + C s )/2

¤ 3C s (2 + 2Cδ0 )/2.

Reduce δ0 if necessary so that 2Cδ0 ¤ 1 and the proof is complete with K∆ = 9C/2.

Lemma 4.1.5 implies that the approximate Hessians do not drift too far from the exact Hessian

if the initial data are good. This property, called bounded deterioration in [37], will directly

imply local q-linear convergence.

Corollary 4.1.6. Let the assumptions of Lemma 4.1.5 hold and let δ0 be as in the statement

of Lemma 4.1.5. Then

E+ ¤ Ec + K∆ s ¤ Ec + K∆ ( ec + e+ ).

(4.14)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BFGS METHOD 75

Proof. The leftmost inequality follows from Lemma 4.1.5 and the fact that I ’ wwT is an

orthogonal projection. The ¬nal inequality follows from the triangle inequality.

We are now ready to prove local q-linear convergence. This is of interest in its own right and

is a critical step in the superlinear convergence proof. Note that, unlike the statement and proof

of Theorem 2.3.4, we do not express the estimates in terms of H ’ ∇2 f (x— ) = H ’ I but

in terms of E = H ’1 ’ I. The two approaches are equivalent, since if E ¤ δ < 1/2, then

H ’1 < 3/2 and the Banach lemma implies that H ¤ 2. Hence

’1

Hn ’ I /2 ¤ Hn Hn ’ I

’1 ’1

¤ Hn ’ I = Hn (Hn ’ I)

’1

¤ Hn Hn ’ I ¤ 3 Hn ’ I /2.

Theorem 4.1.7. Let the standard assumptions hold and let σ ∈ (0, 1). Then there is δ such

that if

’1

x0 ’ x— ¤ δ and H0 ’ ∇2 f (x— )’1 ¤ δ ,

(4.15)

then the BFGS iterates are de¬ned and converge q-linearly to x— with q-factor at most σ.

ˆ

Proof. For δ suf¬ciently small and

ˆ ˆ

xc ’ x— ¤ δ and Ec = Hc ’ I ¤ δ,

’1

(4.16)

¯

the standard assumptions imply that there is K such that

¯ˆ

¯ ec + ec 2 )/2 ¤ K δ ec .

e+ ¤ K( Ec

(4.17)

ˆ ¯ˆ

Reduce δ if necessary so that K δ ¤ σ to obtain e+ ¤ σ ec . The method of proof is to select

δ so that (4.16) is maintained for the entire iteration if the initial iterates satisfy (4.15).

With this in mind we set

’1

K∆ (1 + σ)

—

< δ — /2

δ = δ /2 1 +

(4.18)

1’σ

where K∆ is from Lemma 4.1.5. Now if H0 ’ I < δ then

E0 ¤ δ /(1 ’ δ ) ¤ 2δ < δ —

which is the estimate we need.

Now by Corollary 4.1.6

E1 ¤ E0 + K∆ (1 + σ) e0 .

The proof will be complete if we can show that (4.15) and (4.18) imply that En < δ — for all

n. We do this inductively. If En < δ — and ej+1 ¤ σ ej for all j ¤ n, then (4.14) implies

that

En+1 ¤ En + K∆ ( en + en+1 ) ¤ En + K∆ (1 + σ) en

¤ En + K∆ (1 + σ)σ n e0 ¤ En + K∆ (1 + σ)σ n δ

n

σn

¤ E0 + δ K∆ (1 + σ)

j=0

K∆ (1 + σ)

=δ 1+ .

1’σ

We complete the induction and the proof by invoking (4.18) to conclude that En+1 ¤ δ — .

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

76 ITERATIVE METHODS FOR OPTIMIZATION

Proof of Theorem 4.1.3

The Dennis“Mor´ condition [82], [81] is a necessary and suf¬cient condition for superlinear

e

convergence of quasi-Newton methods. In terms of the assumptions we make in this section,

the condition is

En sn

lim = 0,

(4.19)

sn

n’∞

where {sn } is the sequence of steps and {En } is the sequence of errors in the inverse Hessian.

We will only state and prove the special case of the necessary condition that we need and refer

the reader to [82], [81], [84], or [154] for more general proofs.

Theorem 4.1.8. Let the standard assumptions hold; let {Hn } be a sequence of nonsingular

N — N matrices satisfying

Hn ¤ M

(4.20)

for some M > 0. Let x0 ∈ RN be given and let {xn }∞ be given by

n=1

’1

xn+1 = xn ’ Hn ∇f (xn )

for some sequence of nonsingular matrices Hn . Then if xn ’ x— q-linearly, xn = x— for any

n, and (4.19) holds then xn ’ x— q-superlinearly.

Proof. We begin by invoking (4.10) to obtain

En sn = (Hn ’ I)sn = (Hn ’ I)(yn ’ ∆2 s) = En yn + O( sn 2 ).

’1 ’1

Convergence of xn to x— implies that sn ’ 0 and hence (4.19) can be written as

En y n

lim = 0,

(4.21)

sn

n’∞

where yn = ∇f (xn+1 ) ’ ∇f (xn ).

Now let σ be the q-factor for the sequence {xn }. Clearly

(1 ’ σ) en ¤ sn ¤ (1 + σ) en .

Hence (4.21) is equivalent to

En y n

lim = 0.

(4.22)

en

n’∞

Since Hn ∇f (xn ) = ’sn and sn = yn + O( sn 2 ) we have

’1

’1

En y n = (Hn ’ I)(∇f (xn+1 ) ’ ∇f (xn ))

= Hn ∇f (xn+1 ) + sn ’ yn = Hn ∇f (xn+1 ) + O( sn 2 )

’1 ’1

’1 2

+ sn 2 ) = Hn en+1 + O( en 2 ).

’1

= Hn en+1 + O( en

Therefore, (4.22) implies that

’1

En y n Hn en+1 en+1

+ O( en ) ≥ M ’1

= + O( en ) ’ 0

en en en

as n ’ ∞, proving q-superlinear convergence.

For the remainder of this section we assume that (4.15) holds and that δ is small enough so

that the conclusions of Theorem 4.1.7 hold for some σ ∈ (0, 1). An immediate consequence of

this is that ∞

sn < ∞.

(4.23)

n=0

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BFGS METHOD 77

The Frobenius norm of a matrix A is given by

N

2

(A)2 .

A =

(4.24) F ij

i,j=1

It is easy to show that (see exercise 4.5.5) for any unit vector v ∈ RN ,

A(I ’ vv T ) 2 2 2

and (I ’ vv T )A 2 2

¤A ’ Av ¤A F.

(4.25) F F F

We have, using (4.6), (4.7), and (4.25), that

2 2 2 2 2

En+1 ¤ En ’ En wn + O( sn ) = (1 ’ θn ) En + O( sn ),

(4.26) F F F

where wn = sn / sn and

±

En wn

if En = 0,

En F

θn =

1 if En = 0.

Using (4.23) we see that for any k ≥ 0,

k k

2 2 2 2

θ n En ¤ En ’ En+1 + O(1)

F F F

n=0 n=0

2 2

= E0 ’ Ek+1 + O(1) < ∞.

F F

Hence θn En ’ 0.

F

However, ±

En wn if En = 0

θ n En =

F

0 if En = 0

En sn

= En wn = .

sn

Hence (4.19) holds. This completes the proof of Theorem 4.1.3.

4.1.2 Global Theory

If one uses the BFGS model Hessian in the context of Algorithm optarm, then Theorem 3.2.4

can be applied if the matrices {Hk } remain bounded and well conditioned. However, even if a

limit point of the iteration is a minimizer x— that satis¬es the standard assumptions, Theorem 3.2.4

does not guarantee that the iteration will converge to that point. The situation in which x is near

x— but H is not near ∇2 f (x— ) is, from the point of view of the local theory, no better than that

when x is far from x— . In practice, however, convergence (often superlinear) is observed. The

result in this section is a partial explanation of this.

Our description of the global theory, using the Armijo line search paradigm from Chapter 3, is

based on [43]. We also refer the reader to [221], [45], and [269] for older results with a different

line search approach. Results of this type require strong assumptions on f and the initial iterate

x0 , but the reward is global and locally superlinear convergence for a BFGS“Armijo iteration.

Assumption 4.1.1. The set

D = {x | f (x) ¤ f (x0 )}

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

78 ITERATIVE METHODS FOR OPTIMIZATION

is convex and f is Lipschitz twice continuously differentiable in D. Moreover, there are »+ ≥

»’ > 0 such that

σ(∇2 f (x)) ‚ [»’ , »+ ]

for all x ∈ D.

Assumption 4.1.1 implies that f has a unique minimizer x— in D and that the standard

assumptions hold near x— .

Theorem 4.1.9. Let Assumption 4.1.1 hold and let H0 be spd. Then the BFGS“Armijo

iteration converges q-superlinearly to x— .

The results for local and global convergence do not completely mesh. An implementation

must allow for the fact that Assumption 4.1.1 may fail to hold, even near the root, and that

y T s ¤ 0 is a possibility when far from the root.

4.2 Implementation

The two implementation issues that we must confront are storage of the data needed to maintain

the updates and a strategy for dealing with the possibility that y T s ¤ 0. We address the

storage question in §4.2.1. For the second issue, when y T s is not suf¬ciently positive, we restart

the BFGS update with the identity. We present the details of this in §4.2.2. Our globalization

approach, also given in §4.2.2, is the simplest possible, the Armijo rule as described in Chapter 3.

We choose to discuss the Armijo rule in the interest of simplicity of exposition. However,

while the Armijo rule is robust and suf¬cient for most problems, more complex line search

schemes have been reported to be more ef¬cient [42], and one who seeks to write a general

purpose optimization code should give careful thought to the best way to globalize a quasi-

Newton method. In the case of BFGS, for example, one is always seeking to use a positive de¬nite

quadratic model, even in regions of negative curvature, and in such regions the approximate

Hessian could be reinitialized to the identity more often than necessary.

4.2.1 Storage

For the present we assume that y T s > 0. We will develop a storage-ef¬cient way to compute

the BFGS step using the history of the iteration rather than full matrix storage.

The implementation recommended here is one of many that stores the history of the iteration

’1

and uses that information recursively to compute the action of Hk on a vector. This idea was

suggested in [16], [186], [206], and other implementations may be found in [44] and [201].

All of these implementations store the iteration history in the pairs {sk , yk } and we present a

concrete example in Algorithm bfgsrec. A better, but somewhat less direct, way is based on

the ideas in [91] and [275] and requires that only a single vector be stored for each iteration. We

’1

assume that we can compute the action of H0 on a vector ef¬ciently, say, by factoring H0 at the

outset of the iteration or by setting H0 = I. We will use the BFGS formula from Lemma 4.1.1.

One way to maintain the update is to store the history of the iteration in the sequences of

vectors {yk } and {sk } where

sk = xk+1 ’ xk and yk = ∇f (xk+1 ) ’ ∇f (xk ).

If one has done this for k = 0, . . . , n ’ 1, one can compute the new search direction

’1

dn = ’Hn ∇f (xn )

by a recursive algorithm which applies (4.5).

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BFGS METHOD 79

’1

Algorithm bfgsrec overwrites a given vector d with Hn d. The storage needed is one

vector for d and 2n vectors for the sequences {sk , yk }n’1 . A method for computing the product

k=0

’1

of H0 and a vector must also be provided.

’1

Algorithm 4.2.1. bfgsrec(n, {sk }, {yk }, H0 , d)

’1

1. If n = 0, d = H0 d; return

2. ± = sT d/yn’1 s; d = d ’ ±yn’1

T

n’1

’1

3. call bfgsrec(n ’ 1, {sk }, {yk }, H0 , d)

T T

4. d = d + (± ’ (yn’1 d/yn’1 sn’1 ))sn’1

Algorithm bfgsrec has the great advantage, at least in a language that ef¬ciently supports

recursion, of being very simple. More complex, but nonrecursive versions, have been described

in [16], [201], and [44].

The storage cost of two vectors per iteration can be signi¬cant, and when available storage is

exhausted one can simply discard the iteration history and restart with H0 . This approach, which

we implement in the remaining algorithms in this section, takes advantage of the fact that if H0

’1

is spd then ’H0 ∇f (x) will be a descent direction, and hence useful for a line search. Another

approach, called the limited memory BFGS [44], [207], [176], [201], keeps all but the oldest

(s, y) pair and continues with the update. Neither of these approaches for control of storage,

while essential in practice for large problems, has the superlinear convergence properties that

the full-storage algorithm does.

At a cost of a modest amount of complexity in the formulation, we can reduce the storage

cost to one vector for each iteration. The method for doing this in [275] begins with an expansion

of (4.5) as

’1

H+ = Hc + ±0 sc sT + β0 ((Hc yc )sT + sc (Hc yc )T ),

’1 ’1 ’1

c c

where

T T ’1

yc sc + yc Hc yc ’1

±0 = and β0 = T .

(yc sc )2

T yc sc

Now note that

’1 ’1 ’1 ’1

Hc yc = Hc ∇f (x+ ) ’ Hc ∇f (xc ) = Hc ∇f (x+ ) + sc /»c

and obtain

’1

H+ = Hc + ±1 sc sT + β0 (sc (Hc ∇f (x+ ))T + (Hc ∇f (x+ ))sT ),

’1 ’1 ’1

(4.27) c c

where

±1 = ±0 + 2β0 /»c .

Also

’1

d+ = ’H+ ∇f (x+ )

T

yc sT sc sT ∇f (x+ )

sc yc c c

’1

=’ I’ Hc I’ ∇f (x+ ) ’

(4.28) T Ts T

yc sc yc c yc sc

’1

= Ac sc + Bc Hc ∇f (x+ ),

where

T

yc sT sT ∇f (x+ )

yc

I’ Tc c

’1

Ac = Hc ∇f (x+ ) +

(4.29) T T

yc sc yc sc »c yc sc

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

80 ITERATIVE METHODS FOR OPTIMIZATION

and

sT ∇f (x+ )

Bc = ’1 + c T .

(4.30)

yc sc

’1

At this point we can compute d+ , and therefore »+ and s+ using only Hc ∇f (xc ). We do not

need H+ at all. We can now form H+ with the new data for the next iterate and will show that

we do not need to store the vectors {yk }.

Since (verify this!) Bc = 0, we have

s+ Ac sc

’1

Hc ∇f (x+ ) = ’ + .

Bc »+ Bc

Combining this with (4.27) gives

’1

H+ = Hc + ±c sc sT + βc (sc sT + s+ sT ),

’1

(4.31) c + c

where

β0

±c = ±1 + 2β0 Ac /Bc and βc = ’ .

(4.32)

Bc »+

This leads to the expansion

n

’1 ’1

±k sk sT + βk (sk sT + sk+1 sT ).

Hn+1 = H0 +

(4.33) k k+1 k

k=0

Upon re¬‚ection the reader will see that this is a complete algorithm. We can use (4.28) and Hn

to compute dn+1 . Then we can compute »n+1 and sn+1 and use them and (4.32) to compute

’1

±n and βn . This new data can be used to form Hn+1 with (4.33), which we can use to compute

dn+2 and continue the iteration.

In this way only the steps {sk } and the expansion coef¬cients {±k } and {βk } need be stored.

Algorithm bfgsopt is an implementation of these ideas.

Algorithm 4.2.2. bfgsopt(x, f, )

1. g = ’∇f (x), n = 0.

2. While g >

’1

(a) If n = 0, dn = ’H0 g

otherwise compute A, B, and dn using (4.28), (4.29), and (4.30).

(b) Compute »n , sn , and x = xn+1 with the Armijo rule.

(c) If n > 0 compute ±n’1 and βn’1 using (4.32).

(d) g = ’∇f (x), n = n + 1.

4.2.2 A BFGS“Armijo Algorithm

In this section we present a simple implementation that shows how the theoretical results can

be applied in algorithm design. Let H+ GS be the BFGS update from Hc and de¬ne the two

BF

modi¬ed BFGS (MBFGS) updates by

H+ GS

BF

if y T s > 0,

H+ =

(4.34)

if y T s ¤ 0,

I

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BFGS METHOD 81

and

H+ GS

BF

if y T s > 0,

H+ =

(4.35)

if y T s ¤ 0.

Hc

In the MBFGS1 method, (4.34), the model Hessian is reinitialized to I if y T s ¤ 0. In the

early phase of this iteration, where ∇2 f may have negative eigenvalues, y T s ¤ 0 is certainly

possible and the search direction could be the steepest descent direction for several iterations.

An MBFGS2 step (4.35) keeps the history of the iteration even if y T s ¤ 0. One view

is that this approach keeps as much information as possible. Another is that once y T s ¤ 0,

the iteration history is suspect and should be thrown away. Both forms are used in practice.

Our MATLAB code bfgswopt uses MFBGS1 and maintains an approximation to H ’1 using

Algorithm bfgsopt. We also guard against poor scaling by using (3.50).

4.3 Other Quasi-Newton Methods

The DFP (Davidon, Fletcher, Powell) update [71], [72], [105]

(y ’ Hc s)y T + y(y ’ Hc s)T [(y ’ Hc s)T y]yy T

H + = Hc + ’

(4.36)

yT s (y T s)2

has similar local convergence properties to BFGS but does not perform as well in practice [224],

[225].

Two updates that preserve symmetry, but not de¬niteness, are the PSB (Powell symmetric

Broyden) update [219],

(y ’ Hc s)sT + s(y ’ Hc s)T [sT (y ’ Hc s)]ssT

H + = Hc + ’ ,

(4.37)

sT s (sT s)2

and the symmetric rank-one (SR1) [35] update,

(y ’ Hc s)(y ’ Hc s)T

H+ = Hc + .

(4.38)

(y ’ Hc s)T s

By preserving the symmetry of the approximate Hessians, but not the positive de¬niteness,

these updates present a problem for a line search globalization but an opportunity for a trust

region approach. The SR1 update has been reported to outperform BFGS algorithms in certain

cases [165], [41], [64], [65], [163], [258], [118], [250], [119], [268], [164], in which either the

approximate Hessians can be expected to be positive de¬nite or a trust region framework is used

[41], [64], [65].

One may update the inverse of the SR1 approximate Hessian using the Sherman“Morrison

formula, (4.39), a simple relation between the inverse of a nonsingular matrix and that of a

rank-one update of that matrix [93], [239], [240], [14].

Proposition 4.3.1. Let H be a nonsingular N —N matrix and let u, v ∈ RN . Then H +uv T

is invertible if and only if 1 + v T H ’1 u = 0. In this case

(H ’1 u)v T

(H + uv T )’1 = H ’1 .

I’

(4.39)

1 + v T H ’1 u

The proof is simply a direct veri¬cation of (4.39).

The SR1 algorithm terminates in ¬nitely many iterations for convex quadratic optimization

problems [101]. Since the denominator (y ’ Hc s)T s could vanish, the update could completely

fail and implementations must examine the denominator and take appropriate action if it is too

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

82 ITERATIVE METHODS FOR OPTIMIZATION

small. This update does not enforce or require positivity of the approximate Hessian and has

been used effectively to exploit negative curvature in a trust region context [165], [41].

For overdetermined nonlinear least squares problems one can try to approximate the second-

order term in ∇2 f while computing R T R exactly. Suppose

∇2 f (x) ≈ H = C(x) + A,

where the idea is that C, the computed part, is signi¬cantly easier to compute than A, the

approximated part. This is certainly the case for nonlinear least squares, where C = R T R . A

quasi-Newton method that intends to exploit this structure will update A only; hence

H+ = C(x+ ) + A+ .

Superlinear convergence proofs require, in one way or another, that H+ s = y. Therefore, in

terms of A, one might require the update to satisfy

A+ s = y # = y ’ C(x+ )s.

(4.40)

The de¬nition of y # given in (4.40) is called the default choice in [87]. This is not the only

choice for y # , and one can prove superlinear convergence for this and many other choices [87],

[84]. This idea, using several different updates, has been used in other contexts, such as optimal

control [159], [164].

An algorithm of this type, using SR1 to update A and a different choice for y # , was suggested

in [20] and [21]. The nonlinear least squares update from [77], [78], and [84] uses a DFP update

and yet another y # to compute A+ ,

T T

(y # ’ Ac s)y # + y # (y # ’ Ac s)T [(y # ’ Ac s)T y # ]y # y #

(4.41) A+ = Ac + ’ .

T T

y# s (y # s)2

The application of this idea to large-residual least squares problems is not trivial, and scaling

issues must be considered in a successful implementation.

Our proof of superlinear convergence can be applied to updates like (4.41). We state a special

case of a result from [87] for the BFGS formulation

T

y# y# (Ac s)(Ac s)T

A+ = Ac + ’ .

(4.42) T sT Ac s

y# s

Theorem 4.3.2. Let the standard assumptions hold and assume that

A— = ∇2 f (x— ) ’ C(x— )

is spd. Then there is δ such that if

x0 ’ x— ¤ δ and A0 ’ A— ¤ δ,

then the quasi-Newton iterates de¬ned by (4.42) exist and converge q-superlinearly to x— .

This result can be readily proved using the methods in this chapter (see [159]).

Quasi-Newton methods can also be designed to take into account special structure, such as

the sparsity pattern of the Hessian. One can update only those elements that are nonzero in the

initial approximation to the Hessian, requiring that the secant equation Hs = y holds. Such

updates have been proposed and analyzed in varying levels of generality in [83], [87], [185],

[238], [256], and [255].

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BFGS METHOD 83

Another approach is to use the dependency of f on subsets of the variables, a structure that is

often present in discretizations of in¬nite-dimensional problems where coef¬cients of operators

can be updated rather than entire matrix representations of those operators. We refer the reader

to [133], [131], and [132] for an algebraic viewpoint based on ¬nite-dimensional analysis and

to [159], [157], [164], [160], and [163] for an operator theoretic description of these methods.

When applied to discretizations of in¬nite-dimensional optimization problems, quasi-Newton

methods perform best when they also work well on the in¬nite-dimensional problem itself. Work

on BFGS in Hilbert space can be found, for example, in [135], [158], and [159].

Quasi-Newton methods have been designed for underdetermined problems [184], and Broy-

den™s method itself has been applied to linear least squares problems [111], [148].

4.4 Examples

The computations in this section were done with the MATLAB code bfgswopt. For the small

parameter ID problem, where evaluation of f is far more expensive than the cost of maintaining

or factoring the (very small!) approximate Hessian, one could also use a brute force approach

in which H is updated and factored anew with each iteration.

4.4.1 Parameter ID Problem

We solve the parameter ID problem with the same data as in §3.4.1 using H0 = I as the initial

Hessian. We compare the BFGS solution with the Gauss“Newton iteration from §3.4.1. From

Figure 4.1 one can see the local superlinear convergence and the good performance of the line

search. However, as one should expect, the Gauss“Newton iteration, being designed for small

residual least squares problems, was more ef¬cient. The Gauss“Newton iteration required 14

function evaluations, 6 gradients, and roughly 1.3 million ¬‚oating point operations, while the

BFGS“Armijo iteration needed 29 function evaluations, 15 gradients, and 3.8 million ¬‚oating

point operations.

4.4.2 Discrete Control Problem

We return to the example from §3.4.2. For our ¬rst example we use the initial iterate

u0 (t) = 10.

BFGS also requires an initial approximation to the Hessian and we consider two such approxi-

mations:

Hp = .25I and Hg = I.

(4.43)

The Hessian for the continuous problem is a compact perturbation of the identity and the

theory from [158] and [135] indicates that Hg is a much better approximate Hessian than Hp .

The results in Figure 4.2 support that idea. For the better Hessian, one can see the concavity

of superlinear convergence in the plot of the gradient norm. The computation for the better

Hessian required 12 iterations and roughly 572 thousand ¬‚oating point operations, while the one

with the poor Hessian took 16 iterations and roughly 880 thousand ¬‚oating point operations.

Stepsize reductions were not required for the good Hessian and were needed four times during

the iteration for the poor Hessian. However, the guard against poor scaling (3.50) was needed

in both cases.

When we used the same poor initial iterate that we used in §3.4.2

u0 (t) = 5 + 300 sin(20πt)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

84 ITERATIVE METHODS FOR OPTIMIZATION

Gauss Newton Gauss Newton

2 5

10 10

0 0

10 10

Function Value

Gradient Norm

2 5

10 10

4 10

10 10

6 15

10 10

0 2 4 6 0 2 4 6

Iterations Iterations

BFGS BFGS

5 5

10 10

0

10

Function Value

Gradient Norm

0 5

10 10

10

10

5 15

10 10

0 5 10 15 0 5 10 15

Iterations Iterations

Figure 4.1: BFGS“Armijo and Gauss“Newton for the Parameter ID Problem

Better Hessian Better Hessian

5 5

10 10

Function Value

Gradient Norm

0

10

4

10

5

10

10 3

10 10

0 5 10 15 0 5 10 15

Iterations Iterations

Poor Hessian Poor Hessian

5 5

10 10

Function Value