LOCAL CONVERGENCE 21

2.3.2 Termination of the Iteration

It is not safe to terminate the iteration when f (xc ) ’ f (x+ ) is small, and no conclusions can

safely be drawn by examination of the differences of the objective function values at successive

iterations. While some of the algorithms for dif¬cult problems in Part II of this book do indeed

terminate when successive function values are close, this is an act of desperation. For example,

if

n

j ’1 ,

f (xn ) = ’

j=1

then f (xn ) ’ ’∞ but f (xn+1 ) ’ f (xn ) = ’1/(n + 1) ’ 0. The reader has been warned.

If the standard assumptions hold, then one may terminate the iteration when the norm of ∇f

is suf¬ciently small relative to ∇f (x0 ) (see [154]). We will summarize the key points here and

refer the reader to [154] for the details. The idea is that if ∇2 f (x— ) is well conditioned, then a

small gradient norm implies a small error norm. Hence, for any gradient-based iterative method,

termination on small gradients is reasonable. In the special case of Newton™s method, the norm

of the step is a very good indicator of the error and if one is willing to incur the added cost of an

extra iteration, a very sharp bound on the error can be obtained, as we will see below.

Lemma 2.3.8. Assume that the standard assumptions hold. Let δ > 0 be small enough so

that the conclusions of Lemma 2.3.1 hold for x ∈ B(δ). Then for all x ∈ B(δ)

4κ(∇2 f (x— )) e

e ∇f (x)

¤ ¤ .

(2.22)

4 e0 κ(∇2 f (x— )) ∇f (x0 ) e0

The meaning of (2.22) is that, up to a constant multiplier, the norm of the relative gradient

is the same as the norm of the relative error. This partially motivates the termination condition

(2.12).

In the special case of Newton™s method, one can use the steplength as an accurate estimate

of the error because Theorem 2.3.2 implies that

ec = s + O( ec 2 ).

(2.23)

Hence, near the solution s and ec are essentially the same size. The cost of using (2.23) is that

all the information needed to compute x+ = xc + s has been computed. If we terminate the

iteration when s is smaller than our desired tolerance and then take x+ as the ¬nal result, we

have attained more accuracy than we asked for. One possibility is to terminate the iteration when

√ √

s = O( „s ) for some „s > 0. This, together with (2.23), will imply that ec = O( „s )

and hence, using Theorem 2.3.2, that

e+ = O( ec 2 ) = O(„s ).

(2.24)

For a superlinearly convergent method, termination on small steps is equally valid but one

cannot use (2.24). For a superlinearly convergent method we have

ec = s + o( ec ) and e+ = o( ec ).

(2.25)

Hence, we can conclude that e+ < „s if s < „s . This is a weaker, but still very useful,

result.

For a q-linearly convergent method, such as the chord method, making termination decisions

based on the norms of the steps is much riskier. The relative error in estimating ec by s is

| ec ’ s | ec + s e+

¤ = .

ec ec ec

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

22 ITERATIVE METHODS FOR OPTIMIZATION

Hence, estimation of errors by steps is worthwhile only if convergence is fast. One can go further

[156] if one has an estimate ρ of the q-factor that satis¬es

e+ ¤ ρ ec .

In that case,

(1 ’ ρ) ec ¤ ec ’ e+ ¤ ec ’ e+ = s .

Hence

ρ

e+ ¤ ρ ec ¤ s.

(2.26)

1’ρ

So, if the q-factor can be estimated from above by ρ and

s < (1 ’ ρ)„s /ρ,

then e+ < „s . This approach is used in ODE and DAE codes [32], [234], [228], [213],

but requires good estimates of the q-factor and we do not advocate it for q-linearly convergent

methods for optimization. The danger is that if the convergence is slow, the approximate q-factor

can be a gross underestimate and cause premature termination of the iteration.

It is not uncommon for evaluations of f and ∇f to be very expensive and optimizations are,

therefore, usually allocated a ¬xed maximum number of iterations. Some algorithms, such as

the DIRECT, [150], algorithm we discuss in §8.4.2, assign a limit to the number of function

evaluations and terminate the iteration in only this way.

2.4 Nonlinear Least Squares

Nonlinear least squares problems have objective functions of the form

M

1 1

2

R(x)T R(x).

f (x) = ri (x) =

(2.27) 2

2 2

i=1

The vector R = (r1 , . . . , rM ) is called the residual. These problems arise in data ¬tting, for

example. In that case M is the number of observations and N is the number of parameters;

for these problems M > N and we say the problem is overdetermined. If M = N we have a

nonlinear equation and the theory and methods from [154] are applicable. If M < N the problem

is underdetermined. Overdetermined least squares problems arise most often in data ¬tting

applications like the parameter identi¬cation example in §1.6.2. Underdetermined problems are

less common, but are, for example, important in the solution of high-index differential algebraic

equations [48], [50].

The local convergence theory for underdetermined problems has the additional complexity

that the limit of the Gauss“Newton iteration is not uniquely determined by the distance of the

initial iterate to the set of points where R(x— ) = 0. In §2.4.3 we describe the dif¬culties and

state a simple convergence result.

If x— is a local minimizer of f and f (x— ) = 0, the problem min f is called a zero residual

problem (a remarkable and suspicious event in the data ¬tting scenario). If f (x— ) is small, the

expected result in data ¬tting if the model (i.e., R) is good, the problem is called a small residual

problem. Otherwise one has a large residual problem.

Nonlinear least squares problems are an intermediate stage between nonlinear equations and

optimization problems and the methods for their solution re¬‚ect this. We de¬ne the M — N

Jacobian R of R by

(R (x))ij = ‚ri /‚xj , 1 ¤ i ¤ M, 1 ¤ j ¤ N.

(2.28)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 23

With this notation it is easy to show that

∇f (x) = R (x)T R(x) ∈ RN .

(2.29)

The necessary conditions for optimality imply that at a minimizer x—

R (x— )T R(x— ) = 0.

(2.30)

In the underdetermined case, if R (x— ) has full row rank, (2.30) implies that R(x— ) = 0; this is

not the case for overdetermined problems.

The cost of a gradient is roughly that of a Jacobian evaluation, which, as is the case with

nonlinear equations, is the most one is willing to accept. Computation of the Hessian (an N — N

matrix)

M

∇2 f (x) = R (x)T R (x) + ri (x)T ∇2 ri (x)

(2.31)

j=1

requires computation of the M Hessians {∇2 ri } for the second-order term

M

ri (x)T ∇2 ri (x)

j=1

and is too costly to be practical.

We will also express the second-order term as

M

ri (x)T ∇2 ri (x) = R (x)T R(x),

j=1

where the second derivative R is a tensor. The notation is to be interpreted in the following

way. For v ∈ RM , R (x)T v is the N — N matrix

M

R (x)T v = ∇2 (R(x)T v) = (v)i ∇2 ri (x).

i=1

We will use the tensor notation when expanding R about x— in some of the analysis to follow.

2.4.1 Gauss“Newton Iteration

The Gauss“Newton algorithm simply discards the second-order term in ∇2 f and computes a

step

s = ’(R (xc )T R (xc ))’1 ∇f (xc )

(2.32)

= ’(R (xc )T R (xc ))’1 R (xc )T R(xc ).

The Gauss“Newton iterate is x+ = xc +s. One motivation for this approach is that R (x)T R(x)

vanishes for zero residual problems and therefore might be negligible for small residual problems.

Implicit in (2.32) is the assumption that R (xc )T R (xc ) is nonsingular, which implies that

M ≥ N . Another interpretation, which also covers underdetermined problems, is to say that the

Gauss“Newton iterate is the minimum norm solution of the local linear model of our nonlinear

least squares problem

1

min R(xc ) + R (xc )(x ’ xc ) 2 .

(2.33)

2

Using (2.33) and linear least squares methods is a more accurate way to compute the step than

using (2.32), [115], [116], [127]. In the underdetermined case, the Gauss“Newton step can

be computed with the singular value decomposition [49], [127], [249]. (2.33) is an overde-

termined, square, or underdetermined linear least squares problem if the nonlinear problem is

overdetermined, square, or underdetermined.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

24 ITERATIVE METHODS FOR OPTIMIZATION

The standard assumptions for nonlinear least squares problems follow in Assumption 2.4.1.

Assumption 2.4.1. x— is a minimizer of R 2 , R is Lipschitz continuously differentiable

2

near x— , and R (x— )T R (x— ) has maximal rank. The rank assumption may also be stated as

• R (x— ) is nonsingular (M = N ),

• R (x— ) has f ull column rank (M > N ),

• R (x— ) has f ull row rank (M < N ).

2.4.2 Overdetermined Problems

Theorem 2.4.1. Let M > N . Let Assumption 2.4.1 hold. Then there are K > 0 and δ > 0

such that if xc ∈ B(δ) then the error in the Gauss“Newton iteration satis¬es

2

+ R(x— ) ec ).

e+ ¤ K( ec

(2.34)

Proof. Let δ be small enough so that x ’ x— < δ implies that R (x)T R (x) is nonsingular.

Let γ be the Lipschitz constant for R .

By (2.32)

= ec ’ (R (xc )T R (xc ))’1 R (xc )T R(xc )

e+

(2.35)

= (R (xc )T R (xc ))’1 R (xc )T (R (xc )ec ’ R(xc )).

Note that

R (xc )ec ’ R(xc ) = R (xc )ec ’ R(x— ) + R(x— ) ’ R(xc )

= ’R(x— ) + (R (xc )ec + R(x— ) ’ R(xc )).

Now,

R (xc )ec + R(x— ) ’ R(xc ) ¤ γ ec 2 /2

and, since R (x— )T R(x— ) = 0,

’R (xc )T R(x— ) = (R (x— ) ’ R (xc ))T R(x— ).

Hence,

¤ (R (xc )T R (xc ))’1 (R (x— ) ’ R (xc ))T R(x— )

e+

(R (xc )T R (xc ))’1 R (xc )T γ ec 2

+

2

(2.36)

R(x— ) + R (xc )T ec

T ’1

¤ (R (xc ) R (xc )) γ ec .

2

Setting

1 + R (x)T

K = γ max ( (R (x)T R (x))’1

2

x∈B(δ)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 25

completes the proof.

There are several important consequences of Theorem 2.4.1. The ¬rst is that for zero residual

problems, the local convergence rate is q-quadratic because the R(x— ) ec term on the right

side of (2.34) vanishes. For a problem other than a zero residual one, even q-linear convergence

is not guaranteed. In fact, if xc ∈ B(δ) then (2.35) will imply that e+ ¤ r ec for some

0 < r < 1 if

K(δ + R (x— ) ) ¤ r

(2.37)

and therefore the q-factor will be K R (x— ) . Hence, for small residual problems and accurate

initial data the convergence of Gauss“Newton will be fast. Gauss“Newton may not converge at

all for large residual problems.

Equation (2.36) exposes a more subtle issue when the term

(R (x— ) ’ R (xc ))T R(x— )

is considered as a whole, rather than estimated by

R(x— ) .

γ ec

Using Taylor™s theorem and the necessary conditions (R (x— )T R(x— ) = 0) we have

R (xc )T R(x— ) = [R (x— ) + R (x— )ec + O( ec 2 )]T R(x— )

= eT R (x— )T R(x— ) + O( ec 2 ).

c

Recall that

R (x— )T R(x— ) = ∇2 f (x— ) ’ R (x— )T R (x— )

and hence

(R (x— )’R (xc ))T R(x— )

(2.38)

¤ ∇2 f (x— ) ’ R (x— )T R (x— ) R(x— ) + O( ec 2 ).

In a sense (2.38) says that even for a large residual problem, convergence can be fast if the problem

is not very nonlinear (small R ). In the special case of a linear least squares problem (where

R = 0) Gauss“Newton becomes simply the solution of the normal equations and converges in

one iteration.

So, Gauss“Newton can be expected to work well for overdetermined small residual problems

and good initial iterates. For large residual problems and/or initial data far from the solution,

there is no reason to expect Gauss“Newton to give good results. We address these issues in

§3.2.3.

2.4.3 Underdetermined Problems

We begin with the linear underdetermined least squares problem

min Ax ’ b 2 .

(2.39)

If A is M — N with M < N there will not be a unique minimizer but there will be a unique

minimizer with minimum norm. The minimum norm solution can be expressed in terms of the

singular value decomposition [127], [249] of A,

A = U ΣV T .

(2.40)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

26 ITERATIVE METHODS FOR OPTIMIZATION

In (2.40), Σ is an N — N diagonal matrix. The diagonal entries of Σ, {σi } are called the singular

values. σi ≥ 0 and σi = 0 if i > M . The columns of the M — N matrix U and the N — N

matrix V are called the left and right singular vectors. U and V have orthonormal columns and

hence the minimum norm solution of (2.39) is

x = A† b,

where A† = V Σ† U T ,

† †

Σ† = diag(σ1 , . . . , σN ),

± ’1

and

σi , σi = 0,

†

σi =

0, σi = 0.

A† is called the Moore“Penrose inverse [49], [189], [212]. If A is a square nonsingular matrix,

then A† = A’1 ; if M > N then the de¬nition of A† using the singular value decomposition is

still valid; and, if A has full column rank, A† = (AT A)’1 AT .

Two simple properties of the Moore“Penrose inverse are that A† A is a projection onto the

range of A† and AA† is a projection onto the range of A. This means that

A† AA† = A† , (A† A)T = A† A, AA† A = A, and (AA† )T = AA† .

(2.41)

So the minimum norm solution of the local linear model (2.33) of an underdetermined

nonlinear least squares problem can be written as [17], [102]

s = ’R (xc )† R(xc )

(2.42)

and the Gauss“Newton iteration [17] is

x+ = xc ’ R (xc )† R(xc ).

(2.43)

The challenge in formulating a local convergence result is that there is not a unique optimal point

that attracts the iterates.

In the linear case, where R(x) = Ax ’ b, one gets

x+ = xc ’ A† (Axc ’ b) = (I ’ A† A)xc ’ A† b.

Set e = x ’ A† b and note that

A† AA† b = A† b

by (2.41). Hence

e+ = (I ’ A† A)ec .

This does not imply that x+ = A† b, the minimum norm solution, only that x+ is a solution of

the problem and the iteration converges in one step. The Gauss“Newton iteration cannot correct

for errors that are not in the range of A† .

Let

Z = {x | R(x) = 0}.

We show in Theorem 2.4.2, a special case of the result in [92], that if the standard assumptions

hold at a point x— ∈ Z, then the iteration will converge q-quadratically to a point z — ∈ Z.

However, there is no reason to expect that z — = x— . In general z — will depend on x0 , a very

different situation from the overdetermined case. The hypotheses of Theorem 2.4.2, especially

that of full column rank in R (x), are less general than those in [24], [17], [25], [92], and [90].

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 27

Theorem 2.4.2. Let M ¤ N and let Assumption 2.4.1 hold for some x— ∈ Z. Then there is

δ > 0 such that if

x0 ’ x— ¤ δ,

then the Gauss“Newton iterates

xn+1 = xn ’ R (xn )† R(xn )

exist and converge r-quadratically to a point z — ∈ Z.

Proof. Assumption 2.4.1 and results in [49], [126] imply that if δ is suf¬ciently small then

there is ρ1 such that R (x)† is Lipschitz continuous in the set

B1 = {x | x ’ x— ¤ ρ1 }

and the singular values of R (x) are bounded away from zero in B1 . We may, reducing ρ1 if

necessary, apply the Kantorovich theorem [154], [151], [211] to show that if x ∈ B1 and w ∈ Z

is such that

x ’ w = min x ’ z ,

z∈Z

then there is ξ = ξ(x) ∈ Z such that

w ’ ξ(x) = O( x ’ w 2 ) ¤ x ’ w /2

and ξ is in the range of R (w)† , i.e.,

R (w)† R (w)(x ’ ξ(x)) = x ’ ξ(x).

The method of the proof is to adjust δ so that the Gauss“Newton iterates remain in B1 and

R(xn ) ’ 0 suf¬ciently rapidly. We begin by requiring that δ < ρ1 /2.

Let xc ∈ B1 and let e = x ’ ξ(xc ). Taylor™s theorem, the fundamental theorem of calculus,

and (2.41) imply that

= ec ’ R (xc )† R(xc )

e+

= ec ’ (R (xc )† ’ R (ξ(x))† )R(x) ’ R (x—)† R(x)

= e0 ’ R (x— )† R(x) + O( ec 2 )

= (I ’ R (x— )† R (x— ))ec + O( ec 2 ) = O( ec 2 ).

If we de¬ne d(x) = minz∈Z x ’ z then there is K1 such that

2

¤ K1 d(xc )2 .

d(x+ ) ¤ x+ ’ ξ(xc ) ¤ K1 xc ’ ξ(xc )

(2.44)

Now let

ρ2 = min(ρ1 , (2K1 )’1 ).

So if

xc ∈ B2 = {x | x ’ x— ¤ ρ2 }

then

d(x+ ) ¤ d(xc )/2.

(2.45)

Finally, there is K2 such that

¤ xc ’ x— + x+ ’ xc = xc ’ x— + R (xc )† R(xc )

x+ ’ x—

¤ xc ’ x— + K2 xc ’ ξ(xc ) .

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

28 ITERATIVE METHODS FOR OPTIMIZATION

We now require that

ρ2

δ¤ .

(2.46)

2(1 + K2 )

We complete the proof by induction. If x0 ’ x— ¤ δ and the Gauss“Newton iterates {xk }n

k=0

are in B2 , then xn+1 is be de¬ned and, using (2.46) and (2.44),

n+1

— —

xn+1 ’ x ¤ x0 ’ x + K3 d(xk ) ¤ δ + 2K3 d(x0 ) ¤ ρ1 .

k=0

Hence, the Gauss“Newton iterates exist, remain in B0 , and dn ’ 0.

To show that the sequence of Gauss“Newton iterates does in fact converge, we observe that

there is K3 such that

x+ ’ xc = R (xc )† R(xc ) ¤ K3 xc ’ ξ(xc ) ¤ K3 d(xc ).

Therefore (2.45) implies that for any m, n ≥ 0,

n+m’1

xn+m ’ xn ¤ xl+1 ’ xl

l=n

’ 2’m

1

n+m

= l=n d(xl ) = d(xn )

2

¤ 2d(xn ) ¤ 2’n+1 d(x0 ).

Hence, {xk } is a Cauchy sequence and therefore converges to a point z — ∈ Z. Since

xn ’ z — ¤ 2d(xn ),

(2.44) implies that the convergence is r-quadratic.

2.5 Inexact Newton Methods

An inexact Newton method [74] uses an approximate Newton step s = x+ ’ xc , requiring only

that

∇2 f (xc )s + ∇f (xc ) ¤ ·c ∇f (xc ) ,

(2.47)

i.e., that the linear residual be small. We will refer to any vector s that satis¬es (2.47) with

·c < 1 as an inexact Newton step. We will refer to the parameter ·c on the right-hand side of

(2.47) as the forcing term [99] .

Inexact Newton methods are also called truncated Newton methods [75], [198], [199] in the

context of optimization. In this book, we consider Newton“iterative methods. This is the class of

inexact Newton methods in which the linear equation (2.4) for the Newton step is also solved by

an iterative method and (2.47) is the termination criterion for that linear iteration. It is standard

to refer to the sequence of Newton steps {xn } as the outer iteration and the sequence of iterates

for the linear equation as the inner iteration. The naming convention (see [33], [154], [211])

is that Newton“CG, for example, refers to the Newton“iterative method in which the conjugate

gradient [141] algorithm is used to perform the inner iteration.

Newton“CG is particularly appropriate for optimization, as we expect positive de¬nite Hes-

sians near a local minimizer. The results for inexact Newton methods from [74] and [154]

are suf¬cient to describe the local convergence behavior of Newton“CG, and we summarize

the relevant results from nonlinear equations in §2.5.1. We will discuss the implementation of

Newton“CG in §2.5.2.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 29

2.5.1 Convergence Rates

We will prove the simplest of the convergence results for Newton“CG, Theorem 2.5.1, from

which Theorem 2.5.2 follows directly. We refer the reader to [74] and [154] for the proof of

Theorem 2.5.3.

Theorem 2.5.1. Let the standard assumptions hold. Then there are δ and KI such that if

xc ∈ B(δ), s satis¬es (2.47), and x+ = xc + s, then

e+ ¤ KI ( ec + ·c ) ec .

(2.48)

Proof. Let δ be small enough so that the conclusions of Lemma 2.3.1 and Theorem 2.3.2

hold. To prove the ¬rst assertion (2.48) note that if

r = ’∇2 f (xc )s ’ ∇f (xc )

is the linear residual, then

s + (∇2 f (xc ))’1 ∇f (xc ) = ’(∇2 f (xc ))’1 r

and

e+ = ec + s = ec ’ (∇2 f (xc ))’1 ∇f (xc ) ’ (∇2 f (xc ))’1 r.

(2.49)

Now, (2.47), (2.7), and (2.6) imply that

s + (∇2 f (xc ))’1 ∇f (xc ) ¤ (∇2 f (xc ))’1 ·c ∇f (xc )

¤ 4κ(∇2 f (x— ))·c ec .

Hence, using (2.49) and Theorem 2.3.2, we have that

¤ ec ’ ∇2 f (xc )’1 ∇f (xc ) + 4κ(F (x— ))·c ec

e+

2

+ 4κ(∇2 f (x— ))·c ec ,

¤ K ec

where K is the constant from (2.8). If we set

KI = K + 4κ(∇2 f (x— )),

the proof is complete.

Theorem 2.5.2. Let the standard assumptions hold. Then there are δ and · such that if

¯

x0 ∈ B(δ), {·n } ‚ [0, · ], then the inexact Newton iteration

¯

xn+1 = xn + sn ,

where

∇2 f (xn )sn + ∇f (xn ) ¤ ·n ∇f (xn ) ,

converges q-linearly to x— . Moreover

• if ·n ’ 0 the convergence is q-superlinear, and

p

• if ·n ¤ K· ∇f (xn ) for some K· > 0 the convergence is q-superlinear with q-order

1 + p.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

30 ITERATIVE METHODS FOR OPTIMIZATION

The similarity between Theorem 2.5.2 and Theorem 2.3.5, the convergence result for the

chord method, should be clear. Rather than require that the approximate Hessian be accurate,

we demand that the linear iteration produce a suf¬ciently small relative residual. Theorem 2.5.3

is the remarkable statement that any reduction in the relative linear residual will suf¬ce for linear

convergence in a certain norm. This statement implies [154] that ∇f (xn ) will converge to

zero q-linearly, or, equivalently, that xn ’ x— q-linearly with respect to · — , which is de¬ned

by

x — = ∇2 f (x— )x .

Theorem 2.5.3. Let the standard assumptions hold. Then there is δ such that if xc ∈ B(δ),

s satis¬es (2.47), x+ = xc + s, and ·c ¤ · < · < 1, then

¯

e+ ¤ · ec

¯ —.

(2.50) —

Theorem 2.5.4. Let the standard assumptions hold. Then there is δ such that if x0 ∈ B(δ),

{·n } ‚ [0, ·] with · < · < 1, then the inexact Newton iteration

¯

xn+1 = xn + sn ,

where

∇2 f (xn )sn + ∇f (xn ) ¤ ·n ∇f (xn )

to x— . Moreover

converges q-linearly with respect to · —

• if ·n ’ 0 the convergence is q-superlinear, and

p

• if ·n ¤ K· ∇f (xn ) for some K· > 0 the convergence is q-superlinear with q-order

1 + p.

Q-linear convergence of {xn } to a local minimizer with respect to · — is equivalent to

q-linear convergence of {∇f (xn )} to zero. We will use the rate of convergence of {∇f (xn )}

in our computational examples to compare various methods.

2.5.2 Implementation of Newton“CG

Our implementation of Newton“CG approximately solves the equation for the Newton step with

CG. We make the implicit assumption that ∇f has been computed suf¬ciently accurately for

Dh f (x : w) to be a useful approximate Hessian of the Hessian“vector product ∇2 f (x)w.

2

Forward Difference CG

Algorithm fdcg is an implementation of the solution by CG of the equation for the Newton step

(2.4). In this algorithm we take care to identify failure in CG (i.e., detection of a vector p for

which pT Hp ¤ 0). This failure either means that H is singular (pT Hp = 0; see exercise 2.7.3)

or that pT Hp < 0, i.e., p is a direction of negative curvature. The algorithms we will discuss

in §3.3.7 make good use of directions of negative curvature. The initial iterate for forward

difference CG iteration should be the zero vector. In this way the ¬rst iterate will give a steepest

descent step, a fact that is very useful. The inputs to Algorithm fdcg are the current point x,

the objective f , the forcing term ·, and a limit on the number of iterations kmax. The output is

2

the inexact Newton direction d. Note that in step 2b Dh f (x : p) is used as an approximation to

∇2 f (x)p.

Algorithm 2.5.1. fdcg(x, f, ·, kmax, d)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 31

1. r = ’∇f (x), ρ0 = r 2 , k = 1, d = 0.

2

√

ρk’1 > · ∇f (x) and k < kmax

2. Do While

(a) if k = 1 then p = r

else

β = ρk’1 /ρk’2 and p = r + βp

2

(b) w = Dh f (x : p)

If pT w = 0 signal inde¬niteness; stop.

If pT w < 0 signal negative curvature; stop.

(c) ± = ρk’1 /pT w

(d) d = d + ±p

(e) r = r ’ ±w

2

(f) ρk = r

(g) k = k + 1

Preconditioning can be incorporated into a Newton“CG algorithm by using a forward dif-

ference formulation, too. Here, as in [154], we denote the preconditioner by M . Aside from M ,

the inputs and output of Algorithm fdpcg are the same as that for Algorithm fdcg.

Algorithm 2.5.2. fdpcg(x, f, M, ·, kmax, d)

1. r = ’∇f (x), ρ0 = r 2 , k = 1, d = 0.

2

√

ρk’1 > · ∇f (x) and k < kmax

2. Do While

(a) z = M r

(b) „k’1 = z T r

(c) if k = 1 then β = 0 and p = z

else

β = „k’1 /„k’2 , p = z + βp

2

(d) w = Dh f (x : p)

If pT w = 0 signal inde¬niteness; stop.

If pT w < 0 signal negative curvature; stop.

(e) ± = „k’1 /pT w

(f) d = d + ±p

(g) r = r ’ ±w

(h) ρk = rT r

(i) k = k + 1

In our formulation of Algorithms fdcg and fdpcg, inde¬niteness is a signal that we are

not suf¬ciently near a minimum for the theory in this section to hold. In §3.3.7 we show how

negative curvature can be exploited when far from the solution.

One view of preconditioning is that it is no more than a rescaling of the independent variables.

Suppose, rather than (1.2), we seek to solve

ˆ

min f (y),

(2.51)

y

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

32 ITERATIVE METHODS FOR OPTIMIZATION

ˆ ˆ

where f (y) = f (M 1/2 y) and M is spd. If y — is a local minimizer of f , then x— = M 1/2 y — is

a local minimizer of f and the two problems are equivalent. Moreover, if x = M 1/2 y and ∇x

and ∇y denote gradients in the x and y coordinates, then

ˆ

∇y f (y) = M 1/2 ∇x f (x)

and

ˆ

∇2 f (y) = M 1/2 (∇2 f (x))M 1/2 .

y x

Hence, the scaling matrix plays the role of the square root of the preconditioner for the precon-

ditioned conjugate gradient algorithm.

Newton“CG

The theory guarantees that if x0 is near enough to a local minimizer then ∇2 f (xn ) will be spd

for the entire iteration and xn will converge rapidly to x— . Hence, Algorithm newtcg will not

terminate with failure because of an increase in f or an inde¬nite Hessian. Note that both the

forcing term · and the preconditioner M can change as the iteration progresses.

Algorithm 2.5.3. newtcg(x, f, „, ·)

1. rc = r0 = ∇f (x)

2. Do while ∇f (x) > „r r0 + „a

(a) Select · and a preconditioner M .

(b) fdpcg(x, f, M, ·, kmax, d)

If inde¬niteness has been detected, terminate with failure.

(c) x = x + d.

(d) Evaluate f and ∇f (x).

If f has not decreased, terminate with failure.

(e) r+ = ∇f (x) , σ = r+ /rc , rc = r+ .

The implementation of Newton“CG is simple, but, as presented in Algorithm newtcg,

incomplete. The algorithm requires substantial modi¬cation to be able to generate the good

initial data that the local theory requires. We return to this issue in §3.3.7.

There is a subtle problem with Algorithm fdpcg in that the algorithm is equivalent to the

application of the preconditioned conjugate gradient algorithm to the matrix B that is determined

by

2

Bpi = wi = Dh f (x : pi ), 1 ¤ i ¤ N.

2

However, since the map p ’ Dh f (x : p) is not linear in p, the quality of B as an approximation

to ∇2 f (x) may degrade as the linear iteration progresses. Usually this will not cause problems

unless many iterations are needed to satisfy the inexact Newton condition. However, if one does

not see the expected rate of convergence in a Newton“CG iteration, this could be a factor [128].

One partial remedy is to use a centered-difference Hessian“vector product [162], which reduces

the error in B. In exercise 2.7.15 we discuss a more complex and imaginative way to compute

accurate Hessians.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 33

2.6 Examples

In this section we used the collection of MATLAB codes but disabled the features (see Chapter 3)

that assist in convergence when far from a minimizer. We took care to make certain that the

initial iterates were near enough to the minimizer so that the observations of local convergence

corresponded to the theory. In practical optimization problems, good initial data is usually not

available and the globally convergent methods discussed in Chapter 3 must be used to start the

iteration.

The plots in this section have the characteristics of local convergence in that both the gradient

norms and function values are decreasing. The reader should contrast this with the examples in

Chapter 3.

2.6.1 Parameter Identi¬cation

For this example, M = 100, and the observations are those of the exact solution with c = k = 1,

which we computed analytically. We used T = 10 and u0 = 10. We computed the displacement

and solved the sensitivity equations with the stiff solver ode15s. These results could be

obtained as well in a FORTRAN environment using, for example, the LSODE code [228]. The

relative and absolute error tolerances for the integrator were both set to 10’8 . In view of the

expected accuracy of the gradient, we set the forward difference increment for the approximate

Hessian to h = 10’4 . We terminated the iterations when ∇f < 10’4 . Our reasons for this

are that, for the zero residual problem considered here, the standard assumptions imply that

f (x) = O( ∇f (x) ) for x near the solution. Hence, since we can only resolve f to an accuracy

of 10’8 , iteration beyond the point where ∇f < 10’4 cannot be expected to lead to a further

decrease in f . In fact we observed this in our computations.

The iterations are very sensitive to the initial iterate. We used x0 = (1.1, 1.05)T ; initial

iterates much worse than that caused Newton™s method to fail. The more robust methods from

Chapter 3 should be viewed as essential components of even a simple optimization code.

In Table 2.1 we tabulate the history of the iteration for both the Newton and Gauss“Newton

methods. As expected for a small residual problem, Gauss“Newton performs well and, for this

example, even converges in fewer iterations. The real bene¬t of Gauss“Newton is that com-

putation of the Hessian can be avoided, saving considerable computational work by exploiting

the structure of the problem. In the computation reported here, the MATLAB flops com-

mand indicates that the Newton iteration took roughly 1.9 million ¬‚oating point operations and

Gauss“Newton roughly 650 thousand. This difference would be much more dramatic if there

were more than two parameters or the cost of an evaluation of f depended on N in a signi¬cant

way (which it does not in this example).

Table 2.1: Parameter identi¬cation problem, locally convergent iterations.

Newton Gauss“Newton

∇f (xn ) f (xn ) ∇f (xn ) f (xn )

n

0 2.33e+01 7.88e-01 2.33e+01 7.88e-01

1 6.87e+00 9.90e-02 1.77e+00 6.76e-03

2 4.59e-01 6.58e-04 1.01e-02 4.57e-07

3 2.96e-03 3.06e-08 9.84e-07 2.28e-14

4 2.16e-06 4.15e-14

Figure 2.1 is a graphical representation of the convergence history from Table 2.1. We think

that the plots are a more effective way to understand iteration statistics and will present mostly

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

34 ITERATIVE METHODS FOR OPTIMIZATION

graphs for the remainder of the book. The concavity of the plots of the gradient norms is the

signature of superlinear convergence.

Newton™s Method Newton™s Method

2 0

10 10

0

10

Function Value

Gradient Norm

’5

10

’2

10

’10

10

’4

10

’6 ’15

10 10

0 2 4 0 2 4

Iterations Iterations

Gauss’Newton Method Gauss’Newton Method

2 0

10 10

0

10

Function Value

Gradient Norm

’5

10

’2

10

’4

10 ’10

10

’6

10

’8 ’15

10 10

0 1 2 3 0 1 2 3

Iterations Iterations

Figure 2.1: Local Optimization for the Parameter ID Problem

We next illustrate the difference between Gauss“Newton and Newton on a nonzero residual

problem. We use the same example as before with the observations randomly perturbed. We

used the MATLAB rand function for this, perturbing the samples of the analytic solution by

.5 — rand(M, 1). The least squares residual is about 3.6 and the plots in Figure 2.2 indicate

that Newton™s method is still converging quadratically, but the rate of Gauss“Newton is linear.

The linear convergence of Gauss“Newton can be seen clearly from the linear semilog plot of the

gradient norms. Even so, the Gauss“Newton iteration was more ef¬cient, in terms of ¬‚oating

point operation, than Newton™s method. The Gauss“Newton iteration took roughly 1 million

¬‚oating point operations while the Newton iteration took 1.4 million.

2.6.2 Discrete Control Problem

We solve the discrete control problem from §1.6.1 with N = 400, T = 1, y0 = 0,

L(y, u, t) = (y ’ 3)2 + .5u2 , and φ(y, u, t) = uy + t2

with Newton“CG and two different choices, · = .1, .0001, of the forcing term. The initial

iterate was u0 = (10, 10, . . . , 10)T and the iteration was terminated when ∇f < 10’8 . In

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 35

Newton™s Method Newton™s Method

2

10 4

0

10 3.9

Function Value

Gradient Norm

’2

10 3.8

’4

3.7

10

’6

10 3.6

0 1 2 3 0 1 2 3

Iterations Iterations

Gauss’Newton Method Gauss’Newton Method

2

10 4

0

10 3.9

Function Value

Gradient Norm

’2

10 3.8

’4

3.7

10

’6

10 3.6

0 2 4 6 0 2 4 6

Iterations Iterations

Figure 2.2: Local Optimization for the Parameter ID Problem, Nonzero Residual

Figure 2.3 one can see that the small forcing term produces an iteration history with the concavity

of superlinear convergence. The limiting q-linear behavior of an iteration with constant · is not

yet visible. The iteration with the larger value of · is in the q-linearly convergent stage, as the

linear plot of ∇f against the iteration counter shows.

The cost of the computation is not re¬‚ected by the number of nonlinear iterations. When

· = .0001, the high accuracy of the linear solve is not rewarded. The computation with · = .0001

required 8 nonlinear iterations, a total of 32 CG iterations, roughly 1.25 million ¬‚oating point

operations, and 41 gradient evaluations. The optimization with · = .1 needed 10 nonlinear

iterations, a total of 13 CG iterations, roughly 820 thousand ¬‚oating point operations, and 24

gradient evaluations.

2.7 Exercises on Local Convergence

2.7.1. Apply Newton™s method with (a) analytic ¬rst and second derivatives, (b) analytic ¬rst

derivatives and forward difference second derivatives, and (c) forward difference ¬rst and

second derivatives to ¬nd a local minimum of

1. f (x) = sin2 (x),

2

2. f (x) = ex , and

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

36 ITERATIVE METHODS FOR OPTIMIZATION

eta = .1 eta = .1

5 3

10 10

Function Value

Gradient Norm

0 2

10 10

’5 1

10 10

’10 0

10 10

0 5 10 0 5 10

Iterations Iterations

eta = .0001 eta = .0001

5 3

10 10

0

10

Function Value

Gradient Norm

2

10

’5

10

1

10

’10

10

’15 0

10 10

0 2 4 6 8 0 2 4 6 8

Iterations Iterations

Figure 2.3: Newton“CG for the Discrete Control Problem: · = .1, .0001

3. f (x) = x4 .

Use difference steps of h = 10’1 , 10’2 , 10’4 , and 10’8 . Explain your results.

2.7.2. Repeat part (c) of exercise 2.7.1. Experiment with

2

f (x) = ex + 10’4 rand(x) and f (x) = x2 + 10’4 rand(x),

where rand denotes the random number generator in your computing environment. Ex-

plain the differences in the results.

2.7.3. Show that if A is symmetric, p = 0, and pT Ap = 0, then A is either singular or inde¬nite.

2.7.4. Show that if b ∈ RN and the N — N matrix A is symmetric and has a negative eigenvalue,

then the quadratic functional

m(x) = xT Ax + xT b

does not have a minimizer. Show that if u is an eigenvector corresponding to a negative

eigenvalue of the Hessian, then u is a direction of negative curvature.

2.7.5. If N = 1, the local quadratic model could easily be replaced by a local quartic (i.e.,

fourth-degree) model (what would be wrong with a cubic model?). If a method is based

on minimization of the local quartic model, what kind of local convergence would you

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 37

expect? How would you extend this method to the case N > 1? Look at [30] for some

results on this.

2.7.6. Show that if the standard assumptions hold, h is suf¬ciently small, and x is suf¬ciently

near x— , the difference Hessian de¬ned by (2.19), ∇2 f (x), is spd.

h

2.7.7. Write a locally convergent Newton method code based on accurate function and gradient

information and forward difference Hessians using (2.19). Be sure that your code tests for

positivity of the Hessian so that you can avoid convergence to a local maximum. Is the

test for positivity expensive? Apply your code to the parameter ID problem from §1.6.2.

If you use an ODE solver that lets you control the accuracy of the integration, try values

of the accuracy from 10’8 to 10’2 and see how the iteration performs. Be sure that your

difference Hessian re¬‚ects the accuracy in the gradient.

2.7.8. Let the standard assumptions hold and let »s > 0 be the smallest eigenvalue of ∇2 f (x— ).

Give the best (i.e., largest) bound you can for ρ such that ∇2 f (x) is positive de¬nite for

all x ∈ B(ρ).

2.7.9. Use the de¬nition of A† to prove (2.41).

2.7.10. Fill in the missing details in the proof of Theorem 2.4.2 by showing how the Kantorovich

theorem can be used to prove the existence of ξ(x).

2.7.11. Let f (x) = x2 and f (x) = sin(100x)/10. Using an initial iterate of x0 = 1, try to ¬nd

a local minimum of f + f using Newton™s method with analytic gradients and Hessians.

Repeat the experiment with difference gradients and Hessians (try forward differences

with a step size of h = .2).

2.7.12. Solve the parameter ID problem from §2.6 with the observations perturbed randomly (for

example, you could use the MATLAB rand function for this). Vary the amplitude of the

perturbation and see how the performance of Newton and Gauss“Newton changes.

2.7.13. Derive sensitivity equations for the entries of the Hessian for the parameter ID objective

function. In general, if there are P parameters, how many sensitivity equations would

need to be solved for the gradient? How many for the Hessian?

2.7.14. Solve the discrete control problem from §2.6.2 using Newton“CG with forcing terms that

depend on n. Consider ·n = .5/n, ·n = min(.1, ∇f (un ) ), and some of the choices

from [99]. Vary N and the termination criteria and compare the performance with the

constant · choice in §2.6.2.

2.7.15. Let F : RN ’ RM (where M and N need not be the same) be suf¬ciently smooth (how

smooth is that?) and be such that F can also be computed for complex arguments. Show

that [181], [245]

Im(F (x + ihu))/h = F (x)u + O(h2 ),

where Im denotes imaginary part. What happens if there is error in F ? How can you use

this fact to compute better difference gradients and Hessians?

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Chapter 3

Global Convergence

The locally convergent algorithms discussed in Chapter 2 can and do fail when the initial iterate

is not near the root. The reasons for this failure, as we explain below, are that the Newton

direction may fail to be a direction of descent for f and that even when a search direction is a

direction of decrease of f , as ’∇f is, the length of the step can be too long. Hence, taking a

Newton (or Gauss“Newton, or inexact Newton) step can lead to an increase in the function and

divergence of the iteration (see exercise 3.5.14 for two dramatic examples of this). The globally

convergent algorithms developed in this chapter partially address this problem by either ¬nding

a local minimum or failing in one of a small number of easily detectable ways.

These are not algorithms for global optimization. When these algorithms are applied to

problems with many local minima, the results of the iteration may depend in complex ways on

the initial iterate.

3.1 The Method of Steepest Descent

The steepest descent direction from x is d = ’∇f (x). The method of steepest descent [52]

updates the current iteration xc by the formula

x+ = xc ’ »∇f (xc ).

(3.1)

If we take the simple choice » = 1, then x+ is not guaranteed to be nearer a solution than xc ,

even if xc is very near a solution that satis¬es the standard assumptions. The reason for this is

that, unlike the Newton direction, the steepest descent direction scales with f . The Newton step,

on the other hand, is the same for f as it is for cf for any c = 0 but need not be a direction of

decrease for f .

To make the method of steepest descent succeed, it is important to choose the steplength ».

One way to do this, which we analyze in §3.2, is to let » = β m , where β ∈ (0, 1) and m ≥ 0 is

the smallest nonnegative integer such that there is suf¬cient decrease in f . In the context of the

steepest descent algorithm, this means that

f (xc ’ »∇f (xc )) ’ f (xc ) < ’±» ∇f (xc ) 2 .

(3.2)

This strategy, introduced in [7] and called the Armijo rule, is an example of a line search in

which one searches on a ray from xc in a direction in which f is locally decreasing. In (3.2),

± ∈ (0, 1) is a parameter, which we discuss after we fully specify the algorithm. This strategy

of repeatedly testing for suf¬cient decrease and reducing the stepsize if the test is failed is called

backtracking for obvious reasons.

39

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

40 ITERATIVE METHODS FOR OPTIMIZATION

The motivation for (3.2) is that if we approximate f by the linear model

mc = f (xc ) + ∇f (xc )(x ’ xc ),

then the reduction in the model (i.e., the predicted reduction in f ) is

pred = mc (xc ) ’ mc (x+ ) = » ∇f (xc ) 2 .

(3.2) says that the actual reduction in f

ared = f (xc ) ’ f (x+ )

is at least as much as a fraction of the predicted reduction in the linear model. The parameter ±

is typically set to 10’4 .

The reason we demand suf¬cient decrease instead of simple decrease (i.e., f (xc ) < f (x+ )

or ± = 0) is largely theoretical; a nonzero value of ± is required within the proof to insure that

the iteration does not stagnate before convergence.

Algorithm 3.1.1. steep(x, f, kmax)

1. For k = 1, . . . , kmax

(a) Compute f and ∇f ; test for termination.

(b) Find the least integer m ≥ 0 such that (3.2) holds for » = β m .

(c) x = x + »d.

2. If k = kmax and the termination test is failed, signal failure.

The termination criterion could be based on (2.12), for example.

3.2 Line Search Methods and the Armijo Rule

We introduce a few new concepts so that our proof of convergence of Algorithm steep will

also apply to a signi¬cantly more general class of algorithms.

De¬nition 3.2.1. A vector d ∈ RN is a descent direction for f at x if

df (x + td)

= ∇f (x)T d < 0.

dt t=0

Clearly the steepest descent direction d = ’∇f (x) is a descent direction. A line search

algorithm searches for decrease in f in a descent direction, using the Armijo rule for stepsize

control, unless ∇f (x) = 0.

We will consider descent directions based on quadratic models of f of the form

1

m(x) = f (xc ) + ∇f (xc )T (x ’ xc ) + (x ’ xc )T Hc (x ’ xc ),

2

where Hc , which is sometimes called the model Hessian, is spd. We let d = x ’ xc be such that

m(x) is minimized. Hence,

∇m(x) = ∇f (xc ) + Hc (x ’ xc ) = 0

and hence

’1

d = ’Hc ∇f (xc ).

(3.3)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 41

The steepest descent direction satis¬es (3.3) with Hc = I. However, the Newton direction

d = ’∇2 f (x)’1 ∇f (x) may fail to be a descent direction if x is far from a minimizer because

∇2 f may not be spd. Hence, unlike the case for nonlinear equations [154], Newton™s method is

not a generally good global method, even with a line search, and must be modi¬ed (see [113],

[117], [231], and [100]) to make sure that the model Hessians are spd.

The algorithm we analyze in this section is an extension of Algorithm steep that allows for

descent directions that satisfy (3.3) for spd H. We modify (3.2) to account for H and the new

descent direction d = ’H ’1 ∇f (x). The general suf¬cient decrease condition is

f (xc + »d) ’ f (xc ) < ±»∇f (xc )T d.

(3.4)

Here, as in (3.2), ± ∈ (0, 1) is an algorithmic parameter. Typically ± = 10’4 .

The stepsize reduction scheme in step 1b of Algorithm steep is crude. If β is too large, too

many stepsize reductions may be needed before a step is accepted. If β is too small, the progress

of the entire iteration may be retarded. We will address this problem in two ways. In §3.2.1 we

will construct polynomial models of f along the descent direction to predict an optimal factor

by which to reduce the step. In §3.3.3 we describe a method which remembers the steplength

from the previous iteration.

Our proofs require only the following general line search strategy. If a steplength »c has

been rejected (i.e., (3.4) fails with » = »c ), construct

»+ ∈ [βlow »c , βhigh »c ],

(3.5)

where 0 < βlow ¤ βhigh < 1. The choice β = βlow = βhigh is the simple rule in Algo-

rithm steep. An exact line search, in which » is the exact minimum of f (xc + »d), is not only

not worth the extra expense but can degrade the performance of the algorithm.

Algorithm 3.2.1. optarm(x, f, kmax)

1. For k = 1, . . . , kmax

(a) Compute f and ∇f ; test for termination.

(b) Construct an spd matrix H and solve (3.3) to obtain a descent direction d.

(c) Beginning with » = 1, repeatedly reduce » using any strategy that satis¬es (3.5)

until (3.4) holds.

(d) x = x + »d.

2. If k = kmax and the termination test is failed, signal failure.

In the remainder of this section we prove that if the sequence of model Hessians remains

uniformly bounded and positive de¬nite and the sequence of function values {f (xk )} is bounded

from below, then any limit point of the sequence {xk } generated by Algorithm optarm con-

verges to a point x— that satis¬es the necessary conditions for optimality. We follow that analysis

with a local convergence theory that is much less impressive than that for Newton™s method.

We begin our analysis with a simple estimate that follows directly from the spectral theorem

for spd matrices.

Lemma 3.2.1. Let H be spd with smallest and largest eigenvalues 0 < »s < »l . Then for

all z ∈ RN ,

»’1 z 2 ¤ z T H ’1 z ¤ »’1 z 2 .

s

l

The ¬rst step in the analysis is to use Lemma 3.2.1 to obtain a lower bound for the steplength.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

42 ITERATIVE METHODS FOR OPTIMIZATION

Lemma 3.2.2. Assume that ∇f is Lipschitz continuous with Lipschitz constant L. Let

± ∈ (0, 1), x ∈ RN , and H be an spd matrix. Let »s > 0 be the smallest and »l ≥ »s the

largest eigenvalues of H. Let d be given by (3.3). Assume that ∇f (x) = 0. Then (3.4) holds for

any » such that

2»s (1 ’ ±)

0<»¤ .

(3.6)

Lκ(H)

Proof. Let d = ’H ’1 ∇f (x). By the fundamental theorem of calculus

1

∇f (x + t»d)T »d dt.

f (x + »d) ’ f (x) =

0

Hence

= f (x) + »∇f (x)T d

f (x + »d)

(3.7)

1

+ t»d) ’ ∇f (x))T d dt.

+» (∇f (x

0

Therefore,

»2 L

’1 T

d 2.

f (x + »d) = f (x ’ »H ∇f (x)) ¤ f (x) + »∇f (x) d +

2

Positivity of H, Lemma 3.2.1, and κ(H) = »l »’1 imply that

s

2

= H ’1 ∇f (x) 2

¤ »’2 ∇f (x)T ∇f (x)

d s

¤ ’»l »’2 ∇f (x)T d = ’κ(H)»’1 ∇f (x)T d.

s s

Hence

f (x + »d) ¤ f (x) + (» ’ »2 L»’1 κ(H)/2)∇f (x)T d,

s

which implies (3.4) if

± ¤ (1 ’ »L»’1 κ(H)/2).

s

This is equivalent to (3.6).

Lemma 3.2.3. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk } be the

iteration given by Algorithm optarm with spd matrices Hk that satisfy

κ(Hk ) ¤ κ

¯

(3.8)

for all k. Then the steps

’1

sk = xk+1 ’ xk = »k dk = ’»k Hk ∇f (xk )

satisfy

¯ 2βlow »s (1 ’ ±)

»k ≥ » =

(3.9)

L¯κ

and at most

2»s (1 ’ ±)

m = log / log(βhigh )

(3.10)

L¯κ

stepsize reductions will be required.

Proof. In the context of Algorithm optarm, Lemma 3.2.2 implies that the line search will

terminate when

2»s (1 ’ ±)

»¤ ,

Lκ(Hk )

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 43

if not before. The most that one can overshoot this is by a factor of βlow , which proves (3.9).

The line search will require at most m stepsize reductions, where m is the least nonnegative

integer such that

2»s (1 ’ ±) m

> βhigh .

Lκ(Hk )

This implies (3.10).

The convergence theorem for Algorithm optarm says that if the condition numbers of the

matrices H and the norms of the iterates remain bounded, then every limit point of the iteration

is a stationary point. Boundedness of the sequence of iterates implies that there will be limit

points, but there is no guarantee that there is a unique limit point.

Theorem 3.2.4. Let ∇f be Lipschitz continuous with Lipschitz constant L. Assume that

the matrices Hk are spd and that there are κ and »l such that κ(Hk ) ¤ κ, and Hk ¤ »l for

¯ ¯

all k. Then either f (xk ) is unbounded from below or

lim ∇f (xk ) = 0

(3.11)

k’∞

and hence any limit point of the sequence of iterates produced by Algorithm optarm is a

stationary point.

In particular, if f (xk ) is bounded from below and xkl ’ x— is any convergent subsequence

of {xk }, then ∇f (x— ) = 0.

Proof. By construction, f (xk ) is a decreasing sequence. Therefore, if f (xk ) is bounded

from below, then limk’∞ f (xk ) = f — exists and

lim f (xk+1 ) ’ f (xk ) = 0.