k’∞

By (3.4) and Lemma 3.2.3 we have

’1

f (xk+1 ) ’ f (xk ) < ’±»k ∇f (xk )T Hk ∇f (xk )

¯

¤ ’±»»’1 ∇f (xk ) 2

¤ 0.

l

Hence, by (3.12)

»l (f (xk ) ’ f (xk+1 ))

2

∇f (xk ) ¤ ’0

¯

±»

as k ’ ∞. This completes the proof.

The analysis of the Armijo rule is valid for other line search methods [84], [125], [272],

[273]. The key points are that the suf¬cient decrease condition can be satis¬ed in ¬nitely many

steps and that the stepsizes are bounded away from zero.

3.2.1 Stepsize Control with Polynomial Models

Having computed a descent direction d from xc , one must decide on a stepsize reduction scheme

for iterations in which (3.4) fails for » = 1. A common approach [73], [84], [114], [197], [117]

is to model

ξ(») = f (xc + »d)

by a cubic polynomial. The data on hand initially are

ξ(0) = f (xc ), ξ (0) = ∇f (xc )T d < 0, and ξ(1) = f (x + d),

which is enough to form a quadratic model of ξ. So, if (3.4) does not hold with » = »0 = 1,

i.e.,

ξ(1) = f (xc + d) ≥ f (xc ) + ±∇f (xc )T d = ξ(0) + ±ξ (0),

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

44 ITERATIVE METHODS FOR OPTIMIZATION

we approximate ξ by the quadratic polynomial

q(») = ξ(0) + ξ (0)» + (ξ(1) ’ ξ(0) ’ ξ (0))»2

and let »1 be the minimum of q on the interval [βlow , βhigh ] ‚ (0, 1). This minimum can be

computed directly since ± ∈ (0, 1) and failure of (3.4) imply

q (») = 2(ξ(1) ’ ξ(0) ’ ξ (0)) > 2(± ’ 1)ξ (0) > 0.

Therefore, the global minimum of q is

’ξ (0)

»t = .

2(ξ(1) ’ ξ(0) ’ ξ (0))

So ±

βlow , »t ¤ βlow ,

»t , βlow < »t < βhigh ,

»+ =

(3.13)

βhigh , »t ≥ βhigh .

If our ¬rst reduced value of » does not satisfy (3.4), we base additional reductions on the

data

ξ(0) = f (xc ), ξ (0) = ∇f (xc )T d, ξ(»’ ), ξ(»c ),

where »c < »’ are the most recent values of » to fail to satisfy (3.4). This is suf¬cient data to

approximate ξ with a cubic polynomial

q(») = ξ(0) + ξ (0)» + c2 »2 + c3 »3 ,

where c2 and c3 can be determined by the equations

q(»c ) = ξ(»c ) = f (xc + »c d),

q(»’ ) = ξ(»’ ) = f (xc + »’ d),

which form the nonsingular linear system for c2 and c3

»2 »3 c2 ξ(»c ) ’ ξ(0) ’ ξ (0)»c

c c

= .

(3.14)

»2 »3 c3 ξ(»’ ) ’ ξ(0) ’ ξ (0)»’

’ ’

As with the quadratic case, q has a local minimum [84] at

c2 ’ 3c3 ξ (0)

’c2 + 2

»t = .

(3.15)

3c3

With »t in hand, we compute »+ using (3.13). The application of (3.13) is called safeguarding

and is important for the theory, as one can see from the proof of Theorem 3.2.4. Safeguarding

is also important in practice because, if the cubic model is poor, the unsafeguarded model can

make steplength reductions that are so small that the iteration can stagnate or so large (i.e., near

1) that too many reductions are needed before (3.4) holds.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 45

3.2.2 Slow Convergence of Steepest Descent

Unfortunately, methods based on steepest descent do not enjoy good local convergence prop-

erties, even for very simple functions. To illustrate this point we consider the special case of

convex quadratic objective functions

1T

x Ax ’ bT x + a,

f (x) =

2

where A is spd, b ∈ RN , and a is a scalar. We will look at a very simple example, using

the method of steepest descent with Hk = I (so »l = »s = 1) and show how the speed of

convergence depends on conditioning and scaling.

Lemma 3.2.5. Let f be a convex quadratic and let Hk = I for all k. Then the sequence

{xk } generated by Algorithm optarm converges to the unique minimizer of f .

Proof. In exercise 3.5.4 you are asked to show that f is bounded from below and that

∇f (x) = Ax ’ b. Hence ∇f (x— ) vanishes only at x— = A’1 b. Since ∇2 f (x) = A is spd, the

second-order suf¬cient conditions hold and x— is the unique minimizer of f .

Theorem 3.2.4 implies that

lim ∇f (xk ) = Axk ’ b = A(xk ’ x— ) = 0,

k’∞

and hence xk ’ x— .

Since the steepest descent iteration converges to the unique minimizer of a convex quadratic,

we can investigate the rate of convergence without concern about the initial iterate. We do this

in terms of the A-norm. The problems can be illustrated with the simplest case a = 0 and b = 0.

Proposition 3.2.6. Let f (x) = xT Ax/2 and let {xk } be given by Algorithm optarm with

Hk = I for all k. Then the sequence {xk } satis¬es

= (1 ’ O(κ(A)’2 )) xk

xk+1 A.

(3.16) A

Proof. The suf¬cient decrease condition, (3.4), implies that for all k

xT Axk+1 ’ xT Axk = 2(f (xk+1 ) ’ f (xk ))

k+1 k

¤ 2±∇f (xk )T (xk+1 ’ xk )

(3.17)

= 2±»k ∇f (xk )T d = ’2±»k (Axk )T (Axk ).

The Lipschitz constant of ∇f is simply »l = A ; hence we may write (3.9) as

¯ 2β(1 ’ ±) .

»k ≥ » =

(3.18)

»l κ(A)

In terms of the A-norm, (3.17) can be written as

¯

2 2 2

xk+1 ’ xk ¤ ’2±»»s xk A,

A A

where we use the fact that

2

= (Az)T (Az) ≥ »s z T Az = »s z 2

Az A.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

46 ITERATIVE METHODS FOR OPTIMIZATION

Hence,

¯

2 2

¤ (1 ’ 4±(1 ’ ±)βκ(A)’2 ) xk 2

xk+1 ¤ (1 ’ 2±»»s ) xk A.

A A

This completes the proof.

Now we consider two speci¬c examples. Let N = 1 and de¬ne

f (x) = ωx2 /2,

where

ω < 2(1 ’ ±).

(3.19)

In this case x— = 0. We have ∇f (x) = f (x) = ωx and hence for all x ∈ R,

ωx2

((1 ’ ω)2 ’ 1)

f (x ’ ∇f (x)) ’ f (x) =

2

ω 2 x2

= (ω ’ 2)

2

< ’±|f (x)|2 = ’±ω 2 x2

because (3.19) implies that

ω ’ 2 < ’2±.

Hence (3.4) holds with d = ∇f (x) and » = 1 for all x ∈ R. The rate of convergence can be

computed directly since

x+ = (1 ’ ω)xc

for all xc . The convergence is q-linear with q-factor 1 ’ ω. So if ω is very small, the convergence

will be extremely slow.

Similarly, if ω is large, we see that

ω 2 x2

(»ω ’ 2) < ’±»ω 2 x2

f (x ’ »∇f (x)) ’ f (x) =

2

only if

2(1 ’ ±)

»< .

ω

So

2(1 ’ ±) 2(1 ’ ±)

< βm = » <

β .

ω ω

If ω is very large, many steplength reductions will be required with each iteration and the line

search will be very inef¬cient.

These are examples of poor scaling, where a change in f by a multiplicative factor can

dramatically improve the ef¬ciency of the line search or the convergence speed. In fact, if

ω = 1, steepest descent and Newton™s method are the same and only one iteration is required.

The case for a general convex quadratic is similar. Let »l and »s be the largest and smallest

eigenvalues of the spd matrix A. We assume that b = 0 and a = 0 for this example. We let ul

and us be unit eigenvectors corresponding to the eigenvalues »l and »s . If

»s < 2(1 ’ ±)

is small and the initial iterate is in the direction of us , convergence will require a very large

number of iterations (slow). If »l is large and the initial iterate is in the direction of ul , the line

search will be inef¬cient (many stepsize reductions at each iteration).

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 47

Newton™s method does not suffer from poor scaling of f and converges rapidly with no need

for a line search when the initial iterate is near the solution. However, when far away from the

solution, the Newton direction may not be a descent direction at all and the line search may

fail. Making the transition from steepest descent, which is a good algorithm when far from

the solution, to Newton™s or some other superlinearly convergent method as the iteration moves

toward the solution, is the central problem in the design of line search algorithms. The scaling

problems discussed above must also be addressed, even when far from the solution.

3.2.3 Damped Gauss“Newton Iteration

As we showed in §2.4, the steepest descent direction for the overdetermined least squares objec-

tive

M

1 1

ri (x) 2 = R(x)T R(x)

f (x) = 2

2 i=1 2

is

’∇f (x) = ’R (x)T R(x).

The steepest descent algorithm could be applied to nonlinear least squares problems with the

good global performance and poor local convergence that we expect.

The Gauss“Newton direction at x

dGS = ’(R (x)T R (x))’1 R (x)T R(x)

is not de¬ned if R fails to have full column rank. If R does have full column rank, then

(dGS )T ∇f (x) = ’(R (x)T R(x))T (R (x)T R (x))’1 R (x)T R(x) < 0,

and the Gauss“Newton direction is a descent direction. The combination of the Armijo rule with

the Gauss“Newton direction is called damped Gauss“Newton iteration.

A problem with the damped Gauss“Newton algorithm is that, in order for Theorem 3.2.4 to

be applicable, the matrices {R (xk )T R (xk )} must not only have full column rank but also must

be uniformly bounded and well conditioned, which are very strong assumptions (but if they are

satis¬ed, damped Gauss“Newton is a very effective algorithm).

The Levenberg“Marquardt method [172], [183] addresses these issues by adding a regular-

ization parameter ν > 0 to R (xc )T R (xc ) to obtain x+ = xc + s where

s = ’(νc I + R (xc )T R (xc ))’1 R (xc )T R(xc ),

(3.20)

where I is the N — N identity matrix. The matrix νc I + R (xc )T R (xc ) is positive de¬nite.

The parameter ν is called the Levenberg“Marquardt parameter.

It is not necessary to compute R (xc )T R (xc ) to compute a Levenberg“Marquardt step [76].

One can also solve the full-rank (M + N ) — N linear least squares problem

2

1 R (xc ) R(xc )

√

min s+

(3.21)

νc I 0

2

to compute s (see exercise 3.5.6). Compare this with computing the undamped Gauss“Newton

step by solving (2.33).

If one couples the Levenberg“Marquardt method with the Armijo rule, then Theorem 3.2.4

is applicable far from a minimizer and Theorem 2.4.1 nearby. We ask the reader to provide the

details of the proof in exercise 3.5.7.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

48 ITERATIVE METHODS FOR OPTIMIZATION

Theorem 3.2.7. Let R be Lipschitz continuous. Let xk be the Levenberg“Marquardt“Armijo

iterates. Assume that R (xk ) is uniformly bounded and that the sequence of Levenberg“

Marquardt parameters {νk } is such that

κ(νk I + R (xk )T R (xk ))

is bounded. Then

lim R (xk )T R(xk ) = 0.

k’∞

Moreover, if x— is any limit point of {xk } at which R(x— ) = 0, Assumption 2.4.1 holds, and

νk ’ 0, then xk ’ x— q-superlinearly. If, moreover,

νk = O( R(xk ) )

as k ’ ∞ then the convergence is q-quadratic.

For example, if κ(R (xk )T R (xk )) and R (xk ) are bounded then νk = min(1, R(xk ) )

would satisfy the assumptions of Theorem 3.2.7. For a zero residual problem, this addresses the

potential conditioning problems of the damped Gauss“Newton method and still gives quadratic

convergence in the terminal phase of the iteration. The Levenberg“Marquardt“Armijo iteration

will also converge, albeit slowly, for a large residual problem.

We will not discuss globally convergent methods for underdetermined least squares problems

in this book. We refer the reader to [24], [252], and [253] for discussion of underdetermined

problems.

3.2.4 Nonlinear Conjugate Gradient Methods

Operationally, the conjugate gradient iteration for a quadratic problem updates the current iter-

ation with a linear combination of the current residual r and a search direction p. The search

direction is itself a linear combination of previous residuals. Only r and p need be stored to

continue the iteration. The methods discussed in this section seek to continue this idea to more

nonlinear problems.

Nonlinear conjugate gradient algorithms have the signi¬cant advantage of low storage over

most of the other algorithms covered in this book, the method of steepest descent being the

exception. For problems so large that the Newton or quasi“Newton methods cannot be imple-

mented using the available storage, these methods are among the few options (see [177] and [5]

for examples).

Linear conjugate gradient seeks to minimize f (x) = xT Hx/2 ’ xT b. The residual r =

b ’ Hx is simply ’∇f (x), leading to a natural extension to nonlinear problems in which

r0 = p0 = ∇f (x0 ) and, for k ≥ 1,

rk = ∇f (xk ) and pk = rk + βk pk’1 .

(3.22)

The update of x

xk+1 = xk + ±k pk

can be done with a simple analytic minimization in the quadratic case, but a line search will be

needed in the nonlinear case. The missing pieces, therefore, are the choice of βk , the way the

line search will be done, and convergence theory. Theory is needed, for example, to decide if

pk is a descent direction for all k.

The general form of the algorithm is very simple. The inputs are an initial iterate, which

will be overwritten by the solution, the function to be minimized, and a termination vector

„ = („r , „a ) of relative and absolute residuals.

Algorithm 3.2.2. nlcg(x, f, „ )

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 49

1. r0 = ∇f (x) ; k = 0

2. Do while ∇f (x) > „r r0 + „a

(a) If k = 0 then p = ’∇f (x) else

p = ’∇f (x) + βp

(b) x = x + ±p

The two most common choices for β, both of which are equivalent to the linear CG formula

in the quadratic case, are the Fletcher“Reeves [106]

∇f (xk ) 2

βk R

F

=

(3.23)

∇f (xk’1 ) 2

and Polak“Ribi` re [215], [216]

e

∇f (xk )T (∇f (xk ) ’ ∇f (xk’1 ))

βk =

(3.24)

∇f (xk’1 ) 2

formulas. The Fletcher“Reeves method has been observed to take long sequences of very small

steps and virtually stagnate [112], [207], [208], [226]. The Polak“Ribi` re formula performs

e

much better and is more commonly used but has a less satisfactory convergence theory.

The line search has more stringent requirements, at least for proofs of convergence, than

are satis¬ed by the Armijo method that we advocate for steepest descent. We require that the

steplength parameter satis¬es the Wolfe conditions [272], [273]

f (xk + ±k pk ) ¤ f (xk ) + σ± ±k ∇f (xk )T pk

(3.25)

and

∇f (xk + ±k pk )T pk ≥ σβ ∇f (xk )T pk ,

(3.26)

where 0 < σ± < σβ < 1. The ¬rst of the Wolfe conditions (3.25) is the suf¬cient decrease

condition, (3.4), that all line search algorithms must satisfy. The second (3.26) is often, but not

always, implied by the Armijo backtracking scheme of alternating a test for suf¬cient decrease

and reduction of the steplength. One can design a line search method that will, under modest

assumptions, ¬nd a steplength satisfying the Wolfe conditions [104], [171], [193].

The convergence result [3] for the Fletcher“Reeves formula requires a bit more. The proof

that pk is descent direction requires the strong Wolfe conditions, which replace (3.26) by

|∇f (xk + ±k pk )T pk | ¤ ’σβ ∇f (xk )T pk

(3.27)

and demand that 0 < σ± < σβ < 1/2. The algorithm from [193], for example, will ¬nd a point

satisfying the strong Wolfe conditions.

Theorem 3.2.8. Assume that the set

N = {x | f (x) ¤ f (x0 )}

is bounded and that f is Lipschitz continuously differentiable in a neighborhood of N . Let

Algorithm nlcg be implemented with the Fletcher“Reeves formula and a line search that satis¬es

the strong Wolfe conditions. Then

lim ∇f (xk ) = 0.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

50 ITERATIVE METHODS FOR OPTIMIZATION

This result has been generalized to allow for any choice of βk such that |βk | ¤ βk R [112].

F

A similar result for the Polak“Ribi` re method, but with more complex conditions on the line

e

search, has been proved in [134]. This complexity in the line search is probably necessary, as

there are examples where reasonable line searches lead to failure in the Polak“Ribi` re method,

e

PR PR

[222]. One can also prove convergence if βk is replaced by max(βk , 0) [112].

There is continuing research on these methods and we point to [112], [134], [205], and [202]

as good sources.

3.3 Trust Region Methods

Trust region methods overcome the problems that line search methods encounter with non-spd

approximate Hessians. In particular, a Newton trust region strategy allows the use of complete

Hessian information even in regions where the Hessian has negative curvature. The speci¬c trust

region methods we will present effect a smooth transition from the steepest descent direction to

the Newton direction in a way that gives the global convergence properties of steepest descent

and the fast local convergence of Newton™s method.

The idea is very simple. We let ∆ be the radius of the ball about xc in which the quadratic

model

mc (x) = f (xc ) + ∇f (xc )T (x ’ xc ) + (x ’ xc )T Hc (x ’ xc )/2

can be trusted to accurately represent the function. ∆ is called the trust region radius and the

ball

T (∆) = {x | x ’ xc ¤ ∆}

is called the trust region.

We compute the new point x+ by (approximately) minimizing mc over T (∆). The trust

region problem for doing that is usually posed in terms of the difference st between xc and the

minimizer of mc in the trust region

min mc (xc + s).

(3.28)

s ¤∆

We will refer to either the trial step st or the trial solution xt = xc + st as the solution to the

trust region problem.

Having solved the trust region problem, one must decide whether to accept the step and/or to

change the trust region radius. The trust region methods that we will discuss in detail approximate

the solution of the trust region problem with the minimizer of the quadratic model along a

piecewise linear path contained in the trust region. Before discussing these speci¬c methods,

we present a special case of a result from [223] on global convergence.

A prototype trust region algorithm, upon which we base the speci¬c instances that follow, is

Algorithm 3.3.1.

Algorithm 3.3.1. trbasic(x, f )

1. Initialize the trust region radius ∆.

2. Do until termination criteria are satis¬ed

(a) Approximately solve the trust region problem to obtain xt .

(b) Test both the trial point and the trust region radius and decide whether or not to

accept the step, the trust region radius, or both. At least one of x or ∆ will change

in this phase.

Most trust region algorithms differ only in how step 2a in Algorithm trbasic is done.

There are also different ways to implement step 2b, but these differ only in minor details and the

approach we present next in §3.3.1 is very representative.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 51

3.3.1 Changing the Trust Region and the Step

The trust region radius and the new point are usually tested simultaneously. While a notion of

suf¬cient decrease is important, the test is centered on how well the quadratic model approximates

the function inside the trust region. We measure this by comparing the actual reduction in f

ared = f (xc ) ’ f (xt )

with the predicted reduction, i.e., the decrease in the quadratic model

pred = mc (xc ) ’ mc (xt ) = ’∇f (xc )T st ’ sT Hc st /2.

t

pred > 0 for all the trust region algorithms we discuss in this book unless ∇f (xc ) = 0. We

will introduce three control parameters

µ0 ¤ µlow < µhigh ,

which are used to determine if the trial step should be rejected (ared/pred < µ0 ) and/or the trust

region radius should be decreased (ared/pred < µlow ), increased (ared/pred > µhigh ), or left

unchanged. Typical values are .25 for µlow and .75 for µhigh . Both µ0 = 10’4 or µ0 = µlow

are used. One can also use the suf¬cient decrease condition (3.4) to determine if the trial step

should be accepted [84].

We will contract and expand the trust region radius by simply multiplying ∆ by constants

0 < ωdown < 1 < ωup .

Typical values are ωdown = 1/2 and ωup = 2. There are many other ways to implement a trust

region adjustment algorithm that also give global convergence. For example, the relative error

|pred ’ ared|/ ∇f can be used [84] rather than the ratio ared/pred. Finally we limit the

number of times the trust region radius can be expanded by requiring

∆ ¤ CT ∇f (xc ) ,

(3.29)

for some CT > 1, which may depend on xc . This only serves to eliminate the possibility of

in¬nite expansion and is used in the proofs. Many of the dogleg methods which we consider

later automatically impose (3.29).

The possibility of expansion is important for ef¬ciency in the case of poor scaling of f .

The convergence theory presented here [162] also uses the expansion phase in the proof of

convergence, but that is not essential. We will present the algorithm to test the trust region in a

manner, somewhat different from much of the literature, that only returns once a new iterate has

been accepted.

Algorithm 3.3.2. trtest(xc , xt , x+ , f, ∆)

1. z = xc

2. Do while z = xc

(a) ared = f (xc ) ’ f (xt ), st = xt ’ xc , pred = ’∇f (xc )T st ’ sT Hc st /2

t

(b) If ared/pred < µ0 then set z = xc , ∆ = ωdown ∆, and solve the trust region

problem with the new radius to obtain a new trial point. If the trust region radius

was just expanded, set z = xold .

t

(c) If µ0 ¤ ared/pred < µlow , then set z = xt and ∆ = ωdown ∆.

(d) If µlow ¤ ared/pred ¤ µhigh , set z = xt .

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

52 ITERATIVE METHODS FOR OPTIMIZATION

(e) If µhigh < ared/pred and st = ∆ ¤ CT ∇f (xc ) , then set z = xc , ∆ = ωup ∆,

and solve the trust region problem with the new radius to obtain a new trial point.

Store the old trial point as xold in case the expansion fails.

t

3. x+ = z.

The loop in Algorithm trtest serves the same purpose as the loop in a line search algorithm

such as Algorithm steep. One must design the solution to the trust region problem in such a

way that that loop will terminate after ¬nitely many iterations and a general way to do that is the

subject of the next section.

We incorporate Algorithm trtest into a general trust region algorithm paradigm that we

will use for the remainder of this section.

Algorithm 3.3.3. trgen(x, f )

1. Initialize ∆

2. Do forever

(a) Let xc = x. Compute ∇f (xc ) and an approximate Hessian Hc .

(b) Solve the trust region problem to obtain a trial point xt .

(c) Call trtest(xc , xt , x, f, ∆)

Hessians and gradients are computed only in step 2a of Algorithm trgen.

3.3.2 Global Convergence of Trust Region Algorithms

While one can, in principal, solve the trust region problem exactly (see §3.3.4), it is simpler

and more ef¬cient to solve the problem approximately. It is amazing that one need not do a

very good job with the trust region problem in order to get global (and even locally superlinear)

convergence.

Our demands of our solutions of the trust region problem and our local quadratic models

are modest and readily veri¬able. The parameter σ in part 1 of Assumption 3.3.1, like the

parameter CT in (3.29), is used in the analysis but plays no role in implementation. In the

speci¬c algorithms that we discuss in this book, σ can be computed. Part 2 follows from well-

conditioned and bounded model Hessians if Algorithm trtest is used to manage the trust

region.

Assumption 3.3.1.

1. There is σ > 0 such that

pred = f (xc ) ’ mc (xt ) ≥ σ ∇f (xc ) min( st , ∇f (xc ) ).

(3.30)

2. There is M > 0 such that either st ≥ ∇f (xc ) /M or st = ∆c .

The global convergence theorem based on this assumption should be compared with the

similar result on line search methods”Theorem 3.2.4.

Theorem 3.3.1. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk }

be generated by Algorithm trgen and let the solutions for the trust region problems satisfy

Assumption 3.3.1. Assume that the matrices {Hk } are bounded. Then either f is unbounded

from below, ∇f (xk ) = 0 for some ¬nite k, or

lim ∇f (xk ) = 0.

(3.31)

k’∞

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 53

Proof. Assume that ∇f (xk ) = 0 for all k and that f is bounded from below. We will show

that there is MT ∈ (0, 1] such that once an iterate is taken (i.e., the step is accepted and the trust

region radius is no longer a candidate for expansion), then

sk ≥ MT ∇f (xk ) .

(3.32)

Assume (3.32) for the present. Since sk is an acceptable step, Algorithm trtest and part 1 of

Assumption 3.3.1 imply that

aredk ≥ µ0 predk ≥ µ0 ∇f (xk ) σ min( sk , ∇f (xk ) ).

We may then use (3.32) to obtain

aredk ≥ µ0 σMT ∇f (xk ) 2 .

(3.33)

Now since f (xk ) is a decreasing sequence and f is bounded from below, limk’∞ aredk = 0.

Hence (3.33) implies (3.31).

It remains to prove (3.32). To begin note that if sk < ∆k then by part 2 of Assumption 3.3.1

sk ≥ ∇f (xk ) /M.

Hence, we need only consider the case in which

sk = ∆k and sk < ∇f (xk ) ,

(3.34)

since if (3.34) does not hold then (3.32) holds with MT = min(1, 1/M ).

We will complete the proof by showing that if (3.34) holds and sk is accepted, then

’2

2σ min(1 ’ µhigh , (1 ’ µ0 )ωup )

sk = ∆k ≥ ∇f (xk ) .

(3.35)

M +L

This will complete the proof with

’2

2σ min(1 ’ µhigh , (1 ’ µ0 )ωup )

MT = min 1, 1/M, .

M +L

Now increase the constant M > 0 in part 1 of Assumption 3.3.1 if needed so that

Hk ¤ M for all k.

(3.36)

We prove (3.35) by showing that if (3.34) holds and (3.35) does not hold for a trial step st ,

then the trust region radius will be expanded and the step corresponding to the larger radius will

be acceptable. Let st be a trial step such that st < ∇f (xk ) and

’2

2σ min(1 ’ µhigh , (1 ’ µ0 )ωup )

st = ∆t < ∇f (xk ) .

(3.37)

M +L

We use the Lipschitz continuity of ∇f and (3.36) to obtain

1

T

(∇f (xk + tst ) ’ ∇f (xk ))T st dt

aredt = ’∇f (xk ) st ’

0

1

sT Hk st /2 (∇f (xk + tst ) ’ ∇f (xk ))T st dt

= predt + ’

t

0

≥ predt ’ (M + L) st 2 /2.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

54 ITERATIVE METHODS FOR OPTIMIZATION

Therefore, using (3.30) from Assumption 3.3.1, we have

2

aredt (M + L) st

≥1’

predt 2predt

(3.38)

(M + L) st 2

≥1’ .

2σ ∇f (xk ) min( ∇f (xk ) , st )

Now since st < ∇f (xk ) by (3.34) we have

min( ∇f (xk ) , st ) = st

and hence

aredk (M + L) st

≥1’ > µhigh

(3.39)

predk 2 ∇f (xk ) σ

by (3.37). Hence, an expansion step will be taken by replacing ∆t by ∆+ = ωup ∆t and st by

t

+

st , the minimum of the quadratic model in the new trust region.

Now (3.38) still holds and, after the expansion,

s+ ¤ ωup st < ωup ∇f (xk ) .

t

So

min( ∇f (xk ) , s+ ) > s+ /ωup .

t t

Hence,

ared+ (M + L) s+ 2

t t

≥1’

+

2σ ∇f (xk ) min( ∇f (xk ) , s+ )

predt t

2

(M + L)ωup s+ (M + L)ωup st

t

≥1’ ≥1’ ≥ µ0

2 ∇f (xk ) σ 2 ∇f (xk ) σ

by (3.37). Hence, the expansion will produce an acceptable step. This means that if the ¬nal

accepted step satis¬es (3.34), it must also satisfy (3.35). This completes the proof.

3.3.3 A Unidirectional Trust Region Algorithm

The most direct way to compute a trial point that satis¬es Assumption 3.3.1 is to mimic the line

search and simply minimize the quadratic model in the steepest descent direction subject to the

trust region bound constraints.

In this algorithm, given a current point xc and trust region radius ∆c , our trial point is the

minimizer of

ψc (») = mc (xc ’ »∇f (xc ))

subject to the constraint that

x(») = xc ’ »∇f (xc ) ∈ T (∆c ).

ˆ

Clearly the solution is x(») where

± ∆c

if ∇f (xc )T Hc ∇f (xc ) ¤ 0 ,

∇f (xc )

ˆ

»=

(3.40)

min ∇f (xc ) 2 ∆c

if ∇f (xc )T Hc ∇f (xc ) > 0.

, ∇f (xc )

∇f (xc )T Hc ∇f (xc )

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 55

ˆ

x(»), the minimizer of the quadratic model in the steepest descent direction, subject to the trust

region bounds, is called the Cauchy point. We will denote the Cauchy point by xCP .1

c

Then with xCP as trial point, one can use Theorem 3.3.1 to derive a global convergence

theorem for the unidirectional trust region.

Theorem 3.3.2. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk } be

generated by Algorithm trgen with xt = xCP and (3.40). Assume that the matrices {Hk } are

bounded. Then either f (xk ) is unbounded from below, ∇f (xk ) = 0 for some ¬nite k, or

lim ∇f (xk ) = 0.

k’∞

Proof. We show that xt satis¬es part 2 of Assumption 3.3.1. If st = ∆c then the assertion

holds trivially. If st < ∆c then, by de¬nition of xCP ,

c

∇f (xc ) 2 ∇f (xc )

st = ’ .

∇f (xc )T Hc ∇f (xc )

Hence, if Hc ¤ M ,

st ≥ ∇f (xc ) /M

as asserted.

We leave the proof that xt satis¬es part 1 for the reader (exercise 3.5.8).

The assumptions we used are stronger that those in, for example, [104] and [223], where

lim inf ∇f (xk ) = 0

rather than ∇f (xk ) ’ 0 is proved.

3.3.4 The Exact Solution of the Trust Region Problem

The theory of constrained optimization [117], [104] leads to a characterization of the solutions

of the trust region problem. In this section we derive that characterization via an elementary

argument (see also [84], [242], and [109]). This book focuses on approximate solutions, but the

reader should be aware that the exact solution can be computed accurately [192], [243].

Theorem 3.3.3. Let g ∈ RN and let A be a symmetric N — N matrix. Let

m(s) = g T s + sT As/2.

A vector s is a solution to

min m(s)

(3.41)

s ¤∆

if and only if there is ν ≥ 0 such that

(A + νI)s = ’g

and either ν = 0 or s = ∆.

Proof. If s < ∆ then ∇m(s) = g + As = 0, and the conclusion follows with ν = 0. To

consider the case where s = ∆, let »1 ¤ »2 ¤ · · · »N be the eigenvalues of A.

some of the literature, [84], for example, Hc is assumed to be positive de¬nite and the Cauchy point is taken to

1 In

be the global minimizer of the quadratic model.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

56 ITERATIVE METHODS FOR OPTIMIZATION

Clearly, for any ν,

= g T s + sT As/2

m(s)

= g T s + sT (A + νI)s/2 ’ ν∆2 /2.

Consider the function, de¬ned for ν > ν0 = max(0, ’»1 ),

s(ν) = ’(A + νI)’1 g.

Since

lim s(ν) = 0

ν’∞

and s(ν) is a continuous decreasing function of ν ∈ (ν0 , ∞) we see that if

lim s(ν) > ∆

ν’ν0

then there is a unique ν such that s(ν) = ∆. Since ν ≥ ν0 , A + νI is positive semide¬nite;

therefore, s(ν) is a global minimizer of

g T s + sT (A + νI)s/2.

Hence, we must have

m(s) ≥ m(s(ν))

for all s such that s = ∆. Hence, s(ν) is a solution of (3.41).

The remaining case is

lim s(ν) ¤ ∆.

ν’ν0

This implies that g is orthogonal to the nontrivial space S0 of eigenfunctions corresponding to

’ν0 (for otherwise the limit would be in¬nite). If we let s = s1 + s2 , where s2 is the projection

of s onto S0 , we have

= sT g + sT (A + ν0 )s1 /2 + sT (A + ν0 )s2 /2 ’ ν0 ∆2 /2

m(s) 1 1 2

= sT g + sT (A + »0 )s1 /2 ’ ν0 ∆2 /2.

1 1

Hence, m(s) is minimized by setting s1 equal to the minimum norm solution of (A+ν0 )x = ’g

(which exists by orthogonality of g to S0 ) and letting s2 be any element of S0 such that

2

= ∆2 ’ s1 2 .

s2

This completes the proof.

3.3.5 The Levenberg“Marquardt Parameter

The solution of the trust region problem presented in §3.3.4 suggests that, rather than controlling

∆, one could set

st = ’(νI + Hc )’1 g,

adjust ν in response to ared/pred instead of ∆, and still maintain global convergence. A natural

application of this idea is control of the Levenberg“Marquardt parameter. This results in a

much simpler algorithm than Levenberg“Marquardt“Armijo in that the stepsize control can be

eliminated. We need only vary the Levenberg“Marquardt parameter as the iteration progresses.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 57

We present the algorithm from [190] to illustrate this point. For an inexact formulation, see

[276].

The Levenberg“Marquardt quadratic model of least squares objective

M

1 1

2

R(x)T R(x)

f (x) = ri (x) =

2

2 2

i=1

with parameter νc at the point xc is

= f (xc ) + (x ’ xc )T R (xc )T R(xc )

mc (x)

(3.42)

+ 1 (x ’ xc )T (R (xc )T R (xc ) + νc I)(x ’ xc ).

2

The minimizer of the quadratic model is the trial point

xt = xc ’ (R (xc )T R (xc ) + νc I)’1 R (xc )T R(xc ),

(3.43)

the step is s = xt ’ xc , and the predicted reduction is

1

= m(xc ) ’ m(xt ) = ’sT R (xc )T R(xc ) ’ 2 sT (R (xc )T R (xc ) + νc I)s

pred

= ’sT R (xc )T R(xc ) + 1 sT R (xc )T R(xc ) = ’ 1 sT ∇f (xc ).

2 2

The algorithm we present below follows the trust region paradigm and decides on accepting

the trial point and on adjustments in the Levenberg“Marquardt parameter by examinaing the

ratio

ared f (xc ) ’ f (xt )

=

pred m(xc ) ’ m(xt )

f (xc ) ’ f (xt )

= ’2 .

sT ∇f (xc )

In addition to the trust region parameters 0 < ωdown < 1 < ωup and µ0 ¤ µlow < µhigh

we require a default value ν0 of the Levenberg“Marquardt parameter.

The algorithm for testing the trial point differs from Algorithm trtest in that we decrease

(increase) ν rather that increasing (decreasing) a trust region radius if ared/pred is large (small).

We also attempt to set the Levenberg“Marquardt parameter to zero when possible in order to

recover the Gauss“Newton iteration™s fast convergence for small residual problems.

Algorithm 3.3.4. trtestlm(xc , xt , x+ , f, ν)

1. z = xc

2. Do while z = xc

(a) ared = f (xc ) ’ f (xt ), st = xt ’ xc , pred = ’∇f (xc )T st /2.

(b) If ared/pred < µ0 then set z = xc , ν = max(ωup ν, ν0 ), and recompute the trial

point with the new value of ν.

(c) If µ0 ¤ ared/pred < µlow , then set z = xt and ν = max(ωup ν, ν0 ).

(d) If µlow ¤ ared/pred, then set z = xt .

If µhigh < ared/pred, then set ν = ωdown ν.

If ν < ν0 , then set ν = 0.

3. x+ = z.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

58 ITERATIVE METHODS FOR OPTIMIZATION

The Levenberg“Marquardt version ofAlgorithm trgen is simple to describe and implement.

Algorithm 3.3.5. levmar(x, R, kmax)

1. Set ν = ν0 .

2. For k = 1, . . . , kmax

(a) Let xc = x.

(b) Compute R, f , R , and ∇f ; test for termination.

(c) Compute xt using (3.43).

(d) Call trtestlm(xc , xt , x, f, ν)

We state a convergence result [190], [276] without proof.

Theorem 3.3.4. Let R be Lipschitz continuously differentiable. Let {xk } and {νk } be the

sequence of iterates and Levenberg“Marquardt parameters generated by Algorithm levmar

with kmax = ∞. Assume that {νk } is bounded from above. Then either R (xk )T R(xk ) = 0

for some ¬nite k or

lim R (xk )T R(xk ) = 0.

k’∞

Moreover, if x— is a limit point of {xk } for which R(x— ) = 0 and R (x— ) has full rank, then

xk ’ x— q-quadratically and νk = 0 for k suf¬ciently large.

3.3.6 Superlinear Convergence: The Dogleg

The convergence of the unidirectional trust region iteration can be as slow as that for steepest

descent. To improve the convergence speed in the terminal phase we must allow for approx-

imations to the Newton direction. The power of trust region methods is the ease with which

the transition from steepest descent, with its good global properties, to Newton™s method can be

managed.

We de¬ne the Newton point at xc as

xN = xc ’ Hc ∇f (xc ).

’1

c

If Hc is spd, the Newton point is the global minimizer of the local quadratic model. On the other

hand, if Hc has directions of negative curvature the local quadratic model will not have a ¬nite

minimizer, but the Newton point is still useful. Note that if H = I the Newton point and the

Cauchy point are the same if the Newton point is inside the trust region.

We will restrict our attention to a special class of algorithms that approximate the solution

of the trust region problem by minimizing mc along a piecewise linear path S ‚ T (∆). These

paths are sometimes called doglegs because of the shapes of the early examples [84], [80], [218],

[217], [220]. In the case where ∇2 f (x) is spd, one may think of the dogleg path as a piecewise

linear approximation to the path with parametric representation

{x ’ (»I + ∇2 f (x))’1 ∇f (x) | 0 ¤ »}.

This is the path on which the exact solution of the trust region problem lies.

The next step up from the unidirectional path, the classical dogleg path [220], has as many

as three nodes, xc , xCP — , and xN . Here xCP — is the global minimizer of the quadratic model in

c c c

the steepest descent direction, which will exist if and only if ∇f (xc )T Hc ∇f (xc ) > 0. If xCP —

c

exists and

(xN ’ xCP — )T (xCP — ’ xc ) > 0,

(3.44) c c c

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 59

we will let xN be the terminal node. If (3.44) holds, as it always will if Hc is spd, then the path

c

can be parameterized by the distance from xc and, moreover, mc decreases along the path. If

(3.44) does not hold, we do not use xN as a node and revert to the unidirectional path in the

c

steepest descent direction.

Note that (3.44) implies

∇f (xc )T (xN ’ xc ) < 0.

(3.45) c

We can express the conditions for using the three node path rather than the unidirectional path

very simply. If xCP is on the boundary of the trust region then we accept xCP as the trial point.

c c

If xCP = xCP — is in the interior of the trust region, then we test (3.44) to decide what to do.

c c

With this in mind our trial point for the classical dogleg algorithm will be

± CP

if xc ’ xCP = ∆

x

c

or xCP — exists and (3.44) fails,

D

xN if xc ’ xCP < xc ’ xN ¤ ∆

x (∆) =

(3.46) c c

and (3.44) holds,

D

y (∆) otherwise.

Here y D (∆) is the unique point between xCP and xN such that xD ’ xc = ∆.

c c

The important properties of dogleg methods are as follows:

• No two points on the path have the same distance from xc ; hence the path may be param-

eterized as x(s), where s = x(s) ’ xc .

• mc (x(s)) is a strictly decreasing function of s.

This enables us to show that the dogleg approximate solution of the trust region problem sat-

is¬es Assumption 3.3.1 and apply Theorem 3.3.1 to conclude global convergence. Superlinear

convergence will follow if Hk is a suf¬ciently good approximation to ∇2 f (xk ).

Lemma 3.3.5. Let xc , Hc , and ∆ be given. Let Hc be nonsingular,

sN = ’Hc ∇f (xc ), and xN = xc + sN .

’1

Assume that ∇f (xc )T Hc ∇f (xc ) > 0 and let

∇f (xc ) 2

sCP — = xCP — ’ xc = ’ ∇f (xc ).

∇f (xc )T Hc ∇f (xc )

Let S be the piecewise linear path from xc to xCP — to xN . Then if

(sN ’ sCP — )T sCP — > 0,

(3.47)

for any δ ¤ sN there is a unique point x(δ) on S such that

x(δ) ’ xc = δ.

Proof. Clearly the statement of the result holds on the segment of the path from x to xCP — .

To prove the result on the segment from xCP — to xN we must show that

1

(1 ’ »)sCP — + »sN 2

φ(») =

2

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

60 ITERATIVE METHODS FOR OPTIMIZATION

is strictly monotone increasing for » ∈ (0, 1).

Since (3.47) implies that

sN sCP — ≥ (sN )T sCP — > sCP — 2

and therefore that sN > sCP — , we have

= (sN ’ sCP — )T ((1 ’ »)sCP — + »sN )

φ (»)

= ’(1 ’ ») sCP — 2

+ (1 ’ »)(sN )T sCP — + » sN 2

’ »(sN )T sCP —

> »( sN 2

’ (sN )T sCP — ) ≥ »( sN ’ sCP — ) sN > 0.

Hence, φ is an increasing function and the proof is complete.

The next stage is to show that the local quadratic model decreases on the dogleg path S.

Lemma 3.3.6. Let the assumptions of Lemma 3.3.5 hold. Then the local quadratic model

1

mc (x) = f (xc ) + ∇f (xc )T (x ’ xc ) + (x ’ xc )T Hc (x ’ xc )

2

is strictly monotone decreasing on S.

Proof. Since xCP — is the minimum of the local quadratic model in the steepest descent

c

direction, we need only show that mc is strictly decreasing on the segment of the path between

xCP — and xN . Set

c

= mc (xc + (1 ’ »)sCP — + »sN )

ψ(»)

= f (xc ) + ∇f (xc )T ((1 ’ »)sCP — + »sN )

+ 1 ((1 ’ »)sCP — + »sN )T Hc ((1 ’ »)sCP — + »sN ).

2

ˆ

Noting that Hc sN = ’∇f (xc ) and sCP — = ’»∇f (xc ), we obtain

ˆ

= f (xc ) ’ »(1 ’ »)2 ∇f (xc ) 2

ψ(»)

+»(1 ’ »/2)∇f (xc )T sN

ˆ

+ 1 (1 ’ »)2 »2 ∇f (xc )T Hc ∇f (xc ).

2

Therefore,

ˆ 2

ψ (») = 2»(1 ’ ») ∇f (xc )

ˆ

+(1 ’ »)∇f (xc )T sN ’ (1 ’ »)»2 ∇f (xc )T Hc ∇f (cc ).

Since

ˆ

»∇f (xc )T Hc ∇f (cc ) = ∇f (xc ) 2

we have, using (3.44),

ˆ 2

’ ∇f (xc )T Hc ∇f (xc ))

’1

ψ (») = (1 ’ »)(» ∇f (xc )

ˆ

= (1 ’ »)∇f (xc )T (»∇f (xc ) ’ Hc ∇f (xc ))

’1

1’»

(xc ’ xCP — )T (xN ’ xc ) < 0,

= c c

ˆ

»

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 61

completing the proof.

At this point we have shown that the approximate trust region problem has a unique solution.

To prove global convergence we need only verify that the approximate solution of the trust region

problem xD satis¬es Assumption 3.3.1.

Theorem 3.3.7. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk } be

generated by Algorithm trgen and the solutions for the trust region problem be given by

(3.46). Assume that the matrices {Hk } are bounded. Then either f (xk ) is unbounded from

below, ∇f (xk ) = 0 for some ¬nite k, or

lim ∇f (xk ) = 0.

(3.48)

k’∞

Proof. We need to check that the solutions of the trust region problem satisfy Assump-

tion 3.3.1. Part 2 of the assumption follows from the de¬nition, (3.46), of xD and the bounded-

ness of the approximate Hessians. Let

Hk ¤ M

for all k. If sk < ∆, then (3.46) implies that (3.44) must hold and so xt = xN is the Newton

k

point. Hence,

’1

sk = xk ’ xN = Hk ∇f (xk ) ≥ ∇f (xk ) /M,

k

which proves part 2.

Veri¬cation of part 1 will complete the proof. There are several cases to consider depending

on how xD is computed.

If xD = xCP then either sCP = ∆c or (3.44) fails. We ¬rst consider the case where

ˆ

∇f (xc )T Hc ∇f (xc ) ¤ 0. In that case sCP = ∆c and » = ∆c / ∇f (xc ) . Therefore,

ˆ

ˆ »2

2 T

pred = » ∇f (xc ) ’ 2 ∇f (xc ) Hc ∇f (xc )

T

2 ∇f (xc ) Hc ∇f (xc )

= ∆c ∇f (xc ) ’ ∆c

2 ∇f (xc ) 2

≥ ∆c ∇f (xc ) = s ∇f (xc ) .

Hence (3.30) holds with σ = 1.

Now assume that ∇f (xc )T Hc ∇f (xc ) > 0 and sCP = ∆c . In this case

∇f (xc ) 2 ∆c

≥

∇f (xc )T Hc ∇f (xc ) ∇f (xc )

and so

ˆ

ˆ »2

2 T

pred = » ∇f (xc ) ’ 2 ∇f (xc ) Hc ∇f (xc )

T

2 ∇f (xc ) Hc ∇f (xc )

= ∆c ∇f (xc ) ’ ∆c

2 ∇f (xc ) 2

≥ ∆c ∇f (xc ) /2,

which is (3.30) with σ = 1/2.

If (3.44) fails, ∇f (xc )T Hc ∇f (xc ) > 0, and sCP < ∆c , then

∇f (xc ) 2

ˆ

»= ,

∇f (xc )T Hc ∇f (xc )

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

62 ITERATIVE METHODS FOR OPTIMIZATION

and

ˆ

ˆ »2

2 T

pred = » ∇f (xc ) ’ 2 ∇f (xc ) Hc ∇f (xc )

ˆ

∇f (xc ) 4 2

» ∇f (xc )

= =

2∇f (xc )T Hc ∇f (xc ) 2

s ∇f (xc )

= ,

2

which is (3.30) with σ = 1/2.

The ¬nal case is if (3.44) holds and xD = xCP . In that case the predicted reduction is more

than Cauchy decrease, i.e., the decrease obtained by taking the Cauchy point, and hence

∇f (xc ) 4

pred ≥

2∇f (xc )T Hc ∇f (xc )

2

∇f (xc )

≥ ,

2M

which is (3.30) with σ = 1/(2M ). This completes the proof.

The last part of the proof of this theorem is very important, asserting that any solution of

the trust region problem for which pred is at least a ¬xed fraction of Cauchy decrease will give

global convergence. We refer the reader to [232] and [104] for a more general and detailed

treatment using this point of view.

Corollary 3.3.8. Any algorithm for solving the trust region problem that satis¬es for some

„ >0

pred ≥ „ (mc (xc ) ’ mc (xCP ))

c

satis¬es (3.30) for σ = „ /2.

The trust region CG algorithm we present in §3.3.7 can be analyzed with this corollary.

If Hk = ∇2 f (xk ) or a suf¬ciently good approximation, then the classical dogleg will become

Newton™s method (or a superlinearly convergent method) as the iterations approach a minimizer

that satis¬es the standard assumptions. Hence, the algorithm makes a smooth and automatic

transition into the superlinearly convergent stage.

Theorem 3.3.9. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk } be

generated by Algorithm trgen and the solutions for the trust region problem are given by (3.46).

Assume that Hk = ∇2 f (xk ) and that the matrices {Hk } are bounded. Let f be bounded from

below. Let x— be a minimizer of f at which the standard assumptions hold. Then if x— is a limit

point of xk , then xk ’ x— and the convergence is locally q-quadratic.

Proof. Since x— is a limit point of {xk }, there is, for any ρ > 0, a k suf¬ciently large so that

’1

ek < ρ, Hk ¤ 2 ∇2 f (x— ) , Hk ¤ 2 (∇2 f (x— ))’1 ,

’1

and xk is near enough for the assumptions of Theorem 2.3.2 to hold. If Hk is spd, so is Hk

and for such k, (3.44) holds. Hence, the dogleg path has the nodes xk , xCP , and xN . Moreover,

k k

if ρ is suf¬ciently small, then

’1

Hk ∇f (xk ) ¤ 2 ek ¤ 2ρ.

We complete the proof by showing that if ρ is suf¬ciently small, the trust region radius will be

expanded if necessary until the Newton step is in the trust region. Once we do this, the proof is

complete as then the local quadratic convergence of Newton™s method will take over.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

GLOBAL CONVERGENCE 63

Now

predk ≥ sk ∇f (xk ) /2

by the proof of Theorem 3.3.7. Using Hk = ∇2 f (xk ) we have

1

= ’∇f (xk )T stk ’ (∇f (xk + tstk ) ’ ∇f (xk ))T stk dt

aredk

0

1

sT ∇2 f (xk )stk /2 (∇f (xk + tstk ) ’ ∇f (xk ))T stk dt

= predk + ’

tk

0

= predk + O( sk ∇f (xk ) ρ)

and therefore ared/pred = 1’O(ρ). Hence, for ρ suf¬ciently small, the trust region radius will

be increased, if necessary, until the Newton point is inside the trust region and then a Newton

step will be taken. This completes the proof.

The classical dogleg algorithm is implemented in Algorithm ntrust, which uses the trust

radius adjustment scheme from Algorithm trtest. It is to be understood that trtest is

implemented so that xt is given by (3.46) and hence trtest only samples points on the

piecewise linear search path determined by the Cauchy point, the Newton point, and (3.44).

Algorithm 3.3.6. ntrust(x, f, „ )

1. Compute f (x) and ∇f (x)

2. „ = „a + „r ∇f (x)

3. Do while ∇f (x) > „

(a) Compute and factor ∇2 f (x)

(b) Compute the Cauchy and Newton points and test (3.44)

(c) Call trtest(x, xt , x+ , f, ∆)

(d) Compute f (x+ ) and ∇f (x+ ); x = x+

We implement Algorithm ntrust in the collection of MATLAB codes.

3.3.7 A Trust Region Method for Newton“CG

In this section we present a brief account of an algorithm from [247] (see also [257]) that combines

the trust region paradigm of §3.3.6 with the inexact Newton ideas of §2.5.2. We follow §2.5.2

and denote the preconditioner by M and let C = M ’1 . We solve the scaled trust region problem

min φ(d),

d C ¤∆

where the quadratic model is still

1

φ(d) = ∇f (x)T d + dT ∇2 f (x)d.

2

Here the C-norm is

= (dT Cd)1/2 .

d C

The algorithmic description of the trust region problem solver from the TR“CG method

given below is from [162]. In [247] the algorithm is expressed in terms of C rather than M .

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.