<<

. 3
( 9)



>>

(3.12)
k’∞

By (3.4) and Lemma 3.2.3 we have
’1
f (xk+1 ) ’ f (xk ) < ’±»k ∇f (xk )T Hk ∇f (xk )

¯
¤ ’±»»’1 ∇f (xk ) 2
¤ 0.
l

Hence, by (3.12)
»l (f (xk ) ’ f (xk+1 ))
2
∇f (xk ) ¤ ’0
¯
±»
as k ’ ∞. This completes the proof.
The analysis of the Armijo rule is valid for other line search methods [84], [125], [272],
[273]. The key points are that the suf¬cient decrease condition can be satis¬ed in ¬nitely many
steps and that the stepsizes are bounded away from zero.

3.2.1 Stepsize Control with Polynomial Models
Having computed a descent direction d from xc , one must decide on a stepsize reduction scheme
for iterations in which (3.4) fails for » = 1. A common approach [73], [84], [114], [197], [117]
is to model
ξ(») = f (xc + »d)
by a cubic polynomial. The data on hand initially are
ξ(0) = f (xc ), ξ (0) = ∇f (xc )T d < 0, and ξ(1) = f (x + d),
which is enough to form a quadratic model of ξ. So, if (3.4) does not hold with » = »0 = 1,
i.e.,
ξ(1) = f (xc + d) ≥ f (xc ) + ±∇f (xc )T d = ξ(0) + ±ξ (0),


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




44 ITERATIVE METHODS FOR OPTIMIZATION

we approximate ξ by the quadratic polynomial

q(») = ξ(0) + ξ (0)» + (ξ(1) ’ ξ(0) ’ ξ (0))»2

and let »1 be the minimum of q on the interval [βlow , βhigh ] ‚ (0, 1). This minimum can be
computed directly since ± ∈ (0, 1) and failure of (3.4) imply

q (») = 2(ξ(1) ’ ξ(0) ’ ξ (0)) > 2(± ’ 1)ξ (0) > 0.

Therefore, the global minimum of q is

’ξ (0)
»t = .
2(ξ(1) ’ ξ(0) ’ ξ (0))

So ±
 βlow , »t ¤ βlow ,




»t , βlow < »t < βhigh ,
»+ =
(3.13)





βhigh , »t ≥ βhigh .

If our ¬rst reduced value of » does not satisfy (3.4), we base additional reductions on the
data
ξ(0) = f (xc ), ξ (0) = ∇f (xc )T d, ξ(»’ ), ξ(»c ),

where »c < »’ are the most recent values of » to fail to satisfy (3.4). This is suf¬cient data to
approximate ξ with a cubic polynomial

q(») = ξ(0) + ξ (0)» + c2 »2 + c3 »3 ,

where c2 and c3 can be determined by the equations

q(»c ) = ξ(»c ) = f (xc + »c d),

q(»’ ) = ξ(»’ ) = f (xc + »’ d),

which form the nonsingular linear system for c2 and c3

»2 »3 c2 ξ(»c ) ’ ξ(0) ’ ξ (0)»c
c c
= .
(3.14)
»2 »3 c3 ξ(»’ ) ’ ξ(0) ’ ξ (0)»’
’ ’


As with the quadratic case, q has a local minimum [84] at

c2 ’ 3c3 ξ (0)
’c2 + 2
»t = .
(3.15)
3c3

With »t in hand, we compute »+ using (3.13). The application of (3.13) is called safeguarding
and is important for the theory, as one can see from the proof of Theorem 3.2.4. Safeguarding
is also important in practice because, if the cubic model is poor, the unsafeguarded model can
make steplength reductions that are so small that the iteration can stagnate or so large (i.e., near
1) that too many reductions are needed before (3.4) holds.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 45

3.2.2 Slow Convergence of Steepest Descent
Unfortunately, methods based on steepest descent do not enjoy good local convergence prop-
erties, even for very simple functions. To illustrate this point we consider the special case of
convex quadratic objective functions

1T
x Ax ’ bT x + a,
f (x) =
2

where A is spd, b ∈ RN , and a is a scalar. We will look at a very simple example, using
the method of steepest descent with Hk = I (so »l = »s = 1) and show how the speed of
convergence depends on conditioning and scaling.
Lemma 3.2.5. Let f be a convex quadratic and let Hk = I for all k. Then the sequence
{xk } generated by Algorithm optarm converges to the unique minimizer of f .
Proof. In exercise 3.5.4 you are asked to show that f is bounded from below and that
∇f (x) = Ax ’ b. Hence ∇f (x— ) vanishes only at x— = A’1 b. Since ∇2 f (x) = A is spd, the
second-order suf¬cient conditions hold and x— is the unique minimizer of f .
Theorem 3.2.4 implies that

lim ∇f (xk ) = Axk ’ b = A(xk ’ x— ) = 0,
k’∞

and hence xk ’ x— .
Since the steepest descent iteration converges to the unique minimizer of a convex quadratic,
we can investigate the rate of convergence without concern about the initial iterate. We do this
in terms of the A-norm. The problems can be illustrated with the simplest case a = 0 and b = 0.
Proposition 3.2.6. Let f (x) = xT Ax/2 and let {xk } be given by Algorithm optarm with
Hk = I for all k. Then the sequence {xk } satis¬es

= (1 ’ O(κ(A)’2 )) xk
xk+1 A.
(3.16) A




Proof. The suf¬cient decrease condition, (3.4), implies that for all k

xT Axk+1 ’ xT Axk = 2(f (xk+1 ) ’ f (xk ))
k+1 k


¤ 2±∇f (xk )T (xk+1 ’ xk )
(3.17)

= 2±»k ∇f (xk )T d = ’2±»k (Axk )T (Axk ).

The Lipschitz constant of ∇f is simply »l = A ; hence we may write (3.9) as

¯ 2β(1 ’ ±) .
»k ≥ » =
(3.18)
»l κ(A)

In terms of the A-norm, (3.17) can be written as
¯
2 2 2
xk+1 ’ xk ¤ ’2±»»s xk A,
A A

where we use the fact that
2
= (Az)T (Az) ≥ »s z T Az = »s z 2
Az A.




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




46 ITERATIVE METHODS FOR OPTIMIZATION

Hence,
¯
2 2
¤ (1 ’ 4±(1 ’ ±)βκ(A)’2 ) xk 2
xk+1 ¤ (1 ’ 2±»»s ) xk A.
A A

This completes the proof.
Now we consider two speci¬c examples. Let N = 1 and de¬ne

f (x) = ωx2 /2,

where
ω < 2(1 ’ ±).
(3.19)
In this case x— = 0. We have ∇f (x) = f (x) = ωx and hence for all x ∈ R,

ωx2
((1 ’ ω)2 ’ 1)
f (x ’ ∇f (x)) ’ f (x) =
2

ω 2 x2
= (ω ’ 2)
2

< ’±|f (x)|2 = ’±ω 2 x2

because (3.19) implies that
ω ’ 2 < ’2±.
Hence (3.4) holds with d = ∇f (x) and » = 1 for all x ∈ R. The rate of convergence can be
computed directly since
x+ = (1 ’ ω)xc
for all xc . The convergence is q-linear with q-factor 1 ’ ω. So if ω is very small, the convergence
will be extremely slow.
Similarly, if ω is large, we see that

ω 2 x2
(»ω ’ 2) < ’±»ω 2 x2
f (x ’ »∇f (x)) ’ f (x) =
2
only if
2(1 ’ ±)
»< .
ω
So
2(1 ’ ±) 2(1 ’ ±)
< βm = » <
β .
ω ω
If ω is very large, many steplength reductions will be required with each iteration and the line
search will be very inef¬cient.
These are examples of poor scaling, where a change in f by a multiplicative factor can
dramatically improve the ef¬ciency of the line search or the convergence speed. In fact, if
ω = 1, steepest descent and Newton™s method are the same and only one iteration is required.
The case for a general convex quadratic is similar. Let »l and »s be the largest and smallest
eigenvalues of the spd matrix A. We assume that b = 0 and a = 0 for this example. We let ul
and us be unit eigenvectors corresponding to the eigenvalues »l and »s . If

»s < 2(1 ’ ±)

is small and the initial iterate is in the direction of us , convergence will require a very large
number of iterations (slow). If »l is large and the initial iterate is in the direction of ul , the line
search will be inef¬cient (many stepsize reductions at each iteration).


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 47

Newton™s method does not suffer from poor scaling of f and converges rapidly with no need
for a line search when the initial iterate is near the solution. However, when far away from the
solution, the Newton direction may not be a descent direction at all and the line search may
fail. Making the transition from steepest descent, which is a good algorithm when far from
the solution, to Newton™s or some other superlinearly convergent method as the iteration moves
toward the solution, is the central problem in the design of line search algorithms. The scaling
problems discussed above must also be addressed, even when far from the solution.


3.2.3 Damped Gauss“Newton Iteration
As we showed in §2.4, the steepest descent direction for the overdetermined least squares objec-
tive
M
1 1
ri (x) 2 = R(x)T R(x)
f (x) = 2
2 i=1 2

is
’∇f (x) = ’R (x)T R(x).
The steepest descent algorithm could be applied to nonlinear least squares problems with the
good global performance and poor local convergence that we expect.
The Gauss“Newton direction at x

dGS = ’(R (x)T R (x))’1 R (x)T R(x)

is not de¬ned if R fails to have full column rank. If R does have full column rank, then

(dGS )T ∇f (x) = ’(R (x)T R(x))T (R (x)T R (x))’1 R (x)T R(x) < 0,

and the Gauss“Newton direction is a descent direction. The combination of the Armijo rule with
the Gauss“Newton direction is called damped Gauss“Newton iteration.
A problem with the damped Gauss“Newton algorithm is that, in order for Theorem 3.2.4 to
be applicable, the matrices {R (xk )T R (xk )} must not only have full column rank but also must
be uniformly bounded and well conditioned, which are very strong assumptions (but if they are
satis¬ed, damped Gauss“Newton is a very effective algorithm).
The Levenberg“Marquardt method [172], [183] addresses these issues by adding a regular-
ization parameter ν > 0 to R (xc )T R (xc ) to obtain x+ = xc + s where

s = ’(νc I + R (xc )T R (xc ))’1 R (xc )T R(xc ),
(3.20)

where I is the N — N identity matrix. The matrix νc I + R (xc )T R (xc ) is positive de¬nite.
The parameter ν is called the Levenberg“Marquardt parameter.
It is not necessary to compute R (xc )T R (xc ) to compute a Levenberg“Marquardt step [76].
One can also solve the full-rank (M + N ) — N linear least squares problem
2
1 R (xc ) R(xc )

min s+
(3.21)
νc I 0
2

to compute s (see exercise 3.5.6). Compare this with computing the undamped Gauss“Newton
step by solving (2.33).
If one couples the Levenberg“Marquardt method with the Armijo rule, then Theorem 3.2.4
is applicable far from a minimizer and Theorem 2.4.1 nearby. We ask the reader to provide the
details of the proof in exercise 3.5.7.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




48 ITERATIVE METHODS FOR OPTIMIZATION

Theorem 3.2.7. Let R be Lipschitz continuous. Let xk be the Levenberg“Marquardt“Armijo
iterates. Assume that R (xk ) is uniformly bounded and that the sequence of Levenberg“
Marquardt parameters {νk } is such that

κ(νk I + R (xk )T R (xk ))

is bounded. Then
lim R (xk )T R(xk ) = 0.
k’∞

Moreover, if x— is any limit point of {xk } at which R(x— ) = 0, Assumption 2.4.1 holds, and
νk ’ 0, then xk ’ x— q-superlinearly. If, moreover,

νk = O( R(xk ) )

as k ’ ∞ then the convergence is q-quadratic.
For example, if κ(R (xk )T R (xk )) and R (xk ) are bounded then νk = min(1, R(xk ) )
would satisfy the assumptions of Theorem 3.2.7. For a zero residual problem, this addresses the
potential conditioning problems of the damped Gauss“Newton method and still gives quadratic
convergence in the terminal phase of the iteration. The Levenberg“Marquardt“Armijo iteration
will also converge, albeit slowly, for a large residual problem.
We will not discuss globally convergent methods for underdetermined least squares problems
in this book. We refer the reader to [24], [252], and [253] for discussion of underdetermined
problems.

3.2.4 Nonlinear Conjugate Gradient Methods
Operationally, the conjugate gradient iteration for a quadratic problem updates the current iter-
ation with a linear combination of the current residual r and a search direction p. The search
direction is itself a linear combination of previous residuals. Only r and p need be stored to
continue the iteration. The methods discussed in this section seek to continue this idea to more
nonlinear problems.
Nonlinear conjugate gradient algorithms have the signi¬cant advantage of low storage over
most of the other algorithms covered in this book, the method of steepest descent being the
exception. For problems so large that the Newton or quasi“Newton methods cannot be imple-
mented using the available storage, these methods are among the few options (see [177] and [5]
for examples).
Linear conjugate gradient seeks to minimize f (x) = xT Hx/2 ’ xT b. The residual r =
b ’ Hx is simply ’∇f (x), leading to a natural extension to nonlinear problems in which
r0 = p0 = ∇f (x0 ) and, for k ≥ 1,

rk = ∇f (xk ) and pk = rk + βk pk’1 .
(3.22)

The update of x
xk+1 = xk + ±k pk
can be done with a simple analytic minimization in the quadratic case, but a line search will be
needed in the nonlinear case. The missing pieces, therefore, are the choice of βk , the way the
line search will be done, and convergence theory. Theory is needed, for example, to decide if
pk is a descent direction for all k.
The general form of the algorithm is very simple. The inputs are an initial iterate, which
will be overwritten by the solution, the function to be minimized, and a termination vector
„ = („r , „a ) of relative and absolute residuals.
Algorithm 3.2.2. nlcg(x, f, „ )


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 49

1. r0 = ∇f (x) ; k = 0

2. Do while ∇f (x) > „r r0 + „a

(a) If k = 0 then p = ’∇f (x) else
p = ’∇f (x) + βp
(b) x = x + ±p

The two most common choices for β, both of which are equivalent to the linear CG formula
in the quadratic case, are the Fletcher“Reeves [106]

∇f (xk ) 2
βk R
F
=
(3.23)
∇f (xk’1 ) 2

and Polak“Ribi` re [215], [216]
e

∇f (xk )T (∇f (xk ) ’ ∇f (xk’1 ))
βk =
(3.24)
∇f (xk’1 ) 2

formulas. The Fletcher“Reeves method has been observed to take long sequences of very small
steps and virtually stagnate [112], [207], [208], [226]. The Polak“Ribi` re formula performs
e
much better and is more commonly used but has a less satisfactory convergence theory.
The line search has more stringent requirements, at least for proofs of convergence, than
are satis¬ed by the Armijo method that we advocate for steepest descent. We require that the
steplength parameter satis¬es the Wolfe conditions [272], [273]

f (xk + ±k pk ) ¤ f (xk ) + σ± ±k ∇f (xk )T pk
(3.25)

and
∇f (xk + ±k pk )T pk ≥ σβ ∇f (xk )T pk ,
(3.26)
where 0 < σ± < σβ < 1. The ¬rst of the Wolfe conditions (3.25) is the suf¬cient decrease
condition, (3.4), that all line search algorithms must satisfy. The second (3.26) is often, but not
always, implied by the Armijo backtracking scheme of alternating a test for suf¬cient decrease
and reduction of the steplength. One can design a line search method that will, under modest
assumptions, ¬nd a steplength satisfying the Wolfe conditions [104], [171], [193].
The convergence result [3] for the Fletcher“Reeves formula requires a bit more. The proof
that pk is descent direction requires the strong Wolfe conditions, which replace (3.26) by

|∇f (xk + ±k pk )T pk | ¤ ’σβ ∇f (xk )T pk
(3.27)

and demand that 0 < σ± < σβ < 1/2. The algorithm from [193], for example, will ¬nd a point
satisfying the strong Wolfe conditions.
Theorem 3.2.8. Assume that the set

N = {x | f (x) ¤ f (x0 )}

is bounded and that f is Lipschitz continuously differentiable in a neighborhood of N . Let
Algorithm nlcg be implemented with the Fletcher“Reeves formula and a line search that satis¬es
the strong Wolfe conditions. Then
lim ∇f (xk ) = 0.




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




50 ITERATIVE METHODS FOR OPTIMIZATION

This result has been generalized to allow for any choice of βk such that |βk | ¤ βk R [112].
F

A similar result for the Polak“Ribi` re method, but with more complex conditions on the line
e
search, has been proved in [134]. This complexity in the line search is probably necessary, as
there are examples where reasonable line searches lead to failure in the Polak“Ribi` re method,
e
PR PR
[222]. One can also prove convergence if βk is replaced by max(βk , 0) [112].
There is continuing research on these methods and we point to [112], [134], [205], and [202]
as good sources.


3.3 Trust Region Methods
Trust region methods overcome the problems that line search methods encounter with non-spd
approximate Hessians. In particular, a Newton trust region strategy allows the use of complete
Hessian information even in regions where the Hessian has negative curvature. The speci¬c trust
region methods we will present effect a smooth transition from the steepest descent direction to
the Newton direction in a way that gives the global convergence properties of steepest descent
and the fast local convergence of Newton™s method.
The idea is very simple. We let ∆ be the radius of the ball about xc in which the quadratic
model
mc (x) = f (xc ) + ∇f (xc )T (x ’ xc ) + (x ’ xc )T Hc (x ’ xc )/2
can be trusted to accurately represent the function. ∆ is called the trust region radius and the
ball
T (∆) = {x | x ’ xc ¤ ∆}
is called the trust region.
We compute the new point x+ by (approximately) minimizing mc over T (∆). The trust
region problem for doing that is usually posed in terms of the difference st between xc and the
minimizer of mc in the trust region
min mc (xc + s).
(3.28)
s ¤∆

We will refer to either the trial step st or the trial solution xt = xc + st as the solution to the
trust region problem.
Having solved the trust region problem, one must decide whether to accept the step and/or to
change the trust region radius. The trust region methods that we will discuss in detail approximate
the solution of the trust region problem with the minimizer of the quadratic model along a
piecewise linear path contained in the trust region. Before discussing these speci¬c methods,
we present a special case of a result from [223] on global convergence.
A prototype trust region algorithm, upon which we base the speci¬c instances that follow, is
Algorithm 3.3.1.
Algorithm 3.3.1. trbasic(x, f )
1. Initialize the trust region radius ∆.
2. Do until termination criteria are satis¬ed
(a) Approximately solve the trust region problem to obtain xt .
(b) Test both the trial point and the trust region radius and decide whether or not to
accept the step, the trust region radius, or both. At least one of x or ∆ will change
in this phase.

Most trust region algorithms differ only in how step 2a in Algorithm trbasic is done.
There are also different ways to implement step 2b, but these differ only in minor details and the
approach we present next in §3.3.1 is very representative.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 51

3.3.1 Changing the Trust Region and the Step
The trust region radius and the new point are usually tested simultaneously. While a notion of
suf¬cient decrease is important, the test is centered on how well the quadratic model approximates
the function inside the trust region. We measure this by comparing the actual reduction in f

ared = f (xc ) ’ f (xt )

with the predicted reduction, i.e., the decrease in the quadratic model

pred = mc (xc ) ’ mc (xt ) = ’∇f (xc )T st ’ sT Hc st /2.
t

pred > 0 for all the trust region algorithms we discuss in this book unless ∇f (xc ) = 0. We
will introduce three control parameters

µ0 ¤ µlow < µhigh ,

which are used to determine if the trial step should be rejected (ared/pred < µ0 ) and/or the trust
region radius should be decreased (ared/pred < µlow ), increased (ared/pred > µhigh ), or left
unchanged. Typical values are .25 for µlow and .75 for µhigh . Both µ0 = 10’4 or µ0 = µlow
are used. One can also use the suf¬cient decrease condition (3.4) to determine if the trial step
should be accepted [84].
We will contract and expand the trust region radius by simply multiplying ∆ by constants

0 < ωdown < 1 < ωup .

Typical values are ωdown = 1/2 and ωup = 2. There are many other ways to implement a trust
region adjustment algorithm that also give global convergence. For example, the relative error
|pred ’ ared|/ ∇f can be used [84] rather than the ratio ared/pred. Finally we limit the
number of times the trust region radius can be expanded by requiring

∆ ¤ CT ∇f (xc ) ,
(3.29)

for some CT > 1, which may depend on xc . This only serves to eliminate the possibility of
in¬nite expansion and is used in the proofs. Many of the dogleg methods which we consider
later automatically impose (3.29).
The possibility of expansion is important for ef¬ciency in the case of poor scaling of f .
The convergence theory presented here [162] also uses the expansion phase in the proof of
convergence, but that is not essential. We will present the algorithm to test the trust region in a
manner, somewhat different from much of the literature, that only returns once a new iterate has
been accepted.
Algorithm 3.3.2. trtest(xc , xt , x+ , f, ∆)
1. z = xc
2. Do while z = xc
(a) ared = f (xc ) ’ f (xt ), st = xt ’ xc , pred = ’∇f (xc )T st ’ sT Hc st /2
t
(b) If ared/pred < µ0 then set z = xc , ∆ = ωdown ∆, and solve the trust region
problem with the new radius to obtain a new trial point. If the trust region radius
was just expanded, set z = xold .
t
(c) If µ0 ¤ ared/pred < µlow , then set z = xt and ∆ = ωdown ∆.
(d) If µlow ¤ ared/pred ¤ µhigh , set z = xt .


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




52 ITERATIVE METHODS FOR OPTIMIZATION

(e) If µhigh < ared/pred and st = ∆ ¤ CT ∇f (xc ) , then set z = xc , ∆ = ωup ∆,
and solve the trust region problem with the new radius to obtain a new trial point.
Store the old trial point as xold in case the expansion fails.
t

3. x+ = z.

The loop in Algorithm trtest serves the same purpose as the loop in a line search algorithm
such as Algorithm steep. One must design the solution to the trust region problem in such a
way that that loop will terminate after ¬nitely many iterations and a general way to do that is the
subject of the next section.
We incorporate Algorithm trtest into a general trust region algorithm paradigm that we
will use for the remainder of this section.
Algorithm 3.3.3. trgen(x, f )
1. Initialize ∆
2. Do forever
(a) Let xc = x. Compute ∇f (xc ) and an approximate Hessian Hc .
(b) Solve the trust region problem to obtain a trial point xt .
(c) Call trtest(xc , xt , x, f, ∆)

Hessians and gradients are computed only in step 2a of Algorithm trgen.

3.3.2 Global Convergence of Trust Region Algorithms
While one can, in principal, solve the trust region problem exactly (see §3.3.4), it is simpler
and more ef¬cient to solve the problem approximately. It is amazing that one need not do a
very good job with the trust region problem in order to get global (and even locally superlinear)
convergence.
Our demands of our solutions of the trust region problem and our local quadratic models
are modest and readily veri¬able. The parameter σ in part 1 of Assumption 3.3.1, like the
parameter CT in (3.29), is used in the analysis but plays no role in implementation. In the
speci¬c algorithms that we discuss in this book, σ can be computed. Part 2 follows from well-
conditioned and bounded model Hessians if Algorithm trtest is used to manage the trust
region.
Assumption 3.3.1.
1. There is σ > 0 such that

pred = f (xc ) ’ mc (xt ) ≥ σ ∇f (xc ) min( st , ∇f (xc ) ).
(3.30)

2. There is M > 0 such that either st ≥ ∇f (xc ) /M or st = ∆c .

The global convergence theorem based on this assumption should be compared with the
similar result on line search methods”Theorem 3.2.4.
Theorem 3.3.1. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk }
be generated by Algorithm trgen and let the solutions for the trust region problems satisfy
Assumption 3.3.1. Assume that the matrices {Hk } are bounded. Then either f is unbounded
from below, ∇f (xk ) = 0 for some ¬nite k, or

lim ∇f (xk ) = 0.
(3.31)
k’∞



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 53



Proof. Assume that ∇f (xk ) = 0 for all k and that f is bounded from below. We will show
that there is MT ∈ (0, 1] such that once an iterate is taken (i.e., the step is accepted and the trust
region radius is no longer a candidate for expansion), then

sk ≥ MT ∇f (xk ) .
(3.32)

Assume (3.32) for the present. Since sk is an acceptable step, Algorithm trtest and part 1 of
Assumption 3.3.1 imply that

aredk ≥ µ0 predk ≥ µ0 ∇f (xk ) σ min( sk , ∇f (xk ) ).

We may then use (3.32) to obtain

aredk ≥ µ0 σMT ∇f (xk ) 2 .
(3.33)

Now since f (xk ) is a decreasing sequence and f is bounded from below, limk’∞ aredk = 0.
Hence (3.33) implies (3.31).
It remains to prove (3.32). To begin note that if sk < ∆k then by part 2 of Assumption 3.3.1

sk ≥ ∇f (xk ) /M.

Hence, we need only consider the case in which

sk = ∆k and sk < ∇f (xk ) ,
(3.34)

since if (3.34) does not hold then (3.32) holds with MT = min(1, 1/M ).
We will complete the proof by showing that if (3.34) holds and sk is accepted, then
’2
2σ min(1 ’ µhigh , (1 ’ µ0 )ωup )
sk = ∆k ≥ ∇f (xk ) .
(3.35)
M +L
This will complete the proof with
’2
2σ min(1 ’ µhigh , (1 ’ µ0 )ωup )
MT = min 1, 1/M, .
M +L
Now increase the constant M > 0 in part 1 of Assumption 3.3.1 if needed so that

Hk ¤ M for all k.
(3.36)

We prove (3.35) by showing that if (3.34) holds and (3.35) does not hold for a trial step st ,
then the trust region radius will be expanded and the step corresponding to the larger radius will
be acceptable. Let st be a trial step such that st < ∇f (xk ) and
’2
2σ min(1 ’ µhigh , (1 ’ µ0 )ωup )
st = ∆t < ∇f (xk ) .
(3.37)
M +L
We use the Lipschitz continuity of ∇f and (3.36) to obtain
1
T
(∇f (xk + tst ) ’ ∇f (xk ))T st dt
aredt = ’∇f (xk ) st ’
0

1
sT Hk st /2 (∇f (xk + tst ) ’ ∇f (xk ))T st dt
= predt + ’
t
0

≥ predt ’ (M + L) st 2 /2.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




54 ITERATIVE METHODS FOR OPTIMIZATION

Therefore, using (3.30) from Assumption 3.3.1, we have
2
aredt (M + L) st
≥1’
predt 2predt
(3.38)
(M + L) st 2
≥1’ .
2σ ∇f (xk ) min( ∇f (xk ) , st )

Now since st < ∇f (xk ) by (3.34) we have

min( ∇f (xk ) , st ) = st

and hence
aredk (M + L) st
≥1’ > µhigh
(3.39)
predk 2 ∇f (xk ) σ
by (3.37). Hence, an expansion step will be taken by replacing ∆t by ∆+ = ωup ∆t and st by
t
+
st , the minimum of the quadratic model in the new trust region.
Now (3.38) still holds and, after the expansion,

s+ ¤ ωup st < ωup ∇f (xk ) .
t

So
min( ∇f (xk ) , s+ ) > s+ /ωup .
t t

Hence,
ared+ (M + L) s+ 2
t t
≥1’
+
2σ ∇f (xk ) min( ∇f (xk ) , s+ )
predt t

2
(M + L)ωup s+ (M + L)ωup st
t
≥1’ ≥1’ ≥ µ0
2 ∇f (xk ) σ 2 ∇f (xk ) σ
by (3.37). Hence, the expansion will produce an acceptable step. This means that if the ¬nal
accepted step satis¬es (3.34), it must also satisfy (3.35). This completes the proof.


3.3.3 A Unidirectional Trust Region Algorithm
The most direct way to compute a trial point that satis¬es Assumption 3.3.1 is to mimic the line
search and simply minimize the quadratic model in the steepest descent direction subject to the
trust region bound constraints.
In this algorithm, given a current point xc and trust region radius ∆c , our trial point is the
minimizer of
ψc (») = mc (xc ’ »∇f (xc ))
subject to the constraint that

x(») = xc ’ »∇f (xc ) ∈ T (∆c ).

ˆ
Clearly the solution is x(») where
± ∆c
if ∇f (xc )T Hc ∇f (xc ) ¤ 0 ,

 ∇f (xc )
ˆ
»=
(3.40)

 min ∇f (xc ) 2 ∆c
if ∇f (xc )T Hc ∇f (xc ) > 0.
, ∇f (xc )
∇f (xc )T Hc ∇f (xc )




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 55

ˆ
x(»), the minimizer of the quadratic model in the steepest descent direction, subject to the trust
region bounds, is called the Cauchy point. We will denote the Cauchy point by xCP .1
c
Then with xCP as trial point, one can use Theorem 3.3.1 to derive a global convergence
theorem for the unidirectional trust region.
Theorem 3.3.2. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk } be
generated by Algorithm trgen with xt = xCP and (3.40). Assume that the matrices {Hk } are
bounded. Then either f (xk ) is unbounded from below, ∇f (xk ) = 0 for some ¬nite k, or

lim ∇f (xk ) = 0.
k’∞



Proof. We show that xt satis¬es part 2 of Assumption 3.3.1. If st = ∆c then the assertion
holds trivially. If st < ∆c then, by de¬nition of xCP ,
c

∇f (xc ) 2 ∇f (xc )
st = ’ .
∇f (xc )T Hc ∇f (xc )

Hence, if Hc ¤ M ,
st ≥ ∇f (xc ) /M
as asserted.
We leave the proof that xt satis¬es part 1 for the reader (exercise 3.5.8).
The assumptions we used are stronger that those in, for example, [104] and [223], where

lim inf ∇f (xk ) = 0

rather than ∇f (xk ) ’ 0 is proved.


3.3.4 The Exact Solution of the Trust Region Problem
The theory of constrained optimization [117], [104] leads to a characterization of the solutions
of the trust region problem. In this section we derive that characterization via an elementary
argument (see also [84], [242], and [109]). This book focuses on approximate solutions, but the
reader should be aware that the exact solution can be computed accurately [192], [243].
Theorem 3.3.3. Let g ∈ RN and let A be a symmetric N — N matrix. Let

m(s) = g T s + sT As/2.

A vector s is a solution to
min m(s)
(3.41)
s ¤∆

if and only if there is ν ≥ 0 such that

(A + νI)s = ’g

and either ν = 0 or s = ∆.
Proof. If s < ∆ then ∇m(s) = g + As = 0, and the conclusion follows with ν = 0. To
consider the case where s = ∆, let »1 ¤ »2 ¤ · · · »N be the eigenvalues of A.
some of the literature, [84], for example, Hc is assumed to be positive de¬nite and the Cauchy point is taken to
1 In

be the global minimizer of the quadratic model.



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




56 ITERATIVE METHODS FOR OPTIMIZATION

Clearly, for any ν,

= g T s + sT As/2
m(s)

= g T s + sT (A + νI)s/2 ’ ν∆2 /2.

Consider the function, de¬ned for ν > ν0 = max(0, ’»1 ),

s(ν) = ’(A + νI)’1 g.

Since
lim s(ν) = 0
ν’∞

and s(ν) is a continuous decreasing function of ν ∈ (ν0 , ∞) we see that if

lim s(ν) > ∆
ν’ν0

then there is a unique ν such that s(ν) = ∆. Since ν ≥ ν0 , A + νI is positive semide¬nite;
therefore, s(ν) is a global minimizer of

g T s + sT (A + νI)s/2.

Hence, we must have
m(s) ≥ m(s(ν))
for all s such that s = ∆. Hence, s(ν) is a solution of (3.41).
The remaining case is
lim s(ν) ¤ ∆.
ν’ν0

This implies that g is orthogonal to the nontrivial space S0 of eigenfunctions corresponding to
’ν0 (for otherwise the limit would be in¬nite). If we let s = s1 + s2 , where s2 is the projection
of s onto S0 , we have

= sT g + sT (A + ν0 )s1 /2 + sT (A + ν0 )s2 /2 ’ ν0 ∆2 /2
m(s) 1 1 2


= sT g + sT (A + »0 )s1 /2 ’ ν0 ∆2 /2.
1 1

Hence, m(s) is minimized by setting s1 equal to the minimum norm solution of (A+ν0 )x = ’g
(which exists by orthogonality of g to S0 ) and letting s2 be any element of S0 such that
2
= ∆2 ’ s1 2 .
s2

This completes the proof.

3.3.5 The Levenberg“Marquardt Parameter
The solution of the trust region problem presented in §3.3.4 suggests that, rather than controlling
∆, one could set
st = ’(νI + Hc )’1 g,
adjust ν in response to ared/pred instead of ∆, and still maintain global convergence. A natural
application of this idea is control of the Levenberg“Marquardt parameter. This results in a
much simpler algorithm than Levenberg“Marquardt“Armijo in that the stepsize control can be
eliminated. We need only vary the Levenberg“Marquardt parameter as the iteration progresses.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 57

We present the algorithm from [190] to illustrate this point. For an inexact formulation, see
[276].
The Levenberg“Marquardt quadratic model of least squares objective
M
1 1
2
R(x)T R(x)
f (x) = ri (x) =
2
2 2
i=1

with parameter νc at the point xc is

= f (xc ) + (x ’ xc )T R (xc )T R(xc )
mc (x)
(3.42)
+ 1 (x ’ xc )T (R (xc )T R (xc ) + νc I)(x ’ xc ).
2

The minimizer of the quadratic model is the trial point

xt = xc ’ (R (xc )T R (xc ) + νc I)’1 R (xc )T R(xc ),
(3.43)

the step is s = xt ’ xc , and the predicted reduction is
1
= m(xc ) ’ m(xt ) = ’sT R (xc )T R(xc ) ’ 2 sT (R (xc )T R (xc ) + νc I)s
pred

= ’sT R (xc )T R(xc ) + 1 sT R (xc )T R(xc ) = ’ 1 sT ∇f (xc ).
2 2

The algorithm we present below follows the trust region paradigm and decides on accepting
the trial point and on adjustments in the Levenberg“Marquardt parameter by examinaing the
ratio
ared f (xc ) ’ f (xt )
=
pred m(xc ) ’ m(xt )

f (xc ) ’ f (xt )
= ’2 .
sT ∇f (xc )
In addition to the trust region parameters 0 < ωdown < 1 < ωup and µ0 ¤ µlow < µhigh
we require a default value ν0 of the Levenberg“Marquardt parameter.
The algorithm for testing the trial point differs from Algorithm trtest in that we decrease
(increase) ν rather that increasing (decreasing) a trust region radius if ared/pred is large (small).
We also attempt to set the Levenberg“Marquardt parameter to zero when possible in order to
recover the Gauss“Newton iteration™s fast convergence for small residual problems.
Algorithm 3.3.4. trtestlm(xc , xt , x+ , f, ν)

1. z = xc

2. Do while z = xc

(a) ared = f (xc ) ’ f (xt ), st = xt ’ xc , pred = ’∇f (xc )T st /2.
(b) If ared/pred < µ0 then set z = xc , ν = max(ωup ν, ν0 ), and recompute the trial
point with the new value of ν.
(c) If µ0 ¤ ared/pred < µlow , then set z = xt and ν = max(ωup ν, ν0 ).
(d) If µlow ¤ ared/pred, then set z = xt .
If µhigh < ared/pred, then set ν = ωdown ν.
If ν < ν0 , then set ν = 0.

3. x+ = z.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




58 ITERATIVE METHODS FOR OPTIMIZATION

The Levenberg“Marquardt version ofAlgorithm trgen is simple to describe and implement.
Algorithm 3.3.5. levmar(x, R, kmax)
1. Set ν = ν0 .
2. For k = 1, . . . , kmax
(a) Let xc = x.
(b) Compute R, f , R , and ∇f ; test for termination.
(c) Compute xt using (3.43).
(d) Call trtestlm(xc , xt , x, f, ν)

We state a convergence result [190], [276] without proof.
Theorem 3.3.4. Let R be Lipschitz continuously differentiable. Let {xk } and {νk } be the
sequence of iterates and Levenberg“Marquardt parameters generated by Algorithm levmar
with kmax = ∞. Assume that {νk } is bounded from above. Then either R (xk )T R(xk ) = 0
for some ¬nite k or
lim R (xk )T R(xk ) = 0.
k’∞

Moreover, if x— is a limit point of {xk } for which R(x— ) = 0 and R (x— ) has full rank, then
xk ’ x— q-quadratically and νk = 0 for k suf¬ciently large.


3.3.6 Superlinear Convergence: The Dogleg
The convergence of the unidirectional trust region iteration can be as slow as that for steepest
descent. To improve the convergence speed in the terminal phase we must allow for approx-
imations to the Newton direction. The power of trust region methods is the ease with which
the transition from steepest descent, with its good global properties, to Newton™s method can be
managed.
We de¬ne the Newton point at xc as

xN = xc ’ Hc ∇f (xc ).
’1
c

If Hc is spd, the Newton point is the global minimizer of the local quadratic model. On the other
hand, if Hc has directions of negative curvature the local quadratic model will not have a ¬nite
minimizer, but the Newton point is still useful. Note that if H = I the Newton point and the
Cauchy point are the same if the Newton point is inside the trust region.
We will restrict our attention to a special class of algorithms that approximate the solution
of the trust region problem by minimizing mc along a piecewise linear path S ‚ T (∆). These
paths are sometimes called doglegs because of the shapes of the early examples [84], [80], [218],
[217], [220]. In the case where ∇2 f (x) is spd, one may think of the dogleg path as a piecewise
linear approximation to the path with parametric representation

{x ’ (»I + ∇2 f (x))’1 ∇f (x) | 0 ¤ »}.

This is the path on which the exact solution of the trust region problem lies.
The next step up from the unidirectional path, the classical dogleg path [220], has as many
as three nodes, xc , xCP — , and xN . Here xCP — is the global minimizer of the quadratic model in
c c c
the steepest descent direction, which will exist if and only if ∇f (xc )T Hc ∇f (xc ) > 0. If xCP —
c
exists and
(xN ’ xCP — )T (xCP — ’ xc ) > 0,
(3.44) c c c




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 59

we will let xN be the terminal node. If (3.44) holds, as it always will if Hc is spd, then the path
c
can be parameterized by the distance from xc and, moreover, mc decreases along the path. If
(3.44) does not hold, we do not use xN as a node and revert to the unidirectional path in the
c
steepest descent direction.
Note that (3.44) implies
∇f (xc )T (xN ’ xc ) < 0.
(3.45) c

We can express the conditions for using the three node path rather than the unidirectional path
very simply. If xCP is on the boundary of the trust region then we accept xCP as the trial point.
c c
If xCP = xCP — is in the interior of the trust region, then we test (3.44) to decide what to do.
c c
With this in mind our trial point for the classical dogleg algorithm will be
± CP
if xc ’ xCP = ∆
x
 c

 or xCP — exists and (3.44) fails,





D
xN if xc ’ xCP < xc ’ xN ¤ ∆
x (∆) =
(3.46) c c


 and (3.44) holds,




D

y (∆) otherwise.

Here y D (∆) is the unique point between xCP and xN such that xD ’ xc = ∆.
c c
The important properties of dogleg methods are as follows:

• No two points on the path have the same distance from xc ; hence the path may be param-
eterized as x(s), where s = x(s) ’ xc .

• mc (x(s)) is a strictly decreasing function of s.

This enables us to show that the dogleg approximate solution of the trust region problem sat-
is¬es Assumption 3.3.1 and apply Theorem 3.3.1 to conclude global convergence. Superlinear
convergence will follow if Hk is a suf¬ciently good approximation to ∇2 f (xk ).
Lemma 3.3.5. Let xc , Hc , and ∆ be given. Let Hc be nonsingular,

sN = ’Hc ∇f (xc ), and xN = xc + sN .
’1


Assume that ∇f (xc )T Hc ∇f (xc ) > 0 and let

∇f (xc ) 2
sCP — = xCP — ’ xc = ’ ∇f (xc ).
∇f (xc )T Hc ∇f (xc )

Let S be the piecewise linear path from xc to xCP — to xN . Then if

(sN ’ sCP — )T sCP — > 0,
(3.47)

for any δ ¤ sN there is a unique point x(δ) on S such that

x(δ) ’ xc = δ.



Proof. Clearly the statement of the result holds on the segment of the path from x to xCP — .
To prove the result on the segment from xCP — to xN we must show that
1
(1 ’ »)sCP — + »sN 2
φ(») =
2


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




60 ITERATIVE METHODS FOR OPTIMIZATION

is strictly monotone increasing for » ∈ (0, 1).
Since (3.47) implies that
sN sCP — ≥ (sN )T sCP — > sCP — 2


and therefore that sN > sCP — , we have
= (sN ’ sCP — )T ((1 ’ »)sCP — + »sN )
φ (»)

= ’(1 ’ ») sCP — 2
+ (1 ’ »)(sN )T sCP — + » sN 2
’ »(sN )T sCP —

> »( sN 2
’ (sN )T sCP — ) ≥ »( sN ’ sCP — ) sN > 0.
Hence, φ is an increasing function and the proof is complete.
The next stage is to show that the local quadratic model decreases on the dogleg path S.
Lemma 3.3.6. Let the assumptions of Lemma 3.3.5 hold. Then the local quadratic model
1
mc (x) = f (xc ) + ∇f (xc )T (x ’ xc ) + (x ’ xc )T Hc (x ’ xc )
2
is strictly monotone decreasing on S.
Proof. Since xCP — is the minimum of the local quadratic model in the steepest descent
c
direction, we need only show that mc is strictly decreasing on the segment of the path between
xCP — and xN . Set
c

= mc (xc + (1 ’ »)sCP — + »sN )
ψ(»)

= f (xc ) + ∇f (xc )T ((1 ’ »)sCP — + »sN )

+ 1 ((1 ’ »)sCP — + »sN )T Hc ((1 ’ »)sCP — + »sN ).
2

ˆ
Noting that Hc sN = ’∇f (xc ) and sCP — = ’»∇f (xc ), we obtain
ˆ
= f (xc ) ’ »(1 ’ »)2 ∇f (xc ) 2
ψ(»)

+»(1 ’ »/2)∇f (xc )T sN

ˆ
+ 1 (1 ’ »)2 »2 ∇f (xc )T Hc ∇f (xc ).
2

Therefore,
ˆ 2
ψ (») = 2»(1 ’ ») ∇f (xc )

ˆ
+(1 ’ »)∇f (xc )T sN ’ (1 ’ »)»2 ∇f (xc )T Hc ∇f (cc ).
Since
ˆ
»∇f (xc )T Hc ∇f (cc ) = ∇f (xc ) 2

we have, using (3.44),
ˆ 2
’ ∇f (xc )T Hc ∇f (xc ))
’1
ψ (») = (1 ’ »)(» ∇f (xc )

ˆ
= (1 ’ »)∇f (xc )T (»∇f (xc ) ’ Hc ∇f (xc ))
’1


1’»
(xc ’ xCP — )T (xN ’ xc ) < 0,
= c c
ˆ
»


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 61

completing the proof.
At this point we have shown that the approximate trust region problem has a unique solution.
To prove global convergence we need only verify that the approximate solution of the trust region
problem xD satis¬es Assumption 3.3.1.
Theorem 3.3.7. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk } be
generated by Algorithm trgen and the solutions for the trust region problem be given by
(3.46). Assume that the matrices {Hk } are bounded. Then either f (xk ) is unbounded from
below, ∇f (xk ) = 0 for some ¬nite k, or

lim ∇f (xk ) = 0.
(3.48)
k’∞



Proof. We need to check that the solutions of the trust region problem satisfy Assump-
tion 3.3.1. Part 2 of the assumption follows from the de¬nition, (3.46), of xD and the bounded-
ness of the approximate Hessians. Let

Hk ¤ M

for all k. If sk < ∆, then (3.46) implies that (3.44) must hold and so xt = xN is the Newton
k
point. Hence,
’1
sk = xk ’ xN = Hk ∇f (xk ) ≥ ∇f (xk ) /M,
k

which proves part 2.
Veri¬cation of part 1 will complete the proof. There are several cases to consider depending
on how xD is computed.
If xD = xCP then either sCP = ∆c or (3.44) fails. We ¬rst consider the case where
ˆ
∇f (xc )T Hc ∇f (xc ) ¤ 0. In that case sCP = ∆c and » = ∆c / ∇f (xc ) . Therefore,
ˆ
ˆ »2
2 T
pred = » ∇f (xc ) ’ 2 ∇f (xc ) Hc ∇f (xc )

T
2 ∇f (xc ) Hc ∇f (xc )
= ∆c ∇f (xc ) ’ ∆c
2 ∇f (xc ) 2

≥ ∆c ∇f (xc ) = s ∇f (xc ) .
Hence (3.30) holds with σ = 1.
Now assume that ∇f (xc )T Hc ∇f (xc ) > 0 and sCP = ∆c . In this case

∇f (xc ) 2 ∆c

∇f (xc )T Hc ∇f (xc ) ∇f (xc )
and so
ˆ
ˆ »2
2 T
pred = » ∇f (xc ) ’ 2 ∇f (xc ) Hc ∇f (xc )

T
2 ∇f (xc ) Hc ∇f (xc )
= ∆c ∇f (xc ) ’ ∆c
2 ∇f (xc ) 2

≥ ∆c ∇f (xc ) /2,
which is (3.30) with σ = 1/2.
If (3.44) fails, ∇f (xc )T Hc ∇f (xc ) > 0, and sCP < ∆c , then

∇f (xc ) 2
ˆ
»= ,
∇f (xc )T Hc ∇f (xc )


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




62 ITERATIVE METHODS FOR OPTIMIZATION

and
ˆ
ˆ »2
2 T
pred = » ∇f (xc ) ’ 2 ∇f (xc ) Hc ∇f (xc )

ˆ
∇f (xc ) 4 2
» ∇f (xc )
= =
2∇f (xc )T Hc ∇f (xc ) 2

s ∇f (xc )
= ,
2
which is (3.30) with σ = 1/2.
The ¬nal case is if (3.44) holds and xD = xCP . In that case the predicted reduction is more
than Cauchy decrease, i.e., the decrease obtained by taking the Cauchy point, and hence

∇f (xc ) 4
pred ≥
2∇f (xc )T Hc ∇f (xc )
2
∇f (xc )
≥ ,
2M
which is (3.30) with σ = 1/(2M ). This completes the proof.
The last part of the proof of this theorem is very important, asserting that any solution of
the trust region problem for which pred is at least a ¬xed fraction of Cauchy decrease will give
global convergence. We refer the reader to [232] and [104] for a more general and detailed
treatment using this point of view.
Corollary 3.3.8. Any algorithm for solving the trust region problem that satis¬es for some
„ >0
pred ≥ „ (mc (xc ) ’ mc (xCP ))
c

satis¬es (3.30) for σ = „ /2.
The trust region CG algorithm we present in §3.3.7 can be analyzed with this corollary.
If Hk = ∇2 f (xk ) or a suf¬ciently good approximation, then the classical dogleg will become
Newton™s method (or a superlinearly convergent method) as the iterations approach a minimizer
that satis¬es the standard assumptions. Hence, the algorithm makes a smooth and automatic
transition into the superlinearly convergent stage.
Theorem 3.3.9. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk } be
generated by Algorithm trgen and the solutions for the trust region problem are given by (3.46).
Assume that Hk = ∇2 f (xk ) and that the matrices {Hk } are bounded. Let f be bounded from
below. Let x— be a minimizer of f at which the standard assumptions hold. Then if x— is a limit
point of xk , then xk ’ x— and the convergence is locally q-quadratic.
Proof. Since x— is a limit point of {xk }, there is, for any ρ > 0, a k suf¬ciently large so that
’1
ek < ρ, Hk ¤ 2 ∇2 f (x— ) , Hk ¤ 2 (∇2 f (x— ))’1 ,
’1
and xk is near enough for the assumptions of Theorem 2.3.2 to hold. If Hk is spd, so is Hk
and for such k, (3.44) holds. Hence, the dogleg path has the nodes xk , xCP , and xN . Moreover,
k k
if ρ is suf¬ciently small, then
’1
Hk ∇f (xk ) ¤ 2 ek ¤ 2ρ.

We complete the proof by showing that if ρ is suf¬ciently small, the trust region radius will be
expanded if necessary until the Newton step is in the trust region. Once we do this, the proof is
complete as then the local quadratic convergence of Newton™s method will take over.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 63

Now
predk ≥ sk ∇f (xk ) /2
by the proof of Theorem 3.3.7. Using Hk = ∇2 f (xk ) we have
1
= ’∇f (xk )T stk ’ (∇f (xk + tstk ) ’ ∇f (xk ))T stk dt
aredk
0

1
sT ∇2 f (xk )stk /2 (∇f (xk + tstk ) ’ ∇f (xk ))T stk dt
= predk + ’
tk
0

= predk + O( sk ∇f (xk ) ρ)

and therefore ared/pred = 1’O(ρ). Hence, for ρ suf¬ciently small, the trust region radius will
be increased, if necessary, until the Newton point is inside the trust region and then a Newton
step will be taken. This completes the proof.
The classical dogleg algorithm is implemented in Algorithm ntrust, which uses the trust
radius adjustment scheme from Algorithm trtest. It is to be understood that trtest is
implemented so that xt is given by (3.46) and hence trtest only samples points on the
piecewise linear search path determined by the Cauchy point, the Newton point, and (3.44).
Algorithm 3.3.6. ntrust(x, f, „ )
1. Compute f (x) and ∇f (x)
2. „ = „a + „r ∇f (x)
3. Do while ∇f (x) > „
(a) Compute and factor ∇2 f (x)
(b) Compute the Cauchy and Newton points and test (3.44)
(c) Call trtest(x, xt , x+ , f, ∆)
(d) Compute f (x+ ) and ∇f (x+ ); x = x+

We implement Algorithm ntrust in the collection of MATLAB codes.

3.3.7 A Trust Region Method for Newton“CG
In this section we present a brief account of an algorithm from [247] (see also [257]) that combines
the trust region paradigm of §3.3.6 with the inexact Newton ideas of §2.5.2. We follow §2.5.2
and denote the preconditioner by M and let C = M ’1 . We solve the scaled trust region problem

min φ(d),
d C ¤∆


where the quadratic model is still
1
φ(d) = ∇f (x)T d + dT ∇2 f (x)d.
2
Here the C-norm is
= (dT Cd)1/2 .
d C

The algorithmic description of the trust region problem solver from the TR“CG method
given below is from [162]. In [247] the algorithm is expressed in terms of C rather than M .


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.



<<

. 3
( 9)



>>