<<

. 2
( 9)



>>



LOCAL CONVERGENCE 21

2.3.2 Termination of the Iteration
It is not safe to terminate the iteration when f (xc ) ’ f (x+ ) is small, and no conclusions can
safely be drawn by examination of the differences of the objective function values at successive
iterations. While some of the algorithms for dif¬cult problems in Part II of this book do indeed
terminate when successive function values are close, this is an act of desperation. For example,
if
n
j ’1 ,
f (xn ) = ’
j=1

then f (xn ) ’ ’∞ but f (xn+1 ) ’ f (xn ) = ’1/(n + 1) ’ 0. The reader has been warned.
If the standard assumptions hold, then one may terminate the iteration when the norm of ∇f
is suf¬ciently small relative to ∇f (x0 ) (see [154]). We will summarize the key points here and
refer the reader to [154] for the details. The idea is that if ∇2 f (x— ) is well conditioned, then a
small gradient norm implies a small error norm. Hence, for any gradient-based iterative method,
termination on small gradients is reasonable. In the special case of Newton™s method, the norm
of the step is a very good indicator of the error and if one is willing to incur the added cost of an
extra iteration, a very sharp bound on the error can be obtained, as we will see below.
Lemma 2.3.8. Assume that the standard assumptions hold. Let δ > 0 be small enough so
that the conclusions of Lemma 2.3.1 hold for x ∈ B(δ). Then for all x ∈ B(δ)

4κ(∇2 f (x— )) e
e ∇f (x)
¤ ¤ .
(2.22)
4 e0 κ(∇2 f (x— )) ∇f (x0 ) e0


The meaning of (2.22) is that, up to a constant multiplier, the norm of the relative gradient
is the same as the norm of the relative error. This partially motivates the termination condition
(2.12).
In the special case of Newton™s method, one can use the steplength as an accurate estimate
of the error because Theorem 2.3.2 implies that

ec = s + O( ec 2 ).
(2.23)

Hence, near the solution s and ec are essentially the same size. The cost of using (2.23) is that
all the information needed to compute x+ = xc + s has been computed. If we terminate the
iteration when s is smaller than our desired tolerance and then take x+ as the ¬nal result, we
have attained more accuracy than we asked for. One possibility is to terminate the iteration when
√ √
s = O( „s ) for some „s > 0. This, together with (2.23), will imply that ec = O( „s )
and hence, using Theorem 2.3.2, that

e+ = O( ec 2 ) = O(„s ).
(2.24)

For a superlinearly convergent method, termination on small steps is equally valid but one
cannot use (2.24). For a superlinearly convergent method we have

ec = s + o( ec ) and e+ = o( ec ).
(2.25)

Hence, we can conclude that e+ < „s if s < „s . This is a weaker, but still very useful,
result.
For a q-linearly convergent method, such as the chord method, making termination decisions
based on the norms of the steps is much riskier. The relative error in estimating ec by s is
| ec ’ s | ec + s e+
¤ = .
ec ec ec


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




22 ITERATIVE METHODS FOR OPTIMIZATION

Hence, estimation of errors by steps is worthwhile only if convergence is fast. One can go further
[156] if one has an estimate ρ of the q-factor that satis¬es

e+ ¤ ρ ec .

In that case,
(1 ’ ρ) ec ¤ ec ’ e+ ¤ ec ’ e+ = s .
Hence
ρ
e+ ¤ ρ ec ¤ s.
(2.26)
1’ρ
So, if the q-factor can be estimated from above by ρ and

s < (1 ’ ρ)„s /ρ,

then e+ < „s . This approach is used in ODE and DAE codes [32], [234], [228], [213],
but requires good estimates of the q-factor and we do not advocate it for q-linearly convergent
methods for optimization. The danger is that if the convergence is slow, the approximate q-factor
can be a gross underestimate and cause premature termination of the iteration.
It is not uncommon for evaluations of f and ∇f to be very expensive and optimizations are,
therefore, usually allocated a ¬xed maximum number of iterations. Some algorithms, such as
the DIRECT, [150], algorithm we discuss in §8.4.2, assign a limit to the number of function
evaluations and terminate the iteration in only this way.


2.4 Nonlinear Least Squares
Nonlinear least squares problems have objective functions of the form
M
1 1
2
R(x)T R(x).
f (x) = ri (x) =
(2.27) 2
2 2
i=1

The vector R = (r1 , . . . , rM ) is called the residual. These problems arise in data ¬tting, for
example. In that case M is the number of observations and N is the number of parameters;
for these problems M > N and we say the problem is overdetermined. If M = N we have a
nonlinear equation and the theory and methods from [154] are applicable. If M < N the problem
is underdetermined. Overdetermined least squares problems arise most often in data ¬tting
applications like the parameter identi¬cation example in §1.6.2. Underdetermined problems are
less common, but are, for example, important in the solution of high-index differential algebraic
equations [48], [50].
The local convergence theory for underdetermined problems has the additional complexity
that the limit of the Gauss“Newton iteration is not uniquely determined by the distance of the
initial iterate to the set of points where R(x— ) = 0. In §2.4.3 we describe the dif¬culties and
state a simple convergence result.
If x— is a local minimizer of f and f (x— ) = 0, the problem min f is called a zero residual
problem (a remarkable and suspicious event in the data ¬tting scenario). If f (x— ) is small, the
expected result in data ¬tting if the model (i.e., R) is good, the problem is called a small residual
problem. Otherwise one has a large residual problem.
Nonlinear least squares problems are an intermediate stage between nonlinear equations and
optimization problems and the methods for their solution re¬‚ect this. We de¬ne the M — N
Jacobian R of R by

(R (x))ij = ‚ri /‚xj , 1 ¤ i ¤ M, 1 ¤ j ¤ N.
(2.28)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 23

With this notation it is easy to show that

∇f (x) = R (x)T R(x) ∈ RN .
(2.29)

The necessary conditions for optimality imply that at a minimizer x—

R (x— )T R(x— ) = 0.
(2.30)

In the underdetermined case, if R (x— ) has full row rank, (2.30) implies that R(x— ) = 0; this is
not the case for overdetermined problems.
The cost of a gradient is roughly that of a Jacobian evaluation, which, as is the case with
nonlinear equations, is the most one is willing to accept. Computation of the Hessian (an N — N
matrix)
M
∇2 f (x) = R (x)T R (x) + ri (x)T ∇2 ri (x)
(2.31)
j=1

requires computation of the M Hessians {∇2 ri } for the second-order term
M
ri (x)T ∇2 ri (x)
j=1

and is too costly to be practical.
We will also express the second-order term as
M
ri (x)T ∇2 ri (x) = R (x)T R(x),
j=1

where the second derivative R is a tensor. The notation is to be interpreted in the following
way. For v ∈ RM , R (x)T v is the N — N matrix
M
R (x)T v = ∇2 (R(x)T v) = (v)i ∇2 ri (x).
i=1

We will use the tensor notation when expanding R about x— in some of the analysis to follow.

2.4.1 Gauss“Newton Iteration
The Gauss“Newton algorithm simply discards the second-order term in ∇2 f and computes a
step
s = ’(R (xc )T R (xc ))’1 ∇f (xc )
(2.32)
= ’(R (xc )T R (xc ))’1 R (xc )T R(xc ).
The Gauss“Newton iterate is x+ = xc +s. One motivation for this approach is that R (x)T R(x)
vanishes for zero residual problems and therefore might be negligible for small residual problems.
Implicit in (2.32) is the assumption that R (xc )T R (xc ) is nonsingular, which implies that
M ≥ N . Another interpretation, which also covers underdetermined problems, is to say that the
Gauss“Newton iterate is the minimum norm solution of the local linear model of our nonlinear
least squares problem
1
min R(xc ) + R (xc )(x ’ xc ) 2 .
(2.33)
2
Using (2.33) and linear least squares methods is a more accurate way to compute the step than
using (2.32), [115], [116], [127]. In the underdetermined case, the Gauss“Newton step can
be computed with the singular value decomposition [49], [127], [249]. (2.33) is an overde-
termined, square, or underdetermined linear least squares problem if the nonlinear problem is
overdetermined, square, or underdetermined.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




24 ITERATIVE METHODS FOR OPTIMIZATION

The standard assumptions for nonlinear least squares problems follow in Assumption 2.4.1.
Assumption 2.4.1. x— is a minimizer of R 2 , R is Lipschitz continuously differentiable
2
near x— , and R (x— )T R (x— ) has maximal rank. The rank assumption may also be stated as

• R (x— ) is nonsingular (M = N ),

• R (x— ) has f ull column rank (M > N ),

• R (x— ) has f ull row rank (M < N ).

2.4.2 Overdetermined Problems

Theorem 2.4.1. Let M > N . Let Assumption 2.4.1 hold. Then there are K > 0 and δ > 0
such that if xc ∈ B(δ) then the error in the Gauss“Newton iteration satis¬es
2
+ R(x— ) ec ).
e+ ¤ K( ec
(2.34)



Proof. Let δ be small enough so that x ’ x— < δ implies that R (x)T R (x) is nonsingular.
Let γ be the Lipschitz constant for R .
By (2.32)

= ec ’ (R (xc )T R (xc ))’1 R (xc )T R(xc )
e+
(2.35)
= (R (xc )T R (xc ))’1 R (xc )T (R (xc )ec ’ R(xc )).

Note that
R (xc )ec ’ R(xc ) = R (xc )ec ’ R(x— ) + R(x— ) ’ R(xc )

= ’R(x— ) + (R (xc )ec + R(x— ) ’ R(xc )).

Now,
R (xc )ec + R(x— ) ’ R(xc ) ¤ γ ec 2 /2
and, since R (x— )T R(x— ) = 0,

’R (xc )T R(x— ) = (R (x— ) ’ R (xc ))T R(x— ).

Hence,

¤ (R (xc )T R (xc ))’1 (R (x— ) ’ R (xc ))T R(x— )
e+

(R (xc )T R (xc ))’1 R (xc )T γ ec 2
+
2
(2.36)

R(x— ) + R (xc )T ec
T ’1
¤ (R (xc ) R (xc )) γ ec .
2

Setting
1 + R (x)T
K = γ max ( (R (x)T R (x))’1
2
x∈B(δ)




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 25

completes the proof.
There are several important consequences of Theorem 2.4.1. The ¬rst is that for zero residual
problems, the local convergence rate is q-quadratic because the R(x— ) ec term on the right
side of (2.34) vanishes. For a problem other than a zero residual one, even q-linear convergence
is not guaranteed. In fact, if xc ∈ B(δ) then (2.35) will imply that e+ ¤ r ec for some
0 < r < 1 if
K(δ + R (x— ) ) ¤ r
(2.37)
and therefore the q-factor will be K R (x— ) . Hence, for small residual problems and accurate
initial data the convergence of Gauss“Newton will be fast. Gauss“Newton may not converge at
all for large residual problems.
Equation (2.36) exposes a more subtle issue when the term

(R (x— ) ’ R (xc ))T R(x— )

is considered as a whole, rather than estimated by

R(x— ) .
γ ec

Using Taylor™s theorem and the necessary conditions (R (x— )T R(x— ) = 0) we have

R (xc )T R(x— ) = [R (x— ) + R (x— )ec + O( ec 2 )]T R(x— )

= eT R (x— )T R(x— ) + O( ec 2 ).
c

Recall that
R (x— )T R(x— ) = ∇2 f (x— ) ’ R (x— )T R (x— )
and hence
(R (x— )’R (xc ))T R(x— )
(2.38)
¤ ∇2 f (x— ) ’ R (x— )T R (x— ) R(x— ) + O( ec 2 ).

In a sense (2.38) says that even for a large residual problem, convergence can be fast if the problem
is not very nonlinear (small R ). In the special case of a linear least squares problem (where
R = 0) Gauss“Newton becomes simply the solution of the normal equations and converges in
one iteration.
So, Gauss“Newton can be expected to work well for overdetermined small residual problems
and good initial iterates. For large residual problems and/or initial data far from the solution,
there is no reason to expect Gauss“Newton to give good results. We address these issues in
§3.2.3.

2.4.3 Underdetermined Problems
We begin with the linear underdetermined least squares problem

min Ax ’ b 2 .
(2.39)

If A is M — N with M < N there will not be a unique minimizer but there will be a unique
minimizer with minimum norm. The minimum norm solution can be expressed in terms of the
singular value decomposition [127], [249] of A,

A = U ΣV T .
(2.40)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




26 ITERATIVE METHODS FOR OPTIMIZATION

In (2.40), Σ is an N — N diagonal matrix. The diagonal entries of Σ, {σi } are called the singular
values. σi ≥ 0 and σi = 0 if i > M . The columns of the M — N matrix U and the N — N
matrix V are called the left and right singular vectors. U and V have orthonormal columns and
hence the minimum norm solution of (2.39) is

x = A† b,

where A† = V Σ† U T ,
† †
Σ† = diag(σ1 , . . . , σN ),

± ’1
and
 σi , σi = 0,

σi =

0, σi = 0.
A† is called the Moore“Penrose inverse [49], [189], [212]. If A is a square nonsingular matrix,
then A† = A’1 ; if M > N then the de¬nition of A† using the singular value decomposition is
still valid; and, if A has full column rank, A† = (AT A)’1 AT .
Two simple properties of the Moore“Penrose inverse are that A† A is a projection onto the
range of A† and AA† is a projection onto the range of A. This means that

A† AA† = A† , (A† A)T = A† A, AA† A = A, and (AA† )T = AA† .
(2.41)

So the minimum norm solution of the local linear model (2.33) of an underdetermined
nonlinear least squares problem can be written as [17], [102]

s = ’R (xc )† R(xc )
(2.42)

and the Gauss“Newton iteration [17] is

x+ = xc ’ R (xc )† R(xc ).
(2.43)

The challenge in formulating a local convergence result is that there is not a unique optimal point
that attracts the iterates.
In the linear case, where R(x) = Ax ’ b, one gets

x+ = xc ’ A† (Axc ’ b) = (I ’ A† A)xc ’ A† b.

Set e = x ’ A† b and note that
A† AA† b = A† b
by (2.41). Hence
e+ = (I ’ A† A)ec .
This does not imply that x+ = A† b, the minimum norm solution, only that x+ is a solution of
the problem and the iteration converges in one step. The Gauss“Newton iteration cannot correct
for errors that are not in the range of A† .
Let
Z = {x | R(x) = 0}.
We show in Theorem 2.4.2, a special case of the result in [92], that if the standard assumptions
hold at a point x— ∈ Z, then the iteration will converge q-quadratically to a point z — ∈ Z.
However, there is no reason to expect that z — = x— . In general z — will depend on x0 , a very
different situation from the overdetermined case. The hypotheses of Theorem 2.4.2, especially
that of full column rank in R (x), are less general than those in [24], [17], [25], [92], and [90].


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 27

Theorem 2.4.2. Let M ¤ N and let Assumption 2.4.1 hold for some x— ∈ Z. Then there is
δ > 0 such that if
x0 ’ x— ¤ δ,
then the Gauss“Newton iterates

xn+1 = xn ’ R (xn )† R(xn )

exist and converge r-quadratically to a point z — ∈ Z.
Proof. Assumption 2.4.1 and results in [49], [126] imply that if δ is suf¬ciently small then
there is ρ1 such that R (x)† is Lipschitz continuous in the set

B1 = {x | x ’ x— ¤ ρ1 }

and the singular values of R (x) are bounded away from zero in B1 . We may, reducing ρ1 if
necessary, apply the Kantorovich theorem [154], [151], [211] to show that if x ∈ B1 and w ∈ Z
is such that
x ’ w = min x ’ z ,
z∈Z

then there is ξ = ξ(x) ∈ Z such that

w ’ ξ(x) = O( x ’ w 2 ) ¤ x ’ w /2

and ξ is in the range of R (w)† , i.e.,

R (w)† R (w)(x ’ ξ(x)) = x ’ ξ(x).

The method of the proof is to adjust δ so that the Gauss“Newton iterates remain in B1 and
R(xn ) ’ 0 suf¬ciently rapidly. We begin by requiring that δ < ρ1 /2.
Let xc ∈ B1 and let e = x ’ ξ(xc ). Taylor™s theorem, the fundamental theorem of calculus,
and (2.41) imply that
= ec ’ R (xc )† R(xc )
e+

= ec ’ (R (xc )† ’ R (ξ(x))† )R(x) ’ R (x—)† R(x)

= e0 ’ R (x— )† R(x) + O( ec 2 )

= (I ’ R (x— )† R (x— ))ec + O( ec 2 ) = O( ec 2 ).

If we de¬ne d(x) = minz∈Z x ’ z then there is K1 such that
2
¤ K1 d(xc )2 .
d(x+ ) ¤ x+ ’ ξ(xc ) ¤ K1 xc ’ ξ(xc )
(2.44)

Now let
ρ2 = min(ρ1 , (2K1 )’1 ).
So if
xc ∈ B2 = {x | x ’ x— ¤ ρ2 }
then
d(x+ ) ¤ d(xc )/2.
(2.45)
Finally, there is K2 such that
¤ xc ’ x— + x+ ’ xc = xc ’ x— + R (xc )† R(xc )
x+ ’ x—

¤ xc ’ x— + K2 xc ’ ξ(xc ) .


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




28 ITERATIVE METHODS FOR OPTIMIZATION

We now require that
ρ2
δ¤ .
(2.46)
2(1 + K2 )
We complete the proof by induction. If x0 ’ x— ¤ δ and the Gauss“Newton iterates {xk }n
k=0
are in B2 , then xn+1 is be de¬ned and, using (2.46) and (2.44),
n+1
— —
xn+1 ’ x ¤ x0 ’ x + K3 d(xk ) ¤ δ + 2K3 d(x0 ) ¤ ρ1 .
k=0

Hence, the Gauss“Newton iterates exist, remain in B0 , and dn ’ 0.
To show that the sequence of Gauss“Newton iterates does in fact converge, we observe that
there is K3 such that

x+ ’ xc = R (xc )† R(xc ) ¤ K3 xc ’ ξ(xc ) ¤ K3 d(xc ).

Therefore (2.45) implies that for any m, n ≥ 0,
n+m’1
xn+m ’ xn ¤ xl+1 ’ xl
l=n
’ 2’m
1
n+m
= l=n d(xl ) = d(xn )
2

¤ 2d(xn ) ¤ 2’n+1 d(x0 ).

Hence, {xk } is a Cauchy sequence and therefore converges to a point z — ∈ Z. Since

xn ’ z — ¤ 2d(xn ),

(2.44) implies that the convergence is r-quadratic.


2.5 Inexact Newton Methods
An inexact Newton method [74] uses an approximate Newton step s = x+ ’ xc , requiring only
that
∇2 f (xc )s + ∇f (xc ) ¤ ·c ∇f (xc ) ,
(2.47)
i.e., that the linear residual be small. We will refer to any vector s that satis¬es (2.47) with
·c < 1 as an inexact Newton step. We will refer to the parameter ·c on the right-hand side of
(2.47) as the forcing term [99] .
Inexact Newton methods are also called truncated Newton methods [75], [198], [199] in the
context of optimization. In this book, we consider Newton“iterative methods. This is the class of
inexact Newton methods in which the linear equation (2.4) for the Newton step is also solved by
an iterative method and (2.47) is the termination criterion for that linear iteration. It is standard
to refer to the sequence of Newton steps {xn } as the outer iteration and the sequence of iterates
for the linear equation as the inner iteration. The naming convention (see [33], [154], [211])
is that Newton“CG, for example, refers to the Newton“iterative method in which the conjugate
gradient [141] algorithm is used to perform the inner iteration.
Newton“CG is particularly appropriate for optimization, as we expect positive de¬nite Hes-
sians near a local minimizer. The results for inexact Newton methods from [74] and [154]
are suf¬cient to describe the local convergence behavior of Newton“CG, and we summarize
the relevant results from nonlinear equations in §2.5.1. We will discuss the implementation of
Newton“CG in §2.5.2.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 29

2.5.1 Convergence Rates
We will prove the simplest of the convergence results for Newton“CG, Theorem 2.5.1, from
which Theorem 2.5.2 follows directly. We refer the reader to [74] and [154] for the proof of
Theorem 2.5.3.
Theorem 2.5.1. Let the standard assumptions hold. Then there are δ and KI such that if
xc ∈ B(δ), s satis¬es (2.47), and x+ = xc + s, then

e+ ¤ KI ( ec + ·c ) ec .
(2.48)



Proof. Let δ be small enough so that the conclusions of Lemma 2.3.1 and Theorem 2.3.2
hold. To prove the ¬rst assertion (2.48) note that if

r = ’∇2 f (xc )s ’ ∇f (xc )

is the linear residual, then

s + (∇2 f (xc ))’1 ∇f (xc ) = ’(∇2 f (xc ))’1 r

and
e+ = ec + s = ec ’ (∇2 f (xc ))’1 ∇f (xc ) ’ (∇2 f (xc ))’1 r.
(2.49)
Now, (2.47), (2.7), and (2.6) imply that

s + (∇2 f (xc ))’1 ∇f (xc ) ¤ (∇2 f (xc ))’1 ·c ∇f (xc )

¤ 4κ(∇2 f (x— ))·c ec .

Hence, using (2.49) and Theorem 2.3.2, we have that

¤ ec ’ ∇2 f (xc )’1 ∇f (xc ) + 4κ(F (x— ))·c ec
e+

2
+ 4κ(∇2 f (x— ))·c ec ,
¤ K ec

where K is the constant from (2.8). If we set

KI = K + 4κ(∇2 f (x— )),

the proof is complete.
Theorem 2.5.2. Let the standard assumptions hold. Then there are δ and · such that if
¯
x0 ∈ B(δ), {·n } ‚ [0, · ], then the inexact Newton iteration
¯

xn+1 = xn + sn ,

where
∇2 f (xn )sn + ∇f (xn ) ¤ ·n ∇f (xn ) ,
converges q-linearly to x— . Moreover

• if ·n ’ 0 the convergence is q-superlinear, and
p
• if ·n ¤ K· ∇f (xn ) for some K· > 0 the convergence is q-superlinear with q-order
1 + p.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




30 ITERATIVE METHODS FOR OPTIMIZATION

The similarity between Theorem 2.5.2 and Theorem 2.3.5, the convergence result for the
chord method, should be clear. Rather than require that the approximate Hessian be accurate,
we demand that the linear iteration produce a suf¬ciently small relative residual. Theorem 2.5.3
is the remarkable statement that any reduction in the relative linear residual will suf¬ce for linear
convergence in a certain norm. This statement implies [154] that ∇f (xn ) will converge to
zero q-linearly, or, equivalently, that xn ’ x— q-linearly with respect to · — , which is de¬ned
by
x — = ∇2 f (x— )x .

Theorem 2.5.3. Let the standard assumptions hold. Then there is δ such that if xc ∈ B(δ),
s satis¬es (2.47), x+ = xc + s, and ·c ¤ · < · < 1, then
¯

e+ ¤ · ec
¯ —.
(2.50) —




Theorem 2.5.4. Let the standard assumptions hold. Then there is δ such that if x0 ∈ B(δ),
{·n } ‚ [0, ·] with · < · < 1, then the inexact Newton iteration
¯

xn+1 = xn + sn ,

where
∇2 f (xn )sn + ∇f (xn ) ¤ ·n ∇f (xn )
to x— . Moreover
converges q-linearly with respect to · —

• if ·n ’ 0 the convergence is q-superlinear, and
p
• if ·n ¤ K· ∇f (xn ) for some K· > 0 the convergence is q-superlinear with q-order
1 + p.

Q-linear convergence of {xn } to a local minimizer with respect to · — is equivalent to
q-linear convergence of {∇f (xn )} to zero. We will use the rate of convergence of {∇f (xn )}
in our computational examples to compare various methods.

2.5.2 Implementation of Newton“CG
Our implementation of Newton“CG approximately solves the equation for the Newton step with
CG. We make the implicit assumption that ∇f has been computed suf¬ciently accurately for
Dh f (x : w) to be a useful approximate Hessian of the Hessian“vector product ∇2 f (x)w.
2



Forward Difference CG
Algorithm fdcg is an implementation of the solution by CG of the equation for the Newton step
(2.4). In this algorithm we take care to identify failure in CG (i.e., detection of a vector p for
which pT Hp ¤ 0). This failure either means that H is singular (pT Hp = 0; see exercise 2.7.3)
or that pT Hp < 0, i.e., p is a direction of negative curvature. The algorithms we will discuss
in §3.3.7 make good use of directions of negative curvature. The initial iterate for forward
difference CG iteration should be the zero vector. In this way the ¬rst iterate will give a steepest
descent step, a fact that is very useful. The inputs to Algorithm fdcg are the current point x,
the objective f , the forcing term ·, and a limit on the number of iterations kmax. The output is
2
the inexact Newton direction d. Note that in step 2b Dh f (x : p) is used as an approximation to
∇2 f (x)p.
Algorithm 2.5.1. fdcg(x, f, ·, kmax, d)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 31

1. r = ’∇f (x), ρ0 = r 2 , k = 1, d = 0.
2

ρk’1 > · ∇f (x) and k < kmax
2. Do While

(a) if k = 1 then p = r
else
β = ρk’1 /ρk’2 and p = r + βp
2
(b) w = Dh f (x : p)
If pT w = 0 signal inde¬niteness; stop.
If pT w < 0 signal negative curvature; stop.
(c) ± = ρk’1 /pT w
(d) d = d + ±p
(e) r = r ’ ±w
2
(f) ρk = r
(g) k = k + 1

Preconditioning can be incorporated into a Newton“CG algorithm by using a forward dif-
ference formulation, too. Here, as in [154], we denote the preconditioner by M . Aside from M ,
the inputs and output of Algorithm fdpcg are the same as that for Algorithm fdcg.
Algorithm 2.5.2. fdpcg(x, f, M, ·, kmax, d)

1. r = ’∇f (x), ρ0 = r 2 , k = 1, d = 0.
2

ρk’1 > · ∇f (x) and k < kmax
2. Do While

(a) z = M r
(b) „k’1 = z T r
(c) if k = 1 then β = 0 and p = z
else
β = „k’1 /„k’2 , p = z + βp
2
(d) w = Dh f (x : p)
If pT w = 0 signal inde¬niteness; stop.
If pT w < 0 signal negative curvature; stop.
(e) ± = „k’1 /pT w
(f) d = d + ±p
(g) r = r ’ ±w
(h) ρk = rT r
(i) k = k + 1

In our formulation of Algorithms fdcg and fdpcg, inde¬niteness is a signal that we are
not suf¬ciently near a minimum for the theory in this section to hold. In §3.3.7 we show how
negative curvature can be exploited when far from the solution.
One view of preconditioning is that it is no more than a rescaling of the independent variables.
Suppose, rather than (1.2), we seek to solve
ˆ
min f (y),
(2.51)
y




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




32 ITERATIVE METHODS FOR OPTIMIZATION

ˆ ˆ
where f (y) = f (M 1/2 y) and M is spd. If y — is a local minimizer of f , then x— = M 1/2 y — is
a local minimizer of f and the two problems are equivalent. Moreover, if x = M 1/2 y and ∇x
and ∇y denote gradients in the x and y coordinates, then

ˆ
∇y f (y) = M 1/2 ∇x f (x)

and
ˆ
∇2 f (y) = M 1/2 (∇2 f (x))M 1/2 .
y x

Hence, the scaling matrix plays the role of the square root of the preconditioner for the precon-
ditioned conjugate gradient algorithm.


Newton“CG

The theory guarantees that if x0 is near enough to a local minimizer then ∇2 f (xn ) will be spd
for the entire iteration and xn will converge rapidly to x— . Hence, Algorithm newtcg will not
terminate with failure because of an increase in f or an inde¬nite Hessian. Note that both the
forcing term · and the preconditioner M can change as the iteration progresses.

Algorithm 2.5.3. newtcg(x, f, „, ·)

1. rc = r0 = ∇f (x)

2. Do while ∇f (x) > „r r0 + „a

(a) Select · and a preconditioner M .
(b) fdpcg(x, f, M, ·, kmax, d)
If inde¬niteness has been detected, terminate with failure.
(c) x = x + d.
(d) Evaluate f and ∇f (x).
If f has not decreased, terminate with failure.
(e) r+ = ∇f (x) , σ = r+ /rc , rc = r+ .


The implementation of Newton“CG is simple, but, as presented in Algorithm newtcg,
incomplete. The algorithm requires substantial modi¬cation to be able to generate the good
initial data that the local theory requires. We return to this issue in §3.3.7.
There is a subtle problem with Algorithm fdpcg in that the algorithm is equivalent to the
application of the preconditioned conjugate gradient algorithm to the matrix B that is determined
by
2
Bpi = wi = Dh f (x : pi ), 1 ¤ i ¤ N.
2
However, since the map p ’ Dh f (x : p) is not linear in p, the quality of B as an approximation
to ∇2 f (x) may degrade as the linear iteration progresses. Usually this will not cause problems
unless many iterations are needed to satisfy the inexact Newton condition. However, if one does
not see the expected rate of convergence in a Newton“CG iteration, this could be a factor [128].
One partial remedy is to use a centered-difference Hessian“vector product [162], which reduces
the error in B. In exercise 2.7.15 we discuss a more complex and imaginative way to compute
accurate Hessians.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 33

2.6 Examples
In this section we used the collection of MATLAB codes but disabled the features (see Chapter 3)
that assist in convergence when far from a minimizer. We took care to make certain that the
initial iterates were near enough to the minimizer so that the observations of local convergence
corresponded to the theory. In practical optimization problems, good initial data is usually not
available and the globally convergent methods discussed in Chapter 3 must be used to start the
iteration.
The plots in this section have the characteristics of local convergence in that both the gradient
norms and function values are decreasing. The reader should contrast this with the examples in
Chapter 3.


2.6.1 Parameter Identi¬cation
For this example, M = 100, and the observations are those of the exact solution with c = k = 1,
which we computed analytically. We used T = 10 and u0 = 10. We computed the displacement
and solved the sensitivity equations with the stiff solver ode15s. These results could be
obtained as well in a FORTRAN environment using, for example, the LSODE code [228]. The
relative and absolute error tolerances for the integrator were both set to 10’8 . In view of the
expected accuracy of the gradient, we set the forward difference increment for the approximate
Hessian to h = 10’4 . We terminated the iterations when ∇f < 10’4 . Our reasons for this
are that, for the zero residual problem considered here, the standard assumptions imply that
f (x) = O( ∇f (x) ) for x near the solution. Hence, since we can only resolve f to an accuracy
of 10’8 , iteration beyond the point where ∇f < 10’4 cannot be expected to lead to a further
decrease in f . In fact we observed this in our computations.
The iterations are very sensitive to the initial iterate. We used x0 = (1.1, 1.05)T ; initial
iterates much worse than that caused Newton™s method to fail. The more robust methods from
Chapter 3 should be viewed as essential components of even a simple optimization code.
In Table 2.1 we tabulate the history of the iteration for both the Newton and Gauss“Newton
methods. As expected for a small residual problem, Gauss“Newton performs well and, for this
example, even converges in fewer iterations. The real bene¬t of Gauss“Newton is that com-
putation of the Hessian can be avoided, saving considerable computational work by exploiting
the structure of the problem. In the computation reported here, the MATLAB flops com-
mand indicates that the Newton iteration took roughly 1.9 million ¬‚oating point operations and
Gauss“Newton roughly 650 thousand. This difference would be much more dramatic if there
were more than two parameters or the cost of an evaluation of f depended on N in a signi¬cant
way (which it does not in this example).


Table 2.1: Parameter identi¬cation problem, locally convergent iterations.

Newton Gauss“Newton
∇f (xn ) f (xn ) ∇f (xn ) f (xn )
n
0 2.33e+01 7.88e-01 2.33e+01 7.88e-01
1 6.87e+00 9.90e-02 1.77e+00 6.76e-03
2 4.59e-01 6.58e-04 1.01e-02 4.57e-07
3 2.96e-03 3.06e-08 9.84e-07 2.28e-14
4 2.16e-06 4.15e-14

Figure 2.1 is a graphical representation of the convergence history from Table 2.1. We think
that the plots are a more effective way to understand iteration statistics and will present mostly


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




34 ITERATIVE METHODS FOR OPTIMIZATION

graphs for the remainder of the book. The concavity of the plots of the gradient norms is the
signature of superlinear convergence.

Newton™s Method Newton™s Method
2 0
10 10

0
10




Function Value
Gradient Norm




’5
10
’2
10
’10
10
’4
10

’6 ’15
10 10
0 2 4 0 2 4
Iterations Iterations
Gauss’Newton Method Gauss’Newton Method
2 0
10 10
0
10
Function Value
Gradient Norm




’5
10
’2
10
’4
10 ’10
10
’6
10
’8 ’15
10 10
0 1 2 3 0 1 2 3
Iterations Iterations

Figure 2.1: Local Optimization for the Parameter ID Problem

We next illustrate the difference between Gauss“Newton and Newton on a nonzero residual
problem. We use the same example as before with the observations randomly perturbed. We
used the MATLAB rand function for this, perturbing the samples of the analytic solution by
.5 — rand(M, 1). The least squares residual is about 3.6 and the plots in Figure 2.2 indicate
that Newton™s method is still converging quadratically, but the rate of Gauss“Newton is linear.
The linear convergence of Gauss“Newton can be seen clearly from the linear semilog plot of the
gradient norms. Even so, the Gauss“Newton iteration was more ef¬cient, in terms of ¬‚oating
point operation, than Newton™s method. The Gauss“Newton iteration took roughly 1 million
¬‚oating point operations while the Newton iteration took 1.4 million.

2.6.2 Discrete Control Problem
We solve the discrete control problem from §1.6.1 with N = 400, T = 1, y0 = 0,

L(y, u, t) = (y ’ 3)2 + .5u2 , and φ(y, u, t) = uy + t2

with Newton“CG and two different choices, · = .1, .0001, of the forcing term. The initial
iterate was u0 = (10, 10, . . . , 10)T and the iteration was terminated when ∇f < 10’8 . In


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 35


Newton™s Method Newton™s Method
2
10 4

0
10 3.9




Function Value
Gradient Norm




’2
10 3.8

’4
3.7
10

’6
10 3.6
0 1 2 3 0 1 2 3
Iterations Iterations
Gauss’Newton Method Gauss’Newton Method
2
10 4

0
10 3.9
Function Value
Gradient Norm




’2
10 3.8

’4
3.7
10

’6
10 3.6
0 2 4 6 0 2 4 6
Iterations Iterations

Figure 2.2: Local Optimization for the Parameter ID Problem, Nonzero Residual


Figure 2.3 one can see that the small forcing term produces an iteration history with the concavity
of superlinear convergence. The limiting q-linear behavior of an iteration with constant · is not
yet visible. The iteration with the larger value of · is in the q-linearly convergent stage, as the
linear plot of ∇f against the iteration counter shows.
The cost of the computation is not re¬‚ected by the number of nonlinear iterations. When
· = .0001, the high accuracy of the linear solve is not rewarded. The computation with · = .0001
required 8 nonlinear iterations, a total of 32 CG iterations, roughly 1.25 million ¬‚oating point
operations, and 41 gradient evaluations. The optimization with · = .1 needed 10 nonlinear
iterations, a total of 13 CG iterations, roughly 820 thousand ¬‚oating point operations, and 24
gradient evaluations.


2.7 Exercises on Local Convergence
2.7.1. Apply Newton™s method with (a) analytic ¬rst and second derivatives, (b) analytic ¬rst
derivatives and forward difference second derivatives, and (c) forward difference ¬rst and
second derivatives to ¬nd a local minimum of

1. f (x) = sin2 (x),
2
2. f (x) = ex , and


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




36 ITERATIVE METHODS FOR OPTIMIZATION

eta = .1 eta = .1
5 3
10 10




Function Value
Gradient Norm




0 2
10 10



’5 1
10 10



’10 0
10 10
0 5 10 0 5 10
Iterations Iterations

eta = .0001 eta = .0001
5 3
10 10


0
10
Function Value
Gradient Norm




2
10
’5
10
1
10
’10
10


’15 0
10 10
0 2 4 6 8 0 2 4 6 8
Iterations Iterations


Figure 2.3: Newton“CG for the Discrete Control Problem: · = .1, .0001


3. f (x) = x4 .
Use difference steps of h = 10’1 , 10’2 , 10’4 , and 10’8 . Explain your results.
2.7.2. Repeat part (c) of exercise 2.7.1. Experiment with
2
f (x) = ex + 10’4 rand(x) and f (x) = x2 + 10’4 rand(x),

where rand denotes the random number generator in your computing environment. Ex-
plain the differences in the results.
2.7.3. Show that if A is symmetric, p = 0, and pT Ap = 0, then A is either singular or inde¬nite.
2.7.4. Show that if b ∈ RN and the N — N matrix A is symmetric and has a negative eigenvalue,
then the quadratic functional

m(x) = xT Ax + xT b

does not have a minimizer. Show that if u is an eigenvector corresponding to a negative
eigenvalue of the Hessian, then u is a direction of negative curvature.
2.7.5. If N = 1, the local quadratic model could easily be replaced by a local quartic (i.e.,
fourth-degree) model (what would be wrong with a cubic model?). If a method is based
on minimization of the local quartic model, what kind of local convergence would you


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




LOCAL CONVERGENCE 37

expect? How would you extend this method to the case N > 1? Look at [30] for some
results on this.
2.7.6. Show that if the standard assumptions hold, h is suf¬ciently small, and x is suf¬ciently
near x— , the difference Hessian de¬ned by (2.19), ∇2 f (x), is spd.
h

2.7.7. Write a locally convergent Newton method code based on accurate function and gradient
information and forward difference Hessians using (2.19). Be sure that your code tests for
positivity of the Hessian so that you can avoid convergence to a local maximum. Is the
test for positivity expensive? Apply your code to the parameter ID problem from §1.6.2.
If you use an ODE solver that lets you control the accuracy of the integration, try values
of the accuracy from 10’8 to 10’2 and see how the iteration performs. Be sure that your
difference Hessian re¬‚ects the accuracy in the gradient.
2.7.8. Let the standard assumptions hold and let »s > 0 be the smallest eigenvalue of ∇2 f (x— ).
Give the best (i.e., largest) bound you can for ρ such that ∇2 f (x) is positive de¬nite for
all x ∈ B(ρ).
2.7.9. Use the de¬nition of A† to prove (2.41).

2.7.10. Fill in the missing details in the proof of Theorem 2.4.2 by showing how the Kantorovich
theorem can be used to prove the existence of ξ(x).
2.7.11. Let f (x) = x2 and f (x) = sin(100x)/10. Using an initial iterate of x0 = 1, try to ¬nd
a local minimum of f + f using Newton™s method with analytic gradients and Hessians.
Repeat the experiment with difference gradients and Hessians (try forward differences
with a step size of h = .2).
2.7.12. Solve the parameter ID problem from §2.6 with the observations perturbed randomly (for
example, you could use the MATLAB rand function for this). Vary the amplitude of the
perturbation and see how the performance of Newton and Gauss“Newton changes.

2.7.13. Derive sensitivity equations for the entries of the Hessian for the parameter ID objective
function. In general, if there are P parameters, how many sensitivity equations would
need to be solved for the gradient? How many for the Hessian?
2.7.14. Solve the discrete control problem from §2.6.2 using Newton“CG with forcing terms that
depend on n. Consider ·n = .5/n, ·n = min(.1, ∇f (un ) ), and some of the choices
from [99]. Vary N and the termination criteria and compare the performance with the
constant · choice in §2.6.2.
2.7.15. Let F : RN ’ RM (where M and N need not be the same) be suf¬ciently smooth (how
smooth is that?) and be such that F can also be computed for complex arguments. Show
that [181], [245]
Im(F (x + ihu))/h = F (x)u + O(h2 ),
where Im denotes imaginary part. What happens if there is error in F ? How can you use
this fact to compute better difference gradients and Hessians?




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




Chapter 3

Global Convergence

The locally convergent algorithms discussed in Chapter 2 can and do fail when the initial iterate
is not near the root. The reasons for this failure, as we explain below, are that the Newton
direction may fail to be a direction of descent for f and that even when a search direction is a
direction of decrease of f , as ’∇f is, the length of the step can be too long. Hence, taking a
Newton (or Gauss“Newton, or inexact Newton) step can lead to an increase in the function and
divergence of the iteration (see exercise 3.5.14 for two dramatic examples of this). The globally
convergent algorithms developed in this chapter partially address this problem by either ¬nding
a local minimum or failing in one of a small number of easily detectable ways.
These are not algorithms for global optimization. When these algorithms are applied to
problems with many local minima, the results of the iteration may depend in complex ways on
the initial iterate.


3.1 The Method of Steepest Descent
The steepest descent direction from x is d = ’∇f (x). The method of steepest descent [52]
updates the current iteration xc by the formula

x+ = xc ’ »∇f (xc ).
(3.1)

If we take the simple choice » = 1, then x+ is not guaranteed to be nearer a solution than xc ,
even if xc is very near a solution that satis¬es the standard assumptions. The reason for this is
that, unlike the Newton direction, the steepest descent direction scales with f . The Newton step,
on the other hand, is the same for f as it is for cf for any c = 0 but need not be a direction of
decrease for f .
To make the method of steepest descent succeed, it is important to choose the steplength ».
One way to do this, which we analyze in §3.2, is to let » = β m , where β ∈ (0, 1) and m ≥ 0 is
the smallest nonnegative integer such that there is suf¬cient decrease in f . In the context of the
steepest descent algorithm, this means that

f (xc ’ »∇f (xc )) ’ f (xc ) < ’±» ∇f (xc ) 2 .
(3.2)

This strategy, introduced in [7] and called the Armijo rule, is an example of a line search in
which one searches on a ray from xc in a direction in which f is locally decreasing. In (3.2),
± ∈ (0, 1) is a parameter, which we discuss after we fully specify the algorithm. This strategy
of repeatedly testing for suf¬cient decrease and reducing the stepsize if the test is failed is called
backtracking for obvious reasons.

39
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




40 ITERATIVE METHODS FOR OPTIMIZATION

The motivation for (3.2) is that if we approximate f by the linear model

mc = f (xc ) + ∇f (xc )(x ’ xc ),

then the reduction in the model (i.e., the predicted reduction in f ) is

pred = mc (xc ) ’ mc (x+ ) = » ∇f (xc ) 2 .

(3.2) says that the actual reduction in f

ared = f (xc ) ’ f (x+ )

is at least as much as a fraction of the predicted reduction in the linear model. The parameter ±
is typically set to 10’4 .
The reason we demand suf¬cient decrease instead of simple decrease (i.e., f (xc ) < f (x+ )
or ± = 0) is largely theoretical; a nonzero value of ± is required within the proof to insure that
the iteration does not stagnate before convergence.
Algorithm 3.1.1. steep(x, f, kmax)
1. For k = 1, . . . , kmax
(a) Compute f and ∇f ; test for termination.
(b) Find the least integer m ≥ 0 such that (3.2) holds for » = β m .
(c) x = x + »d.
2. If k = kmax and the termination test is failed, signal failure.

The termination criterion could be based on (2.12), for example.


3.2 Line Search Methods and the Armijo Rule
We introduce a few new concepts so that our proof of convergence of Algorithm steep will
also apply to a signi¬cantly more general class of algorithms.
De¬nition 3.2.1. A vector d ∈ RN is a descent direction for f at x if
df (x + td)
= ∇f (x)T d < 0.
dt t=0



Clearly the steepest descent direction d = ’∇f (x) is a descent direction. A line search
algorithm searches for decrease in f in a descent direction, using the Armijo rule for stepsize
control, unless ∇f (x) = 0.
We will consider descent directions based on quadratic models of f of the form
1
m(x) = f (xc ) + ∇f (xc )T (x ’ xc ) + (x ’ xc )T Hc (x ’ xc ),
2
where Hc , which is sometimes called the model Hessian, is spd. We let d = x ’ xc be such that
m(x) is minimized. Hence,

∇m(x) = ∇f (xc ) + Hc (x ’ xc ) = 0

and hence
’1
d = ’Hc ∇f (xc ).
(3.3)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 41

The steepest descent direction satis¬es (3.3) with Hc = I. However, the Newton direction
d = ’∇2 f (x)’1 ∇f (x) may fail to be a descent direction if x is far from a minimizer because
∇2 f may not be spd. Hence, unlike the case for nonlinear equations [154], Newton™s method is
not a generally good global method, even with a line search, and must be modi¬ed (see [113],
[117], [231], and [100]) to make sure that the model Hessians are spd.
The algorithm we analyze in this section is an extension of Algorithm steep that allows for
descent directions that satisfy (3.3) for spd H. We modify (3.2) to account for H and the new
descent direction d = ’H ’1 ∇f (x). The general suf¬cient decrease condition is

f (xc + »d) ’ f (xc ) < ±»∇f (xc )T d.
(3.4)

Here, as in (3.2), ± ∈ (0, 1) is an algorithmic parameter. Typically ± = 10’4 .
The stepsize reduction scheme in step 1b of Algorithm steep is crude. If β is too large, too
many stepsize reductions may be needed before a step is accepted. If β is too small, the progress
of the entire iteration may be retarded. We will address this problem in two ways. In §3.2.1 we
will construct polynomial models of f along the descent direction to predict an optimal factor
by which to reduce the step. In §3.3.3 we describe a method which remembers the steplength
from the previous iteration.
Our proofs require only the following general line search strategy. If a steplength »c has
been rejected (i.e., (3.4) fails with » = »c ), construct

»+ ∈ [βlow »c , βhigh »c ],
(3.5)

where 0 < βlow ¤ βhigh < 1. The choice β = βlow = βhigh is the simple rule in Algo-
rithm steep. An exact line search, in which » is the exact minimum of f (xc + »d), is not only
not worth the extra expense but can degrade the performance of the algorithm.
Algorithm 3.2.1. optarm(x, f, kmax)
1. For k = 1, . . . , kmax
(a) Compute f and ∇f ; test for termination.
(b) Construct an spd matrix H and solve (3.3) to obtain a descent direction d.
(c) Beginning with » = 1, repeatedly reduce » using any strategy that satis¬es (3.5)
until (3.4) holds.
(d) x = x + »d.
2. If k = kmax and the termination test is failed, signal failure.

In the remainder of this section we prove that if the sequence of model Hessians remains
uniformly bounded and positive de¬nite and the sequence of function values {f (xk )} is bounded
from below, then any limit point of the sequence {xk } generated by Algorithm optarm con-
verges to a point x— that satis¬es the necessary conditions for optimality. We follow that analysis
with a local convergence theory that is much less impressive than that for Newton™s method.
We begin our analysis with a simple estimate that follows directly from the spectral theorem
for spd matrices.
Lemma 3.2.1. Let H be spd with smallest and largest eigenvalues 0 < »s < »l . Then for
all z ∈ RN ,
»’1 z 2 ¤ z T H ’1 z ¤ »’1 z 2 .
s
l



The ¬rst step in the analysis is to use Lemma 3.2.1 to obtain a lower bound for the steplength.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




42 ITERATIVE METHODS FOR OPTIMIZATION

Lemma 3.2.2. Assume that ∇f is Lipschitz continuous with Lipschitz constant L. Let
± ∈ (0, 1), x ∈ RN , and H be an spd matrix. Let »s > 0 be the smallest and »l ≥ »s the
largest eigenvalues of H. Let d be given by (3.3). Assume that ∇f (x) = 0. Then (3.4) holds for
any » such that
2»s (1 ’ ±)
0<»¤ .
(3.6)
Lκ(H)

Proof. Let d = ’H ’1 ∇f (x). By the fundamental theorem of calculus
1
∇f (x + t»d)T »d dt.
f (x + »d) ’ f (x) =
0

Hence
= f (x) + »∇f (x)T d
f (x + »d)
(3.7)
1
+ t»d) ’ ∇f (x))T d dt.
+» (∇f (x
0
Therefore,
»2 L
’1 T
d 2.
f (x + »d) = f (x ’ »H ∇f (x)) ¤ f (x) + »∇f (x) d +
2
Positivity of H, Lemma 3.2.1, and κ(H) = »l »’1 imply that
s

2
= H ’1 ∇f (x) 2
¤ »’2 ∇f (x)T ∇f (x)
d s


¤ ’»l »’2 ∇f (x)T d = ’κ(H)»’1 ∇f (x)T d.
s s

Hence
f (x + »d) ¤ f (x) + (» ’ »2 L»’1 κ(H)/2)∇f (x)T d,
s

which implies (3.4) if
± ¤ (1 ’ »L»’1 κ(H)/2).
s

This is equivalent to (3.6).
Lemma 3.2.3. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk } be the
iteration given by Algorithm optarm with spd matrices Hk that satisfy

κ(Hk ) ¤ κ
¯
(3.8)

for all k. Then the steps
’1
sk = xk+1 ’ xk = »k dk = ’»k Hk ∇f (xk )

satisfy
¯ 2βlow »s (1 ’ ±)
»k ≥ » =
(3.9)
L¯κ
and at most
2»s (1 ’ ±)
m = log / log(βhigh )
(3.10)
L¯κ
stepsize reductions will be required.
Proof. In the context of Algorithm optarm, Lemma 3.2.2 implies that the line search will
terminate when
2»s (1 ’ ±)
»¤ ,
Lκ(Hk )


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




GLOBAL CONVERGENCE 43

if not before. The most that one can overshoot this is by a factor of βlow , which proves (3.9).
The line search will require at most m stepsize reductions, where m is the least nonnegative
integer such that
2»s (1 ’ ±) m
> βhigh .
Lκ(Hk )
This implies (3.10).
The convergence theorem for Algorithm optarm says that if the condition numbers of the
matrices H and the norms of the iterates remain bounded, then every limit point of the iteration
is a stationary point. Boundedness of the sequence of iterates implies that there will be limit
points, but there is no guarantee that there is a unique limit point.
Theorem 3.2.4. Let ∇f be Lipschitz continuous with Lipschitz constant L. Assume that
the matrices Hk are spd and that there are κ and »l such that κ(Hk ) ¤ κ, and Hk ¤ »l for
¯ ¯
all k. Then either f (xk ) is unbounded from below or
lim ∇f (xk ) = 0
(3.11)
k’∞

and hence any limit point of the sequence of iterates produced by Algorithm optarm is a
stationary point.
In particular, if f (xk ) is bounded from below and xkl ’ x— is any convergent subsequence
of {xk }, then ∇f (x— ) = 0.
Proof. By construction, f (xk ) is a decreasing sequence. Therefore, if f (xk ) is bounded
from below, then limk’∞ f (xk ) = f — exists and
lim f (xk+1 ) ’ f (xk ) = 0.

<<

. 2
( 9)



>>