<<

. 6
( 9)



>>





Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




106 ITERATIVE METHODS FOR OPTIMIZATION

5.6.1 In¬nite-Dimensional Problems
The results in this part of this book do not extend in a direct way to in¬nite-dimensional problems.
One reason for this is that often in¬nite-dimensional problems have countably in¬nitely many
constraints or even a continuum of constraints; hence „¦ is not compact in the norm topology of
the Banach space in which the problem is posed and appeals to various types of weak continuity
must be made (see [122] for an example of such arguments and [122] and [10] for applications).
Moreover, identi¬cation of an active set in ¬nitely many iterations is not always possible. A
more complete account of this issue may be found in [254], [162], [161].
These are not the only complications that can arise in in¬nite dimension. Even the projected
gradient method presents challenges, especially if the minima fail to be nondegenerate in the
sense of this book [94], [95]. Convergence behavior for discretized problems can be different
from that for the continuous problem [97]. Nonequivalence of norms makes convergence results
dif¬cult to formulate and analyze for both line search [96], [254], [98] and trust region [140],
[162] methods.
The functional analytic structure of many control problems can be exploited with fast mul-
tilevel methods. Both second kind multigrid methods from [138] and variants of the Atkinson“
Brakhage method [9], [31] have been applied to ¬xed point formulations of parabolic boundary
control problems in one space dimension [136], [137], [153], [162], [161].


5.7 Examples
The computations in this section were done with the MATLAB code bfgsbound. In this code
the storage is limited to ¬ve pairs of vectors, and β = .1 was used in the line search.

5.7.1 Parameter ID Problem
We consider the parameter problem from §3.4.1 with bounds L = (2, 0)T and U = (20, 5)T .
The initial iterate x0 = (5, 5)T is feasible, but the global minimum of (1, 1)T is not. As one
might expect, the lower bound constraint on (x)1 is active at the optimal point x— ≈ (2, 1.72)T .
The termination criterion for both the gradient projection and projected BFGS algorithms was
u ’ u(1) ¤ 10’6 .
The gradient projection algorithm failed. While the value of the objective function was
correct, the projected gradient norm failed to converge and the active set was not identi¬ed.
The projected BFGS iteration converged in 35 iterations. One can see the local superlinear
convergence in Figure 5.1 from the plot of the projected gradient norms. The cost of the BFGS
iteration was 121 function evaluations, 36 gradients, and roughly 5.3 million ¬‚oating point
operations.

5.7.2 Discrete Control Problem
We base the two control problem examples on the example from §1.6.1.
Our ¬rst example takes N = 2000, T = 1, y0 = 0,

L(y, u, t) = (y ’ 3)2 + .1 — u2 , and φ(y, u, t) = uy + t2 ,

with the bound constraints
.5 ¤ u ¤ 2,
and the initial iterate u0 = 2. We terminated the iteration when u ’ u(1) ¤ 10’5 . In
Figure 5.2 we plot the solution of this problem. Clearly the active set is not empty for the
constrained problem.



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BOUND CONSTRAINTS 107

4
Projected BFGS Projected BFGS
x 10
2
10 1.74
Projected Gradient Norm




1.73
0
10




Function Value
1.72
2
10
1.71
4
10
1.7

6
10 1.69
0 10 20 30 40 0 10 20 30 40
Iterations Iterations

4
Gradient Projection Gradient Projection
x 10
2
10 1.74
Projected Gradient Norm




1.73
0
10
Function Value


1.72
2
10
1.71
4
10
1.7

6
10 1.69
0 50 100 0 50 100
Iterations Iterations


Figure 5.3: Constrained Discrete Control Problem I


Projected BFGS Projected BFGS
5 8
10 10
Projected Gradient Norm




Function Value




0 6
10 10



5 4
10 10



10 2
10 10
0 2 4 6 0 2 4 6
Iterations Iterations

Gradient Projection Gradient Projection
5 8
10 10
Projected Gradient Norm




Function Value




0 6
10 10



5 4
10 10



10 2
10 10
0 2 4 6 8 0 2 4 6 8
Iterations Iterations


Figure 5.4: Constrained Discrete Control Problem II




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




108 ITERATIVE METHODS FOR OPTIMIZATION

We solve the constrained problem with Algorithm gradproj and Algorithm bfgsoptb.
In Figure 5.3 we plot the function value and the norm of the projected gradient u ’ u(1).
The projected BFGS iteration required 71 function evaluations, 36 gradient evaluations, and
roughly 5.6 million ¬‚oating point operations, while the gradient projection needed 183 function
evaluations, 92 gradient evaluations, and roughly 10.4 million ¬‚oating point operations.
Our second control problem example solves the same problem as in §3.4.2 using the con-
straints
’206 ¤ u ¤ 206.
We terminate the iteration when u ’ u(1) ¤ 10’6 , which is exactly the condition used in
§3.4.2 when the active set is empty. The solution to the unconstrained problem is feasible,
the active set is empty, and the initial iterate is feasible. Both the gradient projection iteration
and the projected BFGS iteration converge to the solution of the unconstrained problem. The
constraints are not active at either the initial iterate or the ¬nal solution but are active inside
the line search for the ¬rst iterate and for the second iterate. As is clear from a comparison
of Figures 5.4 and 3.3, this small change has a dramatic effect on the cost of the optimization,
eliminating the need for the scaling ¬xup (3.50). The gradient projection method, requiring 15
function evaluations, 8 gradient evaluations, and roughly 167 thousand ¬‚oating point operations,
is far more ef¬cient that the steepest descent iteration reported in §3.4.2. The projected BFGS
iteration was somewhat worse, needing 223 thousand operations, but only 13 function evaluations
and 7 gradient evaluations. In this example the cost of maintaining the BFGS update was not
compensated by a signi¬cantly reduced iteration count.


5.8 Exercises on Bound Constrained Optimization
5.8.1. Suppose that f is continuously differentiable, that x— is a nondegenerate local minimizer
for problem (5.4), and all constraints are active. Show that there is δ such that
1. if x ∈ B(δ) then x— = P(x ’ ∇f (x)), and
2. the gradient projection algorithm converges in one iteration if x0 ∈ B(δ).
5.8.2. Show that if H = I then (5.31) and (5.13) are equivalent.
5.8.3. Prove Theorem 5.5.2.

5.8.4. Verify (5.42).
5.8.5. Suppose the unconstrained problem (1.2) has a solution x— at which the standard assump-
tions for unconstrained optimization hold. Consider the bound constrained problem (5.3)
for u and l such that x— ∈ „¦ and A(x— ) is not empty. Is x— a nondegenerate local mini-
mizer? If not, how are the results in this chapter changed? You might try a computational
example to see what™s going on.
5.8.6. Prove Theorem 5.5.4.
5.8.7. Verify (5.51).
5.8.8. Verify (5.52).
5.8.9. Formulate a generalization of (4.33) for updating A† .

5.8.10. What would happen in the examples if we increased the number of (y, s) pairs that were
stored? By how much would the BFGS cost be increased?




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




Part II

Optimization of Noisy Functions




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




Chapter 6

Basic Concepts and Goals

The algorithms in Part I cannot be implemented at all if the gradient of f is not available,
either analytically or via a difference. Even if gradients are available, these algorithms are
not satisfactory if f has many local minima that are not of interest. We limit our coverage to
deterministic sampling algorithms which are generally applicable and are more or less easy to
implement. Of these algorithms, only the DIRECT algorithm [150] covered in §8.4.2 is truly
intended to be a global optimizer.
The study of optimization methods that do not require gradients is an active research area (see
[227] for a survey of some of this activity), even for smooth problems [61], [62]. Even though
some of the methods, such as the Nelder“Mead [204] and Hooke“Jeeves [145] algorithms are
classic, most of the convergence analysis in this part of the book was done after 1990.
The algorithms and theoretical results that we present in this part of the book are for objective
functions that are perturbations of simple, smooth functions. The surfaces in Figure 6.1 illustrate
this problem. The optimization landscape on the left of Figure 6.1, taken from [271], arose in
a problem in semiconductor design. The landscape on the right is a simple perturbation of a
convex quadratic.

4.5


4


3.5
20
3
0
2.5
-20
2
-40
1.5
-60
25
1
20
-80
15
0
0.5
5 10
10
15 5
20 0
0
25 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2




Figure 6.1: Optimization Landscapes

We do not discuss algorithms that explicitly smooth the objective function or apply a ¬lter,
such as the ones in [168] and [187]. For general problems, these must sample the variable
space in some way, for example by performing high-dimensional integration, and are too costly.
However, in some special cases these integrals can be performed analytically and impressive
results for special-purpose ¬ltering algorithms for computational chemistry have been reported
in, for example, [196] and [277]. Nor do we discuss analog methods (see [149] for a well-known

111
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




112 ITERATIVE METHODS FOR OPTIMIZATION

example).
We also omit stochastic methods like the special-purpose methods discussed in [38] and [39],
or more radical general-purpose global optimization algorithms, such as simulated annealing
[166] (see [1] and [265] for surveys of recent work), interval methods [152], or genetic algorithms
[143], [144] (see [246] or [123] for a survey), which are random to some extent or random
search algorithms. These probabilistic methods, however, should be considered when the more
conservative algorithms such as the ones in this part of the book fail.


6.1 Problem Statement
Consider an objective function f that is a perturbation of a smooth function fs by a small function
φ
f (x) = fs (x) + φ(x).
(6.1)
Small oscillations in φ could cause f to have several local minima that would trap any conven-
tional gradient-based algorithms. The perturbation φ can, in general, be random or based on the
output of an experiment, [250], and may not return the same value when called twice with the
same argument. Hence φ need not even be a function. We assume that φ is everywhere de¬ned
and bounded to make the statement of the results simpler.


6.2 The Simplex Gradient
Most of the the algorithms in this part of the book examine a simplex of points in RN at each
iteration and then change the simplex in response. In this section we develop the tools needed to
describe and analyze these algorithms. The fundamental idea is that many sampling algorithms
require enough information to approximate the gradient by differences and that the accuracy in
that difference approximation can be used to analyze the convergence. However, for problems
of the form (6.1), one must take care not to make the difference increments so small as to attempt
to differentiate the noise.
The ideas in this section were originally used in [155] to analyze the Nelder“Mead [204]
algorithm, which we discuss in §8.1. However, the ideas can be applied to several classes of
algorithms, and we follow the development in [29] in this section.
De¬nition 6.2.1. A simplex S in RN is the convex hull of N + 1 points, {xj }N +1 . xj is
j=1
the jth vertex of S. We let V (or V (S)) denote the N — N matrix of simplex directions
V (S) = (x2 ’ x1 , x3 ’ x1 , . . . , xN +1 ’ x1 ) = (v1 , . . . , vN ).
We say S is nonsingular if V is nonsingular. The simplex diameter diam(S) is
diam(S) = max xi ’ xj .
1¤i,j¤N +1

We will refer to the l2 condition number κ(V ) of V as the simplex condition.
We let δ(f : S) denote the vector of objective function differences
δ(f : S) = (f (x2 ) ’ f (x1 ), f (x3 ) ’ f (x1 ), . . . , f (xN +1 ) ’ f (x1 ))T .
We will not use the simplex diameter directly in our estimates or algorithms. Rather we will use
two oriented lengths
σ+ (S) = max x1 ’ xj and σ’ (S) = min x1 ’ xj .
2¤j¤N +1 2¤j¤N +1

Clearly,
σ+ (S) ¤ diam(S) ¤ 2σ+ (S).



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BASIC CONCEPTS AND GOALS 113

6.2.1 Forward Difference Simplex Gradient

De¬nition 6.2.2. Let S be a nonsingular simplex with vertices {xj }N +1 . The simplex gradient
j=1
D(f : S) is
D(f : S) = V ’T δ(f : S).


Note that the matrix of simplex directions and the vector of objective function differences
depend on which of the vertices is labeled x1 . Most of the algorithms we consider in this part
of the book use a vertex ordering or sample on a regular stencil. In this way the algorithms, in
one way or another, use a simplex gradient.
This de¬nition of simplex gradient is motivated by the ¬rst-order estimate in Lemma 6.2.1.
Lemma 6.2.1. Let S be a simplex. Let ∇f be Lipschitz continuous in a neighborhood of S
with Lipschitz constant 2Kf . Then there is K > 0, depending only on Kf such that

∇f (x1 ) ’ D(f : S) ¤ Kκ(V )σ+ (S).
(6.2)



Proof. Our smoothness assumptions on f and Taylor™s theorem imply that for all 2 ¤ j ¤
N + 1,
|f (x1 ) ’ f (xj ) + vj ∇f (x1 )| ¤ Kf vj 2 ¤ Kf σ+ (S)2 .
T


Hence
δ(f : S) ’ V T ∇f (x1 ) ¤ N 1/2 Kf σ+ (S)2
and hence, setting K = N 1/2 Kf ,

∇f (x1 ) ’ D(f : S) ¤ K V ’T σ+ (S)2 .

The conclusion follows from the fact that σ+ (S) ¤ V .
Search algorithms are not intended, of course, for smooth problems. Minimization of ob-
jective functions of the form in (6.1) is one of the applications of these methods. A ¬rst-order
estimate that takes perturbations into account is our next result.
We will need to measure the perturbations on each simplex. To that end we de¬ne for any
set T
φ T = sup φ(x) .
x∈T

A ¬rst-order estimate also holds for the simplex gradient of an objective function that satis¬es
(6.1).
Lemma 6.2.2. Let S be a nonsingular simplex. Let f satisfy (6.1) and let ∇fs be Lipschitz
continuous in a neighborhood of S with Lipschitz constant 2Ks . Then there is K > 0, depending
only on Ks , such that

φS
∇fs (x1 ) ’ D(f : S) ¤ Kκ(V ) σ+ (S) + .
(6.3)
σ+ (S)


Proof. Lemma 6.2.1 (applied to fs ) implies

∇fs (x1 ) ’ D(fs : S) ¤ Ks N 1/2 κ(V )σ+ (S).


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




114 ITERATIVE METHODS FOR OPTIMIZATION

Now, since δ(φ : S) ¤ 2 N φ and σ+ (S) ¤ V ,
S,

¤ V ’T δ(f : S) ’ δ(fs : S) = V ’T
D(f : S) ’ D(fs : S) δ(φ : S)

φS
¤ 2N 1/2 V ’T ¤ 2N 1/2 κ(V )
φ .
S
σ+ (S)

This completes the proof with K = N 1/2 Ks + 2N 1/2 .
The constants K in (6.2) and (6.3) depend on S only through the Lipschitz constants of fs
and ∇fs in a neighborhood of S. We will express that dependence as K = K(S) when needed.
The algorithms in this section are most pro¬tably applied to problems of the form (6.1), and
the goal is to extract as much information as possible from the smooth part fs of f without
wasting effort in a futile attempt to minimize the noise. In order to formulate our goal for
convergence clearly, we explore the consequences of a small simplex gradient in the special (and
not uncommon) case that the amplitude of the noise is small in Lemma 6.2.3.
Lemma 6.2.3. Let f satisfy (6.1) and let ∇fs be continuously differentiable in a compact set
„¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Then there is K„¦ > 0 such that
for any simplex S ‚ „¦ with vertices {xj }N +1 ,
j=1

φS
x1 ’ x— ¤ K„¦ D(f : S) + κ(V ) σ+ (S) + .
σ+ (S)


Proof. The compactness of „¦ and our smoothness assumptions on fs imply that there is β0
such that
∇fs (x) ≥ β0 x ’ x—
for all x ∈ „¦. We apply (6.3) to obtain
’1
x1 ’ x— ¤ β0 ∇fs (x1 )

φS
’1
¤ β0 D(f : S) + Kκ(V ) σ+ (S) + .
σ+ (S)
’1
This completes the proof with K„¦ = β0 max(1, K).
By sampling in an organized way simplex-based algorithms, some directly and some implic-
itly, attempt to drive the simplex gradient to a small value by changing the size of the simplices
over which f is sampled. The motion of the simplices and the scheme for changing the size
(especially the reduction in size) accounts for the differences in the algorithms. Theorem 6.2.4,
a direct consequence of Lemma 6.2.3, quanti¬es this. We will consider a sequence of uniformly
well-conditioned simplices. Such simplices are generated by several of the algorithms we will
study later.
Theorem 6.2.4. Let f satisfy (6.1) and let ∇fs be continuously differentiable in a compact
set „¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Let S k be a sequence of
simplices having vertices {xk }N +1 . Assume that there is M such that
j j=1

S k ‚ „¦ and κ(V (S k )) ¤ M for all k.

Then,
1. if
φ Sk
lim σ+ (S k ) = 0, lim = 0,
k’∞ σ+ (S k )
k’∞




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BASIC CONCEPTS AND GOALS 115

and lim supk’∞ D(f : S k ) = , for some > 0, then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ;
1
k’∞


2. if, for some > 0,
2
, lim inf σ+ (S k ) ≥ , and lim inf D(f : S k ) ¤ ,
lim sup φ ¤
Sk
k’∞ k’∞
k’∞

then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ( + lim sup σ+ (S k )).
1
k’∞ k’∞




6.2.2 Centered Difference Simplex Gradient
In this section we de¬ne the centered difference simplex gradient and prove a second-order
estimate. We will then prove two variants of Theorem 6.2.4, one to show how the role of the
noise φ differs from that in the one-sided derivative case and a second to quantify how the values
of f on the stencil can be used to terminate an iteration.
De¬nition 6.2.3. Let S be a nonsingular simplex in RN with vertices {xj }N +1 and simplex
j=1
directions vj = xj+1 ’ x1 . The re¬‚ected simplex R = R(S) is the simplex with vertices x1 and

rj = x1 ’ vj for j = 1, . . . , N.

The central simplex gradient DC (f : S) is

V ’T (δ(f : S) ’ δ(f : R))
D(f : S) + D(f : R)
DC (f : S) = = .
2 2


For example, if N = 1 and x2 = x1 + h, then r2 = x1 ’ h. Hence

f (x1 + h) ’ f (x1 ) f (x1 ’ h) ’ f (x1 )
D(f : S) = and D(f : R) = .
h ’h
Therefore,
f (x1 + h) ’ f (x1 ’ h)
DC (f : S) = DC (f : R) =
2h
is the usual central difference.
Lemmas 6.2.5 and 6.2.6 are the second-order analogues of Lemmas 6.2.1 and 6.2.2.
Lemma 6.2.5. Let S be a nonsingular simplex and let ∇2 f be Lipschitz continuous in a
neighborhood of S ∪ R(S) with Lipschitz constant 3KC . Then there is K > 0 such that

∇f (x1 ) ’ DC (f : S) ¤ Kκ(V )σ+ (S)2 .
(6.4)



Proof. The Lipschitz continuity assumption implies that for all 2 ¤ j ¤ N + 1,

f (xj ) ’ f (rj ) + 2∇f (x1 )T vj ¤ KC vj 3
¤ Kc σ+ (S)3 .


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




116 ITERATIVE METHODS FOR OPTIMIZATION

As in the proof of Lemma 6.2.1 we have

V T (δ(f : S) ’ δ(f : R)) ’ V T ∇f (x1 ) ¤ N 1/2 KC σ+ (S)3 ,

and hence the result follows with K = N 1/2 KC .
Lemma 6.2.6. Let S be a nonsingular simplex. Let f satisfy (6.1) and let ∇2 fs be Lipschitz
continuous in a neighborhood of S ∪ R(S) with Lipschitz constant 3KCs . Then there is K > 0,
depending only on KCs , such that

φS
∇fs (x1 ) ’ DC (f : S) ¤ Kκ(V ) σ+ (S)2 + .
(6.5)
σ+ (S)


Proof. This proof is very similar to that of Lemma 6.2.2 and is left to the reader.
The quality of the information that can be obtained from the central simplex gradient is
higher than that of the forward. The difference in practice can be dramatic, as the examples
in §7.6 illustrate. The consequences of a small central simplex gradient follow directly from
Lemma 6.2.6.
Lemma 6.2.7. Let f satisfy (6.1) and let ∇2 fs be continuously differentiable in a compact
set „¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Then there is K„¦ > 0 such
that if a simplex S and its re¬‚ection R(S) are both contained in „¦ then

φS
x1 ’ x— ¤ K„¦ DC (f : S) + κ(V ) σ+ (S)2 + .
σ+ (S)


Lemma 6.2.7 is all one needs to conclude convergence from a sequence of small central
simplex gradients.
Theorem 6.2.8. Let f satisfy (6.1) and let ∇2 fs be continuously differentiable in a compact
set „¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Let S k be a sequence of
simplices having vertices {xk }N +1 . Assume that there is M such that
j j=1


S k , R(S k ) ‚ „¦ and κ(V (S k )) ¤ M for all k.

Then,
1. if
φ Sk
lim σ+ (S k ) = 0, lim = 0,
k’∞ σ+ (S k )
k’∞

and lim supk’∞ DC (f : S k ) = , for some > 0, then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ;
1
k’∞


2. if, for some > 0,
3
, lim inf σ+ (S k ) ≥ 2
, and lim inf DC (f : S k ) ¤ 2
lim sup φ ¤ ,
Sk
k’∞ k’∞
k’∞

then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ( + lim sup σ+ (S k ))2 .
1
k’∞ k’∞




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BASIC CONCEPTS AND GOALS 117

Theorem 6.2.8, like Theorem 6.2.4, motivates using a small simplex gradient as a test for
convergence. Suppose φ ∞ ¤ and an algorithm generates sequences of simplices whose
vertices are intended to approximate a minimizer of fs . We can use the results in §2.3.1 to con-
clude that simplices with σ+ (S) << 1/2 will result in inaccurate forward difference gradients
and those with σ+ (S) << 2/3 in inaccurate central difference gradients. This indicates that
the central simplex gradient will be less sensitive to noise than the forward. While this is not
usually critical in computing a difference Hessian, where the loss of accuracy may cause slow
convergence, it can cause failure of the iteration if one is computing a difference gradient.
If one wants to terminate the algorithm when the simplex gradient is small, say, ¤ „ , a rough
estimate of the minimal possible value of „ is „ = O( 1/2 ) for a forward difference simplex
gradient and „ = O( 2/3 ) for a central simplex gradient.
Moreover, if one is using a centered difference, one has information on the values of f at
enough points to make an important qualitative judgment. In order to evaluate a central simplex
gradient f must be sampled at x1 and x1 ± vj for 1 ¤ j ¤ N . If f (x1 ) ¤ f (x1 ± vj ) for
all 1 ¤ j ¤ N , then one can question the validity of using the simplex gradient as a descent
direction or as a measure of stationarity. We call this stencil failure. We will use stencil failure
as a termination criterion in most of the algorithms we discuss in this part of the book. Our basis
for that is a result from [29], which only requires differentiability of fs .
Theorem 6.2.9. Let S be a nonsingular simplex such that for some µ’ ∈ (0, 1) and κ+ > 0,

κ(V ) ¤ κ+ and xT V V T x ≥ µ’ σ+ (S)2 x 2
for all x.
(6.6)

Let f satisfy (6.1) and let ∇fs be Lipschitz continuously differentiable in a ball B of radius
2σ+ (S) about x1 . Assume that

f (x1 ) < min{f (x1 ± vj )}.
(6.7)
j

Then, if K is the constant from Lemma 6.2.2,

φB
∇fs (x1 ) ¤ 8µ’1 Kκ+ σ+ (S) + .
(6.8) ’
σ+ (S)


Proof. Let R(S), the re¬‚ected simplex, have vertices x1 and {rj }N . (6.7) implies that
j=1
each component of δ(f : S) and δ(f : R) is positive. Now since

V = V (S) = ’V (R),

we must have
0 < δ(f : S)T δ(f : R)

= (V T V ’T δ(f : S))T (V (R)T V (R)’T δ(f : R))
(6.9)

= ’D(f : S)T V V T D(f : R).
We apply Lemma 6.2.2 to both D(f : S) and D(f : R) to obtain

D(f : S) = ∇fs (x1 ) + E1 and D(f : R) = ∇fs (x1 ) + E2 ,

where, since κ(V ) = κ(V (R)) ¤ κ+ ,

φB
Ek ¤ Kκ+ σ+ (S) + .
σ+ (S)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




118 ITERATIVE METHODS FOR OPTIMIZATION

Since V ¤ 2σ+ (S) we have by (6.9)

∇fs (x1 )T V V T ∇fs (x1 ) ¤ 4σ+ (S)2 ∇fs (x1 ) ( E1 + E2 )
(6.10)
+4σ+ (S)2 E1 E2 .

The assumptions of the lemma give a lower estimate of the left side of (6.10),

wT V V T w ≥ µ’ σ+ (S)2 w 2 .

Hence,
∇2 f (x1 ) ¤ b ∇2 f (x1 ) + c,
where, using (6.10),
φB
b = 8µ’1 Ks κ+ σ+ (S) +
1
σ+ (S)
and
2
φB µ’ 2
4µ’1 (Ks κ+ )2
c= σ+ (S) + = B.

σ+ (S) 16
So b2 ’ 4c = b2 (1 ’ µ’ /4) and the quadratic formula then implies that

1 + 1 ’ µ’ /4
b + b2 ’ 4c
∇2 f (x1 ) ¤ =b ¤b
2 2
as asserted.


6.3 Examples
Our examples are selected to represent a variety of problems that can be attacked by the methods
in this part of the book and, at the same time, are easy for the reader to implement. Many of the
problems to which these methods have been applied have complex objective functions and have
been solved as team efforts [107], [250], [121], [70], [69]. In many such cases the objective
function is not even available as a single subroutine as the optimizer, simulator, and design tool
are one package. Hence, the examples we present in this part of the book are even more arti¬cial
than the ones in the ¬rst part. The cost of an evaluation of f is much less in these examples than
it is in practice.

6.3.1 Weber™s Problem
Our discussion of this problem is based on [182]. Weber™s problem is to locate a central facility
(a warehouse or factory, for example) so that the total cost associated with distribution to several
demand centers is minimized. The model is that the cost is proportional to the distance from the
facility. The proportionality constant may be positive re¬‚ecting transportation costs or negative
re¬‚ecting environmental concerns.
If the locations of the demand centers are {zi } ‚ R2 and the corresponding weights are
{wi }, then the objective function is

[(x)1 ’ (zi )1 ]2 + [(x)2 ’ (zi )2 ]2 .
f (x) = wi x ’ zi = wi
(6.11)
i i

We will assume that
wi > 0,
i



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BASIC CONCEPTS AND GOALS 119

so that a global optimum exists. If i wi < 0 then inf f = ’∞ and there is no global optimum.
Weber™s problem is not differentiable at x = zi because the square root function is not
differentiable at 0. A gradient-based algorithm, applied in a naive way, will have dif¬culty with
this problem. There are special-purpose algorithms (see [182] for a survey) for Weber™s problem,
especially if all the weights are positive. Our main interest is in the case where at least one weight
is negative. In that case there may be multiple local minima.
We will consider two examples. The ¬rst, and simplest, is from [182]. This example has
three demand centers with
2 90 43
w = (2, 4, ’5)T and (z1 , z2 , z3 ) = .
42 11 88
The global minimum is at x— = (90, 11)T , at which the gradient is not de¬ned. The complex
contours near the minimizer in Figure 6.2 illustrate the dif¬culty of the problem.
200


150


100


50


0


’50


’100


’150


’200
’200 ’150 ’100 ’50 0 50 100 150 200




Figure 6.2: Contour/Surface for Weber™s Function: First Example


Our second example has two local minimizers, at (’10, ’10) and (25, 30) with the global
minimizer at (25, 30). There are four demand centers with
’10 0 5 25
w = (2, ’4, 2, 1)T and (z1 , z2 , z3 , z4 ) = .
’10 0 8 30
See Figure 6.3.
Our third example adds the oscillatory function
φ(x) = sin(.0035xT x) + 5 sin(.003(x ’ y)T (x ’ y))
to the second example, where y = (’20, 0)T . This complicates the optimization landscape
signi¬cantly, as the surface and contour plots in Figure 6.4 show.

6.3.2 Perturbed Convex Quadratics
The sum of a simple convex quadratic and low-amplitude high-frequency perturbation will serve
as a model problem for all the algorithms in this section. For example, the function graphed on
the right in Figure 6.1,

f (x) = 2x2 (1 + .75 cos(80x)/12) + cos(100x)2 /24
is one of the examples in [120]. Our general form will be
= (x ’ ξ0 )T H(x ’ ξ0 )(1 + a1 cos(bT (x ’ ξ1 ) + c1 (x ’ ξ1 )T (x ’ ξ1 )))
f (x) 1
(6.12)
+a2 (1 + cos(bT (x ’ ξ2 )T + c2 (x ’ ξ2 )T (x ’ ξ2 ))) + a3 |rand|,
2



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




120 ITERATIVE METHODS FOR OPTIMIZATION


60



40



20



0



’20



’40



’60
’60 ’40 ’20 0 20 40 60




Figure 6.3: Contour/Surface for Weber™s Function: Second Example

60



40



20



0



’20



’40



’60
’60 ’40 ’20 0 20 40 60




Figure 6.4: Contour/Surface plots for Weber™s Function: Third Example


where {ξj }, {aj }, {bj }, {cj } are given and rand is a random number generator. f has been
designed so that the minimum value is O(a1 +a2 +a3 ). The unperturbed case a1 = a2 = a3 = 0
is also of interest for many of the algorithms in this part of the book.


6.3.3 Lennard“Jones Problem
The objective function is a simple model of the potential energy in a molecule of identical atoms.
Assume that there are M atoms and that ξi ∈ R3 is the position of the ith atom. Letting

dij = ξi ’ ξj

and
x = (ξ1 , . . . , ξM )T ∈ RN
T T


where N = 3M , we have that the Lennard“Jones energy function is
M i’1
d’12 ’ 2d’6 .
f (x) =
(6.13) ij ij
i=1 j=1

2
f has many local minimizers (O(eM ) is one conjecture [142]) and the values at the mini-
mizers are close. Hence, the Lennard“Jones function does not conform to the noisy perturbation
of a smooth function paradigm. The reader is asked in some of the exercises to see how the
methods perform.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




BASIC CONCEPTS AND GOALS 121

6.4 Exercises on Basic Concepts
6.4.1. Show that if wi > 0 for all i then Weber™s problem has a unique local minimum.

6.4.2. Prove Lemma 6.2.6.
6.4.3. Try to minimize the Lennard“Jones functional using some of the algorithms from the ¬rst
part of the book. Vary the initial iterate and M . Compare your best results with those in
[142], [40], and [210].




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




122 ITERATIVE METHODS FOR OPTIMIZATION




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




Chapter 7

Implicit Filtering

7.1 Description and Analysis of Implicit Filtering
The implicit ¬ltering algorithm was originally formulated in [270], [251], and [271], as a
difference-gradient implementation of the gradient projection algorithm [18] in which the dif-
ference increment is reduced in size as the iteration progresses. A different formulation for
unconstrained problems with certain convexity properties was introduced at about the same time
in [279]. From the point of view of this book, the simplex gradient is used in a direct way. The
algorithmic description and analysis in this chapter uses the results from §6.2 directly. We will
focus on unconstrained problems and derive the convergence results that implicit ¬ltering shares
with the search algorithms in Chapter 8.
Implicit ¬ltering, by using an approximate gradient directly, offers the possibility of im-
proved performance with quasi-Newton methods and can be easily applied to bound constrained
problems. We explore these two possibilities in §§7.2 and 7.4.
In its simplest unconstrained form, implicit ¬ltering is the steepest descent algorithm with
difference gradients, where the difference increment varies as the iteration progresses. Because
the gradient is only an approximation, the computed steepest descent direction may fail to be a
descent direction and the line search may fail. In this event, the difference increment is reduced.
For a given x ∈ RN and h > 0 we let the simplex S(x, h) be the right simplex from x with
edges having length h. Hence the vertices are x and x + hvi for 1 ¤ i ¤ N with V = I. So
κ(V ) = 1. The performance of implicit ¬ltering with a central difference gradient is far superior
to that with the forward difference gradient [120], [187], [250]. We will, therefore, use centered
differences in the discussion. We illustrate the performance of forward difference gradients in
§7.6.
We set
∇h f (x) = DC (f : S(x, h)).
We use a simple Armijo [7] line search and demand that the suf¬cient decrease condition
2
f (x ’ »∇h f (x)) ’ f (x) < ’±» ∇h f (x)
(7.1)

holds (compare with (3.4)) for some ± > 0.
Our central difference steepest descent algorithm fdsteep terminates when

∇h f (x) ¤ „ h
(7.2)

for some „ > 0, when more than pmax iterations have been taken, after a stencil failure, or
when the line search fails by taking more than amax backtracks. Even the failures of fdsteep

123
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




124 ITERATIVE METHODS FOR OPTIMIZATION

can be used to advantage by triggering a reduction in h. The line search parameters ±, β and
the parameter „ in the termination criterion (7.2) do not affect the convergence analysis that we
present here but can affect performance.
Algorithm 7.1.1. fdsteep(x, f, pmax, „, h, amax)
1. For p = 1, . . . , pmax
(a) Compute f and ∇h f ; terminate if (6.7) or (7.2) hold.
(b) Find the least integer 0 ¤ m ¤ amax such that (7.1) holds for » = β m . If no such
m exists, terminate.
(c) x = x ’ »∇h f (x).

Algorithm fdsteep will terminate after ¬nitely many iterations because of the limits on
the number of iterations and the number of backtracks. If the set {x | f (x) ¤ f (x0 )} is bounded
then the iterations will remain in that set. Implicit ¬ltering calls fdsteep repeatedly, reducing
h after each termination of fdsteep. Aside from the data needed by fdsteep, one must
provide a sequence of difference increments, called scales in [120].
Algorithm 7.1.2. imfilter1(x, f, pmax, „, {hk }, amax)
1. For k = 0, . . .
Call fdsteep(x, f, pmax, „, hk , amax)

The convergence result follows from the second-order estimate, (6.5), the consequences of a
stencil failure, Theorem 6.2.9, and the equalities hk = σ+ (S k ) and κ(V k ) = 1. A similar result
for forward differences would follow from (6.3).
Theorem 7.1.1. Let f satisfy (6.1) and let ∇fs be Lipschitz continuous. Let hk ’ 0, {xk }
be the implicit ¬ltering sequence, and S k = S(x, hk ). Assume that (7.1) holds (i.e., there is no
line search failure) for all but ¬nitely many k. Then if

lim (hk + h’1 φ Sk ) =0
(7.3) k
k’∞

then any limit point of the sequence {xk } is a critical point of fs .
Proof. If either (7.1) or (6.7) hold for all but ¬nitely many k then, as is standard,

∇hk f (xk ) = DC (f : S k ) ’ 0.

Hence, using (7.3) and Lemma 6.2.2,

∇fs (xk ) ’ 0,

as asserted.


7.2 Quasi-Newton Methods and Implicit Filtering
The unique feature of implicit ¬ltering is the possibility, for problems that are suf¬ciently smooth
near a minimizer, to obtain faster convergence in the terminal phase of the iteration by using a
quasi-Newton update of a model Hessian. This idea was ¬rst proposed in [250] and [120].
We begin with a quasi-Newton form of Algorithm fdsteep. In this algorithm a quasi-
Newton approximation to the Hessian is maintained and the line search is based on the quasi-
Newton direction
d = ’H ’1 ∇h f (x)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




IMPLICIT FILTERING 125

terminating when either
f (x + »d) ’ f (x) < ±»∇h f (x)T d
(7.4)
or too many stepsize reductions have been taken. With the application to implicit ¬ltering in
mind, Algorithm fdquasi replaces the quasi-Newton H with the identity matrix when the line
search fails.
Algorithm 7.2.1. fdquasi(x, f, H, pmax, „, h, amax)
1. For p = 1, . . . , pmax
(a) Compute f , ∇h f and d = ’H ’1 ∇h f ; terminate if (7.2) holds.
(b) Find the least integer 0 ¤ m ¤ amax such that (7.4) holds for » = β m .
(c) x = x + »d.
(d) Update H with a quasi-Newton formula.

In the context of implicit ¬ltering, where N is small, the full quasi-Newton Hessian or its
inverse is maintained throughout the iteration. Our MATLAB codes store the model Hessian.
Algorithm 7.2.2. imfilter2(x, f, pmax, „, {hk }, amax)
1. H = I.
2. For k = 0, . . .
Call fdquasi(x, f, H, pmax, „, hk , amax).

In [250] and [120] the SR1 method was used because it performed somewhat better than the
BFGS method in the context of a particular application. The examples in §7.6 show the opposite
effect, and both methods have been successfully used in practice.


7.3 Implementation Considerations
Implicit ¬ltering has several iterative parameters and requires some algorithmic decisions in its
implementation. The parameters pmax, amax, and β play the same role that they do in any line
search algorithm. In our MATLAB code imfil.m, which we used for all the computations
reported in this book, we set pmax = 200 — n, amax = 10, and β = 1/2.
The performance of implicit ¬ltering can be sensitive to the value of „ [250], with small values
of „ leading to stagnation and values of „ that are too large leading to premature termination of
fdquasi. Using stencil failure as a termination criterion reduces the sensitivity to small values
of „ and we use „ = .01 in the computations.
The sequence of scales is at best a guess at the level of the noise in the problem. If several of
the scales are smaller than the level of the noise, the line search will fail immediately and work
at these scales will be wasted. Our implementation attempts to detect this by terminating the
optimization if the x is unchanged for three consecutive scales.
The simplex gradient may be a very poor approximation to the gradient. In some such cases
the function evaluation at a trial point may fail to return a value [250] and one must either trap
this failure and return an arti¬cially large value, impose bound constraints, or impose a limit on
the size of the step. In our computations we take the latter approach and limit the stepsize to
10h by setting
±
 ’H ’1 ∇h f (x) if H ’1 ∇h f (x) ¤ 10h,


d=
(7.5)
 ’10hH ’1 ∇h f (x)

 otherwise.
H ’1 ∇h f (x)


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




126 ITERATIVE METHODS FOR OPTIMIZATION

The choice of a quasi-Newton method to use with implicit ¬ltering is an area of active research
[56], [55]. Both SR1 and BFGS have been used, with SR1 performing modestly better in some
applications with bound constraints [270], [251], [271], [250], [55]. The implementation of
implicit ¬ltering in the collection of MATLAB codes imfil.m uses BFGS as the default but
has SR1 as an option. We found BFGS with central differences to be consistently better in the
preparation of the (unconstrained!) computational examples in this book.


7.4 Implicit Filtering for Bound Constrained Problems
Implicit ¬ltering was initially designed as an algorithm for bound constrained problems [250],
[120]. The bound constrained version we present here is simply a projected quasi-Newton
algorithm like the one presented in §5.5.3. There are other approaches to the implementation
and no best approach has emerged. We refer the reader to [120] and [55] for discussions of the
options.
We begin with scaling and the difference gradient. Central differences perform better, but
we do not evaluate f outside of the feasible region. Hence, if a point on the centered difference
stencil is outside of the feasible region, we use a one-sided difference in that direction. In order
to guarantee that at least one point in each direction is feasible, we scale the variables so that
Li = 0, Ui = 1, and h0 ¤ 1/2.
The suf¬cient decrease condition is (compare with (5.31))

f (x(»)) ’ f (x) ¤ ±∇h f (x)T (x(») ’ x),
(7.6)

where
x(») = P(x ’ »∇h f (x)).
One could terminate the iteration at a given scale when the analogue to (7.2)

x ’ x(1) ¤ „ h
(7.7)

holds or when
f (xc ) < f (x ± rj ) for all x ± rj feasible,
(7.8)
which is the analogue to (6.7) for bound constrained problems.
Quasi-Newton methods for bound constraints can be constructed more simply for small
problems, like the ones to which implicit ¬ltering is applied, where it is practical to store the
model of the inverse of the reduced Hessian as a full matrix. By using full matrix storage, the
complexity of bfgsrecb is avoided. One such alternative [53], [54], [55] to the updates in
§5.5.3 is to update the complete reduced Hessian and then correct it with information from the
new active set. This results in a two-stage update in which a model for the inverse of reduced
Hessian is updated with (4.5) to obtain

sy T ysT ssT
’1 ’1
R1/2 = I’ T Rc I’ T +T.
(7.9)
ys ys ys
Then the new reduced Hessian is computed using the active set information at the new point
’1 ’1
R+ = PA+ + PI+ R1/2 PI+ .
(7.10)

It is easy to show that Theorems 5.5.4 and 5.5.5 hold for this form of the update.
A FORTRAN implementation [119] of implicit ¬ltering for bound constrained problems is
in the software collection. In the original version of that implementation a projected SR1 update
was used and a Cholesky factorization of the matrix R+ was performed to verify positivity. The
model Hessian was reinitialized to the identity whenever the scale or the active set changed.


Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




IMPLICIT FILTERING 127

7.5 Restarting and Minima at All Scales
No algorithm in this part of the book is guaranteed to ¬nd even a local minimum, much less
a global one. One approach to improving the robustness of these algorithms is to restart the
iteration after one sweep through the scales. A point x that is not changed after a call to
Algorithm imfilter1 (or imfilter2 or the bound constrained form of either) is called a
minimum at all scales.
If f satis¬es (6.1), fs has a unique critical point that is also a local minimum that satis¬es the
standard assumptions (and hence is a global minimum for fs ), and certain (strong!) technical
assumptions on the decay of φ near the minimum hold, then [120] a minimum at all scales is near
that global minimum of fs . In the unconstrained case this statement follows from the termination
criteria ((7.2) and (6.7)) for implicit ¬ltering, Lemma 6.2.3 (or 6.2.7) and, if central differences
are used, Theorem 6.2.9. The analysis in [120] of the bound constrained case is more technical.
In practice, restarts are expensive and need not be done for most problems. However, restarts
have been reported to make a difference in some cases [178]. It is also comforting to know that
one has a minimum at all scales, and the author of this book recommends testing potential optima
with restarts before one uses the results in practice but not at the state where one is tuning the
optimizer or doing preliminary evaluation of the results.


7.6 Examples
Many of these examples are from [56]. For all the examples we report results with and without a
quasi-Newton Hessian. We report results for both forward and central differences. In the ¬gures
the solid line corresponds to the BFGS Hessian, the dashed-dotted line to the SR1 Hessian, and
the dashed line to H = I, the steepest descent form of implicit ¬ltering.
Unlike the smooth problems considered earlier, where convergence of the gradient to zero
was supported by theory, convergence of the simplex gradient to zero is limited by the noise in
the objective. We illustrate performance by plotting both the objective function value and the
norm of the simplex gradient. From these examples it is clear that the the graphs of function
value against the count of function evaluations is a better indicator of the performance of the
optimizer.
In all cases we terminated the iteration when either fdquasi had been called for each scale
or a budget of function evaluations had been exhausted. Once the code completes an iteration
and the number of function evaluations is greater than or equal to the budget, the iteration is
terminated.
The examples include both smooth and nonsmooth problems, with and without noise. A
serious problem for some algorithms of this type is their failure on very easy problems. For
most of the algorithms covered in this part of the book, we will present examples that illustrate
performance on this collection of problems.

7.6.1 Weber™s Problem
The three Weber™s function examples all have minimizers at points at which the objective is
nondifferentiable. For the computations we used an initial iterate of (10, ’10)T , a budget of
200 function evaluations, and {10 — 2’n }8 n=’2 as the sequence of scales.
In each of the examples the performance of the two quasi-Newton methods was virtually
identical and far better than that without a quasi-Newton model Hessian. Forward and central
differences for the ¬rst two problems (Figures 7.1 and 7.2) perform almost equally well, with
forward having a slight edge. In Figure 7.3, however, the forward difference version of implicit
¬ltering ¬nds a local minimum different from the global minimum that is located by central



Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




128 ITERATIVE METHODS FOR OPTIMIZATION



Central Differences Central Differences
1
10 50
simplex gradient norm


100




function value
150
0
10
200

250

1
10 300
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations

Forward Differences Forward Differences
1
10 50
simplex gradient norm




100
0
10 function value
150
1
10
200
2
10
250

3
10 300
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations


Figure 7.1: First Weber Example




Central Differences Central Differences
1
10 70

60
simplex gradient norm




50
0
function value




10
40

30
1
10
20

10
2
10 0
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations

Forward Differences Forward Differences
1
10 70

60
simplex gradient norm




50
0
function value




10
40

30
1
10
20

10
2
10 0
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations


Figure 7.2: Second Weber Example




Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.




IMPLICIT FILTERING 129

Central Differences Central Differences
1
10 70

60
simplex gradient norm

<<

. 6
( 9)



>>