Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

106 ITERATIVE METHODS FOR OPTIMIZATION

5.6.1 In¬nite-Dimensional Problems

The results in this part of this book do not extend in a direct way to in¬nite-dimensional problems.

One reason for this is that often in¬nite-dimensional problems have countably in¬nitely many

constraints or even a continuum of constraints; hence „¦ is not compact in the norm topology of

the Banach space in which the problem is posed and appeals to various types of weak continuity

must be made (see [122] for an example of such arguments and [122] and [10] for applications).

Moreover, identi¬cation of an active set in ¬nitely many iterations is not always possible. A

more complete account of this issue may be found in [254], [162], [161].

These are not the only complications that can arise in in¬nite dimension. Even the projected

gradient method presents challenges, especially if the minima fail to be nondegenerate in the

sense of this book [94], [95]. Convergence behavior for discretized problems can be different

from that for the continuous problem [97]. Nonequivalence of norms makes convergence results

dif¬cult to formulate and analyze for both line search [96], [254], [98] and trust region [140],

[162] methods.

The functional analytic structure of many control problems can be exploited with fast mul-

tilevel methods. Both second kind multigrid methods from [138] and variants of the Atkinson“

Brakhage method [9], [31] have been applied to ¬xed point formulations of parabolic boundary

control problems in one space dimension [136], [137], [153], [162], [161].

5.7 Examples

The computations in this section were done with the MATLAB code bfgsbound. In this code

the storage is limited to ¬ve pairs of vectors, and β = .1 was used in the line search.

5.7.1 Parameter ID Problem

We consider the parameter problem from §3.4.1 with bounds L = (2, 0)T and U = (20, 5)T .

The initial iterate x0 = (5, 5)T is feasible, but the global minimum of (1, 1)T is not. As one

might expect, the lower bound constraint on (x)1 is active at the optimal point x— ≈ (2, 1.72)T .

The termination criterion for both the gradient projection and projected BFGS algorithms was

u ’ u(1) ¤ 10’6 .

The gradient projection algorithm failed. While the value of the objective function was

correct, the projected gradient norm failed to converge and the active set was not identi¬ed.

The projected BFGS iteration converged in 35 iterations. One can see the local superlinear

convergence in Figure 5.1 from the plot of the projected gradient norms. The cost of the BFGS

iteration was 121 function evaluations, 36 gradients, and roughly 5.3 million ¬‚oating point

operations.

5.7.2 Discrete Control Problem

We base the two control problem examples on the example from §1.6.1.

Our ¬rst example takes N = 2000, T = 1, y0 = 0,

L(y, u, t) = (y ’ 3)2 + .1 — u2 , and φ(y, u, t) = uy + t2 ,

with the bound constraints

.5 ¤ u ¤ 2,

and the initial iterate u0 = 2. We terminated the iteration when u ’ u(1) ¤ 10’5 . In

Figure 5.2 we plot the solution of this problem. Clearly the active set is not empty for the

constrained problem.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BOUND CONSTRAINTS 107

4

Projected BFGS Projected BFGS

x 10

2

10 1.74

Projected Gradient Norm

1.73

0

10

Function Value

1.72

2

10

1.71

4

10

1.7

6

10 1.69

0 10 20 30 40 0 10 20 30 40

Iterations Iterations

4

Gradient Projection Gradient Projection

x 10

2

10 1.74

Projected Gradient Norm

1.73

0

10

Function Value

1.72

2

10

1.71

4

10

1.7

6

10 1.69

0 50 100 0 50 100

Iterations Iterations

Figure 5.3: Constrained Discrete Control Problem I

Projected BFGS Projected BFGS

5 8

10 10

Projected Gradient Norm

Function Value

0 6

10 10

5 4

10 10

10 2

10 10

0 2 4 6 0 2 4 6

Iterations Iterations

Gradient Projection Gradient Projection

5 8

10 10

Projected Gradient Norm

Function Value

0 6

10 10

5 4

10 10

10 2

10 10

0 2 4 6 8 0 2 4 6 8

Iterations Iterations

Figure 5.4: Constrained Discrete Control Problem II

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

108 ITERATIVE METHODS FOR OPTIMIZATION

We solve the constrained problem with Algorithm gradproj and Algorithm bfgsoptb.

In Figure 5.3 we plot the function value and the norm of the projected gradient u ’ u(1).

The projected BFGS iteration required 71 function evaluations, 36 gradient evaluations, and

roughly 5.6 million ¬‚oating point operations, while the gradient projection needed 183 function

evaluations, 92 gradient evaluations, and roughly 10.4 million ¬‚oating point operations.

Our second control problem example solves the same problem as in §3.4.2 using the con-

straints

’206 ¤ u ¤ 206.

We terminate the iteration when u ’ u(1) ¤ 10’6 , which is exactly the condition used in

§3.4.2 when the active set is empty. The solution to the unconstrained problem is feasible,

the active set is empty, and the initial iterate is feasible. Both the gradient projection iteration

and the projected BFGS iteration converge to the solution of the unconstrained problem. The

constraints are not active at either the initial iterate or the ¬nal solution but are active inside

the line search for the ¬rst iterate and for the second iterate. As is clear from a comparison

of Figures 5.4 and 3.3, this small change has a dramatic effect on the cost of the optimization,

eliminating the need for the scaling ¬xup (3.50). The gradient projection method, requiring 15

function evaluations, 8 gradient evaluations, and roughly 167 thousand ¬‚oating point operations,

is far more ef¬cient that the steepest descent iteration reported in §3.4.2. The projected BFGS

iteration was somewhat worse, needing 223 thousand operations, but only 13 function evaluations

and 7 gradient evaluations. In this example the cost of maintaining the BFGS update was not

compensated by a signi¬cantly reduced iteration count.

5.8 Exercises on Bound Constrained Optimization

5.8.1. Suppose that f is continuously differentiable, that x— is a nondegenerate local minimizer

for problem (5.4), and all constraints are active. Show that there is δ such that

1. if x ∈ B(δ) then x— = P(x ’ ∇f (x)), and

2. the gradient projection algorithm converges in one iteration if x0 ∈ B(δ).

5.8.2. Show that if H = I then (5.31) and (5.13) are equivalent.

5.8.3. Prove Theorem 5.5.2.

5.8.4. Verify (5.42).

5.8.5. Suppose the unconstrained problem (1.2) has a solution x— at which the standard assump-

tions for unconstrained optimization hold. Consider the bound constrained problem (5.3)

for u and l such that x— ∈ „¦ and A(x— ) is not empty. Is x— a nondegenerate local mini-

mizer? If not, how are the results in this chapter changed? You might try a computational

example to see what™s going on.

5.8.6. Prove Theorem 5.5.4.

5.8.7. Verify (5.51).

5.8.8. Verify (5.52).

5.8.9. Formulate a generalization of (4.33) for updating A† .

5.8.10. What would happen in the examples if we increased the number of (y, s) pairs that were

stored? By how much would the BFGS cost be increased?

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Part II

Optimization of Noisy Functions

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Chapter 6

Basic Concepts and Goals

The algorithms in Part I cannot be implemented at all if the gradient of f is not available,

either analytically or via a difference. Even if gradients are available, these algorithms are

not satisfactory if f has many local minima that are not of interest. We limit our coverage to

deterministic sampling algorithms which are generally applicable and are more or less easy to

implement. Of these algorithms, only the DIRECT algorithm [150] covered in §8.4.2 is truly

intended to be a global optimizer.

The study of optimization methods that do not require gradients is an active research area (see

[227] for a survey of some of this activity), even for smooth problems [61], [62]. Even though

some of the methods, such as the Nelder“Mead [204] and Hooke“Jeeves [145] algorithms are

classic, most of the convergence analysis in this part of the book was done after 1990.

The algorithms and theoretical results that we present in this part of the book are for objective

functions that are perturbations of simple, smooth functions. The surfaces in Figure 6.1 illustrate

this problem. The optimization landscape on the left of Figure 6.1, taken from [271], arose in

a problem in semiconductor design. The landscape on the right is a simple perturbation of a

convex quadratic.

4.5

4

3.5

20

3

0

2.5

-20

2

-40

1.5

-60

25

1

20

-80

15

0

0.5

5 10

10

15 5

20 0

0

25 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Figure 6.1: Optimization Landscapes

We do not discuss algorithms that explicitly smooth the objective function or apply a ¬lter,

such as the ones in [168] and [187]. For general problems, these must sample the variable

space in some way, for example by performing high-dimensional integration, and are too costly.

However, in some special cases these integrals can be performed analytically and impressive

results for special-purpose ¬ltering algorithms for computational chemistry have been reported

in, for example, [196] and [277]. Nor do we discuss analog methods (see [149] for a well-known

111

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

112 ITERATIVE METHODS FOR OPTIMIZATION

example).

We also omit stochastic methods like the special-purpose methods discussed in [38] and [39],

or more radical general-purpose global optimization algorithms, such as simulated annealing

[166] (see [1] and [265] for surveys of recent work), interval methods [152], or genetic algorithms

[143], [144] (see [246] or [123] for a survey), which are random to some extent or random

search algorithms. These probabilistic methods, however, should be considered when the more

conservative algorithms such as the ones in this part of the book fail.

6.1 Problem Statement

Consider an objective function f that is a perturbation of a smooth function fs by a small function

φ

f (x) = fs (x) + φ(x).

(6.1)

Small oscillations in φ could cause f to have several local minima that would trap any conven-

tional gradient-based algorithms. The perturbation φ can, in general, be random or based on the

output of an experiment, [250], and may not return the same value when called twice with the

same argument. Hence φ need not even be a function. We assume that φ is everywhere de¬ned

and bounded to make the statement of the results simpler.

6.2 The Simplex Gradient

Most of the the algorithms in this part of the book examine a simplex of points in RN at each

iteration and then change the simplex in response. In this section we develop the tools needed to

describe and analyze these algorithms. The fundamental idea is that many sampling algorithms

require enough information to approximate the gradient by differences and that the accuracy in

that difference approximation can be used to analyze the convergence. However, for problems

of the form (6.1), one must take care not to make the difference increments so small as to attempt

to differentiate the noise.

The ideas in this section were originally used in [155] to analyze the Nelder“Mead [204]

algorithm, which we discuss in §8.1. However, the ideas can be applied to several classes of

algorithms, and we follow the development in [29] in this section.

De¬nition 6.2.1. A simplex S in RN is the convex hull of N + 1 points, {xj }N +1 . xj is

j=1

the jth vertex of S. We let V (or V (S)) denote the N — N matrix of simplex directions

V (S) = (x2 ’ x1 , x3 ’ x1 , . . . , xN +1 ’ x1 ) = (v1 , . . . , vN ).

We say S is nonsingular if V is nonsingular. The simplex diameter diam(S) is

diam(S) = max xi ’ xj .

1¤i,j¤N +1

We will refer to the l2 condition number κ(V ) of V as the simplex condition.

We let δ(f : S) denote the vector of objective function differences

δ(f : S) = (f (x2 ) ’ f (x1 ), f (x3 ) ’ f (x1 ), . . . , f (xN +1 ) ’ f (x1 ))T .

We will not use the simplex diameter directly in our estimates or algorithms. Rather we will use

two oriented lengths

σ+ (S) = max x1 ’ xj and σ’ (S) = min x1 ’ xj .

2¤j¤N +1 2¤j¤N +1

Clearly,

σ+ (S) ¤ diam(S) ¤ 2σ+ (S).

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS AND GOALS 113

6.2.1 Forward Difference Simplex Gradient

De¬nition 6.2.2. Let S be a nonsingular simplex with vertices {xj }N +1 . The simplex gradient

j=1

D(f : S) is

D(f : S) = V ’T δ(f : S).

Note that the matrix of simplex directions and the vector of objective function differences

depend on which of the vertices is labeled x1 . Most of the algorithms we consider in this part

of the book use a vertex ordering or sample on a regular stencil. In this way the algorithms, in

one way or another, use a simplex gradient.

This de¬nition of simplex gradient is motivated by the ¬rst-order estimate in Lemma 6.2.1.

Lemma 6.2.1. Let S be a simplex. Let ∇f be Lipschitz continuous in a neighborhood of S

with Lipschitz constant 2Kf . Then there is K > 0, depending only on Kf such that

∇f (x1 ) ’ D(f : S) ¤ Kκ(V )σ+ (S).

(6.2)

Proof. Our smoothness assumptions on f and Taylor™s theorem imply that for all 2 ¤ j ¤

N + 1,

|f (x1 ) ’ f (xj ) + vj ∇f (x1 )| ¤ Kf vj 2 ¤ Kf σ+ (S)2 .

T

Hence

δ(f : S) ’ V T ∇f (x1 ) ¤ N 1/2 Kf σ+ (S)2

and hence, setting K = N 1/2 Kf ,

∇f (x1 ) ’ D(f : S) ¤ K V ’T σ+ (S)2 .

The conclusion follows from the fact that σ+ (S) ¤ V .

Search algorithms are not intended, of course, for smooth problems. Minimization of ob-

jective functions of the form in (6.1) is one of the applications of these methods. A ¬rst-order

estimate that takes perturbations into account is our next result.

We will need to measure the perturbations on each simplex. To that end we de¬ne for any

set T

φ T = sup φ(x) .

x∈T

A ¬rst-order estimate also holds for the simplex gradient of an objective function that satis¬es

(6.1).

Lemma 6.2.2. Let S be a nonsingular simplex. Let f satisfy (6.1) and let ∇fs be Lipschitz

continuous in a neighborhood of S with Lipschitz constant 2Ks . Then there is K > 0, depending

only on Ks , such that

φS

∇fs (x1 ) ’ D(f : S) ¤ Kκ(V ) σ+ (S) + .

(6.3)

σ+ (S)

Proof. Lemma 6.2.1 (applied to fs ) implies

∇fs (x1 ) ’ D(fs : S) ¤ Ks N 1/2 κ(V )σ+ (S).

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

114 ITERATIVE METHODS FOR OPTIMIZATION

√

Now, since δ(φ : S) ¤ 2 N φ and σ+ (S) ¤ V ,

S,

¤ V ’T δ(f : S) ’ δ(fs : S) = V ’T

D(f : S) ’ D(fs : S) δ(φ : S)

φS

¤ 2N 1/2 V ’T ¤ 2N 1/2 κ(V )

φ .

S

σ+ (S)

This completes the proof with K = N 1/2 Ks + 2N 1/2 .

The constants K in (6.2) and (6.3) depend on S only through the Lipschitz constants of fs

and ∇fs in a neighborhood of S. We will express that dependence as K = K(S) when needed.

The algorithms in this section are most pro¬tably applied to problems of the form (6.1), and

the goal is to extract as much information as possible from the smooth part fs of f without

wasting effort in a futile attempt to minimize the noise. In order to formulate our goal for

convergence clearly, we explore the consequences of a small simplex gradient in the special (and

not uncommon) case that the amplitude of the noise is small in Lemma 6.2.3.

Lemma 6.2.3. Let f satisfy (6.1) and let ∇fs be continuously differentiable in a compact set

„¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Then there is K„¦ > 0 such that

for any simplex S ‚ „¦ with vertices {xj }N +1 ,

j=1

φS

x1 ’ x— ¤ K„¦ D(f : S) + κ(V ) σ+ (S) + .

σ+ (S)

Proof. The compactness of „¦ and our smoothness assumptions on fs imply that there is β0

such that

∇fs (x) ≥ β0 x ’ x—

for all x ∈ „¦. We apply (6.3) to obtain

’1

x1 ’ x— ¤ β0 ∇fs (x1 )

φS

’1

¤ β0 D(f : S) + Kκ(V ) σ+ (S) + .

σ+ (S)

’1

This completes the proof with K„¦ = β0 max(1, K).

By sampling in an organized way simplex-based algorithms, some directly and some implic-

itly, attempt to drive the simplex gradient to a small value by changing the size of the simplices

over which f is sampled. The motion of the simplices and the scheme for changing the size

(especially the reduction in size) accounts for the differences in the algorithms. Theorem 6.2.4,

a direct consequence of Lemma 6.2.3, quanti¬es this. We will consider a sequence of uniformly

well-conditioned simplices. Such simplices are generated by several of the algorithms we will

study later.

Theorem 6.2.4. Let f satisfy (6.1) and let ∇fs be continuously differentiable in a compact

set „¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Let S k be a sequence of

simplices having vertices {xk }N +1 . Assume that there is M such that

j j=1

S k ‚ „¦ and κ(V (S k )) ¤ M for all k.

Then,

1. if

φ Sk

lim σ+ (S k ) = 0, lim = 0,

k’∞ σ+ (S k )

k’∞

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS AND GOALS 115

and lim supk’∞ D(f : S k ) = , for some > 0, then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ;

1

k’∞

2. if, for some > 0,

2

, lim inf σ+ (S k ) ≥ , and lim inf D(f : S k ) ¤ ,

lim sup φ ¤

Sk

k’∞ k’∞

k’∞

then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ( + lim sup σ+ (S k )).

1

k’∞ k’∞

6.2.2 Centered Difference Simplex Gradient

In this section we de¬ne the centered difference simplex gradient and prove a second-order

estimate. We will then prove two variants of Theorem 6.2.4, one to show how the role of the

noise φ differs from that in the one-sided derivative case and a second to quantify how the values

of f on the stencil can be used to terminate an iteration.

De¬nition 6.2.3. Let S be a nonsingular simplex in RN with vertices {xj }N +1 and simplex

j=1

directions vj = xj+1 ’ x1 . The re¬‚ected simplex R = R(S) is the simplex with vertices x1 and

rj = x1 ’ vj for j = 1, . . . , N.

The central simplex gradient DC (f : S) is

V ’T (δ(f : S) ’ δ(f : R))

D(f : S) + D(f : R)

DC (f : S) = = .

2 2

For example, if N = 1 and x2 = x1 + h, then r2 = x1 ’ h. Hence

f (x1 + h) ’ f (x1 ) f (x1 ’ h) ’ f (x1 )

D(f : S) = and D(f : R) = .

h ’h

Therefore,

f (x1 + h) ’ f (x1 ’ h)

DC (f : S) = DC (f : R) =

2h

is the usual central difference.

Lemmas 6.2.5 and 6.2.6 are the second-order analogues of Lemmas 6.2.1 and 6.2.2.

Lemma 6.2.5. Let S be a nonsingular simplex and let ∇2 f be Lipschitz continuous in a

neighborhood of S ∪ R(S) with Lipschitz constant 3KC . Then there is K > 0 such that

∇f (x1 ) ’ DC (f : S) ¤ Kκ(V )σ+ (S)2 .

(6.4)

Proof. The Lipschitz continuity assumption implies that for all 2 ¤ j ¤ N + 1,

f (xj ) ’ f (rj ) + 2∇f (x1 )T vj ¤ KC vj 3

¤ Kc σ+ (S)3 .

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

116 ITERATIVE METHODS FOR OPTIMIZATION

As in the proof of Lemma 6.2.1 we have

V T (δ(f : S) ’ δ(f : R)) ’ V T ∇f (x1 ) ¤ N 1/2 KC σ+ (S)3 ,

and hence the result follows with K = N 1/2 KC .

Lemma 6.2.6. Let S be a nonsingular simplex. Let f satisfy (6.1) and let ∇2 fs be Lipschitz

continuous in a neighborhood of S ∪ R(S) with Lipschitz constant 3KCs . Then there is K > 0,

depending only on KCs , such that

φS

∇fs (x1 ) ’ DC (f : S) ¤ Kκ(V ) σ+ (S)2 + .

(6.5)

σ+ (S)

Proof. This proof is very similar to that of Lemma 6.2.2 and is left to the reader.

The quality of the information that can be obtained from the central simplex gradient is

higher than that of the forward. The difference in practice can be dramatic, as the examples

in §7.6 illustrate. The consequences of a small central simplex gradient follow directly from

Lemma 6.2.6.

Lemma 6.2.7. Let f satisfy (6.1) and let ∇2 fs be continuously differentiable in a compact

set „¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Then there is K„¦ > 0 such

that if a simplex S and its re¬‚ection R(S) are both contained in „¦ then

φS

x1 ’ x— ¤ K„¦ DC (f : S) + κ(V ) σ+ (S)2 + .

σ+ (S)

Lemma 6.2.7 is all one needs to conclude convergence from a sequence of small central

simplex gradients.

Theorem 6.2.8. Let f satisfy (6.1) and let ∇2 fs be continuously differentiable in a compact

set „¦ ‚ RN . Assume that fs has a unique critical point x— in „¦. Let S k be a sequence of

simplices having vertices {xk }N +1 . Assume that there is M such that

j j=1

S k , R(S k ) ‚ „¦ and κ(V (S k )) ¤ M for all k.

Then,

1. if

φ Sk

lim σ+ (S k ) = 0, lim = 0,

k’∞ σ+ (S k )

k’∞

and lim supk’∞ DC (f : S k ) = , for some > 0, then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ;

1

k’∞

2. if, for some > 0,

3

, lim inf σ+ (S k ) ≥ 2

, and lim inf DC (f : S k ) ¤ 2

lim sup φ ¤ ,

Sk

k’∞ k’∞

k’∞

then there is KS > 0 such that

lim sup x— ’ xk ¤ KS ( + lim sup σ+ (S k ))2 .

1

k’∞ k’∞

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS AND GOALS 117

Theorem 6.2.8, like Theorem 6.2.4, motivates using a small simplex gradient as a test for

convergence. Suppose φ ∞ ¤ and an algorithm generates sequences of simplices whose

vertices are intended to approximate a minimizer of fs . We can use the results in §2.3.1 to con-

clude that simplices with σ+ (S) << 1/2 will result in inaccurate forward difference gradients

and those with σ+ (S) << 2/3 in inaccurate central difference gradients. This indicates that

the central simplex gradient will be less sensitive to noise than the forward. While this is not

usually critical in computing a difference Hessian, where the loss of accuracy may cause slow

convergence, it can cause failure of the iteration if one is computing a difference gradient.

If one wants to terminate the algorithm when the simplex gradient is small, say, ¤ „ , a rough

estimate of the minimal possible value of „ is „ = O( 1/2 ) for a forward difference simplex

gradient and „ = O( 2/3 ) for a central simplex gradient.

Moreover, if one is using a centered difference, one has information on the values of f at

enough points to make an important qualitative judgment. In order to evaluate a central simplex

gradient f must be sampled at x1 and x1 ± vj for 1 ¤ j ¤ N . If f (x1 ) ¤ f (x1 ± vj ) for

all 1 ¤ j ¤ N , then one can question the validity of using the simplex gradient as a descent

direction or as a measure of stationarity. We call this stencil failure. We will use stencil failure

as a termination criterion in most of the algorithms we discuss in this part of the book. Our basis

for that is a result from [29], which only requires differentiability of fs .

Theorem 6.2.9. Let S be a nonsingular simplex such that for some µ’ ∈ (0, 1) and κ+ > 0,

κ(V ) ¤ κ+ and xT V V T x ≥ µ’ σ+ (S)2 x 2

for all x.

(6.6)

Let f satisfy (6.1) and let ∇fs be Lipschitz continuously differentiable in a ball B of radius

2σ+ (S) about x1 . Assume that

f (x1 ) < min{f (x1 ± vj )}.

(6.7)

j

Then, if K is the constant from Lemma 6.2.2,

φB

∇fs (x1 ) ¤ 8µ’1 Kκ+ σ+ (S) + .

(6.8) ’

σ+ (S)

Proof. Let R(S), the re¬‚ected simplex, have vertices x1 and {rj }N . (6.7) implies that

j=1

each component of δ(f : S) and δ(f : R) is positive. Now since

V = V (S) = ’V (R),

we must have

0 < δ(f : S)T δ(f : R)

= (V T V ’T δ(f : S))T (V (R)T V (R)’T δ(f : R))

(6.9)

= ’D(f : S)T V V T D(f : R).

We apply Lemma 6.2.2 to both D(f : S) and D(f : R) to obtain

D(f : S) = ∇fs (x1 ) + E1 and D(f : R) = ∇fs (x1 ) + E2 ,

where, since κ(V ) = κ(V (R)) ¤ κ+ ,

φB

Ek ¤ Kκ+ σ+ (S) + .

σ+ (S)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

118 ITERATIVE METHODS FOR OPTIMIZATION

Since V ¤ 2σ+ (S) we have by (6.9)

∇fs (x1 )T V V T ∇fs (x1 ) ¤ 4σ+ (S)2 ∇fs (x1 ) ( E1 + E2 )

(6.10)

+4σ+ (S)2 E1 E2 .

The assumptions of the lemma give a lower estimate of the left side of (6.10),

wT V V T w ≥ µ’ σ+ (S)2 w 2 .

Hence,

∇2 f (x1 ) ¤ b ∇2 f (x1 ) + c,

where, using (6.10),

φB

b = 8µ’1 Ks κ+ σ+ (S) +

1

σ+ (S)

and

2

φB µ’ 2

4µ’1 (Ks κ+ )2

c= σ+ (S) + = B.

’

σ+ (S) 16

So b2 ’ 4c = b2 (1 ’ µ’ /4) and the quadratic formula then implies that

√

1 + 1 ’ µ’ /4

b + b2 ’ 4c

∇2 f (x1 ) ¤ =b ¤b

2 2

as asserted.

6.3 Examples

Our examples are selected to represent a variety of problems that can be attacked by the methods

in this part of the book and, at the same time, are easy for the reader to implement. Many of the

problems to which these methods have been applied have complex objective functions and have

been solved as team efforts [107], [250], [121], [70], [69]. In many such cases the objective

function is not even available as a single subroutine as the optimizer, simulator, and design tool

are one package. Hence, the examples we present in this part of the book are even more arti¬cial

than the ones in the ¬rst part. The cost of an evaluation of f is much less in these examples than

it is in practice.

6.3.1 Weber™s Problem

Our discussion of this problem is based on [182]. Weber™s problem is to locate a central facility

(a warehouse or factory, for example) so that the total cost associated with distribution to several

demand centers is minimized. The model is that the cost is proportional to the distance from the

facility. The proportionality constant may be positive re¬‚ecting transportation costs or negative

re¬‚ecting environmental concerns.

If the locations of the demand centers are {zi } ‚ R2 and the corresponding weights are

{wi }, then the objective function is

[(x)1 ’ (zi )1 ]2 + [(x)2 ’ (zi )2 ]2 .

f (x) = wi x ’ zi = wi

(6.11)

i i

We will assume that

wi > 0,

i

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS AND GOALS 119

so that a global optimum exists. If i wi < 0 then inf f = ’∞ and there is no global optimum.

Weber™s problem is not differentiable at x = zi because the square root function is not

differentiable at 0. A gradient-based algorithm, applied in a naive way, will have dif¬culty with

this problem. There are special-purpose algorithms (see [182] for a survey) for Weber™s problem,

especially if all the weights are positive. Our main interest is in the case where at least one weight

is negative. In that case there may be multiple local minima.

We will consider two examples. The ¬rst, and simplest, is from [182]. This example has

three demand centers with

2 90 43

w = (2, 4, ’5)T and (z1 , z2 , z3 ) = .

42 11 88

The global minimum is at x— = (90, 11)T , at which the gradient is not de¬ned. The complex

contours near the minimizer in Figure 6.2 illustrate the dif¬culty of the problem.

200

150

100

50

0

’50

’100

’150

’200

’200 ’150 ’100 ’50 0 50 100 150 200

Figure 6.2: Contour/Surface for Weber™s Function: First Example

Our second example has two local minimizers, at (’10, ’10) and (25, 30) with the global

minimizer at (25, 30). There are four demand centers with

’10 0 5 25

w = (2, ’4, 2, 1)T and (z1 , z2 , z3 , z4 ) = .

’10 0 8 30

See Figure 6.3.

Our third example adds the oscillatory function

φ(x) = sin(.0035xT x) + 5 sin(.003(x ’ y)T (x ’ y))

to the second example, where y = (’20, 0)T . This complicates the optimization landscape

signi¬cantly, as the surface and contour plots in Figure 6.4 show.

6.3.2 Perturbed Convex Quadratics

The sum of a simple convex quadratic and low-amplitude high-frequency perturbation will serve

as a model problem for all the algorithms in this section. For example, the function graphed on

the right in Figure 6.1,

f (x) = 2x2 (1 + .75 cos(80x)/12) + cos(100x)2 /24

is one of the examples in [120]. Our general form will be

= (x ’ ξ0 )T H(x ’ ξ0 )(1 + a1 cos(bT (x ’ ξ1 ) + c1 (x ’ ξ1 )T (x ’ ξ1 )))

f (x) 1

(6.12)

+a2 (1 + cos(bT (x ’ ξ2 )T + c2 (x ’ ξ2 )T (x ’ ξ2 ))) + a3 |rand|,

2

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

120 ITERATIVE METHODS FOR OPTIMIZATION

60

40

20

0

’20

’40

’60

’60 ’40 ’20 0 20 40 60

Figure 6.3: Contour/Surface for Weber™s Function: Second Example

60

40

20

0

’20

’40

’60

’60 ’40 ’20 0 20 40 60

Figure 6.4: Contour/Surface plots for Weber™s Function: Third Example

where {ξj }, {aj }, {bj }, {cj } are given and rand is a random number generator. f has been

designed so that the minimum value is O(a1 +a2 +a3 ). The unperturbed case a1 = a2 = a3 = 0

is also of interest for many of the algorithms in this part of the book.

6.3.3 Lennard“Jones Problem

The objective function is a simple model of the potential energy in a molecule of identical atoms.

Assume that there are M atoms and that ξi ∈ R3 is the position of the ith atom. Letting

dij = ξi ’ ξj

and

x = (ξ1 , . . . , ξM )T ∈ RN

T T

where N = 3M , we have that the Lennard“Jones energy function is

M i’1

d’12 ’ 2d’6 .

f (x) =

(6.13) ij ij

i=1 j=1

2

f has many local minimizers (O(eM ) is one conjecture [142]) and the values at the mini-

mizers are close. Hence, the Lennard“Jones function does not conform to the noisy perturbation

of a smooth function paradigm. The reader is asked in some of the exercises to see how the

methods perform.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS AND GOALS 121

6.4 Exercises on Basic Concepts

6.4.1. Show that if wi > 0 for all i then Weber™s problem has a unique local minimum.

6.4.2. Prove Lemma 6.2.6.

6.4.3. Try to minimize the Lennard“Jones functional using some of the algorithms from the ¬rst

part of the book. Vary the initial iterate and M . Compare your best results with those in

[142], [40], and [210].

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

122 ITERATIVE METHODS FOR OPTIMIZATION

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Chapter 7

Implicit Filtering

7.1 Description and Analysis of Implicit Filtering

The implicit ¬ltering algorithm was originally formulated in [270], [251], and [271], as a

difference-gradient implementation of the gradient projection algorithm [18] in which the dif-

ference increment is reduced in size as the iteration progresses. A different formulation for

unconstrained problems with certain convexity properties was introduced at about the same time

in [279]. From the point of view of this book, the simplex gradient is used in a direct way. The

algorithmic description and analysis in this chapter uses the results from §6.2 directly. We will

focus on unconstrained problems and derive the convergence results that implicit ¬ltering shares

with the search algorithms in Chapter 8.

Implicit ¬ltering, by using an approximate gradient directly, offers the possibility of im-

proved performance with quasi-Newton methods and can be easily applied to bound constrained

problems. We explore these two possibilities in §§7.2 and 7.4.

In its simplest unconstrained form, implicit ¬ltering is the steepest descent algorithm with

difference gradients, where the difference increment varies as the iteration progresses. Because

the gradient is only an approximation, the computed steepest descent direction may fail to be a

descent direction and the line search may fail. In this event, the difference increment is reduced.

For a given x ∈ RN and h > 0 we let the simplex S(x, h) be the right simplex from x with

edges having length h. Hence the vertices are x and x + hvi for 1 ¤ i ¤ N with V = I. So

κ(V ) = 1. The performance of implicit ¬ltering with a central difference gradient is far superior

to that with the forward difference gradient [120], [187], [250]. We will, therefore, use centered

differences in the discussion. We illustrate the performance of forward difference gradients in

§7.6.

We set

∇h f (x) = DC (f : S(x, h)).

We use a simple Armijo [7] line search and demand that the suf¬cient decrease condition

2

f (x ’ »∇h f (x)) ’ f (x) < ’±» ∇h f (x)

(7.1)

holds (compare with (3.4)) for some ± > 0.

Our central difference steepest descent algorithm fdsteep terminates when

∇h f (x) ¤ „ h

(7.2)

for some „ > 0, when more than pmax iterations have been taken, after a stencil failure, or

when the line search fails by taking more than amax backtracks. Even the failures of fdsteep

123

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

124 ITERATIVE METHODS FOR OPTIMIZATION

can be used to advantage by triggering a reduction in h. The line search parameters ±, β and

the parameter „ in the termination criterion (7.2) do not affect the convergence analysis that we

present here but can affect performance.

Algorithm 7.1.1. fdsteep(x, f, pmax, „, h, amax)

1. For p = 1, . . . , pmax

(a) Compute f and ∇h f ; terminate if (6.7) or (7.2) hold.

(b) Find the least integer 0 ¤ m ¤ amax such that (7.1) holds for » = β m . If no such

m exists, terminate.

(c) x = x ’ »∇h f (x).

Algorithm fdsteep will terminate after ¬nitely many iterations because of the limits on

the number of iterations and the number of backtracks. If the set {x | f (x) ¤ f (x0 )} is bounded

then the iterations will remain in that set. Implicit ¬ltering calls fdsteep repeatedly, reducing

h after each termination of fdsteep. Aside from the data needed by fdsteep, one must

provide a sequence of difference increments, called scales in [120].

Algorithm 7.1.2. imfilter1(x, f, pmax, „, {hk }, amax)

1. For k = 0, . . .

Call fdsteep(x, f, pmax, „, hk , amax)

The convergence result follows from the second-order estimate, (6.5), the consequences of a

stencil failure, Theorem 6.2.9, and the equalities hk = σ+ (S k ) and κ(V k ) = 1. A similar result

for forward differences would follow from (6.3).

Theorem 7.1.1. Let f satisfy (6.1) and let ∇fs be Lipschitz continuous. Let hk ’ 0, {xk }

be the implicit ¬ltering sequence, and S k = S(x, hk ). Assume that (7.1) holds (i.e., there is no

line search failure) for all but ¬nitely many k. Then if

lim (hk + h’1 φ Sk ) =0

(7.3) k

k’∞

then any limit point of the sequence {xk } is a critical point of fs .

Proof. If either (7.1) or (6.7) hold for all but ¬nitely many k then, as is standard,

∇hk f (xk ) = DC (f : S k ) ’ 0.

Hence, using (7.3) and Lemma 6.2.2,

∇fs (xk ) ’ 0,

as asserted.

7.2 Quasi-Newton Methods and Implicit Filtering

The unique feature of implicit ¬ltering is the possibility, for problems that are suf¬ciently smooth

near a minimizer, to obtain faster convergence in the terminal phase of the iteration by using a

quasi-Newton update of a model Hessian. This idea was ¬rst proposed in [250] and [120].

We begin with a quasi-Newton form of Algorithm fdsteep. In this algorithm a quasi-

Newton approximation to the Hessian is maintained and the line search is based on the quasi-

Newton direction

d = ’H ’1 ∇h f (x)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

IMPLICIT FILTERING 125

terminating when either

f (x + »d) ’ f (x) < ±»∇h f (x)T d

(7.4)

or too many stepsize reductions have been taken. With the application to implicit ¬ltering in

mind, Algorithm fdquasi replaces the quasi-Newton H with the identity matrix when the line

search fails.

Algorithm 7.2.1. fdquasi(x, f, H, pmax, „, h, amax)

1. For p = 1, . . . , pmax

(a) Compute f , ∇h f and d = ’H ’1 ∇h f ; terminate if (7.2) holds.

(b) Find the least integer 0 ¤ m ¤ amax such that (7.4) holds for » = β m .

(c) x = x + »d.

(d) Update H with a quasi-Newton formula.

In the context of implicit ¬ltering, where N is small, the full quasi-Newton Hessian or its

inverse is maintained throughout the iteration. Our MATLAB codes store the model Hessian.

Algorithm 7.2.2. imfilter2(x, f, pmax, „, {hk }, amax)

1. H = I.

2. For k = 0, . . .

Call fdquasi(x, f, H, pmax, „, hk , amax).

In [250] and [120] the SR1 method was used because it performed somewhat better than the

BFGS method in the context of a particular application. The examples in §7.6 show the opposite

effect, and both methods have been successfully used in practice.

7.3 Implementation Considerations

Implicit ¬ltering has several iterative parameters and requires some algorithmic decisions in its

implementation. The parameters pmax, amax, and β play the same role that they do in any line

search algorithm. In our MATLAB code imfil.m, which we used for all the computations

reported in this book, we set pmax = 200 — n, amax = 10, and β = 1/2.

The performance of implicit ¬ltering can be sensitive to the value of „ [250], with small values

of „ leading to stagnation and values of „ that are too large leading to premature termination of

fdquasi. Using stencil failure as a termination criterion reduces the sensitivity to small values

of „ and we use „ = .01 in the computations.

The sequence of scales is at best a guess at the level of the noise in the problem. If several of

the scales are smaller than the level of the noise, the line search will fail immediately and work

at these scales will be wasted. Our implementation attempts to detect this by terminating the

optimization if the x is unchanged for three consecutive scales.

The simplex gradient may be a very poor approximation to the gradient. In some such cases

the function evaluation at a trial point may fail to return a value [250] and one must either trap

this failure and return an arti¬cially large value, impose bound constraints, or impose a limit on

the size of the step. In our computations we take the latter approach and limit the stepsize to

10h by setting

±

’H ’1 ∇h f (x) if H ’1 ∇h f (x) ¤ 10h,

d=

(7.5)

’10hH ’1 ∇h f (x)

otherwise.

H ’1 ∇h f (x)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

126 ITERATIVE METHODS FOR OPTIMIZATION

The choice of a quasi-Newton method to use with implicit ¬ltering is an area of active research

[56], [55]. Both SR1 and BFGS have been used, with SR1 performing modestly better in some

applications with bound constraints [270], [251], [271], [250], [55]. The implementation of

implicit ¬ltering in the collection of MATLAB codes imfil.m uses BFGS as the default but

has SR1 as an option. We found BFGS with central differences to be consistently better in the

preparation of the (unconstrained!) computational examples in this book.

7.4 Implicit Filtering for Bound Constrained Problems

Implicit ¬ltering was initially designed as an algorithm for bound constrained problems [250],

[120]. The bound constrained version we present here is simply a projected quasi-Newton

algorithm like the one presented in §5.5.3. There are other approaches to the implementation

and no best approach has emerged. We refer the reader to [120] and [55] for discussions of the

options.

We begin with scaling and the difference gradient. Central differences perform better, but

we do not evaluate f outside of the feasible region. Hence, if a point on the centered difference

stencil is outside of the feasible region, we use a one-sided difference in that direction. In order

to guarantee that at least one point in each direction is feasible, we scale the variables so that

Li = 0, Ui = 1, and h0 ¤ 1/2.

The suf¬cient decrease condition is (compare with (5.31))

f (x(»)) ’ f (x) ¤ ±∇h f (x)T (x(») ’ x),

(7.6)

where

x(») = P(x ’ »∇h f (x)).

One could terminate the iteration at a given scale when the analogue to (7.2)

x ’ x(1) ¤ „ h

(7.7)

holds or when

f (xc ) < f (x ± rj ) for all x ± rj feasible,

(7.8)

which is the analogue to (6.7) for bound constrained problems.

Quasi-Newton methods for bound constraints can be constructed more simply for small

problems, like the ones to which implicit ¬ltering is applied, where it is practical to store the

model of the inverse of the reduced Hessian as a full matrix. By using full matrix storage, the

complexity of bfgsrecb is avoided. One such alternative [53], [54], [55] to the updates in

§5.5.3 is to update the complete reduced Hessian and then correct it with information from the

new active set. This results in a two-stage update in which a model for the inverse of reduced

Hessian is updated with (4.5) to obtain

sy T ysT ssT

’1 ’1

R1/2 = I’ T Rc I’ T +T.

(7.9)

ys ys ys

Then the new reduced Hessian is computed using the active set information at the new point

’1 ’1

R+ = PA+ + PI+ R1/2 PI+ .

(7.10)

It is easy to show that Theorems 5.5.4 and 5.5.5 hold for this form of the update.

A FORTRAN implementation [119] of implicit ¬ltering for bound constrained problems is

in the software collection. In the original version of that implementation a projected SR1 update

was used and a Cholesky factorization of the matrix R+ was performed to verify positivity. The

model Hessian was reinitialized to the identity whenever the scale or the active set changed.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

IMPLICIT FILTERING 127

7.5 Restarting and Minima at All Scales

No algorithm in this part of the book is guaranteed to ¬nd even a local minimum, much less

a global one. One approach to improving the robustness of these algorithms is to restart the

iteration after one sweep through the scales. A point x that is not changed after a call to

Algorithm imfilter1 (or imfilter2 or the bound constrained form of either) is called a

minimum at all scales.

If f satis¬es (6.1), fs has a unique critical point that is also a local minimum that satis¬es the

standard assumptions (and hence is a global minimum for fs ), and certain (strong!) technical

assumptions on the decay of φ near the minimum hold, then [120] a minimum at all scales is near

that global minimum of fs . In the unconstrained case this statement follows from the termination

criteria ((7.2) and (6.7)) for implicit ¬ltering, Lemma 6.2.3 (or 6.2.7) and, if central differences

are used, Theorem 6.2.9. The analysis in [120] of the bound constrained case is more technical.

In practice, restarts are expensive and need not be done for most problems. However, restarts

have been reported to make a difference in some cases [178]. It is also comforting to know that

one has a minimum at all scales, and the author of this book recommends testing potential optima

with restarts before one uses the results in practice but not at the state where one is tuning the

optimizer or doing preliminary evaluation of the results.

7.6 Examples

Many of these examples are from [56]. For all the examples we report results with and without a

quasi-Newton Hessian. We report results for both forward and central differences. In the ¬gures

the solid line corresponds to the BFGS Hessian, the dashed-dotted line to the SR1 Hessian, and

the dashed line to H = I, the steepest descent form of implicit ¬ltering.

Unlike the smooth problems considered earlier, where convergence of the gradient to zero

was supported by theory, convergence of the simplex gradient to zero is limited by the noise in

the objective. We illustrate performance by plotting both the objective function value and the

norm of the simplex gradient. From these examples it is clear that the the graphs of function

value against the count of function evaluations is a better indicator of the performance of the

optimizer.

In all cases we terminated the iteration when either fdquasi had been called for each scale

or a budget of function evaluations had been exhausted. Once the code completes an iteration

and the number of function evaluations is greater than or equal to the budget, the iteration is

terminated.

The examples include both smooth and nonsmooth problems, with and without noise. A

serious problem for some algorithms of this type is their failure on very easy problems. For

most of the algorithms covered in this part of the book, we will present examples that illustrate

performance on this collection of problems.

7.6.1 Weber™s Problem

The three Weber™s function examples all have minimizers at points at which the objective is

nondifferentiable. For the computations we used an initial iterate of (10, ’10)T , a budget of

200 function evaluations, and {10 — 2’n }8 n=’2 as the sequence of scales.

In each of the examples the performance of the two quasi-Newton methods was virtually

identical and far better than that without a quasi-Newton model Hessian. Forward and central

differences for the ¬rst two problems (Figures 7.1 and 7.2) perform almost equally well, with

forward having a slight edge. In Figure 7.3, however, the forward difference version of implicit

¬ltering ¬nds a local minimum different from the global minimum that is located by central

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

128 ITERATIVE METHODS FOR OPTIMIZATION

Central Differences Central Differences

1

10 50

simplex gradient norm

100

function value

150

0

10

200

250

1

10 300

0 50 100 150 200 250 0 50 100 150 200 250

function evaluations function evaluations

Forward Differences Forward Differences

1

10 50

simplex gradient norm

100

0

10 function value

150

1

10

200

2

10

250

3

10 300

0 50 100 150 200 250 0 50 100 150 200 250

function evaluations function evaluations

Figure 7.1: First Weber Example

Central Differences Central Differences

1

10 70

60

simplex gradient norm

50

0

function value

10

40

30

1

10

20

10

2

10 0

0 50 100 150 200 250 0 50 100 150 200 250

function evaluations function evaluations

Forward Differences Forward Differences

1

10 70

60

simplex gradient norm

50

0

function value

10

40

30

1

10

20

10

2

10 0

0 50 100 150 200 250 0 50 100 150 200 250

function evaluations function evaluations

Figure 7.2: Second Weber Example

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

IMPLICIT FILTERING 129

Central Differences Central Differences

1

10 70

60

simplex gradient norm