0

10

function value

50

1

10 40

30

2

10

20

3

10 10

0 50 100 150 200 0 50 100 150 200

function evaluations function evaluations

Forward Differences Forward Differences

2

10 70

60

simplex gradient norm

0

10

function value

50

2

10 40

30

4

10

20

6

10 10

0 50 100 150 200 250 0 50 100 150 200 250

function evaluations function evaluations

Figure 7.3: Third Weber Example

differencing. This, of course, is consistent with the theory, which does not claim that implicit

¬ltering is a global optimizer.

7.6.2 Parameter ID

We consider the parameter ID example from §1.6.2 using the data from §2.6.1. Recall that in this

example we use as data the values of the exact solution for c = k = 1 at the points ti = i/100

for 1 ¤ i ¤ 100. The initial iterate was (5, 5)T ; the sequence of scales was {2’k }12 . Implicit

k=1

¬ltering, like the globally convergent algorithms in the ¬rst part of the book, is fairly insensitive

to the choice of initial iterate, as we will see when we revisit this example in §8.5.2.

We report on both low (rtol = atol = 10’3 , Figure 7.4) and high (rtol = atol = 10’6 ,

Figure 7.5) accuracy computations. Note that after 200 function evaluations the function re-

duction from the central difference BFGS form of implicit ¬ltering ¬‚attens out in both plots at

roughly the expected level of O(tol) while the other methods have not. This effect, which is

not uncommon, is one reason for our preference for the BFGS central difference form of the

algorithm.

7.6.3 Convex Quadratics

The performance of the central difference BFGS form of implicit ¬ltering should be very good,

since (see exercises 7.7.1 and 7.7.2) the difference approximation of the gradient is exact. We

would expect that good performance to persist in the perturbed case. We illustrate this with results

on two problems, both given by (6.12). One is an unperturbed problem (aj = bj = cj = 0

for all j) where H is a diagonal matrix with (H)ii = 1/(2i) for 1 ¤ i ¤ N . The other is a

perturbed problem with

ξ0 = (sin(1), sin(2), . . . , sin(N ))T , ξ1 = 0, ξ2 = (1, . . . , 1)T ,

a1 = a2 = .01, a3 = 0, b1 = (1, . . . , 1)T , b2 = 0, and c1 = c2 = 10π.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

130 ITERATIVE METHODS FOR OPTIMIZATION

Central Differences Central Differences

4 2

10 10

simplex gradient norm

2

10

0

function value

10

0

10

2

10

2

10

4 4

10 10

0 50 100 150 200 250 0 50 100 150 200 250

function evaluations function evaluations

Forward Differences Forward Differences

5 2

10 10

simplex gradient norm

0 0

function value

10 10

5 2

10 10

10 4

10 10

0 50 100 150 200 250 0 50 100 150 200 250

function evaluations function evaluations

Figure 7.4: Parameter ID, tol = 10’3

Central Differences Central Differences

4 2

10 10

simplex gradient norm

0

10

2

function value

10

2

10

0

10

4

10

2 6

10 10

0 50 100 150 200 250 0 50 100 150 200 250

function evaluations function evaluations

Forward Differences Forward Differences

5 2

10 10

simplex gradient norm

0 0

function value

10 10

5 2

10 10

10 4

10 10

0 50 100 150 200 250 0 50 100 150 200 250

function evaluations function evaluations

Figure 7.5: Parameter ID, tol = 10’6

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

IMPLICIT FILTERING 131

Central Differences Central Differences

10 20

10 10

simplex gradient norm

0 0

function value

10 10

10 20

10 10

20 40

10 10

0 100 200 300 400 0 100 200 300 400

function evaluations function evaluations

Forward Differences Forward Differences

10 5

10 10

simplex gradient norm

0 0

function value

10 10

10 5

10 10

20 10

10 10

0 200 400 600 0 200 400 600

function evaluations function evaluations

Figure 7.6: Unperturbed Quadratic, N = 4

Central Differences Central Differences

2 5

10 10

simplex gradient norm

0

10

0

function value

10

2

10

5

10

4

10

6 10

10 10

0 1000 2000 3000 4000 0 1000 2000 3000 4000

function evaluations function evaluations

Forward Differences Forward Differences

5 5

10 10

simplex gradient norm

0 0

function value

10 10

5 5

10 10

10 10

10 10

0 1000 2000 3000 4000 0 1000 2000 3000 4000

function evaluations function evaluations

Figure 7.7: Unperturbed Quadratic, N = 32

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

132 ITERATIVE METHODS FOR OPTIMIZATION

Central Differences Central Differences

2 2

10 10

simplex gradient norm

0 0

function value

10 10

2 2

10 10

4 4

10 10

0 1000 2000 3000 4000 0 1000 2000 3000 4000

function evaluations function evaluations

Forward Differences Forward Differences

2 2

10 10

simplex gradient norm

0

10

0

function value

10

2

10

2

10

4

10

6 4

10 10

0 100 200 300 400 0 100 200 300 400

function evaluations function evaluations

Figure 7.8: Perturbed Quadratic, N = 4

Central Differences Central Differences

2 4

10 10

simplex gradient norm

1 2

10 10

function value

0 0

10 10

1 2

10 10

2 4

10 10

0 1000 2000 3000 4000 0 1000 2000 3000 4000

function evaluations function evaluations

Forward Differences Forward Differences

2 3

10 10

simplex gradient norm

2

10

1

function value

10

1

10

0

10

0

10

1 1

10 10

0 1000 2000 3000 4000 0 1000 2000 3000 4000

function evaluations function evaluations

Figure 7.9: Perturbed Quadratic, N = 32

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

IMPLICIT FILTERING 133

If a3 = 0 then f may not return the same value when called with the same argument twice.

The reader is invited to explore the consequences of this in exercise 7.7.3.

The performance of the algorithms in this part of the book sometimes depends on the size of

the problem much more strongly than the Newton-based methods in Part I. In the case of implicit

¬ltering, that dependence is mostly a result of the cost of evaluation of the simplex gradient. To

illustrate this we consider our quadratic problems for N = 4 (Figures 7.6 and 7.8) and N = 32

(Figures 7.7 and 7.9).

For all the quadratic examples the initial iterate was

(1, 2, . . . , N )T

x0 =

10N

and the sequence of scales was {2’k }10 .

k=0

7.7 Exercises on Implicit Filtering

7.7.1. Let S be a nonsingular simplex. Show that DC (f : S) = f (x1 ) if f is a quadratic

function.

7.7.2. How would you expect forward and centered difference implicit ¬ltering to perform when

applied to f (x) = xT x? Would the performance be independent of dimension? Test your

expectation with numerical experimentation.

7.7.3. Use implicit ¬ltering to minimize the perturbed quadratic function with nonzero values of

a3 .

7.7.4. Try to solve the Lennard“Jones problem with implicit ¬ltering for various values of M

and various initial iterates. Compare your best results with those in [142], [40], and [210].

Are you doing any better than you did in exercise 6.4.3?

7.7.5. Show that Theorems 5.5.4 and 5.5.5 hold if the projected BFGS update is implemented us-

ing (7.9) and (7.10). How would these formulas affect an implementation like bfgsrecb,

which is designed for problems in which full matrices cannot be stored?

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

134 ITERATIVE METHODS FOR OPTIMIZATION

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Chapter 8

Direct Search Algorithms

In this chapter we discuss the class of direct search algorithms. These methods use values of

f taken from a set of sample points and use that information to continue the sampling. Unlike

implicit ¬ltering, these methods do not explicitly use approximate gradient information. We will

focus on three such methods: the Nelder“Mead simplex algorithm [204], the multidirectional

search method [85], [261], [262], and the Hooke“Jeeves algorithm [145]. Each of these can be

analyzed using the simplex gradient techniques from Chapter 6. We will not discuss the very

general results based on the taxonomies of direct search methods from [263], [174], and [179] or

the recent research on the application of these methods to bound [173] or linear [175] constraints.

We include at the end of this chapter a short discussion of methods based on surrogate models

and a brief account of a very different search method, the DIRECT algorithm [150]. These two

¬nal topics do not lead to algorithms that are easy to implement, and our discussions will be

very general with pointers to the literature.

8.1 The Nelder“Mead Algorithm

8.1.1 Description and Implementation

The Nelder“Mead [204] simplex algorithm maintains a simplex S of approximations to an

optimal point. In this algorithm the vertices {xj }N +1 are sorted according to the objective

j=1

function values

f (x1 ) ¤ f (x2 ) ¤ · · · ¤ f (xN +1 ).

(8.1)

x1 is called the best vertex and xN +1 the worst. If several vertices have the same objective

value as x1 , the best vertex is not uniquely de¬ned, but this ambiguity has little effect on the

performance of the algorithm.

The algorithm attempts to replace the worst vertex xN +1 with a new point of the form

x(µ) = (1 + µ)x ’ µxN +1 ,

(8.2)

where x is the centroid of the convex hull of {xi }N

i=1

1 N

x= xi .

(8.3)

N i=1

The value of µ is selected from a sequence

’1 < µic < 0 < µoc < µr < µe

135

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

136 ITERATIVE METHODS FOR OPTIMIZATION

by rules that we formally describe in Algorithm nelder. Our formulation of the algorithm

allows for termination if either f (xN +1 ) ’ f (x1 ) is suf¬ciently small or a user-speci¬ed number

of function evaluations has been expended.

Algorithm 8.1.1. nelder(S, f, „, kmax)

1. Evaluate f at the vertices of S and sort the vertices of S so that (8.1) holds.

2. Set f count = N + 1.

3. While f (xN +1 ) ’ f (x1 ) > „

(a) Compute x, (8.3), x(µr ), (8.2), and fr = f (x(µr )). f count = f count + 1.

(b) Re¬‚ect: If f count = kmax then exit. If f (x1 ) ¤ fr < f (xN ), replace xN +1 with

x(µr ) and go to step 3g.

(c) Expand: If f count = kmax then exit. If fr < f (x1 ) then compute fe = f (x(µe )).

f count = f count + 1. If fe < fr , replace xN +1 with x(µe ); otherwise replace

xN +1 with x(µr ). Go to to step 3g.

(d) Outside Contraction: If f count = kmax then exit. If f (xN ) ¤ fr < f (xN +1 ),

compute fc = f (x(µoc )). f count = f count + 1. If fc ¤ fr replace xN +1 with

x(µoc ) and go to step 3g; otherwise go to step 3f.

(e) Inside Contraction: If f count = kmax then exit. If fr ≥ f (xN +1 ) compute

fc = f (x(µic )). f count = f count + 1. If fc < f (xN +1 ), replace xN +1 with

x(µic ) and go to step 3g; otherwise go to step 3f.

(f) Shrink: If f count ≥ kmax ’ N , exit. For 2 ¤ i ¤ N + 1: set xi = x1 ’ (xi ’

x1 )/2; compute f (xi ).

(g) Sort: Sort the vertices of S so that (8.1) holds.

A typical sequence [169] of candidate values for µ is

{µr , µe , µoc , µic } = {1, 2, 1/2, ’1/2}.

Figure 8.1 is an illustration of the options in two dimensions. The vertices labeled 1, 2, and

3 are those of the original simplex.

The Nelder“Mead algorithm is not guaranteed to converge, even for smooth problems [89],

[188]. The failure mode is stagnation at a nonoptimal point. In §8.1.3 we will present some

examples from [188] that illustrate this failure. However, the performance of the Nelder“Mead

algorithm in practice is generally good [169], [274]. The shrink step is rare in practice and we

will assume in the analysis in §8.1.2 that shrinks do not occur. In that case, while a Nelder“Mead

iterate may not result in a reduction in the best function value, the average value

1 N +1

f= f (xj )

N +1 j=1

will be reduced.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

SEARCH ALGORITHMS 137

e

1

r

oc

ic

3 2

Figure 8.1: Nelder“Mead Simplex and New Points

8.1.2 Suf¬cient Decrease and the Simplex Gradient

Our study of the Nelder“Mead algorithm is based on the simple ideas in §3.1. We will denote

the vertices of the simplex S k at the kth iteration by {xk }N +1 . We will simplify notation by

j j=1

suppressing explicit mention of S k in what follows by denoting

V k = V (S k ), δ k = δ(f : S k ), K k = K(S k ), and Dk (f ) = D(f : S k ).

If V 0 is nonsingular then V k is nonsingular for all k > 0 [169]. Hence if S 0 is nonsingular so

is S k for all k and hence Dk (f ) is de¬ned for all k.

We formalize this by assuming that our sequence of simplices satis¬es the following assump-

tion.

Assumption 8.1.1. For all k,

• S k is nonsingular.

• The vertices satisfy (8.1).

• f k+1 < f k .

Assumption 8.1.1 is satis¬ed by the Nelder“Mead sequence if no shrink steps are taken

and the initial simplex directions are linearly independent [169]. The Nelder“Mead algorithm

demands that the average function value improve, but no control is possible on which value is

improved, and the simplex condition number can become unbounded.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

138 ITERATIVE METHODS FOR OPTIMIZATION

We can de¬ne a suf¬cient decrease condition for search algorithms that is analogous to the

suf¬cient decrease condition for steepest descent and related algorithms (3.2). We will ask that

the k + 1st iteration satisfy

f k+1 ’ f k < ’± Dk f 2 .

(8.4)

Here ± > 0 is a small parameter. Our choice of suf¬cient decrease condition is motivated by the

¯

smooth case and steepest descent, where (3.2) and the lower bound ’» on » from Lemma 3.2.3

lead to

¯

f (xk+1 ) ’ f (xk ) ¤ ’»± ∇f (xk ) 2 ,

which is a smooth form of (8.4). Unlike the smooth case, however, we have no descent direction

¯

and must incorporate » into ±. This leads to the possibility that if the simplex diameter is much

smaller than Dk f , (8.4) could fail on the ¬rst iterate. We address this problem with the scaling

σ+ (S 0 )

± = ±0 .

D0 f

A typical choice in line search methods, which we use in our numerical results, is ±0 = 10’4 .

The convergence result for smooth functions follows easily from Lemma 6.2.1.

Theorem 8.1.1. Let a sequence of simplices satisfy Assumption 8.1.1 and let the assumptions

of Lemma 6.2.1 hold, with the Lipschitz constants K k uniformly bounded. Assume that {f k } is

bounded from below. Then if (8.4) holds for all but ¬nitely many k and

lim σ+ (S k )κ(V k ) = 0,

k’∞

then any accumulation point of the simplices is a critical point of f .

Proof. The boundedness from below of {f k } and (8.4) imply that f k ’ 0. Assumption 8.1.1

and (8.4) imply that limk’∞ Dk f = 0. Hence (6.2) implies

lim ∇f (xk ) ¤ lim Kκ(V k )σ+ (S k ) + Dk f = 0.

1

k’∞ k’∞

Hence, if x— is any accumulation point of the sequence {xk } then ∇f (x— ) = 0. This completes

1

k k

the proof since κ(V ) ≥ 1 and therefore σ+ (V ) ’ 0.

The result for the noisy functions that satisfy (6.1) with fs smooth re¬‚ects the fact that

the resolution is limited by the size of φ. In fact, if σ+ (S k ) is much smaller than φ S k , no

information on fs can be obtained by evaluating f at the vertices of S k and once σ+ (S k ) is

1/2

smaller than φ S k no conclusions on ∇fs can be drawn. If, however, the noise decays to zero

suf¬ciently rapidly near the optimal point, the conclusions of Theorem 8.1.1 still hold.

Theorem 8.1.2. Let a sequence of simplices satisfy Assumption 8.1.1 and let the assumptions

of Lemma 6.2.2 hold with the Lipschitz constants Ks uniformly bounded. Assume that {f k } is

k

bounded from below. Then if (8.4) holds for all but ¬nitely many k and if

φ Sk

lim κ(V k ) σ+ (S k ) + = 0,

(8.5)

σ+ (S k )

k’∞

then any accumulation point of the simplices is a critical point of fs .

Proof. Our assumptions, as in the proof of Theorem 8.1.1, imply that Dk f ’ 0. Recall that

Lemma 6.2.2 implies that

φ Sk

Dk fs ¤ Dk f + K k κ(V k ) σ+ (S k ) + ,

(8.6)

σ+ (S k )

and the sequence {K k } is bounded because {Ks } is. Hence, by (8.5), Dk fs ’ 0 as k ’ ∞.

k

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

SEARCH ALGORITHMS 139

Function Differences Max Oriented Length

2 2

10 10

0

10

0

10

2

10

2

10

4

10

4

10 6

10

6

10 8

10

8 10

10 10

0 50 100 150 0 50 100 150

Simplex Gradient Norm Simplex Condition

4 20

10 10

15

10

3

10

10

10

2

10 5

10

1 0

10 10

0 50 100 150 0 50 100 150

Figure 8.2: Unmodi¬ed Nelder“Mead, („, θ, φ) = (1, 15, 10)

8.1.3 McKinnon™s Examples

In this set of three examples from [188], N = 2, and

±

θφ|(x)1 |„ + (x)2 + (x)2 , (x)1 ¤ 0,

2

f (x) =

θ(x)„ + (x)2 + (x)2 , (x)1 > 0.

1 2

The examples in [188] consider the parameter sets

±

(3, 6, 400),

(2, 6, 60),

(„, θ, φ) =

(1, 15, 10).

The initial simplex was

√

x1 = (1, 1)T , x2 = (»+ , »’ )T , x3 = (0, 0)T , where »± = (1 ± 33)/8.

With this data, the Nelder“Mead iteration will stagnate at the origin, which is not a critical point

for f . The stagnation mode is repeated inside contractions that leave the best point (which is not

a minimizer) unchanged.

We terminated the iteration when the difference between the best and worst function values

was < 10’8 .

We illustrate the behavior of the Nelder“Mead algorithm in Figures 8.2, 8.3, and 8.4. In all

the ¬gures we plot, as functions of the iteration index, the difference between the best and worst

function values, σ+ , the maximum oriented length, the norm of the simplex gradient, and the l2

condition number of the matrix of simplex directions. In all three problems stagnation is evident

from the behavior of the simplex gradients. Note also how the simplex condition number is

growing rapidly.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

140 ITERATIVE METHODS FOR OPTIMIZATION

Function Differences Max Oriented Length

2 1

10 10

0

10

0

10

1

10

2

10

2

10

4

10 3

10

6

10 4

10

8 5

10 10

0 20 40 60 0 20 40 60

Simplex Gradient Norm Simplex Condition

5 10

10 10

4 8

10 10

3 6

10 10

2 4

10 10

1 2

10 10

0 0

10 10

0 20 40 60 0 20 40 60

Figure 8.3: Unmodi¬ed Nelder“Mead, („, θ, φ) = (2, 6, 60)

Function Differences Simplex Diameter

2 1

10 10

0

10 0

10

2

10

1

10

4

10

2

10

6

10

8 3

10 10

0 10 20 30 40 0 10 20 30 40

Simplex Gradient Norm Simplex Condition

1 8

10 10

6

10

0 4

10 10

2

10

1 0

10 10

0 10 20 30 40 0 10 20 30 40

Figure 8.4: Unmodi¬ed Nelder“Mead, („, θ, φ) = (3, 6, 400)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

SEARCH ALGORITHMS 141

8.1.4 Restarting the Nelder“Mead Algorithm

When the Nelder“Mead iteration stagnates, a restart with the same best point and a different set

of directions can help sometimes. In order to formulate a restart scheme, one must ¬rst develop

a strategy for detecting stagnation. One might think that a large simplex condition would suf¬ce

for this. However [204], the ability of the Nelder“Mead simplices to drastically vary their shape

is an important feature of the algorithm and looking at the simplex condition alone would lead

to poor results. Failure of (8.4), however, seems to indicate that something is wrong, and we

will use that as our stagnation detector.

Having detected stagnation, one must modify the simplex. Simply performing a shrink

step is not effective. The method we advocate here, from [155], is the oriented restart. The

motivation is that if the simplex gradient can be trusted to be in the correct orthant in RN , a new,

smaller simplex with orthogonal edges oriented with that quadrant should direct the iteration in

a productive direction.

We propose performing an oriented restart when (8.4) fails but f k+1 ’ f k < 0. This means

replacing the current simplex with vertices {xj }N +1 , ordered so that (8.1) holds, with a new

j=1

smaller simplex having vertices (before ordering!) {yj }N +1 with y1 = x1 and

j=1

yj = y1 ’ βj’1 ej’1 for 2 ¤ j ¤ N + 1,

(8.7)

where, for 1 ¤ l ¤ N , el is the lth coordinate vector,

±

σ (S k )sign((Dk f )l ), (Dk f )l = 0,

1 ’

βl =

2

σ’ (S k ), (Dk f )l = 0,

and (Dk f )l is the lth component of Dk f . If Dk f = 0 we assume that the Nelder“Mead iteration

would have been terminated at iteration k because there is no difference between best and worst

values.

So, before ordering, the new simplex has the same ¬rst point as the old. The diameter of the

new simplex has not been increased since the diameter of the new simplex is at most σ+ (S k ).

Moreover all edge lengths have been reduced. So after reordering σ+ (S k+1 ) ¤ σ’ (S k ). As for

κ, after the oriented shrink, but before reordering, κ(V ) = 1. After reordering, of course, the

best point may no longer be x1 . In any case the worst-case bound on κ is

√2

k+1 k+1 2

κ(V )= V ¤ (1 + N ) .

(8.8)

In any case, the new simplex is well conditioned.

Returning to the McKinnon examples, we ¬nd that an oriented restart did remedy stagnation

for the smooth examples. The graphs in Figures 8.5, 8.6, and 8.7 report the same data as for the

unmodi¬ed algorithm, with stars on the plots denoting oriented restarts.

For the smoothest example, („, θ, φ) = (3, 6, 400), the modi¬ed form of Nelder“Mead took a

single oriented restart at the 21st iteration. For the less smooth of these two, („, θ, φ) = (2, 6, 60),

a single restart was taken on the 19th iteration. As one can see from Figures 8.6 and 8.7 the

restart had an immediate effect on the simplex gradient norm and overcame the stagnation.

For the nonsmooth example, („, θ, φ) = (1, 15, 10), in Figure 8.5, the modi¬ed algorithm

terminated with failure after restarting on the 44th, 45th, and 46th iterations. Since the objective

is not smooth at the stagnation point, this is the best we can expect and is far better than the

behavior of the unmodi¬ed algorithm, which stagnates with no warning of the failure.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

142 ITERATIVE METHODS FOR OPTIMIZATION

Function Differences Max Oriented Length

2 2

10 10

0 0

10 10

2 2

10 10

4 4

10 10

0 10 20 30 40 50 0 10 20 30 40 50

Simplex Gradient Norm Simplex Condition

3 8

10 10

6

10

2 4

10 10

2

10

1 0

10 10

0 10 20 30 40 50 0 10 20 30 40 50

Figure 8.5: Modi¬ed Nelder“Mead, („, θ, φ) = (1, 15, 10)

Function Differences Max Oriented Length

5 2

10 10

0 0

10 10

5 2

10 10

10 4

10 10

0 20 40 60 0 20 40 60

Simplex Gradient Norm Simplex Condition

2 3

10 10

0 2

10 10

2 1

10 10

4 0

10 10

0 20 40 60 0 20 40 60

Figure 8.6: Modi¬ed Nelder“Mead, („, θ, φ) = (2, 6, 60)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

SEARCH ALGORITHMS 143

Function Differences Max Oriented Length

5 2

10 10

0 0

10 10

5 2

10 10

10 4

10 10

0 20 40 60 0 20 40 60

Simplex Gradient Norm Simplex Condition

2 4

10 10

0 3

10 10

2 2

10 10

4 1

10 10

6 0

10 10

0 20 40 60 0 20 40 60

Figure 8.7: Modi¬ed Nelder“Mead, („, θ, φ) = (3, 6, 400)

8.2 Multidirectional Search

8.2.1 Description and Implementation

One way to address the possible ill-conditioning in the Nelder“Mead algorithm is to require that

the condition numbers of the simplices be bounded. The multidirectional search (MDS) method

[85], [261], [262] does this by making each new simplex congruent to the previous one. The

results in this section, mostly taken from [29], show that MDS has convergence properties like

those of implicit ¬ltering.

In the special case of equilateral simplices, V k is a constant multiple of V 0 and the simplex

condition number is constant. If the simplices are not equilateral, then κ(V ) may vary depending

on which vertex is called x1 , but (6.6) will hold in any case.

Figure 8.8 illustrates the two-dimensional case for two types of simplices. Beginning with

the ordered simplex S c with vertices x1 , x2 , x3 one ¬rst attempts a re¬‚ection step, leading to a

simplex S r with vertices x1 , r2 , r3 .

If the best function value of the vertices of S r is better than the best f (x1 ) in S 0 , S r is

(provisionally) accepted and expansion is attempted. The expansion step differs from that in

the Nelder“Mead algorithm because N new points are needed to make the new, larger simplex

similar to the old one. The expansion simplex S e has vertices x1 , e2 , e3 and is accepted over S r

if the best function value of the vertices of S e is better than the best in S r . If the best function

value of the vertices of S r is not better than the best in S c , then the simplex is contracted and

the new simplex has vertices x1 , c2 , c3 . After the new simplex is identi¬ed, the vertices are

reordered to create the new ordered simplex S + .

Similar to the Nelder“Mead algorithm, there are expansion and contraction parameters µe

and µc . Typical values for these are 2 and 1/2.

Algorithm 8.2.1. mds(S, f, „, kmax)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

144 ITERATIVE METHODS FOR OPTIMIZATION

Right Simplex

Equilateral Simplex

x3

x2

c3

x3

c2

e2 r2

c3

x1

x1 c2 x2

r3

r2

r3

e3

e3

e2

Figure 8.8: MDS Simplices and New Points

1. Evaluate f at the vertices of S and sort the vertices of S so that (8.1) holds.

2. Set f count = N + 1.

3. While f (xN +1 ) ’ f (x1 ) > „

(a) Re¬‚ect: If f count = kmax then exit.

For j = 2, . . . , N +1: rj = x1 ’(xj ’x1 ); Compute f (rj ); f count = f count+1.

If f (x1 ) > minj {f (rj )} then goto step 3b else goto step 3c.

(b) Expand:

i. For j = 2, . . . , N + 1: ej = x1 ’ µe (xj ’ x1 ); Compute f (ej ); f count =

f count + 1.

ii. If minj {f (rj )} > minj {f (ej )} then

for j = 2, . . . N + 1: xj = ej

else

for j = 2, . . . N + 1: xj = rj

iii. Goto step 3d

(c) Contract: For j = 2, . . . , N + 1: xj = x1 + µc (xj ’ x1 ), Compute f (xj )

(d) Sort: Sort the vertices of S so that (8.1) holds.

If the function values at the vertices of S c are known, then the cost of computing S + is 2N

additional evaluations. Just as with the Nelder“Mead algorithm, the expansion step is optional

but has been observed to improve performance.

The extension of MDS to bound constrained and linearly constrained problems is not trivial.

We refer the reader to [173] and [175] for details.

8.2.2 Convergence and the Simplex Gradient

Assume that the simplices are either equilateral or right simplices (having one vertex from which

all N edges are at right angles). In those cases, as pointed out in [262], the possible vertices

created by expansion and re¬‚ection steps form a regular lattice of points. If the MDS simplices

remain bounded, only ¬nitely many re¬‚ections and expansions are possible before every point

on that lattice has been visited and a contraction to a new maximal simplex size must take place.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

SEARCH ALGORITHMS 145

This exhaustion of a lattice takes place under more general conditions [262] but is most clear for

the equilateral and right simplex cases.

Theorem 6.2.9 implies that in¬nitely many contractions and convergence of the simplex

diameters to zero imply convergence of the simplex gradient to zero. The similarity of The-

orem 6.2.9 to Lemma 6.2.2 and of Theorem 8.2.1, the convergence result for multidirectional

search, to Theorem 8.1.2 is no accident. The Nelder“Mead iteration, which is more aggressive

than the multidirectional search iteration, requires far stronger assumptions (well conditioning

and suf¬cient decrease) for convergence, but the ideas are the same. Theorems 6.2.9 and 8.2.1

can be used to extend the results in [262] to the noisy case. The observation in [85] that one

can apply any heuristic or machine-dependent idea to improve performance, say, by exploring

far away points on spare processors (the “speculative function evaluations” of [46]) without

affecting the analysis is still valid here.

Theorem 8.2.1. Let f satisfy (6.1) and assume that the set

{x | f (x) ¤ f (x0 )}

1

is bounded. Assume that the simplex shape is such that

lim σ+ (S k ) ’ 0.

(8.9)

k’∞

Let B k be a ball of radius 2σ+ (S k ) about xk . Then if

1

φ Bk

lim =0

k’∞ σ+ (S k )

then every limit point of the vertices is a critical point of fs .

Recall that if the simplices are equilateral or right simplices, then (8.9) holds (see exer-

cise 8.6.2).

8.3 The Hooke“Jeeves Algorithm

8.3.1 Description and Implementation

The Hooke“Jeeves algorithm is like implicit ¬ltering in that the objective is evaluated on a stencil

and the function values are used to compute a search direction. However, unlike implicit ¬ltering,

there are only ¬nitely many possible search directions and only qualitative information about

the function values is used.

The algorithm begins with a base point x and pattern size h, which is like the scale in implicit

¬ltering. In the next phase of the algorithm, called the exploratory move in [145], the function is

sampled at successive perturbations of the base point in the search directions {vj }, where vj is

the jth column of a direction matrix V . In [145] and our MATLAB implementation V = I. The

current best value fcb = f (xcb ) and best point xcb are recorded and returned. xcb is initialized

to x. The sampling is managed by ¬rst evaluating f at xcb + vj and only testing xcb ’ vj

if f (xcb + vj ) ≥ f (xcb ). The exploratory phase will either produce a new base point or fail

(meaning that xcb = x). Note that this phase depends on the ordering of the coordinates of x.

Applying a permutation to x could change the output of the exploration.

If the exploratory phase has succeeded, the search direction is

dHJ = xcb ’ x

(8.10)

and the new base point is xcb . The subtle part of the algorithm begins here. Rather than center

the next exploration at xcb , which would use some of the same points that were examined in

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

146 ITERATIVE METHODS FOR OPTIMIZATION

the previous exploration, the Hooke“Jeeves pattern move step is aggressive and tries to move

further. The algorithm centers the next exploratory move at

xC = x + 2dHJ = xcb + dHJ .

If this second exploratory move fails to improve upon f (xcb ), then an exploratory move with

xcb as the center is tried. If that fails h is reduced, x is set to xcb , and the process is started over.

Note that when h has just been set, the base point and the center of the stencil for the exploratory

moves are the same, but afterward they are not.

If, after the ¬rst exploratory move, xcb = x (i.e., as it will be if x is the best point in the

pattern), then x is left unchanged and h is reduced.

Therefore, whenever h is reduced, the stencil centered at x has x itself as the best point.

This is exactly the situation that led to a shrink in the MDS algorithm and, as you might expect,

will enable us to prove a convergence result like those in the previous sections. In [145] h was

simply multiplied by a constant factor. Our description in Algorithm hooke follows the model

of implicit ¬ltering and uses a sequence of scales. Choice of perturbation directions could be

generalized to any simplex shape, not just the right simplices used in [145].

Figure 8.9 illustrates the idea for N = 2. The base point x lies at the center of the stencil. If

f (x+ ) < f (x), f (x+ ) < f (x), f (x’ ) ≥ f (x), and f (x’ ) ≥ f (x),

1 2 1 2

then the new base point xb will be located above and to the right of x. The next exploratory

move will be centered at xC , which is the center of the stencil in the upper right corner of the

¬gure.

The reader, especially one who plans to implement this method, must be mindful that points

may be sampled more than once. For example, in the ¬gure, if the exploratory move centered

at xC fails, f will be evaluated for the second time at the four points in the stencil centered

at xb unless the algorithm is implemented to avoid this. The MDS method is also at risk of

sampling points more than once. The implementations of Hooke“Jeeves and MDS in our suite

of MATLAB codes keep the most recent 4N iterations in memory to guard against this. This

reevaluation is much less likely for the Nelder“Mead and implicit ¬ltering methods. One should

also be aware that the Hooke“Jeeves algorithm, like Nelder“Mead, does not have the natural

parallelism that implicit ¬ltering and MDS do.

One could implement a variant of the Hooke“Jeeves iteration by using xC = x + dHJ

instead of xC = x + 2dHJ and shrinking the size of the simplex on stencil failure. This is the

discrete form of the classical coordinate descent algorithm [180] and can also be analyzed by

the methods of this section (see [279] for a different view).

Our implementation follows the model of implicit ¬ltering as well as the description in

[145]. We begin with the exploratory phase, which uses a base point xb , base function value

fb = f (xb ), and stencil center xC . Note that in the algorithm xb = xC for the ¬rst exploration

and xC = xb +dHJ thereafter. Algorithm hjexplore takes a base point and a scale and returns

a direction and the value at the trial point x + d. We let V = I be the matrix of coordinate

directions, but any nonsingular matrix of search directions could be used. The status ¬‚ag sf is

used to signal failure and trigger a shrink step.

Algorithm 8.3.1. hjexplore(xb , xC , f, h, sf )

1. fb = f (xb ); d = 0; sf = 0; xcb = xb ; fcb = f (xb ); xt = xC

2. for j = 1, . . . , N : p = xt + hvj ; if f (p) ≥ fb then p = xt ’ hvj ;

if f (p) < fb then xt = xcb = p; fb = f (xcb )

3. if xcb = xb ; sf = 1; xb = xcb

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

SEARCH ALGORITHMS 147

xC2+

xC1- xC1+

xC

x2+ xb xC2-

s1+

x1- x1+

x

s1- s1+

s2-

x2-

Figure 8.9: Hooke“Jeeves Pattern and New Points

The exploration is coupled to the pattern move to complete the algorithm for a single value

of the scale. The inputs for Algorithm hjsearch are an initial iterate x, the function, and the

scale. On output, a point x is returned for which the exploration has failed. There are other

considerations, such as the budget for function evaluations, that should trigger a return from the

exploratory phase in a good implementation. In our MATLAB code hooke.m we pay attention

to the number of function evaluations and change in the function value as part of the decision to

return from the exploratory phase.

Algorithm 8.3.2. hjsearch(x, f, h)

1. xb = x; xC = x; sf = 1

2. Call hjexplore(x, xC , f, h, sf )

3. While sf = 1

(a) d = x ’ xb ; xb = x; xC = x + d

(b) Call hjexplore(x, xC , f, h, sf );

If sf = 0; xC = x; Call hjexplore(x, xC , f, h, sf )