Contents

Preface xiii

How to Get the Software xv

I Optimization of Smooth Functions 1

1 Basic Concepts 3

1.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Necessary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Suf¬cient Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Quadratic Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5.1 Positive De¬nite Hessian . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5.2 Inde¬nite Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6.1 Discrete Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6.2 Parameter Identi¬cation . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.6.3 Convex Quadratics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.7 Exercises on Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Local Convergence of Newton™s Method 13

2.1 Types of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 The Standard Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Newton™s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1 Errors in Functions, Gradients, and Hessians . . . . . . . . . . . . . . 17

2.3.2 Termination of the Iteration . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Nonlinear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.1 Gauss“Newton Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4.2 Overdetermined Problems . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4.3 Underdetermined Problems . . . . . . . . . . . . . . . . . . . . . . . 25

2.5 Inexact Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5.1 Convergence Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5.2 Implementation of Newton“CG . . . . . . . . . . . . . . . . . . . . . 30

2.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.6.1 Parameter Identi¬cation . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.6.2 Discrete Control Problem . . . . . . . . . . . . . . . . . . . . . . . . 34

2.7 Exercises on Local Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 35

ix

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

x CONTENTS

3 Global Convergence 39

3.1 The Method of Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Line Search Methods and the Armijo Rule . . . . . . . . . . . . . . . . . . . . 40

3.2.1 Stepsize Control with Polynomial Models . . . . . . . . . . . . . . . . 43

3.2.2 Slow Convergence of Steepest Descent . . . . . . . . . . . . . . . . . 45

3.2.3 Damped Gauss“Newton Iteration . . . . . . . . . . . . . . . . . . . . 47

3.2.4 Nonlinear Conjugate Gradient Methods . . . . . . . . . . . . . . . . . 48

3.3 Trust Region Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3.1 Changing the Trust Region and the Step . . . . . . . . . . . . . . . . . 51

3.3.2 Global Convergence of Trust Region Algorithms . . . . . . . . . . . . 52

3.3.3 A Unidirectional Trust Region Algorithm . . . . . . . . . . . . . . . . 54

3.3.4 The Exact Solution of the Trust Region Problem . . . . . . . . . . . . 55

3.3.5 The Levenberg“Marquardt Parameter . . . . . . . . . . . . . . . . . . 56

3.3.6 Superlinear Convergence: The Dogleg . . . . . . . . . . . . . . . . . . 58

3.3.7 A Trust Region Method for Newton“CG . . . . . . . . . . . . . . . . . 63

3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4.1 Parameter Identi¬cation . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.4.2 Discrete Control Problem . . . . . . . . . . . . . . . . . . . . . . . . 68

3.5 Exercises on Global Convergence . . . . . . . . . . . . . . . . . . . . . . . . 68

4 The BFGS Method 71

4.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.1.1 Local Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.1.2 Global Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.2.1 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.2.2 A BFGS“Armijo Algorithm . . . . . . . . . . . . . . . . . . . . . . . 80

4.3 Other Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.4.1 Parameter ID Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.4.2 Discrete Control Problem . . . . . . . . . . . . . . . . . . . . . . . . 83

4.5 Exercises on BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5 Simple Bound Constraints 87

5.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.2 Necessary Conditions for Optimality . . . . . . . . . . . . . . . . . . . . . . . 87

5.3 Suf¬cient Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.4 The Gradient Projection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4.1 Termination of the Iteration . . . . . . . . . . . . . . . . . . . . . . . 91

5.4.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.4.3 Identi¬cation of the Active Set . . . . . . . . . . . . . . . . . . . . . . 95

5.4.4 A Proof of Theorem 5.2.4 . . . . . . . . . . . . . . . . . . . . . . . . 96

5.5 Superlinear Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.5.1 The Scaled Gradient Projection Algorithm . . . . . . . . . . . . . . . . 96

5.5.2 The Projected Newton Method . . . . . . . . . . . . . . . . . . . . . . 100

5.5.3 A Projected BFGS“Armijo Algorithm . . . . . . . . . . . . . . . . . . 102

5.6 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.6.1 In¬nite-Dimensional Problems . . . . . . . . . . . . . . . . . . . . . . 106

5.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.7.1 Parameter ID Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.7.2 Discrete Control Problem . . . . . . . . . . . . . . . . . . . . . . . . 106

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

CONTENTS xi

5.8 Exercises on Bound Constrained Optimization . . . . . . . . . . . . . . . . . . 108

II Optimization of Noisy Functions 109

6 Basic Concepts and Goals 111

6.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.2 The Simplex Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.2.1 Forward Difference Simplex Gradient . . . . . . . . . . . . . . . . . . 113

6.2.2 Centered Difference Simplex Gradient . . . . . . . . . . . . . . . . . . 115

6.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.3.1 Weber™s Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.3.2 Perturbed Convex Quadratics . . . . . . . . . . . . . . . . . . . . . . 119

6.3.3 Lennard“Jones Problem . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.4 Exercises on Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7 Implicit Filtering 123

7.1 Description and Analysis of Implicit Filtering . . . . . . . . . . . . . . . . . . 123

7.2 Quasi-Newton Methods and Implicit Filtering . . . . . . . . . . . . . . . . . . 124

7.3 Implementation Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.4 Implicit Filtering for Bound Constrained Problems . . . . . . . . . . . . . . . 126

7.5 Restarting and Minima at All Scales . . . . . . . . . . . . . . . . . . . . . . . 127

7.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.6.1 Weber™s Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.6.2 Parameter ID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.6.3 Convex Quadratics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.7 Exercises on Implicit Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 133

8 Direct Search Algorithms 135

8.1 The Nelder“Mead Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

8.1.1 Description and Implementation . . . . . . . . . . . . . . . . . . . . . 135

8.1.2 Suf¬cient Decrease and the Simplex Gradient . . . . . . . . . . . . . . 137

8.1.3 McKinnon™s Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 139

8.1.4 Restarting the Nelder“Mead Algorithm . . . . . . . . . . . . . . . . . 141

8.2 Multidirectional Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

8.2.1 Description and Implementation . . . . . . . . . . . . . . . . . . . . . 143

8.2.2 Convergence and the Simplex Gradient . . . . . . . . . . . . . . . . . 144

8.3 The Hooke“Jeeves Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

8.3.1 Description and Implementation . . . . . . . . . . . . . . . . . . . . . 145

8.3.2 Convergence and the Simplex Gradient . . . . . . . . . . . . . . . . . 148

8.4 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

8.4.1 Surrogate Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

8.4.2 The DIRECT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 149

8.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

8.5.1 Weber™s Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

8.5.2 Parameter ID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

8.5.3 Convex Quadratics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

8.6 Exercises on Search Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 159

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

xii CONTENTS

Bibliography 161

Index 177

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Preface

This book on unconstrained and bound constrained optimization can be used as a tutorial for

self-study or a reference by those who solve such problems in their work. It can also serve as a

textbook in an introductory optimization course.

As in my earlier book [154] on linear and nonlinear equations, we treat a small number of

methods in depth, giving a less detailed description of only a few (for example, the nonlinear

conjugate gradient method and the DIRECT algorithm). We aim for clarity and brevity rather

than complete generality and con¬ne our scope to algorithms that are easy to implement (by the

reader!) and understand.

One consequence of this approach is that the algorithms in this book are often special cases

of more general ones in the literature. For example, in Chapter 3, we provide details only

for trust region globalizations of Newton™s method for unconstrained problems and line search

globalizations of the BFGS quasi-Newton method for unconstrained and bound constrained

problems. We refer the reader to the literature for more general results. Our intention is that

both our algorithms and proofs, being special cases, are more concise and simple than others in

the literature and illustrate the central issues more clearly than a fully general formulation.

Part II of this book covers some algorithms for noisy or global optimization or both. There

are many interesting algorithms in this class, and this book is limited to those deterministic

algorithms that can be implemented in a more-or-less straightforward way. We do not, for

example, cover simulated annealing, genetic algorithms, response surface methods, or random

search procedures.

The reader of this book should be familiar with the material in an elementary graduate level

course in numerical analysis, in particular direct and iterative methods for the solution of linear

equations and linear least squares problems. The material in texts such as [127] and [264] is

suf¬cient.

A suite of MATLAB— codes has been written to accompany this book. These codes were

used to generate the computational examples in the book, but the algorithms do not depend

on the MATLAB environment and the reader can easily implement the algorithms in another

language, either directly from the algorithmic descriptions or by translating the MATLAB code.

The MATLAB environment is an excellent choice for experimentation, doing the exercises, and

small-to-medium-scale production work. Large-scale work on high-performance computers is

best done in another language. The reader should also be aware that there is a large amount of

high-quality software available for optimization. The book [195], for example, provides pointers

to several useful packages.

Parts of this book are based upon work supported by the National Science Foundation over

several years, most recently under National Science Foundation grants DMS-9321938, DMS-

9700569, and DMS-9714811, and by allocations of computing resources from the North Carolina

Supercomputing Center. Any opinions, ¬ndings, and conclusions or recommendations expressed

— MATLAB is a registered trademark of The MathWorks, Inc., 24 Prime Park Way, Natick, MA 01760, USA, (508)

653-1415, info@mathworks.com, http://www.mathworks.com.

xiii

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

xiv PREFACE

in this material are those of the author and do not necessarily re¬‚ect the views of the National

Science Foundation or of the North Carolina Supercomputing Center.

The list of students and colleagues who have helped me with this project, directly, through

collaborations/discussions on issues that I treat in the manuscript, by providing pointers to the

literature, or as a source of inspiration, is long. I am particularly indebted to Tom Banks, Jim

Banoczi, John Betts, David Bortz, Steve Campbell, Tony Choi, Andy Conn, Douglas Cooper, Joe

David, John Dennis, Owen Eslinger, J¨ rg Gablonsky, Paul Gilmore, Matthias Heinkenschloß,

o

Laura Helfrich, Lea Jenkins, Vickie Kearn, Carl and Betty Kelley, Debbie Lockhart, Casey Miller,

Jorge Mor´ , Mary Rose Muccie, John Nelder, Chung-Wei Ng, Deborah Poulson, Ekkehard

e

Sachs, Dave Shanno, Joseph Skudlarek, Dan Sorensen, John Strikwerda, Mike Tocci, Jon Tolle,

Virginia Torczon, Floria Tosca, Hien Tran, Margaret Wright, Steve Wright, and Kevin Yoemans.

C. T. Kelley

Raleigh, North Carolina

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

How to Get the Software

All computations reported in this book were done in MATLAB (version 5.2 on various SUN

SPARCstations and on an Apple Macintosh Powerbook 2400). The suite of MATLAB codes that

we used for the examples is available by anonymous ftp from ftp.math.ncsu.edu in the directory

FTP/kelley/optimization/matlab

or from SIAM™s World Wide Web server at

http://www.siam.org/books/fr18/

One can obtain MATLAB from

The MathWorks, Inc.

3 Apple Hill Drive

Natick, MA 01760-2098

(508) 647-7000

Fax: (508) 647-7001

E-mail: info@mathworks.com

WWW: http://www.mathworks.com

xv

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Part I

Optimization of Smooth Functions

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Chapter 1

Basic Concepts

1.1 The Problem

The unconstrained optimization problem is to minimize a real-valued function f of N variables.

By this we mean to ¬nd a local minimizer, that is, a point x— such that

f (x— ) ¤ f (x) for all x near x— .

(1.1)

It is standard to express this problem as

min f (x)

(1.2)

x

or to say that we seek to solve the problem min f . The understanding is that (1.1) means that we

seek a local minimizer. We will refer to f as the objective function and to f (x— ) as the minimum

or minimum value. If a local minimizer x— exists, we say a minimum is attained at x— .

We say that problem (1.2) is unconstrained because we impose no conditions on the inde-

pendent variables x and assume that f is de¬ned for all x.

The local minimization problem is different from (and much easier than) the global mini-

mization problem in which a global minimizer, a point x— such that

f (x— ) ¤ f (x) for all x,

(1.3)

is sought.

The constrained optimization problem is to minimize a function f over a set U ‚ RN . A

local minimizer, therefore, is an x— ∈ U such that

f (x— ) ¤ f (x) for all x ∈ U near x— .

(1.4)

Similar to (1.2) we express this as

min f (x)

(1.5)

x∈U

or say that we seek to solve the problem minU f . A global minimizer is a point x— ∈ U such

that

f (x— ) ¤ f (x) for all x ∈ U .

(1.6)

We consider only the simplest constrained problems in this book (Chapter 5 and §7.4) and refer

the reader to [104], [117], [195], and [66] for deeper discussions of constrained optimization

and pointers to software.

Having posed an optimization problem one can proceed in the classical way and use methods

that require smoothness of f . That is the approach we take in this ¬rst part of the book. These

3

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

4 ITERATIVE METHODS FOR OPTIMIZATION

methods can fail if the objective function has discontinuities or irregularities. Such nonsmooth

effects are common and can be caused, for example, by truncation error in internal calculations

for f , noise in internal probabilistic modeling in f , table lookup within f , or use of experimental

data in f . We address a class of methods for dealing with such problems in Part II.

1.2 Notation

In this book, following the convention in [154], vectors are to be understood as column vectors.

The vector x— will denote a solution, x a potential solution, and {xk }k≥0 the sequence of iterates.

We will refer to x0 as the initial iterate. x0 is sometimes timidly called the initial guess. We will

denote the ith component of a vector x by (x)i (note the parentheses) and the ith component

of xk by (xk )i . We will rarely need to refer to individual components of vectors. We will let

‚f /‚xi denote the partial derivative of f with respect to (x)i . As is standard [154], e = x ’ x—

will denote the error, en = xn ’ x— the error in the nth iterate, and B(r) the ball of radius r

about x—

B(r) = {x | e < r}.

For x ∈ RN we let ∇f (x) ∈ RN denote the gradient of f ,

∇f (x) = (‚f /‚x1 , . . . , ‚f /‚xN ),

when it exists.

We let ∇2 f denote the Hessian of f ,

(∇2 f )ij = ‚ 2 f /‚xi ‚xj ,

when it exists. Note that ∇2 f is the Jacobian of ∇f . However, ∇2 f has more structure than

a Jacobian for a general nonlinear function. If f is twice continuously differentiable, then the

Hessian is symmetric ((∇2 f )ij = (∇2 f )ji ) by equality of mixed partial derivatives [229].

In this book we will consistently use the Euclidean norm

N

(x)2 .

x= i

i=1

When we refer to a matrix norm we will mean the matrix norm induced by the Euclidean norm

Ax

A = max .

x

x=0

In optimization de¬niteness or semide¬niteness of the Hessian plays an important role in

the necessary and suf¬cient conditions for optimality that we discuss in §1.3 and 1.4 and in our

choice of algorithms throughout this book.

De¬nition 1.2.1. An N — N matrix A is positive semide¬nite if xT Ax ≥ 0 for all x ∈ RN .

A is positive de¬nite if xT Ax > 0 for all x ∈ RN , x = 0. If A has both positive and negative

eigenvalues, we say A is inde¬nite. If A is symmetric and positive de¬nite, we will say A is spd.

We will use two forms of the fundamental theorem of calculus, one for the function“gradient

pair and one for the gradient“Hessian.

Theorem 1.2.1. Let f be twice continuously differentiable in a neighborhood of a line

segment between points x— , x = x— + e ∈ RN ; then

1

—

∇f (x— + te)T e dt

f (x) = f (x ) +

0

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS 5

and

1

—

∇2 f (x— + te)e dt.

∇f (x) = ∇f (x ) +

0

A direct consequence (see Exercise 1.7.1) of Theorem 1.2.1 is the following form of Taylor™s

theorem we will use throughout this book.

Theorem 1.2.2. Let f be twice continuously differentiable in a neighborhood of a point

x ∈ RN . Then for e ∈ RN and e suf¬ciently small

—

f (x— + e) = f (x— ) + ∇f (x— )T e + eT ∇2 f (x— )e/2 + o( e 2 ).

(1.7)

1.3 Necessary Conditions

Let f be twice continuously differentiable. We will use Taylor™s theorem in a simple way to

show that the gradient of f vanishes at a local minimizer and the Hessian is positive semide¬nite.

These are the necessary conditions for optimality.

The necessary conditions relate (1.1) to a nonlinear equation and allow one to use fast al-

gorithms for nonlinear equations [84], [154], [211] to compute minimizers. Therefore, the

necessary conditions for optimality will be used in a critical way in the discussion of local con-

vergence in Chapter 2. A critical ¬rst step in the design of an algorithm for a new optimization

problem is the formulation of necessary conditions. Of course, the gradient vanishes at a maxi-

mum, too, and the utility of the nonlinear equations formulation is restricted to a neighborhood

of a minimizer.

Theorem 1.3.1. Let f be twice continuously differentiable and let x— be a local minimizer

of f . Then

∇f (x— ) = 0.

Moreover ∇2 f (x— ) is positive semide¬nite.

Proof. Let u ∈ RN be given. Taylor™s theorem states that for all real t suf¬ciently small

t2 T 2

f (x + tu) = f (x ) + t∇f (x ) u + u ∇ f (x— )u + o(t2 ).

— — —T

2

Since x— is a local minimizer we must have for t suf¬ciently small 0 ¤ f (x— + tu) ’ f (x— ) and

hence

t

∇f (x— )T u + uT ∇2 f (x— )u + o(t) ≥ 0

(1.8)

2

for all t suf¬ciently small and all u ∈ RN . So if we set t = 0 and u = ’∇f (x— ) we obtain

∇f (x— ) 2

= 0.

Setting ∇f (x— ) = 0, dividing by t, and setting t = 0 in (1.8), we obtain

1T 2

u ∇ f (x— )u ≥ 0

2

for all u ∈ RN . This completes the proof.

The condition that ∇f (x— ) = 0 is called the ¬rst-order necessary condition and a point

satisfying that condition is called a stationary point or a critical point.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

6 ITERATIVE METHODS FOR OPTIMIZATION

1.4 Suf¬cient Conditions

A stationary point need not be a minimizer. For example, the function φ(t) = ’t4 satis¬es the

necessary conditions at 0, which is a maximizer of φ. To obtain a minimizer we must require that

the second derivative be nonnegative. This alone is not suf¬cient (think of φ(t) = t3 ) and only

if the second derivative is strictly positive can we be completely certain. These are the suf¬cient

conditions for optimality.

Theorem 1.4.1. Let f be twice continuously differentiable in a neighborhood of x— . Assume

that ∇f (x— ) = 0 and that ∇2 f (x— ) is positive de¬nite. Then x— is a local minimizer of f .

Proof. Let 0 = u ∈ RN . For suf¬ciently small t we have

t2 T 2

= f (x ) + t∇f (x ) u + u ∇ f (x— )u + o(t2 )

— — —T

f (x + tu)

2

t2 T 2

= f (x ) + u ∇ f (x— )u + o(t2 ).

—

2

Hence, if » > 0 is the smallest eigenvalue of ∇2 f (x— ) we have

»

f (x— + tu) ’ f (x— ) ≥ 2

+ o(t2 ) > 0

tu

2

for t suf¬ciently small. Hence x— is a local minimizer.

1.5 Quadratic Objective Functions

The simplest optimization problems are those with quadratic objective functions. Here

1

f (x) = ’xT b + xT Hx.

(1.9)

2

We may, without loss of generality, assume that H is symmetric because

H + HT

T T

x Hx = x x.

(1.10)

2

Quadratic functions form the basis for most of the algorithms in Part I, which approximate an

objective function f by a quadratic model and minimize that model. In this section we discuss

some elementary issues in quadratic optimization.

Clearly,

∇2 f (x) = H

for all x. The symmetry of H implies that

∇f (x) = ’b + Hx.

De¬nition 1.5.1. The quadratic function f in (1.9) is convex if H is positive semide¬nite.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS 7

1.5.1 Positive De¬nite Hessian

The necessary conditions for optimality imply that if a quadratic function f has a local minimum

x— , then H is positive semide¬nite and

Hx— = b.

(1.11)

In particular, if H is spd (and hence nonsingular), the unique global minimizer is the solution of

the linear system (1.11).

If H is a dense matrix and N is not too large, it is reasonable to solve (1.11) by computing

the Cholesky factorization [249], [127] of H

H = LLT ,

where L is a nonsingular lower triangular matrix with positive diagonal, and then solving (1.11)

by two triangular solves. If H is inde¬nite the Cholesky factorization will not exist and the

standard implementation [127], [249], [264] will fail because the computation of the diagonal

of L will require a real square root of a negative number or a division by zero.

If N is very large, H is sparse, or a matrix representation of H is not available, a more

ef¬cient approach is the conjugate gradient iteration [154], [141]. This iteration requires only

matrix“vector products, a feature which we will use in a direct way in §§2.5 and 3.3.7. Our

formulation of the algorithm uses x as both an input and output variable. On input x contains

x0 , the initial iterate, and on output the approximate solution. We terminate the iteration if the

relative residual is suf¬ciently small, i.e.,

b ’ Hx ¤ b

or if too many iterations have been taken.

Algorithm 1.5.1. cg(x, b, H, , kmax)

1. r = b ’ Hx, ρ0 = r 2 , k = 1.

√

ρk’1 > b and k < kmax

2. Do While

(a) if k = 1 then p = r

else

β = ρk’1 /ρk’2 and p = r + βp

(b) w = Hp

(c) ± = ρk’1 /pT w

(d) x = x + ±p

(e) r = r ’ ±w

2

(f) ρk = r

(g) k = k + 1

Note that if H is not spd, the denominator in ± = ρk’1 /pT w may vanish, resulting in

breakdown of the iteration.

The conjugate gradient iteration minimizes f over an increasing sequence of nested subspaces

of RN [127], [154]. We have that

f (xk ) ¤ f (x) for all x ∈ x0 + Kk ,

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

8 ITERATIVE METHODS FOR OPTIMIZATION

where Kk is the Krylov subspace

Kk = span(r0 , Hr0 , . . . , H k’1 r0 )

for k ≥ 1.

While in principle the iteration must converge after N iterations and conjugate gradient can

be regarded as a direct solver, N is, in practice, far too many iterations for the large problems to

which conjugate gradient is applied. As an iterative method, the performance of the conjugate

gradient algorithm depends both on b and on the spectrum of H (see [154] and the references

cited therein). A general convergence estimate [68], [60], which will suf¬ce for the discussion

here, is

k

κ(H) ’ 1

xk ’ x— ¤ 2 x0 ’ x— .

(1.12) H H

κ(H) + 1

In (1.12), the H-norm of a vector is de¬ned by

2

= uT Hu

u H

for an spd matrix H. κ(H) is the l2 condition number

H ’1 .

κ(H) = H

For spd H

κ(H) = »l »’1 ,

s

where »l and »s are the largest and smallest eigenvalues of H. Geometrically, κ(H) is large if

the ellipsoidal level surfaces of f are very far from spherical.

The conjugate gradient iteration will perform well if κ(H) is near 1 and may perform very

poorly if κ(H) is large. The performance can be improved by preconditioning, which transforms

(1.11) into one with a coef¬cient matrix having eigenvalues near 1. Suppose that M is spd and

a suf¬ciently good approximation to H ’1 so that

κ(M 1/2 HM 1/2 )

is much smaller that κ(H). In that case, (1.12) would indicate that far fewer conjugate gradient

iterations might be needed to solve

M 1/2 HM 1/2 z = M 1/2 b

(1.13)

than to solve (1.11). Moreover, the solution x— of (1.11) could be recovered from the solution

z — of (1.13) by

x = M 1/2 z.

(1.14)

In practice, the square root of the preconditioning matrix M need not be computed. The algo-

rithm, using the same conventions that we used for cg, is

Algorithm 1.5.2. pcg(x, b, H, M, , kmax)

1. r = b ’ Hx, ρ0 = r 2 , k = 1

√

2. Do While ρk’1 > b and k < kmax

(a) z = M r

(b) „k’1 = z T r

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS 9

(c) if k = 1 then β = 0 and p = z

else

β = „k’1 /„k’2 , p = z + βp

(d) w = Hp

(e) ± = „k’1 /pT w

(f) x = x + ±p

(g) r = r ’ ±w

(h) ρk = rT r

(i) k = k + 1

Note that only products of M with vectors in RN are needed and that a matrix representation

of M need not be stored. We refer the reader to [11], [15], [127], and [154] for more discussion

of preconditioners and their construction.

1.5.2 Inde¬nite Hessian

If H is inde¬nite, the necessary conditions, Theorem 1.3.1, imply that there will be no local

minimum. Even so, it will be important to understand some properties of quadratic problems

with inde¬nite Hessians when we design algorithms with initial iterates far from local minimizers

and we discuss some of the issues here.

If

uT Hu < 0,

we say that u is a direction of negative curvature. If u is a direction of negative curvature, then

f (x + tu) will decrease to ’∞ as t ’ ∞.

1.6 Examples

It will be useful to have some example problems to solve as we develop the algorithms. The

examples here are included to encourage the reader to experiment with the algorithms and play

with the MATLAB codes. The codes for the problems themselves are included with the set of

MATLAB codes. The author of this book does not encourage the reader to regard the examples

as anything more than examples. In particular, they are not real-world problems, and should not

be used as an exhaustive test suite for a code. While there are documented collections of test

problems (for example, [10] and [26]), the reader should always evaluate and compare algorithms

in the context of his/her own problems.

Some of the problems are directly related to applications. When that is the case we will cite

some of the relevant literature. Other examples are included because they are small, simple, and

illustrate important effects that can be hidden by the complexity of more serious problems.

1.6.1 Discrete Optimal Control

This is a classic example of a problem in which gradient evaluations cost little more than function

evaluations.

We begin with the continuous optimal control problems and discuss how gradients are com-

puted and then move to the discretizations. We will not dwell on the functional analytic issues

surrounding the rigorous de¬nition of gradients of maps on function spaces, but the reader should

be aware that control problems require careful attention to this. The most important results can

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

10 ITERATIVE METHODS FOR OPTIMIZATION

be found in [151]. The function space setting for the particular control problems of interest in this

section can be found in [170], [158], and [159], as can a discussion of more general problems.

The in¬nite-dimensional problem is

min f,

(1.15)

u

where

T

f (u) = L(y(t), u(t), t) dt,

(1.16)

0

and we seek an optimal point u ∈ L∞ [0, T ]. u is called the control variable or simply the

control. The function L is given and y, the state variable, satis¬es the initial value problem

(with y = dy/dt)

™

y(t) = φ(y(t), u(t), t), y(0) = y0 .

™

(1.17)

One could view the problem (1.15)“(1.17) as a constrained optimization problem or, as we

do here, think of the evaluation of f as requiring the solution of (1.17) before the integral on the

right side of (1.16) can be evaluated. This means that evaluation of f requires the solution of

(1.17), which is called the state equation.

∇f (u), the gradient of f at u with respect to the L2 inner product, is uniquely determined,

if it exists, by

T

f (u + w) ’ f (u) ’ (∇f (u))(t)w(t) dt = o( w )

(1.18)

0

as w ’ 0, uniformly in w. If ∇f (u) exists then

T

df (u + ξw)

(∇f (u))(t)w(t) dt = .

dξ

0 ξ=0

If L and φ are continuously differentiable, then ∇f (u), as a function of t, is given by

∇f (u)(t) = p(t)φu (y(t), u(t), t) + Lu (y(t), u(t), t).

(1.19)

In (1.19) p, the adjoint variable, satis¬es the ¬nal-value problem on [0, T ]

’p(t) = p(t)φy (y(t), u(t), t) + Ly (y(t), u(t), t), p(T ) = 0.

™

(1.20)

So computing the gradient requires u and y, hence a solution of the state equation, and p, which

requires a solution of (1.20), a ¬nal-value problem for the adjoint equation. In the general case,

(1.17) is nonlinear, but (1.20) is a linear problem for p, which should be expected to be easier

to solve. This is the motivation for our claim that a gradient evaluation costs little more than a

function evaluation.

The discrete problems of interest here are constructed by solving (1.17) by numerical in-

tegration. After doing that, one can derive an adjoint variable and compute gradients using a

discrete form of (1.19). However, in [139] the equation for the adjoint variable of the discrete

problem is usually not a discretization of (1.20). For the forward Euler method, however, the

discretization of the adjoint equation is the adjoint equation for the discrete problem and we use

that discretization here for that reason.

The fully discrete problem is minu f , where u ∈ RN and

N

f (u) = L((y)j , (u)j , j),

j=1

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

BASIC CONCEPTS 11

and the states {xj } are given by the Euler recursion

yj+1 = yj + hφ((y)j , (u)j , j) for j = 0, . . . , N ’ 1,

where h = T /(N ’ 1) and x0 is given. Then

(∇f (u))j = (p)j φu ((y)j , (u)j , j) + Lu ((y)j , (u)j , j),

where (p)N = 0 and

(p)j’1 = (p)j + h (p)j φy ((y)j , (u)j , j) + Ly ((y)j , (u)j , j) for j = N, . . . , 1.

1.6.2 Parameter Identi¬cation

This example, taken from [13], will appear throughout the book. The problem is small with

N = 2. The goal is to identify the damping c and spring constant k of a linear spring by

minimizing the difference of a numerical prediction and measured data. The experimental

scenario is that the spring-mass system will be set into motion by an initial displacement from

equilibrium and measurements of displacements will be taken at equally spaced increments in

time.

The motion of an unforced harmonic oscillator satis¬es the initial value problem

u + cu + ku = 0; u(0) = u0 , u (0) = 0,

(1.21)

on the interval [0, T ]. We let x = (c, k)T be the vector of unknown parameters and, when the

dependence on the parameters needs to be explicit, we will write u(t : x) instead of u(t) for the

solution of (1.21). If the displacement is sampled at {tj }M , where tj = (j ’ 1)T /(M ’ 1),

j=1

and the observations for u are {uj }M , then the objective function is

j=1

M

1

|u(tj : x) ’ uj |2 .

f (x) =

(1.22)

2 j=1

This is an example of a nonlinear least squares problem.

u is differentiable with respect to x when c2 ’ 4k = 0. In that case, the gradient of f is

M ‚u(tj :x)

(u(tj : x) ’ uj )

j=1 ‚c

∇f (x) = .

(1.23) M ‚u(tj :x)

(u(tj : x) ’ uj )

j=1 ‚k

We can compute the derivatives of u(t : x) with respect to the parameters by solving the

sensitivity equations. Differentiating (1.21) with respect to c and k and setting w1 = ‚u/‚c and

w2 = ‚u/‚k we obtain

w1 + u + cw1 + kw1 = 0; w1 (0) = w1 (0) = 0,

(1.24)

w2 + cw2 + u + kw2 = 0; w2 (0) = w2 (0) = 0.

If c is large, the initial value problems (1.21) and (1.24) will be stiff and one should use a good

variable step stiff integrator. We refer the reader to [110], [8], [235] for details on this issue. In

the numerical examples in this book we used the MATLAB code ode15s from [236]. Stiffness

can also arise in the optimal control problem from §1.6.1 but does not in the speci¬c examples

we use in this book. We caution the reader that when one uses an ODE code the results may only

be expected to be accurate to the tolerances input to the code. This limitation on the accuracy

must be taken into account, for example, when approximating the Hessian by differences.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

12 ITERATIVE METHODS FOR OPTIMIZATION

1.6.3 Convex Quadratics

While convex quadratic problems are, in a sense, the easiest of optimization problems, they

present surprising challenges to the sampling algorithms presented in Part II and can illustrate

fundamental problems with classical gradient-based methods like the steepest descent algorithm

from §3.1. Our examples will all take N = 2, b = 0, and

»s 0

H= ,

0 »l

where 0 < »s ¤ »l . The function to be minimized is

f (x) = xT Hx

and the minimizer is x— = (0, 0)T .

As »l /»s becomes large, the level curves of f become elongated. When »s = »l = 1,

minx f is the easiest problem in optimization.

1.7 Exercises on Basic Concepts

1.7.1. Prove Theorem 1.2.2.

1.7.2. Consider the parameter identi¬cation problem for x = (c, k, ω, φ)T ∈ R4 associated with

the initial value problem

u + cu + ku = sin(ωt + φ); u(0) = 10, u (0) = 0.

For what values of x is u differentiable? Derive the sensitivity equations for those values

of x for which u is differentiable.

1.7.3. Solve the system of sensitivity equations from exercise 1.7.2 numerically for c = 10,

k = 1, ω = π, and φ = 0 using the integrator of your choice. What happens if you use a

nonstiff integrator?

1.7.4. Let N = 2, d = (1, 1)T , and let f (x) = xT d + xT x. Compute, by hand, the minimizer

using conjugate gradient iteration.

1.7.5. For the same f as in exercise 1.7.4 solve the constrained optimization problem

min f (x),

x∈U

where U is the circle centered at (0, 0)T of radius 1/3. You can solve this by inspection;

no computer and very little mathematics is needed.

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

Chapter 2

Local Convergence of Newton™s

Method

By a local convergence method we mean one that requires that the initial iterate x0 is close to a

local minimizer x— at which the suf¬cient conditions hold.

2.1 Types of Convergence

We begin with the standard taxonomy of convergence rates [84], [154], [211].

De¬nition 2.1.1. Let {xn } ‚ RN and x— ∈ RN . Then

• xn ’ x— q-quadratically if xn ’ x— and there is K > 0 such that

xn+1 ’ x— ¤ K xn ’ x— 2 .

• xn ’ x— q-superlinearly with q-order ± > 1 if xn ’ x— and there is K > 0 such that

xn+1 ’ x— ¤ K xn ’ x— ±

.

• xn ’ x— q-superlinearly if

xn+1 ’ x—

lim = 0.

xn ’ x—

n’∞

• xn ’ x— q-linearly with q-factor σ ∈ (0, 1) if

xn+1 ’ x— ¤ σ xn ’ x—

for n suf¬ciently large.

De¬nition 2.1.2. An iterative method for computing x— is said to be locally (q-quadratically,

q-superlinearly, q-linearly, etc.) convergent if the iterates converge to x— (q-quadratically, q-

superlinearly, q-linearly, etc.) given that the initial data for the iteration is suf¬ciently good.

We remind the reader that a q-superlinearly convergent sequence is also q-linearly conver-

gent with q-factor σ for any σ > 0. A q-quadratically convergent sequence is q-superlinearly

convergent with q-order of 2.

13

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

14 ITERATIVE METHODS FOR OPTIMIZATION

In some cases the accuracy of the iteration can be improved by means that are external

to the algorithm, say, by evaluation of the objective function and its gradient with increasing

accuracy as the iteration progresses. In such cases, one has no guarantee that the accuracy of

the iteration is monotonically increasing but only that the accuracy of the results is improving at

a rate determined by the improving accuracy in the function“gradient evaluations. The concept

of r-type convergence captures this effect.

De¬nition 2.1.3. Let {xn } ‚ RN and x— ∈ RN . Then {xn } converges to x— r-( quadrat-

ically, superlinearly, linearly) if there is a sequence {ξn } ‚ R converging q-(quadratically,

superlinearly, linearly) to 0 such that

xn ’ x— ¤ ξn .

We say that {xn } converges r-superlinearly with r-order ± > 1 if ξn ’ 0 q-superlinearly with

q-order ±.

2.2 The Standard Assumptions

We will assume that local minimizers satisfy the standard assumptions which, like the standard

assumptions for nonlinear equations in [154], will guarantee that Newton™s method converges

q-quadratically to x— . We will assume throughout this book that f and x— satisfy Assumption

2.2.1.

Assumption 2.2.1.

1. f is twice di¬erentiable and

∇2 f (x) ’ ∇2 f (y) ¤ γ x ’ y .

(2.1)

2. ∇f (x— ) = 0.

3. ∇2 f (x— ) is positive de¬nite.

We sometimes say that f is twice Lipschitz continuously differentiable with Lipschitz constant

γ to mean that part 1 of the standard assumptions holds.

If the standard assumptions hold then Theorem 1.4.1 implies that x— is a local minimizer

of f . Moreover, the standard assumptions for nonlinear equations [154] hold for the system

∇f (x) = 0. This means that all of the local convergence results for nonlinear equations can be

applied to unconstrained optimization problems. In this chapter we will quote those results from

nonlinear equations as they apply to unconstrained optimization. However, these statements

must be understood in the context of optimization. We will use, for example, the fact that the

Hessian (the Jacobian of ∇f ) is positive de¬nite at x— in our solution of the linear equation for

the Newton step. We will also use this in our interpretation of the Newton iteration.

2.3 Newton™s Method

As in [154] we will de¬ne iterative methods in terms of the transition from a current iteration xc

to a new one x+ . In the case of a system of nonlinear equations, for example, x+ is the root of

the local linear model of F about xc

Mc (x) = F (xc ) + F (xc )(x ’ xc ).

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 15

Solving Mc (x+ ) = 0 leads to the standard formula for the Newton iteration

x+ = xc ’ F (xc )’1 F (xc ).

(2.2)

One could say that Newton™s method for unconstrained optimization is simply the method

for nonlinear equations applied to ∇f (x) = 0. While this is technically correct if xc is near a

minimizer, it is utterly wrong if xc is near a maximum. A more precise way of expressing the

idea is to say that x+ is a minimizer of the local quadratic model of f about xc .

1

mc (x) = f (xc ) + ∇f (xc )T (x ’ xc ) + (x ’ xc )T ∇2 f (xc )(x ’ xc ).

2

If ∇2 f (xc ) is positive de¬nite, then the minimizer x+ of mc is the unique solution of ∇mc (x) =

0. Hence,

0 = ∇mc (x+ ) = ∇f (xc ) + ∇2 f (xc )(x+ ’ xc ).

Therefore,

x+ = xc ’ (∇2 f (xc ))’1 ∇f (xc ),

(2.3)

which is the same as (2.2) with F replaced by ∇f and F by ∇2 f . Of course, x+ is not computed

by forming an inverse matrix. Rather, given xc , ∇f (xc ) is computed and the linear equation

∇2 f (xc )s = ’∇f (xc )

(2.4)

is solved for the step s. Then (2.3) simply says that x+ = xc + s.

However, if uc is far from a minimizer, ∇2 f (uc ) could have negative eigenvalues and the

quadratic model will not have local minimizers (see exercise 2.7.4), and Mc , the local linear

model of ∇f about uc , could have roots which correspond to local maxima or saddle points

of mc . Hence, we must take care when far from a minimizer in making a correspondence

between Newton™s method for minimization and Newton™s method for nonlinear equations. In

this chapter, however, we will assume that we are suf¬ciently near a local minimizer for the

standard assumptions for local optimality to imply those for nonlinear equations (as applied to

∇f ). Most of the proofs in this chapter are very similar to the corresponding results, [154], for

nonlinear equations. We include them in the interest of completeness.

We begin with a lemma from [154], which we state without proof.

Lemma 2.3.1. Assume that the standard assumptions hold. Then there is δ > 0 so that for

all x ∈ B(δ)

∇2 f (x) ¤ 2 ∇2 f (x— ) ,

(2.5)

(∇2 f (x))’1 ¤ 2 (∇2 f (x— ))’1 ,

(2.6)

and

(∇2 f (x— ))’1 ’1

e /2 ¤ ∇f (x) ¤ 2 ∇2 f (x— ) e .

(2.7)

As a ¬rst example, we prove the local convergence for Newton™s method.

Theorem 2.3.2. Let the standard assumptions hold. Then there are K > 0 and δ > 0 such

that if xc ∈ B(δ), the Newton iterate from xc given by (2.3) satis¬es

e+ ¤ K ec 2 .

(2.8)

Proof. Let δ be small enough so that the conclusions of Lemma 2.3.1 hold. By Theorem 1.2.1

1

2 ’1 2 ’1

(∇2 f (xc ) ’ ∇2 f (x— + tec ))ec dt.

e+ = ec ’ ∇ f (xc ) ∇f (xc ) = (∇ f (xc ))

0

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

16 ITERATIVE METHODS FOR OPTIMIZATION

By Lemma 2.3.1 and the Lipschitz continuity of ∇2 f ,

e+ ¤ (2 (∇2 f (x— ))’1 )γ ec 2 /2.

This completes the proof of (2.8) with K = γ (∇2 f (x— ))’1 .

As in the nonlinear equations setting, Theorem 2.3.2 implies that the complete iteration is

locally quadratically convergent.

Theorem 2.3.3. Let the standard assumptions hold. Then there is δ > 0 such that if

x0 ∈ B(δ), the Newton iteration

xn+1 = xn ’ (∇2 f (xn ))’1 ∇f (xn )

converges q-quadratically to x— .

Proof. Let δ be small enough so that the conclusions of Theorem 2.3.2 hold. Reduce δ if

needed so that Kδ = · < 1. Then if n ≥ 0 and xn ∈ B(δ), Theorem 2.3.2 implies that

2

en+1 ¤ K en ¤ · en < en

(2.9)

and hence xn+1 ∈ B(·δ) ‚ B(δ). Therefore, if xn ∈ B(δ) we may continue the iteration. Since

x0 ∈ B(δ) by assumption, the entire sequence {xn } ‚ B(δ). (2.9) then implies that xn ’ x—

q-quadratically.

Newton™s method, from the local convergence point of view, is exactly the same as that

for nonlinear equations applied to the problem of ¬nding a root of ∇f . We exploit the extra

structure of positive de¬niteness of ∇2 f with an implementation of Newton™s method based on

the Cholesky factorization [127], [249], [264]

∇2 f (u) = LLT ,

(2.10)

where L is lower triangular and has a positive diagonal.

We terminate the iteration when ∇f is suf¬ciently small (see [154]). A natural criterion is

to demand a relative decrease in ∇f and terminate when

∇f (xn ) ¤ „r ∇f (x0 ) ,

(2.11)

where „r ∈ (0, 1) is the desired reduction in the gradient norm. However, if ∇f (x0 ) is very

small, it may not be possible to satisfy (2.11) in ¬‚oating point arithmetic and an algorithm based

entirely on (2.11) might never terminate. A standard remedy is to augment the relative error

criterion and terminate the iteration using a combination of relative and absolute measures of

∇f , i.e., when

∇f (xn ) ¤ „r ∇f (x0 ) + „a .

(2.12)

In (2.12) „a is an absolute error tolerance. Hence, the termination criterion input to many of the

algorithms presented in this book will be in the form of a vector „ = („r , „a ) of relative and

absolute residuals.

Algorithm 2.3.1. newton(x, f, „ )

1. r0 = ∇f (x)

2. Do while ∇f (x) > „r r0 + „a

(a) Compute ∇2 f (x)

(b) F actor ∇2 f (x) = LLT

(c) Solve LLT s = ’∇f (x)

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 17

(d) x = x + s

(e) Compute ∇f (x).

Algorithm newton, as formulated above, is not completely satisfactory. The value of the

objective function f is never used and step 2b will fail if ∇2 f is not positive de¬nite. This failure,

in fact, could serve as a signal that one is too far from a minimizer for Newton™s method to be

directly applicable. However, if we are near enough (see Exercise 2.7.8) to a local minimizer,

as we assume in this chapter, all will be well and we may apply all the results from nonlinear

equations.

2.3.1 Errors in Functions, Gradients, and Hessians

In the presence of errors in functions and gradients, however, the problem of unconstrained

optimization becomes more dif¬cult than that of root ¬nding. We discuss this difference only

brie¬‚y here and for the remainder of this chapter assume that gradients are computed exactly, or

at least as accurately as f , say, either analytically or with automatic differentiation [129], [130].

However, we must carefully study the effects of errors in the evaluation of the Hessian just as

we did those of errors in the Jacobian in [154].

A signi¬cant difference from the nonlinear equations case arises if only functions are available

and gradients and Hessians must be computed with differences. A simple one-dimensional

analysis will suf¬ce to explain this. Assume that we can only compute f approximately. If we

ˆ

compute f = f + f rather than f , then a forward difference gradient with difference increment

h

ˆ ˆ

f (x + h) ’ f (x)

Dh f (x) =

h

√

differs from f by O(h + f /h) and this error is minimized if h = O( f ). In that case the error

√

in the gradient is g = O(h) = O( f ). If a forward difference Hessian is computed using Dh

as an approximation to the gradient, then the error in the Hessian will be

√ 1/4

∆ = O( g ) = O( f)

(2.13)

and the accuracy in ∇2 f will be much less than that of a Jacobian in the nonlinear equations

case.

If f is signi¬cantly larger than machine roundoff, (2.13) indicates that using numerical

Hessians based on a second numerical differentiation of the objective function will not be very

accurate. Even in the best possible case, where f is the same size as machine roundoff, ¬nite

difference Hessians will not be very accurate and will be very expensive to compute if the Hessian

is dense. If, as on most computers today, machine roundoff is (roughly) 10’16 , (2.13) indicates

that a forward difference Hessian will be accurate to roughly four decimal digits.

One can obtain better results with centered differences, but at a cost of twice the number of

function evaluations. A centered difference approximation to ∇f is

ˆ ˆ

f (x + h) ’ f (x ’ h)

Dh f (x) =

2h

1/3

and the error is O(h2 + f /h), which is minimized if h = O( f) leading to an error in the

2/3

= O( f ).

gradient of Therefore, a central difference Hessian will have an error of

g

4/9

∆ = O(( g )2/3 ) = O( f ),

which is substantially better. We will ¬nd that accurate gradients are much more important than

accurate Hessians and one option is to compute gradients with central differences and Hessians

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

18 ITERATIVE METHODS FOR OPTIMIZATION

2/3

with forward differences. If one does that the centered difference gradient error is O( f) and

therefore the forward difference Hessian error will be

√ 1/3

∆=O g = O( f ).

More elaborate schemes [22] compute a difference gradient and then reuse the function evalua-

tions in the Hessian computation.

In many optimization problems, however, accurate gradients are available. When that is the

case, numerical differentiation to compute Hessians is, like numerical computation of Jacobians

for nonlinear equations [154], a reasonable idea for many problems and the less expensive

forward differences work well.

Clever implementations of difference computation can exploit sparsity in the Hessian [67],

[59] to evaluate a forward difference approximation with far fewer than N evaluations of ∇f .

In the sparse case it is also possible [22], [23] to reuse the points from a centered difference

approximation to the gradient to create a second-order accurate Hessian.

Unless g (xn ) ’ 0 as the iteration progresses, one cannot expect convergence. For this

reason estimates like (2.14) are sometimes called local improvement [88] results. Theorem 2.3.4

is a typical example.

¯

Theorem 2.3.4. Let the standard assumptions hold. Then there are K > 0, δ > 0, and

δ1 > 0 such that if xc ∈ B(δ) and ∆(xc ) < δ1 then

x+ = xc ’ (∇2 f (xc ) + ∆(xc ))’1 (∇f (xc ) + g (xc ))

is de¬ned (i.e., ∇2 f (xc ) + ∆(xc ) is nonsingular) and satis¬es

¯ 2

e+ ¤ K( ec + ∆(xc ) ec + g (xc ) ).

(2.14)

Proof. Let δ be small enough so that the conclusions of Lemma 2.3.1 and Theorem 2.3.2

hold. Let

xN = xc ’ (∇2 f (xc ))’1 ∇f (xc )

+

and note that

x+ = xN + ((∇2 f (xc ))’1 ’ (∇2 f (xc ) + ∆(xc ))’1 )∇f (xc ) ’ (∇2 f (xc ) + ∆(xc ))’1 g (xc ).

+

Lemma 2.3.1 and Theorem 2.3.2 imply

2

+ 2 (∇2 f (xc ))’1 ’ (∇2 f (xc ) + ∆(xc ))’1 ∇2 f (x— ) ec

e+ ¤ K ec

(2.15)

+ (∇2 f (xc ) + ∆(xc ))’1 g (xc ) .

If

∆(xc ) ¤ (∇2 f (x— ))’1 ’1

/4,

then Lemma 2.3.1 implies that

∆(xc ) ¤ (∇2 f (xc ))’1 ’1

/2

and the Banach Lemma [12], [154] states that ∇2 f (xc ) + ∆(xc ) is nonsingular and

(∇2 f (xc ) + ∆(xc ))’1 ¤ 2 (∇2 f (xc ))’1 ¤ 4 (∇2 f (x— ))’1 .

Hence,

(∇2 f (xc ))’1 ’ (∇2 f (xc ) + ∆(xc ))’1 ¤ 8 (∇2 f (x— ))’1 2

∆(xc ) .

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

LOCAL CONVERGENCE 19

We use these estimates and (2.15) to obtain

2

+ 16 (∇2 f (x— ))’1 2

∇2 f (x— ) ∆(xc ) ec + 4 (∇2 f (x— ))’1

e+ ¤ K ec g (xc ) .

Setting

¯

K = K + 16 (∇2 f (x— ))’1 2

∇2 f (x— ) + 4 (∇2 f (x— ))’1

completes the proof.

As is the case with equations, (2.14) implies that one cannot hope to ¬nd a minimizer with

more accuracy that one can evaluate ∇f . In most cases the iteration will stagnate once e is

(roughly) the same size as g . The speed of convergence will be governed by the accuracy in the

Hessian.

The result for the chord method illustrates this latter point. In the chord method we form

and compute the Cholesky factorization of ∇2 f (x0 ) and use that factorization to compute all

subsequent Newton steps. Hence,

x+ = xc ’ (∇2 f (x0 ))’1 ∇f (xc )

and

∆(xc ) ¤ γ x0 ’ xc ¤ γ( e0 + ec ).

(2.16)

Algorithmically the chord iteration differs from the Newton iteration only in that the computation

and factorization of the Hessian is moved outside of the main loop.

Algorithm 2.3.2. chord(x, f, „ )

1. r0 = ∇f (x)

2. Compute ∇2 f (x)

3. F actor ∇2 f (x) = LLT

4. Do while ∇f (x) > „r r0 + „a

(a) Solve LLT s = ’∇f (x)

(b) x = x + s

(c) Compute ∇f (x).

= 0 and ∆ = O( e0 ).

The convergence theory follows from Theorem 2.3.4 with g

Theorem 2.3.5. Let the standard assumptions hold. Then there are KC > 0 and δ > 0

such that if x0 ∈ B(δ) the chord iterates converge q-linearly to x— and

en+1 ¤ KC e0 en .

(2.17)

Proof. Let δ be small enough so that the conclusions of Theorem 2.3.4 hold. Assume that

en , e0 ∈ B(δ). Combining (2.16) and (2.14) implies

¯ ¯

en+1 ¤ K( en (1 + γ) + γ e0 ) en ¤ K(1 + 2γ)δ en .

Hence, if δ is small enough so that

¯

K(1 + 2γ)δ = · < 1,

then the chord iterates converge q-linearly to x— . Q-linear convergence implies that en < e0

¯

and hence (2.17) holds with KC = K(1 + 2γ).

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.

20 ITERATIVE METHODS FOR OPTIMIZATION

The Shamanskii method [233], [154], [211] is a generalization of the chord method that

updates Hessians after every m + 1 nonlinear iterations. Newton™s method corresponds to

m = 1 and the chord method to m = ∞. The convergence result is a direct consequence of

Theorems 2.3.3 and 2.3.5.

Theorem 2.3.6. Let the standard assumptions hold and let m ≥ 1 be given. Then there are

KS > 0 and δ > 0 such that if x0 ∈ B(δ) the Shamanskii iterates converge q-superlinearly to

x— with q-order m and

en+1 ¤ KS en m+1 .

(2.18)

As one more application of Theorem 2.3.4, we analyze the effects of a difference approxima-

tion of the Hessian. We follow the notation of [154] where possible. For example, to construct

a Hessian matrix, whose columns are ∇2 f (x)ej , where ej is the unit vector with jth compo-

nent 1 and other components 0, we could approximate the matrix“vector products ∇2 f (x)ej by

forward differences and then symmetrize the resulting matrix. We de¬ne

∇2 f (x) = (Ah + AT )/2,

(2.19) h h

2 2

where Ah is the matrix whose jth column is Dh f (x : ej ). Dh f (x : w), the difference approxi-

mation of the action of the Hessian ∇2 f (x) on a vector w, is de¬ned to be the quotient

±

0, w = 0,

2

Dh f (x : w) =

(2.20)

∇f (x + hw/ w ) ’ ∇f (x)

, w = 0.

h/ w

Note that we may also write

2

Dh f (x : w) = Dh (∇f )(x : w),

where the notation Dh , taken from [154], denotes numerical directional derivative. If x is very

large, then the error in computing the sum x + hw/ w will have to be taken into consideration

in the choice of h.

We warn the reader, as we did in [154], that D2 f (x : w) is not a linear map and that

D2 f (x : w) is not, in general, the same as ∇2 f (x)w.

h

If we compute ∇f (x) + g (x) and the gradient errors satisfy an estimate of the form

g (x) ¤¯

for all x, then the computed difference Hessian is ∇h (∇f + g) and satis¬es

∇2 f (x) ’ ∇h (∇f + g )(x) = O(h + ¯/h).

(2.21)

√

So, as in [154], the choice h ≈ ¯ is optimal in the sense that it minimizes the quantity in the

O-term in (2.21).

The local convergence theorem in this case is [88], [154], [278], as follows.

Theorem 2.3.7. Let the standard assumptions hold. Then there are δ, ¯, and KD > 0 such

that if xc ∈ B(δ), g (x) ¤ ¯c for all x ∈ B(δ), and

h≥M g (xc )

then

’1

x+ = xc ’ (∇hc (∇f (xc ) + g (xc ))) (∇f (xc ) + g (xc ))

satis¬es

e+ ¤ KD (¯c + (¯c + h) ec ).

Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.

Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.