for our mail. Traders sometimes use optimizers to discover rule combinations that

trade profitably. In Part II, we will demonstrate how a genetic optimizer can evolve

profitable rule-based entry models. More commonly, traders call upon optimizers

to determine the most appropriate values for system parameters; almost any kind

of optimizer, except perhaps an analytic optimizer, may be employed for this pur-

pose. Various kinds of optimizers, including powerful genetic algorithms, are

effective for training or evolving neural or fuzzy logic networks. Asset allocation

problems yield to appropriate optimization strategies. Sometimes it seems as if the

only limit on how optimizers may be employed is the user™s imagination, and

therein lies a danger: It is easy to be seduced into “optimizer abuse” by the great

and alluring power of this tool. The correct and incorrect applications of opti-

mizers are discussed later in this chapter.

TYPES OF OPTIMIZERS

There are many kinds of optimizers, each with its own special strengths and weak

nesses, advantages and disadvantages. Optimizers can be classified along such

dimensions as human versus machine, complex versus simple, special purpose

versus general purpose, and analytic versus stochastic. All optimizers-regardless

of kind, efficiency, or reliability-execute a search for the best of many potential

solutions to a formally specified problem.

lmpllcit Optimizers

A mouse cannot be used to click on a button that says “optimize.” There is no spe-

cial command to enter. In fact, there is no special software or even machine in

sight. Does this mean there is no optimizer? No. Even when there is no optimizer

apparent, and it seems as though no optimization is going on, there is. It is known

as implicit optimization and works as follows: The trader tests a set of rules based

upon some ideas regarding the market. Performance of the system is poor, and so

the trader reworks the ideas, modifies the system™s rules, and runs another simu-

lation Better performance is observed. The trader repeats this process a few

times, each time making changes based on what has been learned along the way.

Eventually, the trader builds a system worthy of being traded with real money.

Was this system an optimized one? Since no parameters were ever explicitly

adjusted and no rules were ever rearranged by the software, it appears as if the

trader has succeeded in creating an unoptimized system. However, more than one

solution from a set of many possible solutions was tested and the best solution was

selected for use in trading or further study. This means that the system was opti-

mized after all! Any form of problem solving in which more than one solution is

examined and the best is chosen constitutes de facto optimization. The trader has

a powerful brain that employed mental problem-solving algorithms, e.g., heuris-

tically guided trial-and-error ones, which are exceptionally potent optimizers.

32

This means that optimization is always present: optimizers are always at work.

There is no escape!

Brute Force Optimizers

A brute force optimizer searches for the best possible solution by systematically

testing all potential solutions, i.e., all definable combinations of rules, parameters,

or both. Because every possible combination must be tested, brute force opti-

mization can be very slow. Lack of speed becomes a serious issue as the number

of combinations to be examined grows. Consequently, brute force optimization is

subject to the law of “combinatorial explosion.” Just how slow is brute force opti-

mization? Consider a case where there are four parameters to optimize and where

each parameter can take on any of 50 values. Brute force optimization would

require that 504 (about 6 million) tests or simulations be conducted before the

optimal parameter set could be determined: if one simulation was executed every

1.62 seconds (typical for TradeStation), the optimization process would take about

4 months to complete. This approach is not very practical, especially when many

systems need to be tested and optimized, when there are many parameters, when

the parameters can take on many values, or when you have a life. Nevertheless,

brute force optimization is useful and effective. If properly done, it will always

find the best possible solution. Brute force is a good choice for small problems

where combinatorial explosion is not an issue and solutions can be found in min-

utes, rather than days or years.

Only a small amount of programming code is needed to implement brute

force optimization. Simple loop constructs are commonly employed. Parameters

to be optimized are stepped from a start value to a stop value by some increment

using a For loop (C, C+ +, Basic, Pascal/Delphi) or a Do loop (FORTRAN). A

brute force optimizer for two parameters, when coded in a modem dialect of

Basic, might appear as follows:

Because brute force optimizers are conceptually simple and easy to program,

they are often built into the more advanced software packages that arc available

for traders.

As a practical illustration of bmte force optimization, TradeStation was used

to optimize the moving averages in a dual moving-average crossover system.

Optimization was for net profit, the only trading system characteristic that Trade-

Station can optimize without the aid of add-on products, The Easy Language code

for the dual moving-average trading model appears below:

The system was optimized by stepping the length of the first moving average

(LenA) from 2 to 10 in increments of 2. The length of the second moving average

(LenB) was advanced from 2 to 50 with the same increments. Increments were set

greater than 1 so that fewer than 200 combinations would need to be tested

(TradeStation can only save data on a maximum of 200 optimization runs). Since

not all possible combinations of values for the two parameters were explored, the

optimization was less thorough than it could have been; the best solution may have

been missed in the search. Notwithstanding, the optimization required 125 tests,

which took 3 minutes and 24 seconds to complete on 5 years of historical, end-of-

day data, using an Intel 486 machine running at 66 megahertz. The results gener-

ated by the optimization were loaded into an Excel spreadsheet and sorted for net

profit. Table 3-l presents various performance measures for the top 25 solutions.

In the table, LENA represents the period of the shorter moving average,

LENB the period of the longer moving average, NetPrft the total net profit,

LtNerPlft the net profit for long positions, S:NefPrji the net profit for short posi-

tions, PFact the profit factor, ROA the total (unannualized) return-on-account,

MaxDD the maximum drawdown, #Trds the total number of trades taken, and

%Prji the percentage of profitable trades.

Since optimization is a problem-solving search procedure, it frequently

results in surprising discoveries. The optimization performed on the dual moving-

average crossover system was no exception to the rule. Conventional trading wis-

dom says that “the trend is your friend.” However, having a second moving average

that is faster than the first, the most profitable solutions in Table 3. I trade against

the trend. These profitable countertrend solutions might not have been discovered

without the search performed by the optimization procedure.

34

TABLE

Successful user-guided optimization calls for skill, domain knowledge, or

both, on the part of the person guiding the optimization process. Given adequate

skill and experience, not to mention a tractable problem, user-guided optimization

can be extremely efficient and dramatically faster than brute force methods. ˜Ibe

speed and efficiency derive from the addition of intelligence to the search process:

Zones with a high probability of paying off can be recognized and carefully exam-

ined, while time-consuming investigations of regions unlikely to yield good

results can be avoided.

User-guided optimization is most appropriate when ballpark results have

already been established by other means, when the problem is familiar or well

understood, or when only a small number of parameters need to be manipulated.

As a means of “polishing” an existing solution, user guided-optimization is an

excellent choice. It is also useful for studying model sensitivity to changes in rules

or parameter values.

Genetic Optimizers

Imagine something powerful enough to solve all the problems inherent in the

creation of a human being. That something surely represents the ultimate in

problem solving and optimization. What is it? It is the familiar process of evo-

lution. Genetic optimizers endeavor to harness some of that incredible prob-

lem-solving power through a crude simulation of the evolutionary process. In

terms of overall performance and the variety of problems that may be solved,

there is no general-purpose optimizer more powerful than a properly crafted

genetic one.

Genetic optimizers are Stochustic optimizers in the sense that they take

advantage of random chance in their operation. It may not seem believable that

tossing dice can be a great way to solve problems, but, done correctly, it can be!

In addition to randomness, genetic optimizers employ selection and recombina-

tion. The clever integration of random chance, selection, and recombination is

responsible for the genetic optimizer™s great power. A full discussion of genetic

algorithms, which are the basis for genetic optimizers, appears in Part II.

Genetic optimizers have many highly desirable characteristics. One such

characteristic is speed, especially when faced with combinatorial explosion. A

genetic optimizer can easily be many orders of magnitude faster than a brute force

optimizer when there are a multiplicity of rules, or parameters that have many pos-

sible values, to manipulate. This is because, like user-guided optimization, genetic

optimization can focus on important regions of solution space while mostly ignor-

ing blind alleys. In contrast to user-guided optimization, the benefit of a selective

search is achieved without the need for human intervention.

Genetic optimizers can swiftly solve complex problems, and they are also

more immune than other kinds of optimizers to the effects of local maxima in the

36

fitness surface or, equivalently, local minima in the cost surface. Analytic methods

are worst in that they almost always walk right to the top of the nearest hill or bot-

tom of the nearest valley, without regard to whether higher hills or lower valleys

exist elsewhere. In contrast, a good genetic optimizer often locates the globally

best solution-quite an impressive feat when accomplished for cantankerous fit-

ness surfaces, such as those associated with matrices of neural connection weights.

Another characteristic of genetic optimization is that it works well with fit-

ness surfaces marked by discontinuities, flat regions, and other troublesome irreg-

ularities. Genetic optimization shares this characteristic with brute force,

user-guided, annealing-based, and other nonanalytic optimization methods.

Solutions that maximize such items as net profit, return on investment, the Sharpe

Ratio, and others that define difficult, nonanalytic fitness landscapes can be found

using a genetic optimizer. Genetic optimizers shine with difficult fitness functions

that lie beyond the purview of analytic methods. This does not mean that they can-

not be used to solve problems having more tractable fitness surfaces: Perhaps

slower than the analytic methods, they have the virtue of being more resistant to

the traps set by local optima.

Overall, genetic optimizers are the optimizers of choice when there are many

parameters or rules to adapt, when a global solution is desired, or when arbitrarily

complex (and not necessarily differentiable or continuous) fitness or cost functions

must be handled. Although special-purpose optimizers can outperform genetic opti-

mizers on specific kinds of problems, for general-purpose optimization, genetic

optimizers are among the most powerful tools available.

What does a genetic optimizer look like in action? The dual moving-average

crossover system discussed earlier was translated to Cl 1 so that the genetic opti-

mizer in the C-Trader toolkit could be used to solve for the two system parame-

ters, LenA and LenB. LenA, the period of the first moving average, was examined

over the range of 2 through 50, as was LenB, the period of the second moving aver-

age. Optimization was for net profit so that the results would be directly compa-

rable with those produced earlier by brute force optimization. Below is the Cl 1

code for the crossover system:

CWR 3 Optimizers and Optimimion

I/ take no trades in lo&back period

if(clt[cbl c 910302) ( eqclsLcb1 = 0.0; continue; )

To solve for the best parameters, brute force optimization would require that 2,041

tests be performed; in TradeStation, that works out to about 56 minutes of com-

puting time, extrapolating from the earlier illustration in which a small subset of

the current solution space was examined. Only 1 minute of running time was

required by the genetic optimizer; in an attempt to put it at a significant disadvan-

tage, it was prematurely stopped after performing only 133 tests.

The output from the genetic optimizer appears in Table 3-2. In this table, PI rep-

resents the period of the faster moving average, P2 the period of the slower moving

average, NETthe total net profit, NETLNG the net profit for long positions, NETSiS

the net profit for short positions, PFAC the profit factor, ROA% the annualized rehm

on account, DRAW the maximum drawdown, TRDS the number of trades taken by the

system, WIN% the percentage of winning trades, AVGT the profit or loss resulting

from the average trade, and FZTthe fitness of the solution (which, in this instance, is

merely the total net p&it). As with the brute force data in Table 3-1, the genetic data

have been sorted by net profit (fitness) and only the 25 best solutions were presented.

Top 25 Solutions Found Using Genetic Optimization in C-Trader Toolkit

Comparison of the brute force and genetic optimization results (Tables 3- 1 and 3-2,

respectively) reveals that the genetic optimizer isolated a solution with a greater net

profit ($172,725) than did the brute force optimizer ($145,125). This is no surprise

since a larger solution space, not decimated by increments, was explored. The sur-

prise is that the better solution was found so quickly, despite the handicap of a pre-

maturely stopped evolutionary process. Results like these demonstrate the incredible

effectiveness of genetic optimization.

Optimization by Simulated Annealing

Optimizers based on annealing mimic the thermodynamic process by which liq-

uids freeze and metals anneal. Starting out at a high temperature, the atoms of a

liquid or molten metal bounce rapidly about in a random fashion. Slowly cooled,

they mange themselves into an orderly configuration-a crystal-that represents

a minimal energy state for the system. Simulated in software, this thermodynamic

process readily solves large-scale optimization problems.

As with genetic opimization, optimization by simulared annealing is a very

powerful Stochastic technique, modeled upon a natural phenomenon, that can find

globally optimal solutions and handle ill-behaved fitness functions. Simulated

annealing has effectively solved significant combinatorial problems, including

CHAPTER 3 Optimizers and Optimization 39

the famous “traveling salesman problem,” and the problem of how best to arrange the

millions of circuit elements found on modem integrated circuit chips, such as

those that power computers. Methods based on simulated annealing should not be

construed as limited to combinatorial optimization; they can readily be adapted to

the optimization of real-valued parameters. Consequently, optimizers based on

simulated annealing are applicable to a wide variety of problems, including those

faced by traders.

Since genetic optimizers perform so well, we have experienced little need to

explore optimizers based on simulated annealing. In addition, there have been a

few reports suggesting that, in many cases, annealing algorithms do not perform

as well as genetic algorithms. Because of these reasons, we have not provided

examples of simulated annealing and have little more to say about the method.

Analytic Optimizers

Analysis (as in “real analysis” or “complex analysis”) is an extension of classical

college calculus. Analytic optimizers involve the well-developed machinery of

analysis, specifically differential calculus and the study of analytic functions, in

the solution of practical problems. In some instances, analytic methods can yield

a direct (noniterative) solution to an optimization problem. This happens to be the

case for multiple regression, where solutions can be obtained with a few matrix

calculations. In multiple regression, the goal is to find a set of regression weights

that minimize the sum of the squared prediction errors. In other cases, iterative

techniques must be used. The connection weights in a neural network, for exam-

ple, cannot be directly determined. They must be estimated using an iterative pro-

cedure, such as back-propagation.

Many iterative techniques used to solve multivariate optimization prob-

lems (those involving several variables or parameters) employ some variation

on the theme of steepest ascent. In its most basic form, optimization by steep-

est ascent works as follows: A point in the domain of the fitness function (that

is, a set of parameter values) is chosen by some means. The gradient vector at

that point is evaluated by computing the derivatives of the fitness function with

respect to each of the variables or parameters; this defines the direction in n-

dimensional parameter space for which a fixed amount of movement will pro-

duce the greatest increase in fitness. A small step is taken up the hill in fitness

space, along the direction of the gradient. The gradient is then recomputed at

this new point, and another, perhaps smaller, step is taken. The process is

repeated until convergence occurs.

A real-world implementation of steepest ascent optimization has to specify

how the step size will be determined at each iteration, and how the direction

defined by the gradient will be adjusted for better overall convergence of the opti-

mization process. Naive implementations assume that there is an analytic fitness

surface (one that can be approximated locally by a convergent power series) hav-

ing hills that must be climbed. More sophisticated implementations go further,

commonly assuming that the fitness function can be well approximated locally by

a quadratic form. If a fitness function satisfies this assumption, then much faster

convergence to a solution can be achieved. However, when the fitness surface has

many irregularly shaped bills and valleys, quadratic forms often fail to provide a

good approximation. In such cases, the more sophisticated methods break down

entirely or their performance seriously degrades.

Worse than degraded performance is the problem of local solutions. Almost

all analytic methods, whether elementary or sophisticated, are easily trapped by

local maxima: they generally fail to locate the globally best solution when there

are™ many hills and valleys in the fitness surface. Least-squares, neural network

predictive modeling gives rise to fitness surfaces that, although clearly analytic,

are full of bumps, troughs, and other irregularities that lead standard analytic tech-

niques (including back-propagation, a variant on steepest ascent) astray. Local

maxima and other hazards that accompany such fitness surfaces can, however, be

sidestepped by cleverly marrying a genetic algorithm with an analytic one. For tit-

ness surfaces amenable to analytic optimization, such a combined algorithm can

provide the best of both worlds: fast, accurate solutions that are also likely to be

globally optimal.

Some fitness surfaces are simply not amenable to analytic optimization.

More specifically, analytic methods cannot be used when the fitness surface has

flat areas or discontinuities in the region of parameter space where a solution is to

be sought. Flat areas imply null gradients, hence the absence of a preferred direc-

tion in which to take a step. At points of discontinuity, the gradient is not defined;

again, a stepping direction cannot be determined. Even if a method does not

explicitly use gradient information, such information is employed implicitly by

the optimization algorithm. Unfortunately, many fitness functions of interest to

traders-including, for instance, all functions that involve net profit, drawdown,

percentage of winning trades, risk-to-reward ratios, and other like items-have

plateaus and discontinuities. They are, therefore, not tractable using analytic methods.

Although the discussion has centered on the maximization of fitness, every-

thing said applies as well to the minimization of cost. Any maximization technique

can be used for minimization, and vice versa: Multiply a fitness function by - 1 to

obtain an equivalent cost function; multiply a cost function by - 1 and a fitness func-

tion is the result. If a minimization algorithm takes your fancy, but a maximization

is required, use this trick to avoid having to recode the optimization algorithm.

Linear Programming

The techniques of linear programming are designed for optimization problems

involving linear cost or fitness functions, and linear constraints on the parameters

or input variables. Linear programming is typically used to solve resource allo-

cation problems. In the world of trading, one use of linear programming might be

to allocate capital among a set of investments to maximize net profit. If risk-

adjusted profit is to be optimized, linear programming methods cannot be used:

Risk-adjusted profit is not a linear function of the amount of capital allocated to

each of the investments; in such instances, other techniques (e.g., genetic algo-

rithms) must be employed. Linear programming methods are rarely useful in the

development of trading systems. They are mentioned here only to inform readers

of their existence.

HOW TO FAIL WITH OPTIMIZATION

Most traders do not seek failure, at least not consciously. However, knowledge of

the way failure is achieved can be of great benefit when seeking to avoid it. Failure

with an optimizer is easy to accomplish by following a few key rules. First, be sure

to use a small data sample when running sirindations: The smaller the sample, the

greater the likelihood it will poorly represent the data on which the trading model

will actually be traded. Next, make sure the trading system has a large number of

parameters and rules to optimize: For a given data sample, the greater the number

of variables that must be estimated, the easier it will be to obtain spurious results.

It would also be beneficial to employ only a single sample on which to run tests;

annoying out-of-sample data sets have no place in the rose-colored world of the

ardent loser. Finally, do avoid the headache of inferential statistics. Follow these

rules and failure is guaranteed.

What shape will failure take? Most likely, system performance will look

great in tests, but terrible in real-time trading. Neural network developers call this

phenomenon “poor generalization”; traders are acquainted with it through the

experience of margin calls and a serious loss of trading capital. One consequence

of such a failure-laden outcome is the formation of a popular misconception: that

all optimization is dangerous and to be feared.

In actual fact, optimizers are not dangerous and not all optimization should be

feared. Only bad optimization is dangerous and frightening. Optimization of large

parameter sets on small samples, without out-of-sample tests or inferential statis-

tics, is simply a bad practice that invites unhappy results for a variety of reasons.

Small Samples

Consider the impact of small samples on the optimization process. Small samples

of market data are unlikely to be representative of the universe from which they

are drawn: consequently, they will probably differ significantly from other sam-

ples obtained from the same universe. Applied to a small development sample, an

optimizer will faithfully discover the best possible solution. The best solution for

42

the development sample, however, may turn out to be a dreadful solution for the

later sample on which genuine trades will be taken. Failure ensues, not because

optimization has found a bad solution, but because it has found a good solution to

the wrong problem!

Optimization on inadequate samples is also good at spawning solutions that

represent only mathematical artifact. As the number of data points declines to the

number of free (adjustable) parameters, most models (trading, regression, or other-

wise) will attain a perfect tit to even random data. The principle involved is the

same one responsible for the fact that a line, which is a two-parameter model, can

always be drawn through any two distinct points, but cannot always be made to

intersect three arbitrary points. In statistics, this is known as the degrees-of-freedom

issue; there are as many degrees of freedom as there are data points beyond that

which can be fitted perfectly for purely mathematical reasons. Even when there are

enough data points to avoid a totally artifact-determined solution, some part of the

model fitness obtained through optimization will be of an artifact-determined

nature, a by-product of the process.

For multiple regression models, a formula is available that can be used to

estimate how much “shrinkage” would occur in the multiple correlation coeffi-

cient (a measure of model fitness) if the artifact-determined component were

removed. The shrinkage correction formula, which shows the relationship

between the number of parameters (regression coefficients) being optimized, sam-

ple size, and decreased levels of apparent fitness (correlation) in tests on new sam-

ples, is shown below in FORTRAN-style notation:

In this equation, N represents the number of data points, P the number of model

parameters, R the multiple correlation coefficient determined for the sample by the

regression (optimization) procedure, and RC the shrinkage-corrected multiple cor-

relation coefficient. The inverse formula, one that estimates the optimization-

inflated correlation (R) given the true correlation (RfJ existing in the population

from which the data were sampled, appears below:

These formulas, although legitimate only for linear regression, are not bad

for estimating how well a fully trained neural network model-which is nothing

more than a particular kind of nonhnezu regression-will generalize. When work-

ing with neural networks, let P represent the total number of connection weights

in the model. In addition, make sure that simple correlations are used when work-

ing with these formulas; if a neural network or regression package reports the

squared multiple correlation, take the square root.

43

Large Parameter Sets

An excessive number of free parameters or rules will impact an optimization effort

in a manner similar to an insufficient number of data points. As the number of ele-

ments undergoing optimization rises, a model™s ability to capitalize on idiosyn-

crasies in the development sample increases along with the proportion of the

model™s fitness that can be attributed to mathematical artifact. The result of opti-

mizing a large number of variables-whether rules, parameters, or both-will be

a model that performs well on the development data, but poorly on out-of-sample

test data and in actual trading.

It is not the absolute number of free parameters that should be of concern,

but the number of parameters relative to the number of data points. The shrinkage

formula discussed in the context of small samples is also heuristically relevant

here: It illustrates how the relationship between the number of data points and the

number of parameters affects the outcome. When there are too many parameters,

given the number of data points, mathematical artifacts and capitalization on

chance (curve-fitting, in the bad sense) become reasons for failure.

No Verification

One of the better ways to get into trouble is by failing to verify model performance

using out-of-sample tests or inferential statistics. Without such tests, the spurious

solutions resulting from small samples and large parameter sets, not to mention

other less obvious causes, will go undetected. The trading system that appears to

be ideal on the development sample will be put “on-line,” and devastating losses

will follow. Developing systems without subjecting them to out-of-sample and sta-

tistical tests is like flying blind, without a safety belt, in an uninspected aircraft.

HOW TO SUCCEED WITH OPTIMIZATION

Four steps can be taken to avoid failure and increase the odds of achieving suc-

cessful optimization. As a first step, optimize on the largest possible representative

sample and make sure many simulated trades are available for analysis. The sec-

ond step is to keep the number of free parameters or rules small, especially in rela-

tion to sample size. A third step involves running tests on out-of-sample data, that

is, data not used or even seen during the optimization process. As a fourth and final

step, it may be worthwhile to statistically assess the results.

Large, Representative Samples

As suggested earlier, failure is often a consequence of presenting an optimizer with

the wrong problem to solve. Conversely, success is likely when the optimizer is

44

presented with the right problem. The conclusion is that trading models should be

optimized on data from the near future, the data that will actually be traded; do that

and watch the profits roll in. The catch is where to find tomorrow™s data today.

Since the future has not yet happened, it is impossible to present the opti-

mizer with precisely the problem that needs to be solved. Consequently, it is nec-

essary to attempt the next-best alternative: to present the optimizer with a broader

problem, the solution to which should be as applicable as possible to the actual,

but impossible-to-solve, problem. One way to accomplish this is with a data sam-

ple that, even though not drawn from the future, embodies many characteristics

that might appear in future samples. Such a data sample should include bull and

bear markets, trending and nontrending periods, and even crashes. In addition, the

data in the sample should be as recent as possible so that it will reflect current pat-

terns of market behavior. This is what is meant by a representative sample.

As well as representative, the sample should be large. Large samples make it

harder for optimizers to uncover spurious or artifact-determined solutions.

Shrinkage, the expected decline in performance on unoptimized data, is reduced

when large samples are employed in the optimization process.

Sometimes, however, a trade-off must be made between the sample™s size

and the extent to which it is representative. As one goes farther back in history to

bolster a sample, the data may become less representative of current market con-

ditions. In some instances, there is a clear transition point beyond which the data

become much less representative: For example, the S&P 500 futures began trad-

ing in 1983, effecting a structural change in the general market. Trade-offs become

much less of an issue when working with intraday data on short time frames,

where tens of thousands or even hundreds of thousands of bars of data can be gath-

ered without going back beyond the recent past.

Finally, when running simulations and optimizations, pay attention to the

number of trades a system takes. Like large data samples, it is highly desirable that

simulations and tests involve numerous trades. Chance or artifact can easily be

responsible for any profits produced by a system that takes only a few trades,

regardless of the number of data points used in the test!

Few Rules and Parameters

To achieve success, limit the number of free rules and parameters, especially when

working with small data samples. For a given sample size, the fewer the rules or

parameters to optimize, the greater the likelihood that a trading system will main-

tain its performance in out-of-sample tests and real-time trading. Although sever-

al dozen parameters may be acceptable when working with several thousand

trades taken on 100,000 l-minute bars (about 1 year for the S&P 500 futures),

even two or three parameters may be excessive when developing a system using a

few years of end-of-day data. If a particular model requires many parameters, then

significant effort should be put into assembling a mammoth sample (the legendary

Gann supposedly went back over 1,000 years in his study of wheat prices). An

alternative that sometimes works is optimizing a trading model on a whole port-

folio, using the same rules and parameters across all markets-a technique used

extensively in this book.

Verification of Results

After optimizing the rules and parameters of a trading system to obtain good

behavior on the development or in-sample data, but before risking any real money,

it is essential to verify the system™s performance in some manner. Verification of

system performance is important because it gives the trader a chance to veto fai-

me and embrace success: Systems that fail the test of verification can be discard-

ed, ones that pass can be traded with confidence. Verification is the single most

critical step on the road to success with optimization or, in fact, with any other

method of discovering a trading model that really works.

To ensure success, verify any trading solution using out-of-sample tests or

inferential statistics, preferably both. Discard any solution that fails to be profitable

in an out-of-sample test: It is likely to fail again when the rubber hits the road.

Compute inferential statistics on all tests, both in-sample and out-of-sample. These

statistics reveal the probability that the performance observed in a sample reflects

something real that will hold up in other samples and in real-time trading. Inferential

statistics work by making probability inferences based on the distribution of prof-

itability in a system™s trades or returns. Be sure to use statistics that are corrected for

multiple tests when analyzing in-sample optimization results. Out-of-sample tests

should be analyzed with standard, uncorrected statistics. Such statistics appear in

some of the performance reports that are displayed in the chapter on simulators. The

use of statistics to evaluate trading systems is covered in depth in the following chap-

ter. Develop a working knowledge of statistics; it will make you a better trader.

Some suggest checking a model for sensitivity to small changes in parame-

ter values. A model highly tolerant of such changes is more “robust” than a model

not as tolerant, it is said. Do not pay too much attention to these claims. In truth,

parameter tolerance cannot be relied upon as a gauge of model robustness. Many

extremely robust models are highly sensitive to the values of certain parameters.

The only true arbiters of system robustness are statistical and, especially, out-of-

sample tests.

ALTERNATIVES TO TRADITIONAL OPTIMIZATION

There are two major alternatives to traditional optimization: walk-forward opti-

mization and self-adaptive systems. Both of these techniques have the advantage

that any tests carried out are, from start to finish, effectively out-of-sample.

46

Examine the performance data, run some inferential statistics, plot the equity

curve, and the system is ready to be traded. Everything is clean and mathemati-

cally unimpeachable. Corrections for shrinkage or multiple tests, worries over

excessive curve-fitting, and many of the other concerns that plague traditional

optimization methodologies can be forgotten. Moreover, with today™s modem

computer technology, walk-forward and self-adaptive models are practical and not

even difficult to implement.

The principle behind walk-forward optimization (also known as walk-for-

ward testing) is to emulate the steps involved in actually trading a system that

requires periodic optimization. It works like this: Optimize the system on the data

points 1 through M. Then simulate trading on data points M + I through M + K.

Reoptimize the system on data points K + 1 through K + M. Then simulate trad-

ing on points (K + M) + 1 through (K + M) + K. Advance through the data series

in this fashion until no more data points are left to analyze. As should be evident,

the system is optimized on a sample of historical data and then traded. After some

period of time, the system is reoptimized and trading is resumed. The sequence of

events guarantees that the data on which trades take place is always in the future

relative to the optimization process; all trades occur on what is, essentially, out-of-

sample data. In walk-forward testing, M is the look-back or optimization window

and K the reoptimization interval.

Self-adaptive systems work in a similar manner, except that the optimization

or adaptive process is part of the system, rather than the test environment. As each

bar or data point comes along, a self-adaptive system updates its internal state (its

parameters or rules) and then makes decisions concerning actions required on the

next bar or data point. When the next bar arrives, the decided-upon actions are car-

ried out and the process repeats. Internal updates, which are how the system learns

about or adapts to the market, need not occur on every single bar. They can be per-

formed at fixed intervals or whenever deemed necessary by the model.

The trader planning to work with self-adapting systems will need a power-

ful, component-based development platform that employs a strong language, such

as Cf f, Object Pascal, or Visual Basic, and that provides good access to third-

party libraries and software components. Components are designed to be incorpo-

rated into user-written software, including the special-purpose software that

constitutes an adaptive system. The more components that are available, the less

work there is to do. At the very least, a trader venturing into self-adaptive systems

should have at hand genetic optimizer and trading simulator components that can

be easily embedded within a trading model. Adaptive systems will be demonstrat-

ed in later chapters, showing how this technique works in practice.

There is no doubt that walk-forward optimization and adaptive systems will

become more popular over time as the markets become more efficient and diffi-

cult to trade, and as commercial software packages become available that place

these techniques within reach of the average trader.

CHAPTER 3 optimizers and OQtimizatio” 47

OPTIMIZER TOOLS AND INFORMATION

Aerodynamics, electronics, chemistry, biochemistry, planning, and business are

just a few of the fields in which optimization plays a role. Because optimization is

of interest to so many problem-solving areas, research goes on everywhere, infor-

mation is abundant, and optimization tools proliferate. Where can this information

be found? What tools and products are available?

Brute force optimizers are usually buried in software packages aimed pri-

marily at tasks other than optimization; they are usually not available on their own.

In the world of trading, products like TradeStation and SuperCharts from Omega

Research (800-292.3453), Excalibur from Futures Truth (828-697-0273), and

MetaStock from Equis International (800-882-3040) have built-in brute force opti-

mizers. If you write your own software, brute force optimization is so trivial to

implement using in-line progranuning code that the use of special libraries or

components is superfluous. Products and code able to carry out brute force opti-

mization may also serve well for user-guided optimization.

Although sometimes appearing as built-in tools in specialized programs,

genetic optimizers are more often distributed in the form of class libraries or soft-

ware components, add-ons to various application packages, or stand-alone research

instruments. As an example of a class library written with the component paradigm

in mind, consider OptEvolve, the C+ + genetic optimizer from Scientific Consultant

Services (516-696-3333): This general-purpose genetic optimizer implements sev-

eral algorithms, including differential evolution, and is sold in the form of highly

portable C+ + code that can be used in UNIXiLINUX, DOS, and Windows envi-

ronments. TS-Evolve, available from Ruggiero Associates (800-21 l-9785) gives

users of TradeStation the ability to perform full-blown genetic optimizations. The

Evolver, which can be purchased from Palisade Corporation (800.432.7475) is a

general-purpose genetic optimizer for Microsoft™s Excel spreadsheet; it comes with

a dynamic link library (DLL) that can provide genetic optimization services to user

programs written in any language able to call DLL functions. GENESIS, a stand-

alone instrument aimed at the research community, was written by John Grefenstette

of the Naval Research Laboratory; the product is available in the form of generic C

source code. While genetic optimizers can occasionally be found in modeling tools

for chemists and in other specialized products, they do not yet form a native part of

popular software packages designed for traders.

Information about genetic optimization is readily available. Genetic algo-

rithms are discussed in many books, magazines, and journals and on Internet

newsgroups. A good overview of the field of genetic optimization can be found in

the Handbook of Generic Algorithms (Davis, 1991). Price and Storm (1997)

described an algorithm for “differential evolution,” which has been shown to be an

exceptionally powerful technique for optimization problems involving real-valued

parameters. Genetic algorithms are currently the focus of many academic journals

and conference proceedings. Lively discussions on all aspects of genetic opti-

mization take place in several Internet newsgroups of which compaigenetic is the

most noteworthy.

A basic exposition of simulated annealing can be found in Numericnl

Recipes in C (Press et al., 1992), as can C functions implementing optimizers for

both combinatorial and real-valued problems. Neural, Novel & Hybrid Algorithms

for Time Series Prediction (Masters, 1995) also discusses annealing-based opti-

mization and contains relevant C+ + code on the included CD-ROM. Like genet-

ic optimization, simulated annealing is the focus of many research studies,

conference presentations, journal articles, and Internet newsgroup discussions.

Algorithms and code for conjugate gradient and variable metric optimiza-

tion, two fairly sophisticated analytic methods, can be found in Numerical Recipes

in C (Press et al., 1992) and Numerical Recipes (Press et al., 1986). Masters (1995)

provides an assortment of analytic optimization procedures in C+ + (on the CD-

ROM that comes with his book), as well as a good discussion of the subject.

Additional procedures for analytic optimization are available in the IMSL and the

NAG library (from Visual Numerics, Inc., and Numerical Algorithms Group,

respectively) and in the optimization toolbox for MATLAB (a general-purpose

mathematical package from The MathWorks, 508-647-7000, that has gamed pop-

ularity in the financial engineering community). Finally, Microsoft™s Excel spread-

sheet contains a built-in analytic optimizer-the Solver-that employs conjugate

gradient or Newtonian methods.

As a source of general information about optimization applied to trading sys-

tem development, consult Design, Testing and Optimization qf Trading Systems by

Robert Pardo (1992). Among other things, this book shows the reader how to opti-

mize profitably, how to avoid undesirable curve-fitting, and how to carry out walk-

forward tests.

WHICH OPTIMIZER IS FOR YOU?

At the very least, you should have available an optimizer that is designed to make

both brute force and user-guided optimization easy to carry out. Such an optimiz-

er is already at hand if you use either TradeStation or Excalibur for system devel-

opment tasks. On the other hand, if you develop your systems in Excel, Visual

Basic, C+ +, or Delphi, you will have to create your own brute force optimizer.

As demonstrated earlier, a brute force optimizer is simple to implement. For many

problems, brute force or user-guided optimization is the best approach.

If your system development efforts require something beyond brute force, a

genetic optimizer is a great second choice. Armed with both brute force and genet-

ic optimizers, you will be able to solve virtually any problem imaginable. In our

own efforts, we hardly ever reach for any other kind of optimization tool!

TradeStation users will probably want TS-Evolve from Ruggiero Associates. The

Evolver product from Palisade Corporation is a good choice for Excel and Visual

Basic users. If you develop systems in C+ + or Delphi, select the C+ + Genetic

Optimizer from Scientific Consultant Services, Inc. A genetic optimizer is the

Swiss Army knife of the optimizer world: Even problems more efficiently solved

using such other techniques as analytic optimization will yield, albeit more slowly,

to a good genetic optimizer.

Finally, if you want to explore analytic optimization or simulated annealing,

we suggest Numerical Recipes in C (Press et al., 1992) and Masters (1995) as

good sources of both information and code. Excel users can try out the built-in

Solver tool.

CHAPTER 4

Statistics

M any trading system developers have little familiarity with inferential statistics.

This is a rather perplexing state of affairs since statistics are essential to assessing

the behavior of trading systems. How, for example, can one judge whether an

apparent edge in the trades produced by a system is real or an artifact of sampling

or chance? Think of it-the next sample may not merely be another test, but an

actual trading exercise. If the system™s “edge” was due to chance, trading capital

could quickly be depleted. Consider optimization: Has the system been tweaked

into great profitability, or has the developer only succeeded in the nasty art of

curve-fitting? We have encountered many system developers who refuse to use

any optimization strategy whatsoever because of their irrational fear of curve-fit-

ting, not knowing that the right statistics can help detect such phenomena. In short,

inferential statistics can help a trader evaluate the likelihood that a system is cap-

turing a real inefficiency and will perform as profitably in the future as it has in

the past. In this book, we have presented the results of statistical analyses when-

ever doing so seemed useful and appropriate.

Among the kinds of inferential statistics that are most useful to traders are

t-tests, correlational statistics, and such nonparametric statistics as the runs test.

T-rests are useful for determining the probability that the mean or sum of any

series of independent values (derived from a sampling process) is greater or less

than some other such mean, is a fixed number, or falls within a certain band. For

example, t-tests can reveal the probability that the total profits from a series of

trades, each with its individual profitAoss figure, could be greater than some thresh-

old as a result of chance or sampling. These tests are also useful for evaluating san-

pies of returns, e.g., the daily or monthly returns of a portfolio over a period of

years. Finally, t-tests can help to set the boundaries of likely future performance

51

(assuming no structural change in the market), making possible such statements as

“the probability that the average profit will be between x and y in the future is

greater than 95%:™

Correlational stnristics help determine the degree of relationship between

different variables. When applied inferentially, they may also be used to assess

whether any relationships found are “statistically significant,” and not merely due

to chance. Such statistics aid in setting confidence intervals or boundaries on the

“true” (population) correlation, given the observed correlation for a specific sam-

ple. ,Correlational statistics are essential when searching for predictive variables to

include in a neural network or regression-based trading model.

Correlational statistics, as well as such nonparamenic statistics as the runs test,

are useful in assessing serial dependence or serial correlation. For instance, do prof-

itable trades come in streaks or runs that are then followed by periods of unprofitable

trading? The runs test can help determine whether this is actually occurring. If there

is serial dependence in a system, it is useful to know it because the system can then

be revised to make use of the serial dependence. For example, if a system has clear

ly defined streaks of winning and losing, a metasystem can be developed. The mem-

system would take every trade after a winning trade until the tirst losing trade comes

along, then stop trading until a winning trade is hit, at which point it would again

begin taking trades. If there really are runs, this strategy, or something similar, could

greatly improve a system™s behavior.

WHY USE STATISTICS TO EVALUATE TRADING

SYSTEMS?

It is very important to determine whether any observed profits are real (not art-

facts of testing), and what the likelihood is that the system producing them will

continue to yield profits in the future when it is used in actual trading. While out-

of-sample testing can provide some indication of whether a system will hold up on

new (future) data, statistical methods can provide additional information and esti-

mates of probability. Statistics can help determine whether a system™s perfor-

mance is due to chance alone or if the trading model has some real validity.

Statistical calculations can even be adjusted for a known degree of curve-fitting,

thereby providing estimates of whether a chance pattern, present in the data sam-

ple being used to develop the system, has been curve-fitted or whether a pattern

present in the population (and hence one that would probably be present in future

samples drawn from the market being examined) has been modeled.

It should be noted that statistics generally make certain theoretical assumptions

about the data samples and populations to which they may be appropriately applied.

These assumptions are often violated when dealing with trading models. Some vio-

lations have little practical effect and may be ignored, while others may be worked

around. By using additional statistics, the more serious violations can sometimes be

WAFTER 4 Statistics 53

detected, avoided, or compensated for; at the very least, they can be understood. In

short, we are fully aware of these violations and will discuss our acts of hubris and

their ramifications after a foundation for understanding the issues has been laid.

SAMPLING

Fundamental to statistics and, therefore, important to understand, is the act of

sampling, which is the extraction of a number of data points or trades (a sample)

from a larger, abstractly defined set of data points or trades (a population). The

central idea behind statistical analysis is the use of samples to make inferences

about the populations from which they are drawn. When dealing with trading

models, the populations will most often be defined as all raw data (past, present,

and future) for a given tradable (e.g., all 5-minute bars on all futures on the S&P

500), all trades (past, present, and future) taken by a specified system on a given

tradable, or all yearly, monthly, or even daily returns. All quarterly earnings

(past, present, and future) of IBM is another example of a population. A sample

could be the specific historical data used in developing or testing a system, the

simulated trades taken, or monthly returns generated by the system on that data.

When creating a trading system, the developer usually draws a sample of

data from the population being modeled. For example, to develop an S&P 500 sys-

tem based on the hypothesis “If yesterday™s close is greater than the close three

days ago, then the market will rise tomorrow,” the developer draws a sample of

end-of-day price data from the S&P 500 that extends back, e.g., 5 years. The hope

is that the data sample drawn from the S&P 500 is represenrative of that market,

i.e., will accurately reflect the actual, typical behavior of that market (the popula-

tion from which the sample was drawn), so that the system being developed will

perform as well in the future (on a previously unseen sample of population data)

as it did in the past (on the sample used as development data). To help determine

whether the system will hold up, developers sometimes test systems on one or

more out-of-sample periods, i.e., on additional samples of data that have not been

used to develop or optimize the trading model. In our example, the S&P 500 devel-

oper might use 5 years of data--e.g., 1991 through 1995-to develop and tweak

the system, and reserve the data from 1996 as the out-of-sample period on which

to test the system. Reserving one or more sets of out-of-sample data is strongly

recommended.

One problem with drawing data samples from financial populations arises

from the complex and variable nature of the markets: today™s market may not be

tomorrow™s Sometimes the variations are very noticeable and their causes are

easily discerned, e.g., when the S&P 500 changed in 1983 as a result of the intro-

duction of futures and options. In such instances, the change may be construed as

having created two distinct populations: the S&P 500 prior to 1983 and the S&P

500 after 1983. A sample drawn from the earlier period would almost certainly

not be representative of the population defined by the later period because it was

drawn from a different population! This is, of course, an extreme case. More

often, structural market variations are due to subtle influences that are sometimes

impossible to identify, especially before the fact. In some cases, the market may

still be fundamentally the same, but it may be going through different phases;

each sample drawn might inadvertently be taken from a different phase and be

representative of that phase alone, not of the market as a whole. How can it be

determined that the population from which a sample is drawn for the purpose of

system development is the same as the population on which the system will be

traded? Short of hopping into a time machine and sampling the future, there is no

reliable way to tell if tomorrow will be the day the market undergoes a system-

killing metamorphosis! Multiple out-of-sample tests, conducted over a long peri-

od of time, may provide some assurance that a system will hold up, since they

may show that the market has not changed substantially across several sampling

periods. Given a representative sample, statistics can help make accurate infer-

ences about the population from which the sample was drawn. Statistics cannot,

however, reveal whether tomorrow™s market will have changed in some funda-

mental manner.

OPTIMIZATION AND CURVE-FITTING

Another issue found in trading system development is optimization, i.e., improv-

ing the performance of a system by adjusting its parameters until the system per-

forms its best on what the developer hopes is a representative sample. When the

system fails to hold up in the future (or on out-of-sample data), the optimization

process is pejoratively called curve-fitting. However, there is good curve-fitting

and bad curve-fitting. Good curve-fitting is when a model can be fit to the entire

relevant population (or, at least, to a sufficiently large sample thereof), suggesting

that valid characteristics of the entire population have been captured in the model.

Bad curve-@zing occurs when the system only fits chance characteristics, those

that are not necessarily representative of the population from which the sample

was drawn.

Developers are correct to fear bad curve-fitting, i.e., the situation in which

parameter values are adapted to the particular sample on which the system was

optimized, not to the population as a whole. If the sample was small or was not

representative of the population from which it was drawn, it is likely that the sys-

tem will look good on that one sample but fail miserably on another, or worse, lose

money in real-time trading. However, as the sample gets larger, the chance of this

happening becomes smaller: Bad curve-fitting declines and good curve-fitting

increases. All the statistics discussed reflect this, even the ones that specifically

concern optimization. It is true that the more combinations of things optimized,

the greater the likelihood good performance may be obtained by chance alone.

However, if the statistical result was sufficiently good, or the sample on which it

was based large enough to reduce the probability that the outcome was due to

chance, the result might still be very real and significant, even if many parameters

were optimized.

Some have argued that size does not matter, i.e., that sample size and the

number of trades studied have little or nothing to do with the risk of overopti-

mization, and that a large sample does not mitigate curve-fitting. This is patently

untrue, both intuitively and mathematically. Anyone would have less confidence in

a system that took only three or four trades over a lo-year period than in one that

took over 1,000 reasonably profitable trades. Think of a linear regression model in

which a straight line is being fit to a number of points. If there are only two points,

it is easy to fit the line perfectly every time, regardless of where the points are

located. If there are three points, it is harder. If there is a scatterplot of points, it is

going to be harder still, unless those points reveal some real characteristic of the

population that involves a linear relationship.

The linear regression example demonstrates that bad curve-fitting does

become more difficult as the sample size gets larger. Consider two trading sys-

tems: One system had a profit per trade of $100, it took 2 trades, and the stan-

dard deviation was $100 per trade: the other system took 1,000 trades, with

similar means and standard deviations. When evaluated statistically, the system

with 1,000 trades will be a lot more “statistically significant” than the one with

the 2 trades.

In multiple linear regression models, as the number of regression parameters

(beta weights) being estimated is increased relative to the sample size, the amount

of curve-fitting increases and statistical significance lessens for the same

degree of model fit. In other words, the greater the degree of curve-fitting, the

harder it is to get statistical significance. The exception is if the improvement in fit

when adding regressors is sufficient to compensate for the loss in significance due

to the additional parameters being estimated. In fact, an estimate of shrinkage (the

degree to which the multiple correlation can be expected to shrink when computed

using out-of-sample data) can even be calculated given sample size and number of

regressors: Shrinkage increases with regressors and decreases with sample size. In

short, there is mathematical evidence that curve-fitting to chance characteristics of

a sample, with concomitant poor generalization, is more likely if the sample is

small relative to the number of parameters being fit by the model. In fact, as n (the

sample size) goes to infinity, the probability that the curve-fitting (achieved by

optimizing a set of parameters) is nonrepresentative of the population goes to zero.

The larger the number of parameters being optimized, the larger the sample

required. In the language of statistics, the parameters being estimated use up the

available “degrees of freedom.”

All this leads to the conclusion that the larger the sample, the more likely its

“curves” are representative of characteristics of the market as a whole. A small

sample almost certainly will be nonrepresentative of the market: It is unlikely that

its curves will reflect those of the entire market that persist over time. Any model

built using a small sample will be capitalizing purely on the chance of sampling.

Whether curve-fitting is “good” or “bad” depends on if it was done to chance or

to real market patterns, which, in turn, largely depends on the size and representa-

tiveness of the sample. Statistics are useful because they make it possible to take

curve-fitting into account when evaluating a system.

When dealing with neural networks, concerns about overtraining or general-

ization are tantamount to concerns about bad curve-fitting. If the sample is large

enough and representative, curve-fitting some real characteristic of the market is

more likely, which may be good because the model should fit the market. On the

other hand, if the sample is small, the model will almost certainly be fit to pecu-

liar characteristics of the sample and not to the behavior of the market generally.

In neural networks, the concern about whether the neural network will generalize

is the same as the concern about whether other kinds of systems will hold up in

the future. To a great extent, generalization depends on the size of the sample on

which the neural network is trained. The larger the sample, or the smaller the num-

ber of connection weights (parameters) being estimated, the more likely the net-

work will generalize. Again, this can be demonstrated mathematically by

examining simple cases.

As was the case with regression, au estimate of shrinkage (the opposite of

generalization) may be computed when developing neural networks. In a very real

sense, a neural network is actually a multiple regression, albeit, nonlinear, and the

correlation of a neural net™s output with the target may be construed as a multiple

correlation coefficient. The multiple correlation obtained between a net™s output

and the target may be corrected for shrinkage to obtain some idea of how the net

might perform on out-of-sample data. Such shrinkage-corrected multiple correla-

tions should routinely be computed as a means of determining whether a network

has merely curve-fit the data or has discovered something useful. The formula for

correcting a multiple correlation for shrinkage is as follows:

A FORTRAN-style expression was used for reasons of typsetting. In this for-

mula, SQRT represents the square root operator; N is the number of data points

or, in the case of neural networks, facts; P is the number of regression coefti-

cients or, in the case of neural networks, connection weights; R represents the

uncorrected multiple correlation; and RC is the multiple correlation corrected

for shrinkage. Although this formula is strictly applicable only to linear multi-

ple regression (for which it was originally developed), it works well with neur-

al networks and may be used to estimate how much performance was inflated on

the in-sample data due to curve-fitting. The formula expresses a relationship

between sample size, number of parameters, and deterioration of results. The

statistical correction embodied in the shrinkage formula is used in the chapter on

neural network entry models.

SAMPLE SIZE AND REPRESENTATIVENESS

Although, for statistical reasons, the system developer should seek the largest sam

ple possible, there is a trade-off between sample size and representativeness when

dealing with the financial markets. Larger samples mean samples that go farther

back in time, which is a problem because the market of years ago may be funda-

mentally different from the market of today-remember the S&P 500 in 1983?

This means that a larger sample may sometimes be a less representative sample,

or one that confounds several distinct populations of data! Therefore, keep in mind

that, although the goal is to have the largest sample possible, it is equally impor-

tant to try to make sure the period from which the sample is drawn is still repre-

sentative of the market being predicted.

EVALUATING A SYSTEM STATISTICALLY

Now that some of the basics are out of the way, let us look at how statistics are

used when developing and evaluating a trading system. The examples below

employ a system that was optimized on one sample of data (the m-sample data)

and then run (tested) on another sample of data (the out-of-sample data). The out-

of-sample evaluation of this system will be discussed before the in-sample one

because the statistical analysis was simpler for the former (which is equivalent to

the evaluation of an unoptimized trading system) in that no corrections for mul-

tiple tests or optimization were required. The system is a lunar model that trades

the S&P 500; it was published in an article we wrote (see Katz with McCormick,

June 1997). The TradeStation code for this system is shown below:

58

Example 1: Evaluating the Out-of-Sample Test

Evaluating an optimized system on a set of out-of-sample data that was never used

during the optimization process is identical to evaluating an unoptimized system.

In both cases, one test is run without adjusting any parameters. Table 4-1 illus-

trates the use of statistics to evaluate an unoptimized system: It contains the out-

of-sample or verification results together with a variety of statistics. Remember, in

this test, a fresh set of data was used; this data was not used as the basis for

adjustments in the system™s parameters.

The parameters of the trading model have already been set. A sample of data

was drawn from a period in the past, in this specific case, l/1/95 through l/1/97;

this is the out-of-sample or verification data. The model was then run on this out-

of-sample data, and it generated simulated trades. Forty-seven trades were taken.

This set of trades can itself be considered a sample of trades, one drawn from the

population of all trades that the system took in the past or will take in the future;

i.e., it is a sample of trades taken from the universe or population of all trades for

that system. At this point, some inference must be made regarding the average

profit per trade in the population as a whole, based on the sample of trades. Could

the performance obtained in the sample be due to chance alone? To find the

answer, the system must be statistically evaluated.

To begin statistically evaluating this system, the sample mean (average) for

n (the number of trades or sample size) must first be calculated. The mean is

simply the sum of the profit/loss figures for the trades generated divided by n (in

this case, 47). The sample mean was $974.47 per trade. The standard deviation

(the variability in the trade profit/loss figures) is then computed by subtracting

the sample mean from each of the profit/loss numbers for all 47 trades in the

sample; this results in 47 (n) deviations. Each of the deviations is then squared,

and then all squared deviations are added together. The sum of the squared devi-

ations is divided hy n - I (in this case, 46). By taking the square root of the

resultant number (the mean squared deviation), the sample standard deviation is

obtained. Using the sample standard deviation, the expected standard deviation

of the nean is computed: The sample standard deviation (in this case, $6,091.10)

is divided by the square root of the sample size. For this example, the expected

standard deviation of the mean was $888.48.

To determine the likelihood that the observed profitability is due to chance

alone, a simple t-test is calculated. Since the sample profitability is being compared

with no profitability, zero is subtracted from the sample mean trade profit/loss (com-

puted earlier). The resultant number is then divided by the sample standard devia-

tion to obtain the value of the t-statistic, which in this case worked out to be 1.0968.

Finally the probability of getting such a large t-statistic by chance alone (under the

assumption that the system was not profitable in the population from which the sam-

ple was drawn) is calculated: The cumulative t-distribution for that t-statistic is cotn-

puted with the appropriate degrees of freedom, which in this case was n - 1, or 46.

Statistics

CHAFDX 4

TABLE 4-I

Trades from the S&P 500 Data Sample on Which the Lunar Model Was

Verified

Enby Date Exit Dale Slatistical Analyses of Mean Profit/Loss

ProfiliLoss Cumulative

850207 850221 650 88825

66325 Sample Size 47.0000

850221 950223 -2500

950309 950323 92350 Sample Mean 974.4681

6025

950323 950324 -2500 89850 Sample SIandard Devlatlon 6091.1028

088.4787

950407 950419 -2500 a7350 Expected SD of Mean

950421 850424 -2500 84850

1.0868

850508 850516 -2500 82350 T Statislic (PiL > 0)

79850 Probability (Siiniflcance) 0.1392

850523 950524 -25W

850806 850609 -2500 77350

850620 74050 Serial CorrelaIion (lag=l) 0.2120

050622 -2500

79250 Associated T Statistic 1.4301

850704 850718 4400

0.1572

850719 950725 -2500 76750 Probability (Significance)

850603 950618 2575 79325

16.0000

850816 950901 25 78350 Number Of Wlns

hD?ntaQe Of Wins 0.3404

850901 850816 10475 89825

0.5318

950918 950829 -2600 87325 Upper 98% Bound

851002 951003 84625 Lower 89% Bound 0.1702

-2500

851017 851016 -2550 a2275

851031 951114 3150 85425

951114 951116 82925

-2500

951128 951214 6760 89675

94925

951214 851228 5250

851228 860109 -2500 92425

860112 8601 I7 -2500 69925

108625

860128 860213 18700

860213 860213 106125

-2500

960227 960227 -2500 103™325

Additional rows follow but are not shown in the table.

(Microsoft™s Excel spreadsheet provides a function to obtain probabilities based on

the t-distribution. Numen™cal Recipes in C provides the incomplete beta function,

which is very easily used to calculate probabilities based on a variety of distribu-

tions, including Student™s t.) The cumulative t-distribution calculation yields a figure

that represents the probability that the results obtained from the trading system were

due to chance. Since this figure was small, it is unlikely that the results were due to

capitalization on random features of the sample. The smaller the number, the more

likely the system performed the way it did for reasons other than chance. In this

instance, the probability was 0.1392; i.e., if a system with a true (population) profit

FIGURE 4-1

Frequency and Cumulative Distribution for In-Sample Trades

of $0 was repeatedly tested on independent samples, only about 14% of the time

would it show a profit as high as that actually observed.

Although the t-test was, in this example, calculated for a sample of trade prof-

it/loss figures, it could just as easily have been computed for a sample of daily

returns. Daily returns were employed in this way to calculate the probabilities

referred to in discussions of the substantitive tests that appear in later chapters. In

fact, the annualized risk-to-reward ratio (ARRR) that appears in many of the tables

and discussions is nothing more than a resealed t-statistic based on daily returns.

Finally, a con$dence interval on the probability of winning is estimated. In

the example, there were 16 wins in a sample of 47 trades, which yielded a per-

centage of wins equal to 0.3404. Using a particular inverse of the cumulative bino-

mial distribution, upper 99% and lower 99% boundaries are calculated. There is a

99% probability that the percentage of wins in the population as a whole is

between 0.1702 and 0.5319. In Excel, the CRITBINOM function may be used in

the calculation of confidence intervals on percentages.

The various statistics and probabilities computed above should provide the

system developer with important information regarding the behavior of the trad-

ing model-that is, if the assumptions of normality and independence are met and

CHAPTER 4 Statistics 61

if the sample is representative. Most likely, however, the assumptions underlying

the t-tests and other statistics are violated; market data deviates seriously from the

normal distribution, and trades are usually not independent. In addition, the sam-

ple might not be representative. Does this mean that the statistical evaluation just

discussed is worthless? Let™s consider the cases.

What if the Distribution Is Not Normal? An assumption in the t-test is that the

underlying distribution of the data is normal. However, the distribution of

profit/loss figures of a trading system is anything but normal, especially if there

are stops and profit targets, as can be seen in Figure 4- 1, which shows the distrib-

ution of profits and losses for trades taken by the lunar system. Think of it for a

moment. Rarely will a profit greater than the profit target occur. In fact, a lot

of trades are going to bunch up with a profit equal to that of the profit target. Other

trades are going to bunch up where the stop loss is set, with losses equal to that;

and there will be trades that will fall somewhere in between, depending on the exit

method. The shape of the distribution will not be that of the bell curve that describes

the normal distribution. This is a violation of one of the assumptions underlying the

t-test. In this case, however, the Central Limit Theorem comes to the rescue. It states

that as the number of cases in the sample increases, the distribution of the sample

mean approaches normal. By the time there is a sample size of 10, the errors result-

ing from the violation of the normality assumption will be small, and with sample

sizes greater than 20 or 30, they will have little practical significance for inferences

regarding the mean. Consequently, many statistics can be applied with reasonable

assurance that the results will be meaningful, as long as the sample size is adequate,

as was the case in the example above, which had an n of 47.

What if There Is Serial Dependence.3 A more serious violation, which makes

the above-described application of the t-test not quite cricket, is serial depen-

dence, which is when cases constituting a sample (e.g., trades) are not statistical-

ly independent of one another. Trades come from a time series. When a series of

trades that occurred over a given span of dates is used as a sample, it is not quite

a random sample. A truly random sample would mean that the 100 trades were

randomly taken from the period when the contract for the market started (e.g.,

1983 for the S&P 500) to far into the future; such a sample would not only be less

likely to suffer from serial dependence, but be more representative of the popula-

tion from which it was drawn. However, when developing trading systems, sam-

pling is usually done from one narrow point in time; consequently, each trade may

be correlated with those adjacent to it and so would not be independent,

The practical effect of this statistically is to reduce the eflective sample size.

When trying to make inferences, if there is substantial serial dependence, it may

be as if the sample contained only half or even one-fourth of the actual number of

trades or data points observed. To top it off, the extent of serial dependence can-

not definitively be determined. A rough “guestimate,” however, can be made. One

such guestimate may be obtained by computing a simple lag/lead serial correla-

tion: A correlation is computed between the profit and loss for Trade i and the

profit and loss for Trade i + I, with i ranging from 1 to n - 1. In the example, the

serial correlation was 0.2120, not very high, but a lower number would be prefer-

able. An associated t-statistic may then be calculated along with a statistical sig-

nificance for the correlation In the current case, these statistics reveal that if there

really were no serial correlation in the population, a correlation as large as the one

obtained from the sample would only occur in about 16% of such tests.

Serial dependence is a serious problem. If there is a substantial amount of it,

it would need to be compensated for by treating the sample as if it were smaller

than it actually is. Another way to deal with the effect of serial dependence is to

draw a random sample of trades from a larger sample of trades computed over a

longer period of time. This would also tend to make the sample of trades more rep-

resentative of the population,

What ifthe Markets Change? When developing trading systems, a third assump-

tion of the t-test may be inadvertently violated. There are no precautions that can

be taken to prevent it from happening or to compensate for its occurrence. The rea-

son is that the population from which the development or verification sample was

drawn may be different from the population from which future trades may be taken.

This would happen if the market underwent some real structural or other change.

As mentioned before, the population of trades of a system operating on the S&P

500 before 1983 would be different from the population after that year since, in

1983, the options and futures started trading on the S&P 500 and the market

changed. This sort of thing can devastate any method of evaluating a trading sys-

tem. No matter how much a system is back-tested, if the market changes before

trading begins, the trades will not be taken from the same market for which the sys-

tem was developed and tested; the system will fall apart. All systems, even cur-

rently profitable ones, will eventually succumb to market change. Regardless of the

market, change is inevitable. It is just a question of when it will happen. Despite

this grim fact, the use of statistics to evaluate systems remains essential, because if

the market does not change substantially shortly after trading of the system com-

mences, or if the change is not sufficient to grossly affect the system™s performance,

then a reasonable estimate of expected probabilities and returns can be calculated,

Example 2: Evaluating the In-Sample Tests

How can a system that has been fit to a data sample by the repeated adjustment of

parameters (i.e., an optimized system) be evaluated? Traders frequently optimize

systems to obtain good results. In this instance, the use of statistics is more impor-

tant than ever since the results can be analyzed, compensating for the multiplicity

of tests being performed as part of the process of optimization. Table 4-2 contains

the profit/loss figures and a variety of statistics for the in-sample trades (those

taken on the data sample used to optimize the system). The system was optimized

on data from l/1/90 through l/2/95.

Most of the statistics in Table 4-2 are identical to those in Table 4-1, which

was associated with Example 1. Two additional statistics (that differ from those in

the first example) are labeled “Optimization Tests Run” and “˜Adjusted for

Optimization.” The first statistic is simply the number of different parameter com-

binations tried, i.e., the total number of times the system was run on the data, each

time using a different set of parameters. Since the lunar system parameter, LI, was

stepped from 1 to 20 in increments of 1, 20 tests were performed; consequently,

there were 20 t-statistics, one for each test. The number of tests mn is used to make

an adjustment to the probability or significance obtained from the best t-statistic

TABLE 4-2

Trades from the S&P 500 Data Sample on Which the Lunar Model

Was Optimized

800417 900501 5750

800501 800516 11700 17450

800516 900522 -2500 14950

150 15100

800615 900702 2300 1,400

900702 800716 4550 2,950

800731 6675 28825

800731 800802 -2500 28125

800814 900828 8500 35425

SO0828 800811 575 38200

900911 ˜OOSZB 7225 43425

40825

801010 90,ow -2875 38050

*01028 80,028 -2500 35550

˜0,109 *0,,,2 -2700 32850

801128 80,211 8125 40875

801211 80,225 -875 40100

80,225 s10,02 -2500 37600

810108 910108 -2500 35100

010208 -2504

010221 4550

910322 5600

810408 -2500

9m423 -2.500

810507 3800

computed on the sample: Take 1, and subtract from it the statistical significance

obtained for the best-performing test. Take the resultant number and raise it to the

mth power (where m = the number of tests mn). Then subtract that number from

1. This provides the probability of finding, in a sample of m tests (in this case, 20),

at least one t-statistic as good as the one actually obtained for the optimized solu-

tion. The uncorrected probability that the profits observed for the best solution were

due to chance was less than 2%, a fairly significant result, Once adjusted for mul-

tiple tests, i.e., optimization, the statistical significance does not appear anywhere

near as good. Results at the level of those observed could have been obtained for

such an optimized system 3 1% of the time by chance alone. However, things are

not quite as bad as they seem. The adjustment was extremely conservative and

assumed that every test was completely independent of every other test. In actual

fact, there will be a high serial correlation between most tests since, in many trad-

ing systems, small changes in the parameters produce relatively small changes in

the results. This is exactly like serial dependence in data samples: It reduces the

effective population size, in this case, the effective number of tests run. Because

many of the tests are correlated, the 20 actual tests probably correspond to about 5

to 10 independent tests. If the serial dependence among tests is considered, the

adjusted-for-optimization probability would most likely be around 0.15, instead of

the 0.3 104 actually calculated. The nature and extent of serial dependence in the

multiple tests are never known, and therefore, a less conservative adjustment for

optimization cannot be directly calculated, only roughly reckoned.

Under certain circumstances, such as in multiple regression models, there are

exact mathematical formulas for calculating statistics that incorporate the fact that

parameters are being tit, i.e., that optimization is occurring, making corrections for

optimization unnecessary.

Interpreting the Example Statistics

In Example 1, the verification test was presented. The in-sample optimization run

was presented in Example 2. In the discussion of results, we are returning to the nat-

ural order in which the tests were run, i.e., optimization first, verification second.

Optimization Results. Table 4-2 shows the results for the in-sample period. Over

the 5 years of data on which the system was optimized, there were 118 trades (n

= 118). the mean or average trade yielded about $740.97, and the trades were

highly variable, with a sample standard deviation of around +$3,811: i.e., there

were many trades that lost several thousand dollars, as well as trades that made

many thousands. The degree of profitability can easily be seen by looking at the

profit/loss column, which contains many $2,500 losses (the stop got hit) and a sig-

nificant number of wins, many greater than $5,000, some even greater than

$10,000. The expected standard deviation of the mean suggests that if samples of

this kind were repeatedly taken, the mean would vary only about one-tenth as

much as the individual trades, and that many of the samples would have mean

profitabilities in the range of $740 + $350.

The t-statistic for the best-performing system from the set of optimization

mns was 2.1118, which has a statistical significance of 0.0184. This was a fairly

strong result. If only one test had been run (no optimizing), this good a result would

have been obtained (by chance alone) only twice in 100 tests, indicating that the

system is probably capturing some real market inefficiency and has some chance of

holding up. However, be warned: This analysis was for the best of 20 sets of para-

meter values tested. If corrected for the fact that 20 combinations of parameter val-

ues were tested, the adjusted statistical significance would only be about 0.3 1, not

very good; the performance of the system could easily have been due to chance.

Therefore, although the system may hold up, it could also, rather easily, fail.

The serial correlation between trades was only 0.0479, a value small enough

in the present context, with a significance of only 0.6083. These results strongly

suggest that there was no meaningful serial correlation between trades and that the

statistical analyses discussed above are likely to be correct.

There were 58 winning trades in the sample, which represents about a 49%

win rate. The upper 99% confidence boundary was approximately 61% and the

lower 99% confidence boundary was approximately 37%, suggesting that the true

percentage of wins in the population has a 99% likelihood of being found between

those two values. In truth, the confidence region should have been broadened by

correcting for optimization; this was not done because we were not very con-

cerned about the percentage of wins.

Results. Table 4-1, presented earlier, contains the data and statistics

Vetificution

for the out-of-sample test for the model. Since all parameters were already fixed,

and only one test was conducted, mere was no need to consider optimization or its

consequences in any manner. In the period from M/95 to t/1/97, there were 47

trades. The average trade in this sample yielded about $974, which is a greater

average profit per trade than in the optimization sample! The system apparently

did maintain profitable behavior.

At slightly over $6,000, the sample standard deviation was almost double

that of the standard deviation in me optimization sample. Consequently, the stan-

dard deviation of the sample mean was around $890, a fairly large standard error

of estimate; together with the small sample size, this yielded a lower t-statistic

than found in the optimization sample and, therefore, a lowered statistical signifi-

cance of only about 14%. These results were neither very good nor very bad:

There is better than an 80% chance that the system is capitalizing on some real

(non-chance) market inefficiency. The serial correlation in the test sample, however,

was quite a bit higher than in the optimization sample and was significant, with a

probability of 0.1572; i.e., as large a serial correlation as this would only be

expected about 16% of the time by chance alone, if no true (population) serial cor-

relation was present. Consequently, the t-test on the profit/loss figures has likely

66

overstated the statistical significance to some degree (maybe between 20 and

30%). If the sample size was adjusted downward the right amount, the t-test prob-

ability would most likely be around 0.18, instead of the 0.1392 that was calculat-

ed. The confidence interval for the percentage of wins in the population ranged

from about 17% to about 53%.

Overall, the assessment is that the system is probably going to hold up in the

future, but not with a high degree of certainty. Considering there were two inde-

pendent tests--one showing about a 31% probability (corrected for optimization)

that the profits were due to chance, the other showing a statistical significance of

approximately 14% (corrected to 18% due to the serial correlation), there is a good

chance that the average population trade is profitable and, consequently, that the

system will remain profitable in the future.

OTHER STATISTICAL TECHNIQUES AND THEIR

USE

The following section is intended only to acquaint the reader with some other sta-

tistical techniques that are available. We strongly suggest that a more thorough study

be undertaken by those serious about developing and evaluating trading systems.

Genetically Evolved Systems

We develop many systems using genetic algorithms. A popular$fimessfunction (cri-

terion used to determine whether a model is producing the desired outcome) is the

total net profit of the system. However, net profit is not the best measure of system

quality! A system that only trades the major crashes on the S&P 500 will yield a

very high total net profit with a very high percentage of winning trades. But who

knows if such a system would hold up? Intuitively, if the system only took two or

three trades in 10 years, the probability seems very low that it would continue to

perform well in the future or even take any more trades. Part of the problem is that

net profit does not consider the number of trades taken or their variability.

An alternative fitness function that avoids some of the problems associated

with net profit is the t-statistic or its associated probability. When using the t-sta-

tistic as a fitness function, instead of merely trying to evolve the most profitable

systems, the intention is to genetically evolve systems that have the greatest like-

lihood of being profitable in the future or, equivalently, that have the least likeli-

hood of being profitable merely due to chance or curve-fitting. This approach

works fairly well. The t-statistic factors in profitability, sample size, and number

of trades taken. All things being equal, the greater the number of trades a system

takes, the greater the t-statistic and the more likely it will hold up in the future.

Likewise, systems that produce more consistently profitable trades with less vari-

ation are more desirable than systems that produce wildly varying trades and will

yield higher t-statistic values. The t-statistic incorporates many of the features that

define the quality of a trading model into one number that can be maximized by a

genetic algorithm.

Multiple Regression

Another statistical technique frequently used is multiple regression. Consider

intermarket analysis: The purpose of intermarket analysis is to find measures of

behaviors in other markets that are predictive of the future behavior of the market

being studied. Running various regressions is an appropriate technique for ana-

lyzing such potential relationships; moreover, there are excellent statistics to use

for testing and setting confidence intervals on the correlations and regression

(beta) weights generated by the analyses. Due to lack of space and the limited

scope of this chapter, no examples are presented, but the reader is referred to

Myers (1986), a good basic text on multiple regression.

A problem with most textbooks on multiple regression analysis (including

the one just mentioned) is that they do not deal with the issue of serial correlation

in time series data, and its effect on the statistical inferences that can be made from

regression analyses using such data. The reader will need to take the effects of

serial correlation into account: Serial correlation in a data sample has the effect of

reducing the effective sample size, and statistics can be adjusted (at least in a

rough-and-ready manner) based on this effect. Another trick that can be used in

some cases is to perform some transformations on the original data series to make

the time series more “stationary” and to remove the unwanted serial correlations.

Monte Carlo Simulations

One powerful, unique approach to making statistical inferences is known as the

Monte Carlo Simulation, which involves repeated tests on synthetic data that are

constructed to have the properties of samples taken from a random population.

Except for randomness, the synthetic data are constructed to have the basic char-

acteristics of the population from which the real sample was drawn and about

which inferences must be made. This is a very powerful method. The beauty of

Monte Carlo Simulations is that they can be performed in a way that avoids the

dangers of assumptions (such as that of the normal distribution) being violated,

which would lead to untrustworthy results.

Out-of-Sample Testing

Another way to evaluate a system is to perform out-of-sample testing. Several time

periods are reserved to test a model that has been developed or optimized on some

other time period. Out-of-sample testing helps determine how the model behaves

68

on data it had not seen during optimization or development. This approach is

strongly recommended. In fact, in the examples discussed above, both in-sample

and out-of-sample tests were analyzed. No corrections to the statistics for the

process of optimization are necessary in out-of-sample testing. Out-of-sample and

multiple-sample tests may also provide some information on whether the market

has changed its behavior over various periods of time.

Walk-Forward Testing

In walk-forward testing, a system is optimized on several years of data and then

traded the next year. The system is then reoptimized on several more years of data,

moving the window forward to include the year just traded. The system is then

traded for another year. This process is repeated again and again, “walking for-

ward” through the data series. Although very computationally intensive, this is an

excellent way to study and test a trading system. In a sense, even though opti-

mization is occurring, all trades are taken on what is essentially out-of-sample test

data. All the statistics discussed above, such as the t-tests, can be used on walk-

forward test results in a simple manner that does not require any corrections for

optimization. In addition, the tests will very closely simulate the process that

occurs during real trading--first optimization occurs, next the system is traded on

data not used during the optimization, and then every so often the system is reop-

timized to update it. Sophisticated developers can build the optimization process

into the system, producing what might be called an “adaptive” trading model.

Meyers (1997) wrote an article illustrating the process of walk-forward testing.

CONCLUSION

In the course of developing trading systems, statistics help the trader quickly reject

models exhibiting behavior that could have been due to chance or to excessive

curve-fitting on an inadequately sized sample. Probabilities can be estimated, and

if it is found that there is only a very small probability that a model™s performance

could be due to chance alone, then the trader can feel more confident when actu-

ally trading the model.

There are many ways for the trader to use and calculate statistics. The cen-

tral theme is the attempt to make inferences about a population on the basis of

samples drawn from that population.

Keep in mind that when using statistics on the kinds of data faced by traders,

certain assumptions will be violated. For practical purposes, some of the violations

may not be too critical; thanks to the Central Limit Theorem, data that are not nor-

mally distributed can usually be analyzed adequately for most needs. Other viola-

tions that are more serious (e.g., ones involving serial dependence) do need to be

taken into account, but rough-and-ready rules may be used to reckon corrections

to the probabilities. The bottom line: It is better to operate with some information,

even knowing that some assumptions may be violated, than to operate blindly.

We have glossed over many of the details, definitions, and reasons behind the

statistics discussed above. Again, the intention was merely to acquaint the reader

with some of the more frequently used applications. We suggest that any commit-

ted trader obtain and study some good basic texts on statistical techniques.

PART II

The Study of Entries

Introduction

I n this section, various entry methods arc systematically evaluated. The focus is

on which techniques provide good entries and which do not. A good entry is

important because it can reduce exposure to risk and increase the likelihood that a

trade will be profitable. Although it is sometimes possible to make a profit with a

bad entry (given a sufficiently good exit), a good entry gets the trade started on the

right foot.

WHAT CONSTITUTES A GOOD ENTRY?

A good entry is one that initiates a trade at a point of low potential risk and high

potential reward. A point of low risk is usually a point from which there is little

adverse excursion before the market begins to move in the trade™s favor. Entries

that yield small adverse excursions on successful trades are desirable because they

permit fairly tight stops to be set, thereby minimizing risk. A good entry should

also have a high probability of being followed quickly by favorable movement in