. 2
( 9)


CHMTER 3 Optimizers and Optimization

for our mail. Traders sometimes use optimizers to discover rule combinations that
trade profitably. In Part II, we will demonstrate how a genetic optimizer can evolve
profitable rule-based entry models. More commonly, traders call upon optimizers
to determine the most appropriate values for system parameters; almost any kind
of optimizer, except perhaps an analytic optimizer, may be employed for this pur-
pose. Various kinds of optimizers, including powerful genetic algorithms, are
effective for training or evolving neural or fuzzy logic networks. Asset allocation
problems yield to appropriate optimization strategies. Sometimes it seems as if the
only limit on how optimizers may be employed is the user™s imagination, and
therein lies a danger: It is easy to be seduced into “optimizer abuse” by the great
and alluring power of this tool. The correct and incorrect applications of opti-
mizers are discussed later in this chapter.

There are many kinds of optimizers, each with its own special strengths and weak
nesses, advantages and disadvantages. Optimizers can be classified along such
dimensions as human versus machine, complex versus simple, special purpose
versus general purpose, and analytic versus stochastic. All optimizers-regardless
of kind, efficiency, or reliability-execute a search for the best of many potential
solutions to a formally specified problem.

lmpllcit Optimizers
A mouse cannot be used to click on a button that says “optimize.” There is no spe-
cial command to enter. In fact, there is no special software or even machine in
sight. Does this mean there is no optimizer? No. Even when there is no optimizer
apparent, and it seems as though no optimization is going on, there is. It is known
as implicit optimization and works as follows: The trader tests a set of rules based
upon some ideas regarding the market. Performance of the system is poor, and so
the trader reworks the ideas, modifies the system™s rules, and runs another simu-
lation Better performance is observed. The trader repeats this process a few
times, each time making changes based on what has been learned along the way.
Eventually, the trader builds a system worthy of being traded with real money.
Was this system an optimized one? Since no parameters were ever explicitly
adjusted and no rules were ever rearranged by the software, it appears as if the
trader has succeeded in creating an unoptimized system. However, more than one
solution from a set of many possible solutions was tested and the best solution was
selected for use in trading or further study. This means that the system was opti-
mized after all! Any form of problem solving in which more than one solution is
examined and the best is chosen constitutes de facto optimization. The trader has
a powerful brain that employed mental problem-solving algorithms, e.g., heuris-
tically guided trial-and-error ones, which are exceptionally potent optimizers.

This means that optimization is always present: optimizers are always at work.
There is no escape!

Brute Force Optimizers
A brute force optimizer searches for the best possible solution by systematically
testing all potential solutions, i.e., all definable combinations of rules, parameters,
or both. Because every possible combination must be tested, brute force opti-
mization can be very slow. Lack of speed becomes a serious issue as the number
of combinations to be examined grows. Consequently, brute force optimization is
subject to the law of “combinatorial explosion.” Just how slow is brute force opti-
mization? Consider a case where there are four parameters to optimize and where
each parameter can take on any of 50 values. Brute force optimization would
require that 504 (about 6 million) tests or simulations be conducted before the
optimal parameter set could be determined: if one simulation was executed every
1.62 seconds (typical for TradeStation), the optimization process would take about
4 months to complete. This approach is not very practical, especially when many
systems need to be tested and optimized, when there are many parameters, when
the parameters can take on many values, or when you have a life. Nevertheless,
brute force optimization is useful and effective. If properly done, it will always
find the best possible solution. Brute force is a good choice for small problems
where combinatorial explosion is not an issue and solutions can be found in min-
utes, rather than days or years.
Only a small amount of programming code is needed to implement brute
force optimization. Simple loop constructs are commonly employed. Parameters
to be optimized are stepped from a start value to a stop value by some increment
using a For loop (C, C+ +, Basic, Pascal/Delphi) or a Do loop (FORTRAN). A
brute force optimizer for two parameters, when coded in a modem dialect of
Basic, might appear as follows:
Because brute force optimizers are conceptually simple and easy to program,
they are often built into the more advanced software packages that arc available
for traders.
As a practical illustration of bmte force optimization, TradeStation was used
to optimize the moving averages in a dual moving-average crossover system.
Optimization was for net profit, the only trading system characteristic that Trade-
Station can optimize without the aid of add-on products, The Easy Language code
for the dual moving-average trading model appears below:

The system was optimized by stepping the length of the first moving average
(LenA) from 2 to 10 in increments of 2. The length of the second moving average
(LenB) was advanced from 2 to 50 with the same increments. Increments were set
greater than 1 so that fewer than 200 combinations would need to be tested
(TradeStation can only save data on a maximum of 200 optimization runs). Since
not all possible combinations of values for the two parameters were explored, the
optimization was less thorough than it could have been; the best solution may have
been missed in the search. Notwithstanding, the optimization required 125 tests,
which took 3 minutes and 24 seconds to complete on 5 years of historical, end-of-
day data, using an Intel 486 machine running at 66 megahertz. The results gener-
ated by the optimization were loaded into an Excel spreadsheet and sorted for net
profit. Table 3-l presents various performance measures for the top 25 solutions.
In the table, LENA represents the period of the shorter moving average,
LENB the period of the longer moving average, NetPrft the total net profit,
LtNerPlft the net profit for long positions, S:NefPrji the net profit for short posi-
tions, PFact the profit factor, ROA the total (unannualized) return-on-account,
MaxDD the maximum drawdown, #Trds the total number of trades taken, and
%Prji the percentage of profitable trades.
Since optimization is a problem-solving search procedure, it frequently
results in surprising discoveries. The optimization performed on the dual moving-
average crossover system was no exception to the rule. Conventional trading wis-
dom says that “the trend is your friend.” However, having a second moving average
that is faster than the first, the most profitable solutions in Table 3. I trade against
the trend. These profitable countertrend solutions might not have been discovered
without the search performed by the optimization procedure.

Successful user-guided optimization calls for skill, domain knowledge, or
both, on the part of the person guiding the optimization process. Given adequate
skill and experience, not to mention a tractable problem, user-guided optimization
can be extremely efficient and dramatically faster than brute force methods. ˜Ibe
speed and efficiency derive from the addition of intelligence to the search process:
Zones with a high probability of paying off can be recognized and carefully exam-
ined, while time-consuming investigations of regions unlikely to yield good
results can be avoided.
User-guided optimization is most appropriate when ballpark results have
already been established by other means, when the problem is familiar or well
understood, or when only a small number of parameters need to be manipulated.
As a means of “polishing” an existing solution, user guided-optimization is an
excellent choice. It is also useful for studying model sensitivity to changes in rules
or parameter values.

Genetic Optimizers
Imagine something powerful enough to solve all the problems inherent in the
creation of a human being. That something surely represents the ultimate in
problem solving and optimization. What is it? It is the familiar process of evo-
lution. Genetic optimizers endeavor to harness some of that incredible prob-
lem-solving power through a crude simulation of the evolutionary process. In
terms of overall performance and the variety of problems that may be solved,
there is no general-purpose optimizer more powerful than a properly crafted
genetic one.
Genetic optimizers are Stochustic optimizers in the sense that they take
advantage of random chance in their operation. It may not seem believable that
tossing dice can be a great way to solve problems, but, done correctly, it can be!
In addition to randomness, genetic optimizers employ selection and recombina-
tion. The clever integration of random chance, selection, and recombination is
responsible for the genetic optimizer™s great power. A full discussion of genetic
algorithms, which are the basis for genetic optimizers, appears in Part II.
Genetic optimizers have many highly desirable characteristics. One such
characteristic is speed, especially when faced with combinatorial explosion. A
genetic optimizer can easily be many orders of magnitude faster than a brute force
optimizer when there are a multiplicity of rules, or parameters that have many pos-
sible values, to manipulate. This is because, like user-guided optimization, genetic
optimization can focus on important regions of solution space while mostly ignor-
ing blind alleys. In contrast to user-guided optimization, the benefit of a selective
search is achieved without the need for human intervention.
Genetic optimizers can swiftly solve complex problems, and they are also
more immune than other kinds of optimizers to the effects of local maxima in the

fitness surface or, equivalently, local minima in the cost surface. Analytic methods
are worst in that they almost always walk right to the top of the nearest hill or bot-
tom of the nearest valley, without regard to whether higher hills or lower valleys
exist elsewhere. In contrast, a good genetic optimizer often locates the globally
best solution-quite an impressive feat when accomplished for cantankerous fit-
ness surfaces, such as those associated with matrices of neural connection weights.
Another characteristic of genetic optimization is that it works well with fit-
ness surfaces marked by discontinuities, flat regions, and other troublesome irreg-
ularities. Genetic optimization shares this characteristic with brute force,
user-guided, annealing-based, and other nonanalytic optimization methods.
Solutions that maximize such items as net profit, return on investment, the Sharpe
Ratio, and others that define difficult, nonanalytic fitness landscapes can be found
using a genetic optimizer. Genetic optimizers shine with difficult fitness functions
that lie beyond the purview of analytic methods. This does not mean that they can-
not be used to solve problems having more tractable fitness surfaces: Perhaps
slower than the analytic methods, they have the virtue of being more resistant to
the traps set by local optima.
Overall, genetic optimizers are the optimizers of choice when there are many
parameters or rules to adapt, when a global solution is desired, or when arbitrarily
complex (and not necessarily differentiable or continuous) fitness or cost functions
must be handled. Although special-purpose optimizers can outperform genetic opti-
mizers on specific kinds of problems, for general-purpose optimization, genetic
optimizers are among the most powerful tools available.
What does a genetic optimizer look like in action? The dual moving-average
crossover system discussed earlier was translated to Cl 1 so that the genetic opti-
mizer in the C-Trader toolkit could be used to solve for the two system parame-
ters, LenA and LenB. LenA, the period of the first moving average, was examined
over the range of 2 through 50, as was LenB, the period of the second moving aver-
age. Optimization was for net profit so that the results would be directly compa-
rable with those produced earlier by brute force optimization. Below is the Cl 1
code for the crossover system:
CWR 3 Optimizers and Optimimion

I/ take no trades in lo&back period
if(clt[cbl c 910302) ( eqclsLcb1 = 0.0; continue; )

To solve for the best parameters, brute force optimization would require that 2,041
tests be performed; in TradeStation, that works out to about 56 minutes of com-
puting time, extrapolating from the earlier illustration in which a small subset of
the current solution space was examined. Only 1 minute of running time was
required by the genetic optimizer; in an attempt to put it at a significant disadvan-
tage, it was prematurely stopped after performing only 133 tests.
The output from the genetic optimizer appears in Table 3-2. In this table, PI rep-
resents the period of the faster moving average, P2 the period of the slower moving
average, NETthe total net profit, NETLNG the net profit for long positions, NETSiS
the net profit for short positions, PFAC the profit factor, ROA% the annualized rehm
on account, DRAW the maximum drawdown, TRDS the number of trades taken by the
system, WIN% the percentage of winning trades, AVGT the profit or loss resulting
from the average trade, and FZTthe fitness of the solution (which, in this instance, is
merely the total net p&it). As with the brute force data in Table 3-1, the genetic data
have been sorted by net profit (fitness) and only the 25 best solutions were presented.
Top 25 Solutions Found Using Genetic Optimization in C-Trader Toolkit

Comparison of the brute force and genetic optimization results (Tables 3- 1 and 3-2,
respectively) reveals that the genetic optimizer isolated a solution with a greater net
profit ($172,725) than did the brute force optimizer ($145,125). This is no surprise
since a larger solution space, not decimated by increments, was explored. The sur-
prise is that the better solution was found so quickly, despite the handicap of a pre-
maturely stopped evolutionary process. Results like these demonstrate the incredible
effectiveness of genetic optimization.

Optimization by Simulated Annealing
Optimizers based on annealing mimic the thermodynamic process by which liq-
uids freeze and metals anneal. Starting out at a high temperature, the atoms of a
liquid or molten metal bounce rapidly about in a random fashion. Slowly cooled,
they mange themselves into an orderly configuration-a crystal-that represents
a minimal energy state for the system. Simulated in software, this thermodynamic
process readily solves large-scale optimization problems.
As with genetic opimization, optimization by simulared annealing is a very
powerful Stochastic technique, modeled upon a natural phenomenon, that can find
globally optimal solutions and handle ill-behaved fitness functions. Simulated
annealing has effectively solved significant combinatorial problems, including
CHAPTER 3 Optimizers and Optimization 39

the famous “traveling salesman problem,” and the problem of how best to arrange the
millions of circuit elements found on modem integrated circuit chips, such as
those that power computers. Methods based on simulated annealing should not be
construed as limited to combinatorial optimization; they can readily be adapted to
the optimization of real-valued parameters. Consequently, optimizers based on
simulated annealing are applicable to a wide variety of problems, including those
faced by traders.
Since genetic optimizers perform so well, we have experienced little need to
explore optimizers based on simulated annealing. In addition, there have been a
few reports suggesting that, in many cases, annealing algorithms do not perform
as well as genetic algorithms. Because of these reasons, we have not provided
examples of simulated annealing and have little more to say about the method.

Analytic Optimizers
Analysis (as in “real analysis” or “complex analysis”) is an extension of classical
college calculus. Analytic optimizers involve the well-developed machinery of
analysis, specifically differential calculus and the study of analytic functions, in
the solution of practical problems. In some instances, analytic methods can yield
a direct (noniterative) solution to an optimization problem. This happens to be the
case for multiple regression, where solutions can be obtained with a few matrix
calculations. In multiple regression, the goal is to find a set of regression weights
that minimize the sum of the squared prediction errors. In other cases, iterative
techniques must be used. The connection weights in a neural network, for exam-
ple, cannot be directly determined. They must be estimated using an iterative pro-
cedure, such as back-propagation.
Many iterative techniques used to solve multivariate optimization prob-
lems (those involving several variables or parameters) employ some variation
on the theme of steepest ascent. In its most basic form, optimization by steep-
est ascent works as follows: A point in the domain of the fitness function (that
is, a set of parameter values) is chosen by some means. The gradient vector at
that point is evaluated by computing the derivatives of the fitness function with
respect to each of the variables or parameters; this defines the direction in n-
dimensional parameter space for which a fixed amount of movement will pro-
duce the greatest increase in fitness. A small step is taken up the hill in fitness
space, along the direction of the gradient. The gradient is then recomputed at
this new point, and another, perhaps smaller, step is taken. The process is
repeated until convergence occurs.
A real-world implementation of steepest ascent optimization has to specify
how the step size will be determined at each iteration, and how the direction
defined by the gradient will be adjusted for better overall convergence of the opti-
mization process. Naive implementations assume that there is an analytic fitness
surface (one that can be approximated locally by a convergent power series) hav-
ing hills that must be climbed. More sophisticated implementations go further,
commonly assuming that the fitness function can be well approximated locally by
a quadratic form. If a fitness function satisfies this assumption, then much faster
convergence to a solution can be achieved. However, when the fitness surface has
many irregularly shaped bills and valleys, quadratic forms often fail to provide a
good approximation. In such cases, the more sophisticated methods break down
entirely or their performance seriously degrades.
Worse than degraded performance is the problem of local solutions. Almost
all analytic methods, whether elementary or sophisticated, are easily trapped by
local maxima: they generally fail to locate the globally best solution when there
are™ many hills and valleys in the fitness surface. Least-squares, neural network
predictive modeling gives rise to fitness surfaces that, although clearly analytic,
are full of bumps, troughs, and other irregularities that lead standard analytic tech-
niques (including back-propagation, a variant on steepest ascent) astray. Local
maxima and other hazards that accompany such fitness surfaces can, however, be
sidestepped by cleverly marrying a genetic algorithm with an analytic one. For tit-
ness surfaces amenable to analytic optimization, such a combined algorithm can
provide the best of both worlds: fast, accurate solutions that are also likely to be
globally optimal.
Some fitness surfaces are simply not amenable to analytic optimization.
More specifically, analytic methods cannot be used when the fitness surface has
flat areas or discontinuities in the region of parameter space where a solution is to
be sought. Flat areas imply null gradients, hence the absence of a preferred direc-
tion in which to take a step. At points of discontinuity, the gradient is not defined;
again, a stepping direction cannot be determined. Even if a method does not
explicitly use gradient information, such information is employed implicitly by
the optimization algorithm. Unfortunately, many fitness functions of interest to
traders-including, for instance, all functions that involve net profit, drawdown,
percentage of winning trades, risk-to-reward ratios, and other like items-have
plateaus and discontinuities. They are, therefore, not tractable using analytic methods.
Although the discussion has centered on the maximization of fitness, every-
thing said applies as well to the minimization of cost. Any maximization technique
can be used for minimization, and vice versa: Multiply a fitness function by - 1 to
obtain an equivalent cost function; multiply a cost function by - 1 and a fitness func-
tion is the result. If a minimization algorithm takes your fancy, but a maximization
is required, use this trick to avoid having to recode the optimization algorithm.

Linear Programming
The techniques of linear programming are designed for optimization problems
involving linear cost or fitness functions, and linear constraints on the parameters
or input variables. Linear programming is typically used to solve resource allo-
cation problems. In the world of trading, one use of linear programming might be
to allocate capital among a set of investments to maximize net profit. If risk-
adjusted profit is to be optimized, linear programming methods cannot be used:
Risk-adjusted profit is not a linear function of the amount of capital allocated to
each of the investments; in such instances, other techniques (e.g., genetic algo-
rithms) must be employed. Linear programming methods are rarely useful in the
development of trading systems. They are mentioned here only to inform readers
of their existence.

Most traders do not seek failure, at least not consciously. However, knowledge of
the way failure is achieved can be of great benefit when seeking to avoid it. Failure
with an optimizer is easy to accomplish by following a few key rules. First, be sure
to use a small data sample when running sirindations: The smaller the sample, the
greater the likelihood it will poorly represent the data on which the trading model
will actually be traded. Next, make sure the trading system has a large number of
parameters and rules to optimize: For a given data sample, the greater the number
of variables that must be estimated, the easier it will be to obtain spurious results.
It would also be beneficial to employ only a single sample on which to run tests;
annoying out-of-sample data sets have no place in the rose-colored world of the
ardent loser. Finally, do avoid the headache of inferential statistics. Follow these
rules and failure is guaranteed.
What shape will failure take? Most likely, system performance will look
great in tests, but terrible in real-time trading. Neural network developers call this
phenomenon “poor generalization”; traders are acquainted with it through the
experience of margin calls and a serious loss of trading capital. One consequence
of such a failure-laden outcome is the formation of a popular misconception: that
all optimization is dangerous and to be feared.
In actual fact, optimizers are not dangerous and not all optimization should be
feared. Only bad optimization is dangerous and frightening. Optimization of large
parameter sets on small samples, without out-of-sample tests or inferential statis-
tics, is simply a bad practice that invites unhappy results for a variety of reasons.

Small Samples
Consider the impact of small samples on the optimization process. Small samples
of market data are unlikely to be representative of the universe from which they
are drawn: consequently, they will probably differ significantly from other sam-
ples obtained from the same universe. Applied to a small development sample, an
optimizer will faithfully discover the best possible solution. The best solution for

the development sample, however, may turn out to be a dreadful solution for the
later sample on which genuine trades will be taken. Failure ensues, not because
optimization has found a bad solution, but because it has found a good solution to
the wrong problem!
Optimization on inadequate samples is also good at spawning solutions that
represent only mathematical artifact. As the number of data points declines to the
number of free (adjustable) parameters, most models (trading, regression, or other-
wise) will attain a perfect tit to even random data. The principle involved is the
same one responsible for the fact that a line, which is a two-parameter model, can
always be drawn through any two distinct points, but cannot always be made to
intersect three arbitrary points. In statistics, this is known as the degrees-of-freedom
issue; there are as many degrees of freedom as there are data points beyond that
which can be fitted perfectly for purely mathematical reasons. Even when there are
enough data points to avoid a totally artifact-determined solution, some part of the
model fitness obtained through optimization will be of an artifact-determined
nature, a by-product of the process.
For multiple regression models, a formula is available that can be used to
estimate how much “shrinkage” would occur in the multiple correlation coeffi-
cient (a measure of model fitness) if the artifact-determined component were
removed. The shrinkage correction formula, which shows the relationship
between the number of parameters (regression coefficients) being optimized, sam-
ple size, and decreased levels of apparent fitness (correlation) in tests on new sam-
ples, is shown below in FORTRAN-style notation:

In this equation, N represents the number of data points, P the number of model
parameters, R the multiple correlation coefficient determined for the sample by the
regression (optimization) procedure, and RC the shrinkage-corrected multiple cor-
relation coefficient. The inverse formula, one that estimates the optimization-
inflated correlation (R) given the true correlation (RfJ existing in the population
from which the data were sampled, appears below:

These formulas, although legitimate only for linear regression, are not bad
for estimating how well a fully trained neural network model-which is nothing
more than a particular kind of nonhnezu regression-will generalize. When work-
ing with neural networks, let P represent the total number of connection weights
in the model. In addition, make sure that simple correlations are used when work-
ing with these formulas; if a neural network or regression package reports the
squared multiple correlation, take the square root.

Large Parameter Sets
An excessive number of free parameters or rules will impact an optimization effort
in a manner similar to an insufficient number of data points. As the number of ele-
ments undergoing optimization rises, a model™s ability to capitalize on idiosyn-
crasies in the development sample increases along with the proportion of the
model™s fitness that can be attributed to mathematical artifact. The result of opti-
mizing a large number of variables-whether rules, parameters, or both-will be
a model that performs well on the development data, but poorly on out-of-sample
test data and in actual trading.
It is not the absolute number of free parameters that should be of concern,
but the number of parameters relative to the number of data points. The shrinkage
formula discussed in the context of small samples is also heuristically relevant
here: It illustrates how the relationship between the number of data points and the
number of parameters affects the outcome. When there are too many parameters,
given the number of data points, mathematical artifacts and capitalization on
chance (curve-fitting, in the bad sense) become reasons for failure.

No Verification
One of the better ways to get into trouble is by failing to verify model performance
using out-of-sample tests or inferential statistics. Without such tests, the spurious
solutions resulting from small samples and large parameter sets, not to mention
other less obvious causes, will go undetected. The trading system that appears to
be ideal on the development sample will be put “on-line,” and devastating losses
will follow. Developing systems without subjecting them to out-of-sample and sta-
tistical tests is like flying blind, without a safety belt, in an uninspected aircraft.

Four steps can be taken to avoid failure and increase the odds of achieving suc-
cessful optimization. As a first step, optimize on the largest possible representative
sample and make sure many simulated trades are available for analysis. The sec-
ond step is to keep the number of free parameters or rules small, especially in rela-
tion to sample size. A third step involves running tests on out-of-sample data, that
is, data not used or even seen during the optimization process. As a fourth and final
step, it may be worthwhile to statistically assess the results.

Large, Representative Samples
As suggested earlier, failure is often a consequence of presenting an optimizer with
the wrong problem to solve. Conversely, success is likely when the optimizer is

presented with the right problem. The conclusion is that trading models should be
optimized on data from the near future, the data that will actually be traded; do that
and watch the profits roll in. The catch is where to find tomorrow™s data today.
Since the future has not yet happened, it is impossible to present the opti-
mizer with precisely the problem that needs to be solved. Consequently, it is nec-
essary to attempt the next-best alternative: to present the optimizer with a broader
problem, the solution to which should be as applicable as possible to the actual,
but impossible-to-solve, problem. One way to accomplish this is with a data sam-
ple that, even though not drawn from the future, embodies many characteristics
that might appear in future samples. Such a data sample should include bull and
bear markets, trending and nontrending periods, and even crashes. In addition, the
data in the sample should be as recent as possible so that it will reflect current pat-
terns of market behavior. This is what is meant by a representative sample.
As well as representative, the sample should be large. Large samples make it
harder for optimizers to uncover spurious or artifact-determined solutions.
Shrinkage, the expected decline in performance on unoptimized data, is reduced
when large samples are employed in the optimization process.
Sometimes, however, a trade-off must be made between the sample™s size
and the extent to which it is representative. As one goes farther back in history to
bolster a sample, the data may become less representative of current market con-
ditions. In some instances, there is a clear transition point beyond which the data
become much less representative: For example, the S&P 500 futures began trad-
ing in 1983, effecting a structural change in the general market. Trade-offs become
much less of an issue when working with intraday data on short time frames,
where tens of thousands or even hundreds of thousands of bars of data can be gath-
ered without going back beyond the recent past.
Finally, when running simulations and optimizations, pay attention to the
number of trades a system takes. Like large data samples, it is highly desirable that
simulations and tests involve numerous trades. Chance or artifact can easily be
responsible for any profits produced by a system that takes only a few trades,
regardless of the number of data points used in the test!

Few Rules and Parameters
To achieve success, limit the number of free rules and parameters, especially when
working with small data samples. For a given sample size, the fewer the rules or
parameters to optimize, the greater the likelihood that a trading system will main-
tain its performance in out-of-sample tests and real-time trading. Although sever-
al dozen parameters may be acceptable when working with several thousand
trades taken on 100,000 l-minute bars (about 1 year for the S&P 500 futures),
even two or three parameters may be excessive when developing a system using a
few years of end-of-day data. If a particular model requires many parameters, then
significant effort should be put into assembling a mammoth sample (the legendary
Gann supposedly went back over 1,000 years in his study of wheat prices). An
alternative that sometimes works is optimizing a trading model on a whole port-
folio, using the same rules and parameters across all markets-a technique used
extensively in this book.

Verification of Results
After optimizing the rules and parameters of a trading system to obtain good
behavior on the development or in-sample data, but before risking any real money,
it is essential to verify the system™s performance in some manner. Verification of
system performance is important because it gives the trader a chance to veto fai-
me and embrace success: Systems that fail the test of verification can be discard-
ed, ones that pass can be traded with confidence. Verification is the single most
critical step on the road to success with optimization or, in fact, with any other
method of discovering a trading model that really works.
To ensure success, verify any trading solution using out-of-sample tests or
inferential statistics, preferably both. Discard any solution that fails to be profitable
in an out-of-sample test: It is likely to fail again when the rubber hits the road.
Compute inferential statistics on all tests, both in-sample and out-of-sample. These
statistics reveal the probability that the performance observed in a sample reflects
something real that will hold up in other samples and in real-time trading. Inferential
statistics work by making probability inferences based on the distribution of prof-
itability in a system™s trades or returns. Be sure to use statistics that are corrected for
multiple tests when analyzing in-sample optimization results. Out-of-sample tests
should be analyzed with standard, uncorrected statistics. Such statistics appear in
some of the performance reports that are displayed in the chapter on simulators. The
use of statistics to evaluate trading systems is covered in depth in the following chap-
ter. Develop a working knowledge of statistics; it will make you a better trader.
Some suggest checking a model for sensitivity to small changes in parame-
ter values. A model highly tolerant of such changes is more “robust” than a model
not as tolerant, it is said. Do not pay too much attention to these claims. In truth,
parameter tolerance cannot be relied upon as a gauge of model robustness. Many
extremely robust models are highly sensitive to the values of certain parameters.
The only true arbiters of system robustness are statistical and, especially, out-of-
sample tests.

There are two major alternatives to traditional optimization: walk-forward opti-
mization and self-adaptive systems. Both of these techniques have the advantage
that any tests carried out are, from start to finish, effectively out-of-sample.

Examine the performance data, run some inferential statistics, plot the equity
curve, and the system is ready to be traded. Everything is clean and mathemati-
cally unimpeachable. Corrections for shrinkage or multiple tests, worries over
excessive curve-fitting, and many of the other concerns that plague traditional
optimization methodologies can be forgotten. Moreover, with today™s modem
computer technology, walk-forward and self-adaptive models are practical and not
even difficult to implement.
The principle behind walk-forward optimization (also known as walk-for-
ward testing) is to emulate the steps involved in actually trading a system that
requires periodic optimization. It works like this: Optimize the system on the data
points 1 through M. Then simulate trading on data points M + I through M + K.
Reoptimize the system on data points K + 1 through K + M. Then simulate trad-
ing on points (K + M) + 1 through (K + M) + K. Advance through the data series
in this fashion until no more data points are left to analyze. As should be evident,
the system is optimized on a sample of historical data and then traded. After some
period of time, the system is reoptimized and trading is resumed. The sequence of
events guarantees that the data on which trades take place is always in the future
relative to the optimization process; all trades occur on what is, essentially, out-of-
sample data. In walk-forward testing, M is the look-back or optimization window
and K the reoptimization interval.
Self-adaptive systems work in a similar manner, except that the optimization
or adaptive process is part of the system, rather than the test environment. As each
bar or data point comes along, a self-adaptive system updates its internal state (its
parameters or rules) and then makes decisions concerning actions required on the
next bar or data point. When the next bar arrives, the decided-upon actions are car-
ried out and the process repeats. Internal updates, which are how the system learns
about or adapts to the market, need not occur on every single bar. They can be per-
formed at fixed intervals or whenever deemed necessary by the model.
The trader planning to work with self-adapting systems will need a power-
ful, component-based development platform that employs a strong language, such
as Cf f, Object Pascal, or Visual Basic, and that provides good access to third-
party libraries and software components. Components are designed to be incorpo-
rated into user-written software, including the special-purpose software that
constitutes an adaptive system. The more components that are available, the less
work there is to do. At the very least, a trader venturing into self-adaptive systems
should have at hand genetic optimizer and trading simulator components that can
be easily embedded within a trading model. Adaptive systems will be demonstrat-
ed in later chapters, showing how this technique works in practice.
There is no doubt that walk-forward optimization and adaptive systems will
become more popular over time as the markets become more efficient and diffi-
cult to trade, and as commercial software packages become available that place
these techniques within reach of the average trader.
CHAPTER 3 optimizers and OQtimizatio” 47

Aerodynamics, electronics, chemistry, biochemistry, planning, and business are
just a few of the fields in which optimization plays a role. Because optimization is
of interest to so many problem-solving areas, research goes on everywhere, infor-
mation is abundant, and optimization tools proliferate. Where can this information
be found? What tools and products are available?
Brute force optimizers are usually buried in software packages aimed pri-
marily at tasks other than optimization; they are usually not available on their own.
In the world of trading, products like TradeStation and SuperCharts from Omega
Research (800-292.3453), Excalibur from Futures Truth (828-697-0273), and
MetaStock from Equis International (800-882-3040) have built-in brute force opti-
mizers. If you write your own software, brute force optimization is so trivial to
implement using in-line progranuning code that the use of special libraries or
components is superfluous. Products and code able to carry out brute force opti-
mization may also serve well for user-guided optimization.
Although sometimes appearing as built-in tools in specialized programs,
genetic optimizers are more often distributed in the form of class libraries or soft-
ware components, add-ons to various application packages, or stand-alone research
instruments. As an example of a class library written with the component paradigm
in mind, consider OptEvolve, the C+ + genetic optimizer from Scientific Consultant
Services (516-696-3333): This general-purpose genetic optimizer implements sev-
eral algorithms, including differential evolution, and is sold in the form of highly
portable C+ + code that can be used in UNIXiLINUX, DOS, and Windows envi-
ronments. TS-Evolve, available from Ruggiero Associates (800-21 l-9785) gives
users of TradeStation the ability to perform full-blown genetic optimizations. The
Evolver, which can be purchased from Palisade Corporation (800.432.7475) is a
general-purpose genetic optimizer for Microsoft™s Excel spreadsheet; it comes with
a dynamic link library (DLL) that can provide genetic optimization services to user
programs written in any language able to call DLL functions. GENESIS, a stand-
alone instrument aimed at the research community, was written by John Grefenstette
of the Naval Research Laboratory; the product is available in the form of generic C
source code. While genetic optimizers can occasionally be found in modeling tools
for chemists and in other specialized products, they do not yet form a native part of
popular software packages designed for traders.
Information about genetic optimization is readily available. Genetic algo-
rithms are discussed in many books, magazines, and journals and on Internet
newsgroups. A good overview of the field of genetic optimization can be found in
the Handbook of Generic Algorithms (Davis, 1991). Price and Storm (1997)
described an algorithm for “differential evolution,” which has been shown to be an
exceptionally powerful technique for optimization problems involving real-valued
parameters. Genetic algorithms are currently the focus of many academic journals
and conference proceedings. Lively discussions on all aspects of genetic opti-
mization take place in several Internet newsgroups of which compaigenetic is the
most noteworthy.
A basic exposition of simulated annealing can be found in Numericnl
Recipes in C (Press et al., 1992), as can C functions implementing optimizers for
both combinatorial and real-valued problems. Neural, Novel & Hybrid Algorithms
for Time Series Prediction (Masters, 1995) also discusses annealing-based opti-
mization and contains relevant C+ + code on the included CD-ROM. Like genet-
ic optimization, simulated annealing is the focus of many research studies,
conference presentations, journal articles, and Internet newsgroup discussions.
Algorithms and code for conjugate gradient and variable metric optimiza-
tion, two fairly sophisticated analytic methods, can be found in Numerical Recipes
in C (Press et al., 1992) and Numerical Recipes (Press et al., 1986). Masters (1995)
provides an assortment of analytic optimization procedures in C+ + (on the CD-
ROM that comes with his book), as well as a good discussion of the subject.
Additional procedures for analytic optimization are available in the IMSL and the
NAG library (from Visual Numerics, Inc., and Numerical Algorithms Group,
respectively) and in the optimization toolbox for MATLAB (a general-purpose
mathematical package from The MathWorks, 508-647-7000, that has gamed pop-
ularity in the financial engineering community). Finally, Microsoft™s Excel spread-
sheet contains a built-in analytic optimizer-the Solver-that employs conjugate
gradient or Newtonian methods.
As a source of general information about optimization applied to trading sys-
tem development, consult Design, Testing and Optimization qf Trading Systems by
Robert Pardo (1992). Among other things, this book shows the reader how to opti-
mize profitably, how to avoid undesirable curve-fitting, and how to carry out walk-
forward tests.

At the very least, you should have available an optimizer that is designed to make
both brute force and user-guided optimization easy to carry out. Such an optimiz-
er is already at hand if you use either TradeStation or Excalibur for system devel-
opment tasks. On the other hand, if you develop your systems in Excel, Visual
Basic, C+ +, or Delphi, you will have to create your own brute force optimizer.
As demonstrated earlier, a brute force optimizer is simple to implement. For many
problems, brute force or user-guided optimization is the best approach.
If your system development efforts require something beyond brute force, a
genetic optimizer is a great second choice. Armed with both brute force and genet-
ic optimizers, you will be able to solve virtually any problem imaginable. In our
own efforts, we hardly ever reach for any other kind of optimization tool!
TradeStation users will probably want TS-Evolve from Ruggiero Associates. The
Evolver product from Palisade Corporation is a good choice for Excel and Visual
Basic users. If you develop systems in C+ + or Delphi, select the C+ + Genetic
Optimizer from Scientific Consultant Services, Inc. A genetic optimizer is the
Swiss Army knife of the optimizer world: Even problems more efficiently solved
using such other techniques as analytic optimization will yield, albeit more slowly,
to a good genetic optimizer.
Finally, if you want to explore analytic optimization or simulated annealing,
we suggest Numerical Recipes in C (Press et al., 1992) and Masters (1995) as
good sources of both information and code. Excel users can try out the built-in
Solver tool.


M any trading system developers have little familiarity with inferential statistics.
This is a rather perplexing state of affairs since statistics are essential to assessing
the behavior of trading systems. How, for example, can one judge whether an
apparent edge in the trades produced by a system is real or an artifact of sampling
or chance? Think of it-the next sample may not merely be another test, but an
actual trading exercise. If the system™s “edge” was due to chance, trading capital
could quickly be depleted. Consider optimization: Has the system been tweaked
into great profitability, or has the developer only succeeded in the nasty art of
curve-fitting? We have encountered many system developers who refuse to use
any optimization strategy whatsoever because of their irrational fear of curve-fit-
ting, not knowing that the right statistics can help detect such phenomena. In short,
inferential statistics can help a trader evaluate the likelihood that a system is cap-
turing a real inefficiency and will perform as profitably in the future as it has in
the past. In this book, we have presented the results of statistical analyses when-
ever doing so seemed useful and appropriate.
Among the kinds of inferential statistics that are most useful to traders are
t-tests, correlational statistics, and such nonparametric statistics as the runs test.
T-rests are useful for determining the probability that the mean or sum of any
series of independent values (derived from a sampling process) is greater or less
than some other such mean, is a fixed number, or falls within a certain band. For
example, t-tests can reveal the probability that the total profits from a series of
trades, each with its individual profitAoss figure, could be greater than some thresh-
old as a result of chance or sampling. These tests are also useful for evaluating san-
pies of returns, e.g., the daily or monthly returns of a portfolio over a period of
years. Finally, t-tests can help to set the boundaries of likely future performance
(assuming no structural change in the market), making possible such statements as
“the probability that the average profit will be between x and y in the future is
greater than 95%:™
Correlational stnristics help determine the degree of relationship between
different variables. When applied inferentially, they may also be used to assess
whether any relationships found are “statistically significant,” and not merely due
to chance. Such statistics aid in setting confidence intervals or boundaries on the
“true” (population) correlation, given the observed correlation for a specific sam-
ple. ,Correlational statistics are essential when searching for predictive variables to
include in a neural network or regression-based trading model.
Correlational statistics, as well as such nonparamenic statistics as the runs test,
are useful in assessing serial dependence or serial correlation. For instance, do prof-
itable trades come in streaks or runs that are then followed by periods of unprofitable
trading? The runs test can help determine whether this is actually occurring. If there
is serial dependence in a system, it is useful to know it because the system can then
be revised to make use of the serial dependence. For example, if a system has clear
ly defined streaks of winning and losing, a metasystem can be developed. The mem-
system would take every trade after a winning trade until the tirst losing trade comes
along, then stop trading until a winning trade is hit, at which point it would again
begin taking trades. If there really are runs, this strategy, or something similar, could
greatly improve a system™s behavior.

It is very important to determine whether any observed profits are real (not art-
facts of testing), and what the likelihood is that the system producing them will
continue to yield profits in the future when it is used in actual trading. While out-
of-sample testing can provide some indication of whether a system will hold up on
new (future) data, statistical methods can provide additional information and esti-
mates of probability. Statistics can help determine whether a system™s perfor-
mance is due to chance alone or if the trading model has some real validity.
Statistical calculations can even be adjusted for a known degree of curve-fitting,
thereby providing estimates of whether a chance pattern, present in the data sam-
ple being used to develop the system, has been curve-fitted or whether a pattern
present in the population (and hence one that would probably be present in future
samples drawn from the market being examined) has been modeled.
It should be noted that statistics generally make certain theoretical assumptions
about the data samples and populations to which they may be appropriately applied.
These assumptions are often violated when dealing with trading models. Some vio-
lations have little practical effect and may be ignored, while others may be worked
around. By using additional statistics, the more serious violations can sometimes be
WAFTER 4 Statistics 53

detected, avoided, or compensated for; at the very least, they can be understood. In
short, we are fully aware of these violations and will discuss our acts of hubris and
their ramifications after a foundation for understanding the issues has been laid.

Fundamental to statistics and, therefore, important to understand, is the act of
sampling, which is the extraction of a number of data points or trades (a sample)
from a larger, abstractly defined set of data points or trades (a population). The
central idea behind statistical analysis is the use of samples to make inferences
about the populations from which they are drawn. When dealing with trading
models, the populations will most often be defined as all raw data (past, present,
and future) for a given tradable (e.g., all 5-minute bars on all futures on the S&P
500), all trades (past, present, and future) taken by a specified system on a given
tradable, or all yearly, monthly, or even daily returns. All quarterly earnings
(past, present, and future) of IBM is another example of a population. A sample
could be the specific historical data used in developing or testing a system, the
simulated trades taken, or monthly returns generated by the system on that data.
When creating a trading system, the developer usually draws a sample of
data from the population being modeled. For example, to develop an S&P 500 sys-
tem based on the hypothesis “If yesterday™s close is greater than the close three
days ago, then the market will rise tomorrow,” the developer draws a sample of
end-of-day price data from the S&P 500 that extends back, e.g., 5 years. The hope
is that the data sample drawn from the S&P 500 is represenrative of that market,
i.e., will accurately reflect the actual, typical behavior of that market (the popula-
tion from which the sample was drawn), so that the system being developed will
perform as well in the future (on a previously unseen sample of population data)
as it did in the past (on the sample used as development data). To help determine
whether the system will hold up, developers sometimes test systems on one or
more out-of-sample periods, i.e., on additional samples of data that have not been
used to develop or optimize the trading model. In our example, the S&P 500 devel-
oper might use 5 years of data--e.g., 1991 through 1995-to develop and tweak
the system, and reserve the data from 1996 as the out-of-sample period on which
to test the system. Reserving one or more sets of out-of-sample data is strongly
One problem with drawing data samples from financial populations arises
from the complex and variable nature of the markets: today™s market may not be
tomorrow™s Sometimes the variations are very noticeable and their causes are
easily discerned, e.g., when the S&P 500 changed in 1983 as a result of the intro-
duction of futures and options. In such instances, the change may be construed as
having created two distinct populations: the S&P 500 prior to 1983 and the S&P
500 after 1983. A sample drawn from the earlier period would almost certainly
not be representative of the population defined by the later period because it was
drawn from a different population! This is, of course, an extreme case. More
often, structural market variations are due to subtle influences that are sometimes
impossible to identify, especially before the fact. In some cases, the market may
still be fundamentally the same, but it may be going through different phases;
each sample drawn might inadvertently be taken from a different phase and be
representative of that phase alone, not of the market as a whole. How can it be
determined that the population from which a sample is drawn for the purpose of
system development is the same as the population on which the system will be
traded? Short of hopping into a time machine and sampling the future, there is no
reliable way to tell if tomorrow will be the day the market undergoes a system-
killing metamorphosis! Multiple out-of-sample tests, conducted over a long peri-
od of time, may provide some assurance that a system will hold up, since they
may show that the market has not changed substantially across several sampling
periods. Given a representative sample, statistics can help make accurate infer-
ences about the population from which the sample was drawn. Statistics cannot,
however, reveal whether tomorrow™s market will have changed in some funda-
mental manner.

Another issue found in trading system development is optimization, i.e., improv-
ing the performance of a system by adjusting its parameters until the system per-
forms its best on what the developer hopes is a representative sample. When the
system fails to hold up in the future (or on out-of-sample data), the optimization
process is pejoratively called curve-fitting. However, there is good curve-fitting
and bad curve-fitting. Good curve-fitting is when a model can be fit to the entire
relevant population (or, at least, to a sufficiently large sample thereof), suggesting
that valid characteristics of the entire population have been captured in the model.
Bad curve-@zing occurs when the system only fits chance characteristics, those
that are not necessarily representative of the population from which the sample
was drawn.
Developers are correct to fear bad curve-fitting, i.e., the situation in which
parameter values are adapted to the particular sample on which the system was
optimized, not to the population as a whole. If the sample was small or was not
representative of the population from which it was drawn, it is likely that the sys-
tem will look good on that one sample but fail miserably on another, or worse, lose
money in real-time trading. However, as the sample gets larger, the chance of this
happening becomes smaller: Bad curve-fitting declines and good curve-fitting
increases. All the statistics discussed reflect this, even the ones that specifically
concern optimization. It is true that the more combinations of things optimized,
the greater the likelihood good performance may be obtained by chance alone.
However, if the statistical result was sufficiently good, or the sample on which it
was based large enough to reduce the probability that the outcome was due to
chance, the result might still be very real and significant, even if many parameters
were optimized.
Some have argued that size does not matter, i.e., that sample size and the
number of trades studied have little or nothing to do with the risk of overopti-
mization, and that a large sample does not mitigate curve-fitting. This is patently
untrue, both intuitively and mathematically. Anyone would have less confidence in
a system that took only three or four trades over a lo-year period than in one that
took over 1,000 reasonably profitable trades. Think of a linear regression model in
which a straight line is being fit to a number of points. If there are only two points,
it is easy to fit the line perfectly every time, regardless of where the points are
located. If there are three points, it is harder. If there is a scatterplot of points, it is
going to be harder still, unless those points reveal some real characteristic of the
population that involves a linear relationship.
The linear regression example demonstrates that bad curve-fitting does
become more difficult as the sample size gets larger. Consider two trading sys-
tems: One system had a profit per trade of $100, it took 2 trades, and the stan-
dard deviation was $100 per trade: the other system took 1,000 trades, with
similar means and standard deviations. When evaluated statistically, the system
with 1,000 trades will be a lot more “statistically significant” than the one with
the 2 trades.
In multiple linear regression models, as the number of regression parameters
(beta weights) being estimated is increased relative to the sample size, the amount
of curve-fitting increases and statistical significance lessens for the same
degree of model fit. In other words, the greater the degree of curve-fitting, the
harder it is to get statistical significance. The exception is if the improvement in fit
when adding regressors is sufficient to compensate for the loss in significance due
to the additional parameters being estimated. In fact, an estimate of shrinkage (the
degree to which the multiple correlation can be expected to shrink when computed
using out-of-sample data) can even be calculated given sample size and number of
regressors: Shrinkage increases with regressors and decreases with sample size. In
short, there is mathematical evidence that curve-fitting to chance characteristics of
a sample, with concomitant poor generalization, is more likely if the sample is
small relative to the number of parameters being fit by the model. In fact, as n (the
sample size) goes to infinity, the probability that the curve-fitting (achieved by
optimizing a set of parameters) is nonrepresentative of the population goes to zero.
The larger the number of parameters being optimized, the larger the sample
required. In the language of statistics, the parameters being estimated use up the
available “degrees of freedom.”
All this leads to the conclusion that the larger the sample, the more likely its
“curves” are representative of characteristics of the market as a whole. A small
sample almost certainly will be nonrepresentative of the market: It is unlikely that
its curves will reflect those of the entire market that persist over time. Any model
built using a small sample will be capitalizing purely on the chance of sampling.
Whether curve-fitting is “good” or “bad” depends on if it was done to chance or
to real market patterns, which, in turn, largely depends on the size and representa-
tiveness of the sample. Statistics are useful because they make it possible to take
curve-fitting into account when evaluating a system.
When dealing with neural networks, concerns about overtraining or general-
ization are tantamount to concerns about bad curve-fitting. If the sample is large
enough and representative, curve-fitting some real characteristic of the market is
more likely, which may be good because the model should fit the market. On the
other hand, if the sample is small, the model will almost certainly be fit to pecu-
liar characteristics of the sample and not to the behavior of the market generally.
In neural networks, the concern about whether the neural network will generalize
is the same as the concern about whether other kinds of systems will hold up in
the future. To a great extent, generalization depends on the size of the sample on
which the neural network is trained. The larger the sample, or the smaller the num-
ber of connection weights (parameters) being estimated, the more likely the net-
work will generalize. Again, this can be demonstrated mathematically by
examining simple cases.
As was the case with regression, au estimate of shrinkage (the opposite of
generalization) may be computed when developing neural networks. In a very real
sense, a neural network is actually a multiple regression, albeit, nonlinear, and the
correlation of a neural net™s output with the target may be construed as a multiple
correlation coefficient. The multiple correlation obtained between a net™s output
and the target may be corrected for shrinkage to obtain some idea of how the net
might perform on out-of-sample data. Such shrinkage-corrected multiple correla-
tions should routinely be computed as a means of determining whether a network
has merely curve-fit the data or has discovered something useful. The formula for
correcting a multiple correlation for shrinkage is as follows:

A FORTRAN-style expression was used for reasons of typsetting. In this for-
mula, SQRT represents the square root operator; N is the number of data points
or, in the case of neural networks, facts; P is the number of regression coefti-
cients or, in the case of neural networks, connection weights; R represents the
uncorrected multiple correlation; and RC is the multiple correlation corrected
for shrinkage. Although this formula is strictly applicable only to linear multi-
ple regression (for which it was originally developed), it works well with neur-
al networks and may be used to estimate how much performance was inflated on
the in-sample data due to curve-fitting. The formula expresses a relationship
between sample size, number of parameters, and deterioration of results. The
statistical correction embodied in the shrinkage formula is used in the chapter on
neural network entry models.

Although, for statistical reasons, the system developer should seek the largest sam
ple possible, there is a trade-off between sample size and representativeness when
dealing with the financial markets. Larger samples mean samples that go farther
back in time, which is a problem because the market of years ago may be funda-
mentally different from the market of today-remember the S&P 500 in 1983?
This means that a larger sample may sometimes be a less representative sample,
or one that confounds several distinct populations of data! Therefore, keep in mind
that, although the goal is to have the largest sample possible, it is equally impor-
tant to try to make sure the period from which the sample is drawn is still repre-
sentative of the market being predicted.

Now that some of the basics are out of the way, let us look at how statistics are
used when developing and evaluating a trading system. The examples below
employ a system that was optimized on one sample of data (the m-sample data)
and then run (tested) on another sample of data (the out-of-sample data). The out-
of-sample evaluation of this system will be discussed before the in-sample one
because the statistical analysis was simpler for the former (which is equivalent to
the evaluation of an unoptimized trading system) in that no corrections for mul-
tiple tests or optimization were required. The system is a lunar model that trades
the S&P 500; it was published in an article we wrote (see Katz with McCormick,
June 1997). The TradeStation code for this system is shown below:

Example 1: Evaluating the Out-of-Sample Test
Evaluating an optimized system on a set of out-of-sample data that was never used
during the optimization process is identical to evaluating an unoptimized system.
In both cases, one test is run without adjusting any parameters. Table 4-1 illus-
trates the use of statistics to evaluate an unoptimized system: It contains the out-
of-sample or verification results together with a variety of statistics. Remember, in
this test, a fresh set of data was used; this data was not used as the basis for
adjustments in the system™s parameters.
The parameters of the trading model have already been set. A sample of data
was drawn from a period in the past, in this specific case, l/1/95 through l/1/97;
this is the out-of-sample or verification data. The model was then run on this out-
of-sample data, and it generated simulated trades. Forty-seven trades were taken.
This set of trades can itself be considered a sample of trades, one drawn from the
population of all trades that the system took in the past or will take in the future;
i.e., it is a sample of trades taken from the universe or population of all trades for
that system. At this point, some inference must be made regarding the average
profit per trade in the population as a whole, based on the sample of trades. Could
the performance obtained in the sample be due to chance alone? To find the
answer, the system must be statistically evaluated.
To begin statistically evaluating this system, the sample mean (average) for
n (the number of trades or sample size) must first be calculated. The mean is
simply the sum of the profit/loss figures for the trades generated divided by n (in
this case, 47). The sample mean was $974.47 per trade. The standard deviation
(the variability in the trade profit/loss figures) is then computed by subtracting
the sample mean from each of the profit/loss numbers for all 47 trades in the
sample; this results in 47 (n) deviations. Each of the deviations is then squared,
and then all squared deviations are added together. The sum of the squared devi-
ations is divided hy n - I (in this case, 46). By taking the square root of the
resultant number (the mean squared deviation), the sample standard deviation is
obtained. Using the sample standard deviation, the expected standard deviation
of the nean is computed: The sample standard deviation (in this case, $6,091.10)
is divided by the square root of the sample size. For this example, the expected
standard deviation of the mean was $888.48.
To determine the likelihood that the observed profitability is due to chance
alone, a simple t-test is calculated. Since the sample profitability is being compared
with no profitability, zero is subtracted from the sample mean trade profit/loss (com-
puted earlier). The resultant number is then divided by the sample standard devia-
tion to obtain the value of the t-statistic, which in this case worked out to be 1.0968.
Finally the probability of getting such a large t-statistic by chance alone (under the
assumption that the system was not profitable in the population from which the sam-
ple was drawn) is calculated: The cumulative t-distribution for that t-statistic is cotn-
puted with the appropriate degrees of freedom, which in this case was n - 1, or 46.


Trades from the S&P 500 Data Sample on Which the Lunar Model Was

Enby Date Exit Dale Slatistical Analyses of Mean Profit/Loss
ProfiliLoss Cumulative
850207 850221 650 88825
66325 Sample Size 47.0000
850221 950223 -2500
950309 950323 92350 Sample Mean 974.4681
950323 950324 -2500 89850 Sample SIandard Devlatlon 6091.1028
950407 950419 -2500 a7350 Expected SD of Mean
950421 850424 -2500 84850
850508 850516 -2500 82350 T Statislic (PiL > 0)
79850 Probability (Siiniflcance) 0.1392
850523 950524 -25W
850806 850609 -2500 77350
850620 74050 Serial CorrelaIion (lag=l) 0.2120
050622 -2500
79250 Associated T Statistic 1.4301
850704 850718 4400
850719 950725 -2500 76750 Probability (Significance)
850603 950618 2575 79325
850816 950901 25 78350 Number Of Wlns
hD?ntaQe Of Wins 0.3404
850901 850816 10475 89825
950918 950829 -2600 87325 Upper 98% Bound
851002 951003 84625 Lower 89% Bound 0.1702
851017 851016 -2550 a2275
851031 951114 3150 85425
951114 951116 82925
951128 951214 6760 89675
951214 851228 5250
851228 860109 -2500 92425
860112 8601 I7 -2500 69925
860128 860213 18700
860213 860213 106125
960227 960227 -2500 103™325

Additional rows follow but are not shown in the table.

(Microsoft™s Excel spreadsheet provides a function to obtain probabilities based on
the t-distribution. Numen™cal Recipes in C provides the incomplete beta function,
which is very easily used to calculate probabilities based on a variety of distribu-
tions, including Student™s t.) The cumulative t-distribution calculation yields a figure
that represents the probability that the results obtained from the trading system were
due to chance. Since this figure was small, it is unlikely that the results were due to
capitalization on random features of the sample. The smaller the number, the more
likely the system performed the way it did for reasons other than chance. In this
instance, the probability was 0.1392; i.e., if a system with a true (population) profit

Frequency and Cumulative Distribution for In-Sample Trades

of $0 was repeatedly tested on independent samples, only about 14% of the time
would it show a profit as high as that actually observed.
Although the t-test was, in this example, calculated for a sample of trade prof-
it/loss figures, it could just as easily have been computed for a sample of daily
returns. Daily returns were employed in this way to calculate the probabilities
referred to in discussions of the substantitive tests that appear in later chapters. In
fact, the annualized risk-to-reward ratio (ARRR) that appears in many of the tables
and discussions is nothing more than a resealed t-statistic based on daily returns.
Finally, a con$dence interval on the probability of winning is estimated. In
the example, there were 16 wins in a sample of 47 trades, which yielded a per-
centage of wins equal to 0.3404. Using a particular inverse of the cumulative bino-
mial distribution, upper 99% and lower 99% boundaries are calculated. There is a
99% probability that the percentage of wins in the population as a whole is
between 0.1702 and 0.5319. In Excel, the CRITBINOM function may be used in
the calculation of confidence intervals on percentages.
The various statistics and probabilities computed above should provide the
system developer with important information regarding the behavior of the trad-
ing model-that is, if the assumptions of normality and independence are met and
CHAPTER 4 Statistics 61

if the sample is representative. Most likely, however, the assumptions underlying
the t-tests and other statistics are violated; market data deviates seriously from the
normal distribution, and trades are usually not independent. In addition, the sam-
ple might not be representative. Does this mean that the statistical evaluation just
discussed is worthless? Let™s consider the cases.

What if the Distribution Is Not Normal? An assumption in the t-test is that the
underlying distribution of the data is normal. However, the distribution of
profit/loss figures of a trading system is anything but normal, especially if there
are stops and profit targets, as can be seen in Figure 4- 1, which shows the distrib-
ution of profits and losses for trades taken by the lunar system. Think of it for a
moment. Rarely will a profit greater than the profit target occur. In fact, a lot
of trades are going to bunch up with a profit equal to that of the profit target. Other
trades are going to bunch up where the stop loss is set, with losses equal to that;
and there will be trades that will fall somewhere in between, depending on the exit
method. The shape of the distribution will not be that of the bell curve that describes
the normal distribution. This is a violation of one of the assumptions underlying the
t-test. In this case, however, the Central Limit Theorem comes to the rescue. It states
that as the number of cases in the sample increases, the distribution of the sample
mean approaches normal. By the time there is a sample size of 10, the errors result-
ing from the violation of the normality assumption will be small, and with sample
sizes greater than 20 or 30, they will have little practical significance for inferences
regarding the mean. Consequently, many statistics can be applied with reasonable
assurance that the results will be meaningful, as long as the sample size is adequate,
as was the case in the example above, which had an n of 47.

What if There Is Serial Dependence.3 A more serious violation, which makes
the above-described application of the t-test not quite cricket, is serial depen-
dence, which is when cases constituting a sample (e.g., trades) are not statistical-
ly independent of one another. Trades come from a time series. When a series of
trades that occurred over a given span of dates is used as a sample, it is not quite
a random sample. A truly random sample would mean that the 100 trades were
randomly taken from the period when the contract for the market started (e.g.,
1983 for the S&P 500) to far into the future; such a sample would not only be less
likely to suffer from serial dependence, but be more representative of the popula-
tion from which it was drawn. However, when developing trading systems, sam-
pling is usually done from one narrow point in time; consequently, each trade may
be correlated with those adjacent to it and so would not be independent,
The practical effect of this statistically is to reduce the eflective sample size.
When trying to make inferences, if there is substantial serial dependence, it may
be as if the sample contained only half or even one-fourth of the actual number of
trades or data points observed. To top it off, the extent of serial dependence can-
not definitively be determined. A rough “guestimate,” however, can be made. One
such guestimate may be obtained by computing a simple lag/lead serial correla-
tion: A correlation is computed between the profit and loss for Trade i and the
profit and loss for Trade i + I, with i ranging from 1 to n - 1. In the example, the
serial correlation was 0.2120, not very high, but a lower number would be prefer-
able. An associated t-statistic may then be calculated along with a statistical sig-
nificance for the correlation In the current case, these statistics reveal that if there
really were no serial correlation in the population, a correlation as large as the one
obtained from the sample would only occur in about 16% of such tests.
Serial dependence is a serious problem. If there is a substantial amount of it,
it would need to be compensated for by treating the sample as if it were smaller
than it actually is. Another way to deal with the effect of serial dependence is to
draw a random sample of trades from a larger sample of trades computed over a
longer period of time. This would also tend to make the sample of trades more rep-
resentative of the population,

What ifthe Markets Change? When developing trading systems, a third assump-
tion of the t-test may be inadvertently violated. There are no precautions that can
be taken to prevent it from happening or to compensate for its occurrence. The rea-
son is that the population from which the development or verification sample was
drawn may be different from the population from which future trades may be taken.
This would happen if the market underwent some real structural or other change.
As mentioned before, the population of trades of a system operating on the S&P
500 before 1983 would be different from the population after that year since, in
1983, the options and futures started trading on the S&P 500 and the market
changed. This sort of thing can devastate any method of evaluating a trading sys-
tem. No matter how much a system is back-tested, if the market changes before
trading begins, the trades will not be taken from the same market for which the sys-
tem was developed and tested; the system will fall apart. All systems, even cur-
rently profitable ones, will eventually succumb to market change. Regardless of the
market, change is inevitable. It is just a question of when it will happen. Despite
this grim fact, the use of statistics to evaluate systems remains essential, because if
the market does not change substantially shortly after trading of the system com-
mences, or if the change is not sufficient to grossly affect the system™s performance,
then a reasonable estimate of expected probabilities and returns can be calculated,

Example 2: Evaluating the In-Sample Tests
How can a system that has been fit to a data sample by the repeated adjustment of
parameters (i.e., an optimized system) be evaluated? Traders frequently optimize
systems to obtain good results. In this instance, the use of statistics is more impor-
tant than ever since the results can be analyzed, compensating for the multiplicity
of tests being performed as part of the process of optimization. Table 4-2 contains
the profit/loss figures and a variety of statistics for the in-sample trades (those
taken on the data sample used to optimize the system). The system was optimized
on data from l/1/90 through l/2/95.
Most of the statistics in Table 4-2 are identical to those in Table 4-1, which
was associated with Example 1. Two additional statistics (that differ from those in
the first example) are labeled “Optimization Tests Run” and “˜Adjusted for
Optimization.” The first statistic is simply the number of different parameter com-
binations tried, i.e., the total number of times the system was run on the data, each
time using a different set of parameters. Since the lunar system parameter, LI, was
stepped from 1 to 20 in increments of 1, 20 tests were performed; consequently,
there were 20 t-statistics, one for each test. The number of tests mn is used to make
an adjustment to the probability or significance obtained from the best t-statistic


Trades from the S&P 500 Data Sample on Which the Lunar Model
Was Optimized

800417 900501 5750
800501 800516 11700 17450
800516 900522 -2500 14950
150 15100
800615 900702 2300 1,400
900702 800716 4550 2,950
800731 6675 28825
800731 800802 -2500 28125
800814 900828 8500 35425
SO0828 800811 575 38200
900911 ˜OOSZB 7225 43425
801010 90,ow -2875 38050
*01028 80,028 -2500 35550
˜0,109 *0,,,2 -2700 32850
801128 80,211 8125 40875
801211 80,225 -875 40100
80,225 s10,02 -2500 37600
810108 910108 -2500 35100

010208 -2504
010221 4550

910322 5600
810408 -2500
9m423 -2.500
810507 3800
computed on the sample: Take 1, and subtract from it the statistical significance
obtained for the best-performing test. Take the resultant number and raise it to the
mth power (where m = the number of tests mn). Then subtract that number from
1. This provides the probability of finding, in a sample of m tests (in this case, 20),
at least one t-statistic as good as the one actually obtained for the optimized solu-
tion. The uncorrected probability that the profits observed for the best solution were
due to chance was less than 2%, a fairly significant result, Once adjusted for mul-
tiple tests, i.e., optimization, the statistical significance does not appear anywhere
near as good. Results at the level of those observed could have been obtained for
such an optimized system 3 1% of the time by chance alone. However, things are
not quite as bad as they seem. The adjustment was extremely conservative and
assumed that every test was completely independent of every other test. In actual
fact, there will be a high serial correlation between most tests since, in many trad-
ing systems, small changes in the parameters produce relatively small changes in
the results. This is exactly like serial dependence in data samples: It reduces the
effective population size, in this case, the effective number of tests run. Because
many of the tests are correlated, the 20 actual tests probably correspond to about 5
to 10 independent tests. If the serial dependence among tests is considered, the
adjusted-for-optimization probability would most likely be around 0.15, instead of
the 0.3 104 actually calculated. The nature and extent of serial dependence in the
multiple tests are never known, and therefore, a less conservative adjustment for
optimization cannot be directly calculated, only roughly reckoned.
Under certain circumstances, such as in multiple regression models, there are
exact mathematical formulas for calculating statistics that incorporate the fact that
parameters are being tit, i.e., that optimization is occurring, making corrections for
optimization unnecessary.

Interpreting the Example Statistics
In Example 1, the verification test was presented. The in-sample optimization run
was presented in Example 2. In the discussion of results, we are returning to the nat-
ural order in which the tests were run, i.e., optimization first, verification second.

Optimization Results. Table 4-2 shows the results for the in-sample period. Over
the 5 years of data on which the system was optimized, there were 118 trades (n
= 118). the mean or average trade yielded about $740.97, and the trades were
highly variable, with a sample standard deviation of around +$3,811: i.e., there
were many trades that lost several thousand dollars, as well as trades that made
many thousands. The degree of profitability can easily be seen by looking at the
profit/loss column, which contains many $2,500 losses (the stop got hit) and a sig-
nificant number of wins, many greater than $5,000, some even greater than
$10,000. The expected standard deviation of the mean suggests that if samples of
this kind were repeatedly taken, the mean would vary only about one-tenth as
much as the individual trades, and that many of the samples would have mean
profitabilities in the range of $740 + $350.
The t-statistic for the best-performing system from the set of optimization
mns was 2.1118, which has a statistical significance of 0.0184. This was a fairly
strong result. If only one test had been run (no optimizing), this good a result would
have been obtained (by chance alone) only twice in 100 tests, indicating that the
system is probably capturing some real market inefficiency and has some chance of
holding up. However, be warned: This analysis was for the best of 20 sets of para-
meter values tested. If corrected for the fact that 20 combinations of parameter val-
ues were tested, the adjusted statistical significance would only be about 0.3 1, not
very good; the performance of the system could easily have been due to chance.
Therefore, although the system may hold up, it could also, rather easily, fail.
The serial correlation between trades was only 0.0479, a value small enough
in the present context, with a significance of only 0.6083. These results strongly
suggest that there was no meaningful serial correlation between trades and that the
statistical analyses discussed above are likely to be correct.
There were 58 winning trades in the sample, which represents about a 49%
win rate. The upper 99% confidence boundary was approximately 61% and the
lower 99% confidence boundary was approximately 37%, suggesting that the true
percentage of wins in the population has a 99% likelihood of being found between
those two values. In truth, the confidence region should have been broadened by
correcting for optimization; this was not done because we were not very con-
cerned about the percentage of wins.

Results. Table 4-1, presented earlier, contains the data and statistics
for the out-of-sample test for the model. Since all parameters were already fixed,
and only one test was conducted, mere was no need to consider optimization or its
consequences in any manner. In the period from M/95 to t/1/97, there were 47
trades. The average trade in this sample yielded about $974, which is a greater
average profit per trade than in the optimization sample! The system apparently
did maintain profitable behavior.
At slightly over $6,000, the sample standard deviation was almost double
that of the standard deviation in me optimization sample. Consequently, the stan-
dard deviation of the sample mean was around $890, a fairly large standard error
of estimate; together with the small sample size, this yielded a lower t-statistic
than found in the optimization sample and, therefore, a lowered statistical signifi-
cance of only about 14%. These results were neither very good nor very bad:
There is better than an 80% chance that the system is capitalizing on some real
(non-chance) market inefficiency. The serial correlation in the test sample, however,
was quite a bit higher than in the optimization sample and was significant, with a
probability of 0.1572; i.e., as large a serial correlation as this would only be
expected about 16% of the time by chance alone, if no true (population) serial cor-
relation was present. Consequently, the t-test on the profit/loss figures has likely

overstated the statistical significance to some degree (maybe between 20 and
30%). If the sample size was adjusted downward the right amount, the t-test prob-
ability would most likely be around 0.18, instead of the 0.1392 that was calculat-
ed. The confidence interval for the percentage of wins in the population ranged
from about 17% to about 53%.
Overall, the assessment is that the system is probably going to hold up in the
future, but not with a high degree of certainty. Considering there were two inde-
pendent tests--one showing about a 31% probability (corrected for optimization)
that the profits were due to chance, the other showing a statistical significance of
approximately 14% (corrected to 18% due to the serial correlation), there is a good
chance that the average population trade is profitable and, consequently, that the
system will remain profitable in the future.

The following section is intended only to acquaint the reader with some other sta-
tistical techniques that are available. We strongly suggest that a more thorough study
be undertaken by those serious about developing and evaluating trading systems.

Genetically Evolved Systems
We develop many systems using genetic algorithms. A popular$fimessfunction (cri-
terion used to determine whether a model is producing the desired outcome) is the
total net profit of the system. However, net profit is not the best measure of system
quality! A system that only trades the major crashes on the S&P 500 will yield a
very high total net profit with a very high percentage of winning trades. But who
knows if such a system would hold up? Intuitively, if the system only took two or
three trades in 10 years, the probability seems very low that it would continue to
perform well in the future or even take any more trades. Part of the problem is that
net profit does not consider the number of trades taken or their variability.
An alternative fitness function that avoids some of the problems associated
with net profit is the t-statistic or its associated probability. When using the t-sta-
tistic as a fitness function, instead of merely trying to evolve the most profitable
systems, the intention is to genetically evolve systems that have the greatest like-
lihood of being profitable in the future or, equivalently, that have the least likeli-
hood of being profitable merely due to chance or curve-fitting. This approach
works fairly well. The t-statistic factors in profitability, sample size, and number
of trades taken. All things being equal, the greater the number of trades a system
takes, the greater the t-statistic and the more likely it will hold up in the future.
Likewise, systems that produce more consistently profitable trades with less vari-
ation are more desirable than systems that produce wildly varying trades and will
yield higher t-statistic values. The t-statistic incorporates many of the features that
define the quality of a trading model into one number that can be maximized by a
genetic algorithm.

Multiple Regression
Another statistical technique frequently used is multiple regression. Consider
intermarket analysis: The purpose of intermarket analysis is to find measures of
behaviors in other markets that are predictive of the future behavior of the market
being studied. Running various regressions is an appropriate technique for ana-
lyzing such potential relationships; moreover, there are excellent statistics to use
for testing and setting confidence intervals on the correlations and regression
(beta) weights generated by the analyses. Due to lack of space and the limited
scope of this chapter, no examples are presented, but the reader is referred to
Myers (1986), a good basic text on multiple regression.
A problem with most textbooks on multiple regression analysis (including
the one just mentioned) is that they do not deal with the issue of serial correlation
in time series data, and its effect on the statistical inferences that can be made from
regression analyses using such data. The reader will need to take the effects of
serial correlation into account: Serial correlation in a data sample has the effect of
reducing the effective sample size, and statistics can be adjusted (at least in a
rough-and-ready manner) based on this effect. Another trick that can be used in
some cases is to perform some transformations on the original data series to make
the time series more “stationary” and to remove the unwanted serial correlations.

Monte Carlo Simulations
One powerful, unique approach to making statistical inferences is known as the
Monte Carlo Simulation, which involves repeated tests on synthetic data that are
constructed to have the properties of samples taken from a random population.
Except for randomness, the synthetic data are constructed to have the basic char-
acteristics of the population from which the real sample was drawn and about
which inferences must be made. This is a very powerful method. The beauty of
Monte Carlo Simulations is that they can be performed in a way that avoids the
dangers of assumptions (such as that of the normal distribution) being violated,
which would lead to untrustworthy results.

Out-of-Sample Testing
Another way to evaluate a system is to perform out-of-sample testing. Several time
periods are reserved to test a model that has been developed or optimized on some
other time period. Out-of-sample testing helps determine how the model behaves

on data it had not seen during optimization or development. This approach is
strongly recommended. In fact, in the examples discussed above, both in-sample
and out-of-sample tests were analyzed. No corrections to the statistics for the
process of optimization are necessary in out-of-sample testing. Out-of-sample and
multiple-sample tests may also provide some information on whether the market
has changed its behavior over various periods of time.

Walk-Forward Testing
In walk-forward testing, a system is optimized on several years of data and then
traded the next year. The system is then reoptimized on several more years of data,
moving the window forward to include the year just traded. The system is then
traded for another year. This process is repeated again and again, “walking for-
ward” through the data series. Although very computationally intensive, this is an
excellent way to study and test a trading system. In a sense, even though opti-
mization is occurring, all trades are taken on what is essentially out-of-sample test
data. All the statistics discussed above, such as the t-tests, can be used on walk-
forward test results in a simple manner that does not require any corrections for
optimization. In addition, the tests will very closely simulate the process that
occurs during real trading--first optimization occurs, next the system is traded on
data not used during the optimization, and then every so often the system is reop-
timized to update it. Sophisticated developers can build the optimization process
into the system, producing what might be called an “adaptive” trading model.
Meyers (1997) wrote an article illustrating the process of walk-forward testing.

In the course of developing trading systems, statistics help the trader quickly reject
models exhibiting behavior that could have been due to chance or to excessive
curve-fitting on an inadequately sized sample. Probabilities can be estimated, and
if it is found that there is only a very small probability that a model™s performance
could be due to chance alone, then the trader can feel more confident when actu-
ally trading the model.
There are many ways for the trader to use and calculate statistics. The cen-
tral theme is the attempt to make inferences about a population on the basis of
samples drawn from that population.
Keep in mind that when using statistics on the kinds of data faced by traders,
certain assumptions will be violated. For practical purposes, some of the violations
may not be too critical; thanks to the Central Limit Theorem, data that are not nor-
mally distributed can usually be analyzed adequately for most needs. Other viola-
tions that are more serious (e.g., ones involving serial dependence) do need to be
taken into account, but rough-and-ready rules may be used to reckon corrections
to the probabilities. The bottom line: It is better to operate with some information,
even knowing that some assumptions may be violated, than to operate blindly.
We have glossed over many of the details, definitions, and reasons behind the
statistics discussed above. Again, the intention was merely to acquaint the reader
with some of the more frequently used applications. We suggest that any commit-
ted trader obtain and study some good basic texts on statistical techniques.

The Study of Entries

I n this section, various entry methods arc systematically evaluated. The focus is
on which techniques provide good entries and which do not. A good entry is
important because it can reduce exposure to risk and increase the likelihood that a
trade will be profitable. Although it is sometimes possible to make a profit with a
bad entry (given a sufficiently good exit), a good entry gets the trade started on the
right foot.

A good entry is one that initiates a trade at a point of low potential risk and high
potential reward. A point of low risk is usually a point from which there is little
adverse excursion before the market begins to move in the trade™s favor. Entries
that yield small adverse excursions on successful trades are desirable because they
permit fairly tight stops to be set, thereby minimizing risk. A good entry should
also have a high probability of being followed quickly by favorable movement in


. 2
( 9)