Introduction
We are pleased to announce the availability of a new online tool to demonstrate and analyze the phenomenon of backtest overfitting. It is available HERE. It was developed by researchers at the Scientific Data Management Group at Lawrence Berkeley National Laboratory, with contributions and suggestions from several other persons. A complete list of contributors is given below.
In finance, “backtest overfitting” means using historical market data (i.e., a “backtest”) to develop an investment strategy, where too many variations of the strategy are tried, relative to the amount of data available. Overfit strategies typically work well when tested against the historical data, but then give disappointing performance when fielded in practice. Backtest overfitting appears to be quite widespread in the financial field, and it is hard to detect, since the number of variations attempted when developing a strategy is seldom disclosed to the users of the strategy.
For example, as we show in this paper, if only five years of daily stock market data are available as a backtest, then no more than 45 variations of a strategy should be tried on this data, or the resulting strategy is likely to be overfit (in the specific sense that the strategy’s Sharpe ratio, a standard measure of investment performance, is likely to be 1.0 or greater).
If you create or invest in a systematic investment strategy (or in an exchange traded fund based on such a strategy), understanding the degree to which the strategy is overfit can help avoid disappointment and financial losses. The tool below allows one to see how easy it is to overfit an investment system, and how this can impact the financial bottom-line performance.
The tool employs a simplified version of the process many financial analysts use to create trading strategies, namely to use a computer program to find the “best” strategy based on historical stock market data, by adjusting variables such as the holding period and the stop loss level. If care is not taken to avoid backtest overfitting, such strategies may look great on paper (based on tests using historical stock market data), but then give rather disappointing results when actually deployed.
How the tool operates
This online tool operates as follows:
- It first constructs a time series simulating stock market data (an In Sample dataset), using the default pseudorandom number generator gauss from the Python language. The Gaussian-distributed data on which this time series is based has zero mean and unit standard deviation.
- It then develops an “optimal” trading strategy based on this data, by successively adjusting several variables, including:
- Stop loss: This is the maximum percent of profit lost that can be sustained before the stock is sold.
- Holding period: This is the maximum length that stock can be held before it is sold. This is given in terms of trading days per month, which is assumed to be 22.
- Entry day: This is the day that one enters the market in each trading month, which is assumed to be 22 days long.
- Side: The trading strategy that is employed, either long (profits are to be made when stocks are rising) or short (profits are to be made when stocks are falling).
- If a change to these variables yields a higher Sharpe ratio than the previous permutations, then a new strategy is output. The program then continues to try different permutations until it has tried exhausted the parameter space.
- The program then generates a second pseudorandom simulated stock market time series (an Out of Sample dataset), and the “optimal” strategy generated above is then applied to this second time series.
- The program then outputs, on the result page, a “movie” showing the progression of the generation of the optimal strategy on In Sample (backtest) data on the left-hand side of the result page, with the performance of the optimal strategy on Out of Sample data shown in the graph on the right-hand side of the result page.
In most runs using the tool, the Sharpe ratio of the right-hand graph (i.e., the final strategy on Out of Sample data) is either negative or much lower than than Sharpe ratio of the final left-hand graph (i.e., the final strategy on In Sample data), indicating that the strategy has been overfit on the In Sample (backtest) data.
Give it a try!
We welcome all to give the tool a try. It is available at the LBNL website. Please send us comments.
Additional background on backtest overfitting
Some additional background on backtest overfitting and the degree to which it can compromise investment performance is given in these articles:
- Article by Brendan Conway of Barron’s
- Article by Saijel Kishan of Bloomberg
- Article by Jason Zweig of the Wall Street Journal
- Institutional Investor Journal interview with Marcos Lopez de Prado
Credits
The online tool and program were constructed by Stephanie Ger, Alex Sim, John Wu and David H. Bailey, based on an earlier Python program developed by Marcos Lopez De Prado. This program in turn is based on the following research paper:
- David H. Bailey, Jonathan M. Borwein, Marcos Lopez de Prado, and Qiji Jim Zhu Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance, Notices of American Mathematical Society, May 2014, pg. 458-471.
We gratefully acknowledge the helpful comments and suggestions from colleagues and friends in shaping this web site. In particular, the suggestions from Mr. David Witkin of StatisTrade, and Bin Dong of LBNL were extensive and very helpful in improving readability of the web pages. Thanks!