The failure of anomaly indicators in finance

A black swan; credit: Wikimedia

The replicability crisis in science

Recent public reports have underscored a crisis of replicability in numerous fields of science:

  • In 2012, Amgen researchers reported that they were able to replicate fewer than 10 of 53 cancer studies.
  • In March 2014, physicists announced with fanfare that they had detected evidence of gravitational waves from the “inflation” epoch of the big bang. However, other researchers were unable to verify this conclusion. The current consensus is that the twisting patterns in the data are due to dust in the Milky Way, not inflation.
  • In 2015, in a study by the Reproducibility Project, only 39 of 100 psychology studies could be replicated, even after taking extensive steps such as consulting with the original authors.
  • Also in 2015, a study by the U.S. Federal Reserve was able to replicate only 29 of 67 economics studies.
  • In an updated 2018 study by the Reproducibility Project, only 14 out of 28 classic and contemporary psychology experimental studies were successfully replicated.
  • In 2018, the Reproducibility Project was able to replicate only five of ten key studies in cancer research, with three inconclusive and two negative; eight more studies are in the works but incomplete.

Part of the reason for these disappointing results is the all-too-widespread practice of p-hacking — adjusting data and altering hypotheses until a combination is found that meets, say the p = 0.05 level of significance, so that the study has a good chance of being published. Note that this is a classic multiple testing fallacy of statistics: perform enough tests and one is bound to pass any specific level of statistical significance. For additional details and discussion, see this Math Scholar blog and this Mathematical Investor blog.

The replicability crisis in finance

In the past few years, several authors in the mathematical finance field have drawn attention to the nagging problem of backtest overfitting. See, for instance, this 2014 paper by the present author and three colleagues. Indeed, backtest overfitting is now thought be a principal reason why investment funds and strategies that look good on paper often fail in practice — the impressive performance in backtest studies often is not replicated, to put it mildly, when the fund or strategy is fielded in practice.

Along this line, in a separate 2017 JOIM paper (see also this synopsis), the present author and two colleagues demonstrated that it is straightforward to design a stock fund, based on backtests, that achieves virtually any desired goal — say, a steady increase of 1% per month growth, month after month for ten or more years. However, when such designs are presented with new data, they invariably prove to be brittle artifacts, often failing catastrophically, or, at the least, utterly failing to replicate their stated goal.

Today, if anything, the situation has worsened, with the rise of computerized optimization techniques for investment funds and strategies. After all, it is an increasingly simple matter to explore thousands, millions or even billions of alternative component weightings or parameter settings for an investment fund or strategy, and select only the highest-scoring combination to be published or fielded. As we have shown (see HERE for instance), such computer explorations (typically never disclosed to editors, readers or customers) almost always render the resulting study or financial product hopelessly overfit and, in many cases, subject to serious failure.

After reviewing these and other recent developments in the field, Marcos Lopez de Prado and Michael Lewis sadly concluded that most investment strategies uncovered by practitioners and academics are false. In a similar vein, the present author and Lopez de Prado, in a 2018 Forbes interview with Brett Steenbarger, discussed the situation in these terms:

Imagine that a pharmaceutical company develops 1000 drugs and tests these on 1000 groups of volunteer patients. When a few dozen of the tests prove “significant” at the .05 level of chance, those medications are marketed as proven remedies. Believing the “scientific tests”, patients flock to the new wonder drugs, only to find that their conditions become worse as the medications don’t deliver the expected benefit. Some consumers become quite ill and several die.

Clearly, there would be a public outcry over such deceptive practice. Indeed, that is precisely the reason we have a regulatory agency and laws to help ensure that medications have been properly tested before they are offered to the public. … [But] no such protections are offered to financial consumers, leaving them vulnerable to unproven investment strategies. … These false positives are particularly misleading, as they are promoted by researchers with seemingly impeccable research backgrounds—and who do not employ the scientific tools needed to detect such false findings.

New anomaly indicator study

In an attempt to better assess the state of replicability in the finance field, Kewei You, Chen Xue and Lu Zhang have published an in-depth study on the replicability of anomaly indicators in finance (the first and third authors are with the Fisher College of Business, Ohio State University; the second is with the Linder College of Business, University of Cincinnati). The full technical article is available HERE.

The You-Xue-Zhang paper addressed 452 anomaly variables taken from a large set of published papers on anomaly indicators in the academic finance field. Unlike some other studies, they included micro-cap stocks (those whose market equity is less than the 20th percentile in the New York Stock Exchange) in their analysis. The paper is very detailed (over 100 pages in length), with extensive tables and data analysis.

In the end, these authors soberly concluded that most of these anomaly indicators fail to replicate. Out of the 452 studied, 65% did not even clear the single test threshold of t = 1.96 or greater. With a more stringent criteria that partially compensates for multiple testing, namely t = 2.78 at the 5% significance level, the failure rate increases to 82%.

Some of the heaviest casualties of the You-Xue-Zhang study were from the “trading frictions” literature. In the category of the study that includes liquidity, market microstructure and other trading friction indicators, fully 102 out of 106 failed to replicate, even with the single-test threshold. The authors listed a few of the prominent failures (dates are reference keys in the You-Xue-Zhang paper):

[T]he Jegadeesh (1990) short-term reversal; the Datar-Naik-Radcliffe (1998) share turnover; the Chordia-Subrahmanyam-Anshuman (2001) coefficient of variation for dollar trading volume; the Amihud (2002) absolute return-to-volume; the Easley-Hvidkjaer-O’Hara (2002) probability of informed trading; the Pastor-Stambaugh (2003) liquidity beta; the Acharya-Pedersen (2005) liquidity betas; the Ang-Hodrick-Xing-Zhang (2006) idiosyncratic volatility, total volatility, and systematic volatility; the Liu (2006) number of zero daily trading volume; the Bali-Cakici-Whitelaw (2011) maximum daily return; the Corwin-Schultz (2012) high-low bid-ask spread; the Adrian-Etula-Muir (2014) financial intermediary leverage beta; and the Kelly-Jiang (2014) tail risk.

Some other fairly well-known anomaly indicators that You, Xue and Zhang were not able to statistically replicate include the following:

[T]he Bhandari (1988) debt-to-market; the Lakonishok-Shleifer-Vishny (1994) five-year sales growth; the La Porta (1996) long-term analysts’ forecasts; several of the Abarbanell-Bushee (1998) fundamental signals; the O-score and Z-score studied in Dichev (1998); the Piotroski (2000) fundamental score; the Diether-Malloy-Scherbina (2002) dispersion in analysts’ forecasts; the Gompers-Ishii-Metrick (2003) corporate governance index; the Francis-LaFond-Olsson-Schipper (2004) earnings attributes, including persistence, smoothness, value relevance, and conservatism; the Francis-LaFond-Olsson-Schipper (2005) accrual quality; the Richardson-Sloan-Soliman-Tuna (2005) total accruals; the Campbell-Hilscher-Szilagyi (2008) failure probability; and the Fama-French (2015) operating profitability.

You, Xue and Zhang conclude, “In all, capital markets are more efficient than previously recognized.”

There is some concern that although You, Xue and Zhang cited some multiple-testing statistics for their data, they might not have fully compensated for this phenomenon, given the extent of their study. But this means that, if anything, their results are conservative — even fewer of the indicators in their study are statistically replicable, or replicate only with a very marginal statistical confidence.

What’s more, it should be kept in mind that their data continued only through 2016. Since then, computerized, big-data-based trading activity has significantly increased in markets worldwide, which very likely means that some of the remaining anomaly indicators that once had significant merit are no longer effective.

The pressure to publish new results

One springboard for the You-Xue-Zhang article was a 2016 study by Harvey, Liu and Zhu, who after analyzing 296 anomalies from published papers found that between 80 and 158 (i.e., up to 53%) are likely false discoveries.

Harvey, Liu and Zhu identified a fundamental structural bias in the field as the likely culprit: Whereas replication studies routinely appear in top journals of most other scientific fields, very few such such studies are published in finance and economics. In other words, there is a prevailing preference to publishing new results rather than rigorously verifying previous results.

Such problems have also been noted by Campbell R. Harvey, past president (2016) of the American Finance Association, who lamented that journal editors compete for citation-based impact numbers. But because replication studies and others that do not report remarkable new results tend to generate fewer citations, such papers are less likely to be published. On the other hand, Harvey observes that researchers also contribute to publication bias. Knowing journals’ preference for papers with significant new results, authors may not submit papers with more mundane results — a bias known in other fields as the “file drawer effect.” Publication bias may also be induced by authors cherry-picking the most significant results for journal submissions, which is a form of p-hacking.

Restoring replicability

Fortunately, there are some specific tools that can be used to prevent the deleterious effects of statistical practices such as backtest overfitting and other related forms of multiple testing bias that lead to poor replicability. For example, this 2014 JPM paper provides some solid ways to detect and prevent false discoveries. Also, this 2017 JCF paper provides a theoretical framework to calculate the probability of backtest overfitting. Many of these techniques and others are discussed in Lopez de Prado’s recently published book Advances in Financial Machine Learning.

In the Forbes article (mentioned above), the present author argues that additional government regulation and oversight may be required, in particular to address the extent to which retail investors are not aware of the potential statistical pitfalls in studies that lead to the design of investment products.

But in the end, the only long-term solution is education — all researchers and investment practitioners in finance need to be rigorously trained in modern statistics and how best to use these tools. Special attention should be paid to showing how statistical tests can mislead when used naively or carelessly. Note that this training is needed not only for students and others entering the work force, but also for many who are already practitioners in the field. This will not be easy, but is needed for the field to move forward in a high-tech world.

Comments are closed.