Reproducibility in scientific research
In the past year or two, the reproducibility of research results in finance and economics has come under serious question.
If it is any comfort, similar difficulties have emerged in numerous other scientific fields. In 2011, a team of Bayer researchers attempted to reproduce a set of key published pharmaceutical studies. They were only able to validate 11 out of 67 of these studies. Similarly, in 2012, Amgen attempted to reproduce a set of studies in oncology (cancer). They were only able to reproduce 6 out of 53 (11%). A commentary in Nature remarked, “Even knowing the limitations of preclinical research, this was a shocking result.”
In the field of psychology, in the wake of several recent cases of downright fraud, a team of Virginia researchers formed a Reproducibility Project to attempt to reproduce the findings of recent peer-reviewed papers in the field. After a concerted effort to reproduce 100 studies in three leading journals, the team found that less than half stood up when retested.
Concerns about reproducibility have even arisen in a discipline that several of us have practiced, namely mathematical and scientific computing. A recent workshop concluded that the field has failed to foster a “culture of reproducibility.” In many cases, researchers have not kept careful records or repositories of algorithm statements, computer code, and data, so that even the original research team cannot reproduce their own work. In other cases, the numerical reproducibility of the conclusions is in question.
Reproducibility in finance and economics
In the field of finance, concerns have been building for years that many financial strategies and commercially marketed financial products are based on research that is not reproducible, or, in particular, that does not stand up to rigorous statistical standards. For example, several of us have argued that backtest overfitting is widespread, both in peer-reviewed research studies and also among financial practitioners. In a 2014 paper, we concluded
We strongly suspect that such backtest overfitting is a large part of the reason why so many algorithmic or systematic hedge funds do not live up to the elevated expectations generated by their managers.
We are hardly alone in making this observation. In an October 2014 study, a team of researchers, not affiliated with us in any way, concluded that “most claimed research findings in financial economics are likely false.”
New Federal Reserve study
A new study by Andrew Chang (Board of Governors of the Federal Reserve System) and Phillip Li (Office of the Comptroller of the Currency) investigated the reproducibility of recent studies in the field of economics. They attempted to replicate 67 papers involving empirical studies, which had been recently published in 13 highly regarded journals in economics, such as the American Economic Journal: Economic Policy, American Economic Review, Canadian Journal of Economics, Econometrica, Review of Economics and Statistics and others.
The authors of the Federal Reserve study were very diligent and determined in their efforts to obtain working data and other information. They first tried to obtain this data from online sources. If this failed, they then queried each individual author of the paper in turn, beginning with the corresponding author, until they either succeeded in obtaining the needed information or failed altogether.
In spite of their efforts, they were able to obtain data and code replication files for only 29 of 35 papers that were published in journals that required the authors to provide such information as a condition of publication. For those papers published in a journal that did not have such a requirement, they were only able to obtain the requisite data in 11 out of 26 cases.
For those studies for which they were able to obtain the requisite data, the Federal Reserve authors then made a determined attempt to reproduce the calculations and analysis presented in the papers. They even tried to to install the specific versions of various software environments and application programs (e.g., Matlab) that were used in the various studies.
In the end, they were able to to replicate the results of only 29 of the 59 papers that did not involve proprietary data. They concluded,
Because we are able to replicate less than half of the papers in our sample even with help from the authors, we assert that economics research is usually not replicable.
The Federal Reserve report’s recommendations
In spite of their discouraging conclusion, the authors of the Federal Reserve study offered a number of helpful, constructive suggestions, which, we might add, are applicable to a wide range of scientific studies, not just in economics:
- Mandatory data and code files should be a condition of publication.
- An entry in the journal’s data and code archive should indicate whether a paper without replication files in the journal’s archive is exempt from the journal’s replication policy.
- Readme files should indicate the operating system-software version combination used in the analysis.
- Readme files should contain an expected model estimation time [i.e., how long the computer runs take].
- Code that relies on random number generators should set seeds and specify the random number generator.
- Readme me files should clearly delineate which files should be executed in what order to produce desired results.
- Authors should provide raw data in addition to transformed series [i.e., in addition to the transformed data].
- Programs that replicate estimation results should carry out the estimation [i.e., if a model requires parameters to be estimated, these parameters should be provided, or else the computer code that provides the estimates should be included].
[Some clarifying comments are added by us in brackets above.]
Conclusions
Reproducibility has always been something aspired to, but seldom achieved.
But in spite of the discouraging findings of recent meta-studies, we should see these developments in a positive light. After all, they are evidence that the field of scientific research in general, and of finance and economics in particular, is finally both recognizing and addressing the need to improve reproducibility in the field. We fully expect (and applaud) additional efforts to both identify previous studies that are not reproducible, and to enact stricter standards in the field to ensure that a higher percentage of future studies are reproducible.
As Lisa Barrett, a professor of psychology at Northwestern University, observed,
[T]he failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works. … Failure to replicate is not a bug; it is a feature. It is what leads us along the path — the wonderfully twisty path — of scientific discovery.