Articles about reproducible research and p-values

David Steinberg. 

I want to devote this issue to some articles both in and outside the statistical literature discussing the use of p-values and related questions of selective inference. The p-value (and in fact much of the standard statistical paradigm for science) has been a focal point for controversy in recent years. The article that probably touched off most of the debate, by John Ioannidis, is one of those discussed below. Ioannidis’ article claimed that most published results in scientific journals are false positives and that the accepted demand that these results achieve statistical significance was not sufficient safeguard. Other scientists also took up the theme of “reproducible research”. Of particular note is the reproducibility project in psychology (see for details), in which a number of psychologists set out to repeat, as accurately as possible, a number of published research projects. For a summary of their work, see the article “Over Half of Psychology Studies Fail Reproducibility Test” in Nature from 27 August 2015. The leading journal Science devoted the entire issue of 2 December 2011 to the issue of reproducibility. Of particular interest for our profession is the article by Roger Peng, which is summarized below.

The American Statistical Association decided that an institutional response from our profession was needed and set up a committee to study the problem and to draft a statement specifically directed to the use of p-values. That statement will be published soon in the American Statistician and is available on line at the TAS web site. It has already caught the notice of the scientific community, though not always with a full understanding of the messages that the ASA committee wished to convey. For an example, see the article “Statisticians Issue Warning Over Misuse of P Values” in Nature on 7 March 2016.

The topic has generated substantial controversy and I think all statisticians should be aware of it, regardless of their area of specialty or application. To bring things closer to home, I conclude by describing several articles on how the sort of selectivity issues that are related to much of the reproducible science debate arise in financial statistics, in particular to the estimation of Sharpe ratios. Thanks to my friend Eric Berger for bringing these articles to my attention.

Why Most Published Research Findings Are False. John Ioannidis, PLoS Medicine 2(8): e124. doi:10.1371/journal.pmed.0020124

John Ioannidis is a professor of medicine and of health research and policy, and also a professor of statistics at Stanford. The stimulus for this article was a disturbing trend that he noticed in medical research in which results published in one paper failed to achieve support in subsequent research. In this article he develops a theoretical argument explaining why one should expect to find many false positives in the scientific literature. The ubiquity of declaring a positive finding whenever a p-value is less than 0.05 is a central part of his theory. He cites numerous instances in the medical literature in which scientists have failed to replicate earlier findings.

Reproducible Research in Computational Science. Roger Peng, Science, 334, 2 December 2011, 1226-1227.

Roger Peng begins this article in the special issue of Science on reproducible research by noting the remarkable research achievements in which scientific computing has played a fundamental role. He then goes on to express concern over the need for standards of reproducibility in computational science. As a simple example, consider the need to be able to reproduce the findings of another researcher’s simulation study. Peng makes a number of concrete suggestions as to how we can promote reproducibility. These include making code available to others and, when possible, data that are used to reach results. He also advises that journals adopt clear policies in asking authors of published papers to make such information available.

The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-normality. Bailey, D. H., and M. López de Prado. 2014, Journal of Portfolio Management 40:94–107.

“Backtesting” refers to testing a trading strategy on data from earlier time periods. The idea is simple: apply your 2016 strategy as if you had adopted it in 2012, say, and see how it performs. As Bailey and de Prado point out, it is currently possible to run millions of such backtests, thanks to the availability of large financial data sets, machine learning methods and high-performance computing. Moreover, analysts can backtest alternative investment strategies and compare their results. Naturally, one is tempted to then adopt the winning strategy. But this is almost guaranteed to lead to backtest overfitting, with actual results that are less successful than what was found in the backtest.

The problem of performance inflation extends beyond backtesting. More generally, researchers and investment managers may report only positive outcomes, leading to selection bias. Realistic assessment of performance expectations must take these selection effects into consideration.

Bailey and de Prado advocate applying this to Sharpe ratio estimation and using the Deflated Sharpe Ratio (DSR), which corrects for two leading sources of performance inflation: Selection bias under multiple testing and non-Normally distributed returns. In doing so, they show that DSR helps separate legitimate empirical findings from statistical flukes.

Evaluating Trading Strategies. Campbell R. Harvey and Yan Liu. 2014, Journal of Portfolio Management 40:108-118.

Harvey and Liu also examine the problem of how to evaluate trading strategies. They are particularly concerned with work in which many strategies and combinations of strategies have been tried, arguing that evaluation methods then require adjustment for these multiple tests. For example, Sharpe ratios and other statistics will be overstated as a result of the search. They propose methods that are simple to implement and allow for the real-time evaluation of candidate trading strategies.

Backtesting. Campbell R. Harvey and Yan Liu. 2015, Journal of Portfolio Management Fall:12-28.

This article by Harvey and Liu looks at a common practice in evaluating backtests of trading strategies – to discount reported Sharpe ratios by 50%. Harvey and Liu argue that there are good economic and statistical reasons for reducing the Sharpe ratios. In particular, they relate the discount to the use of data mining. This mining may manifest itself in academic researchers searching for asset-pricing factors that explain the behavior of equity returns, or by researchers at firms that specialize in quantitative equity strategies trying to develop profitable systematic strategies.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s