Monday, September 28, 2015

Why most published data are not reproducible

Public confidence in science has suffered another setback from recent reports that the data from most published studies are not reproducible (Baker 2015, Bartlett 2015, Begley et al 2015, Jump 2015).  The implication from this, statistically, is unavoidable, and troubling to say the least:  it means that the results of at least half of all research that has ever been published, probably in all fields of study, are inconclusive at best. They may be reliable and useful, but maybe not.  Mounting evidence in fact leans toward the latter (Ioannidis 2005, Lehrer 2010, Hayden 2013).

Moreover, these inconclusive reports are likely to involve mostly those that had been regarded as especially promising contributions — lauded as particularly novel and ground-breaking.  In contrast, the smaller group that passed the reproducibility test is likely to involve mostly esoteric research that few people care about, or so-called ‘safe research’: studies that report merely confirmatory results, designed to generate data that were already categorically expected, i.e. studies that aimed to provide just another example of support for well-established theory — or if not the latter, support for something that was already an obvious bet or easily believable anyway, even without data collection (or theory).  [A study that anticipates only positive results in advance is pointless.  There is no reason for doing the science in the first place; it just confirms what one already knows must be true].    

Are there any remedies for this reproducibility problem?  Undoubtedly some, and researchers are scrambling, ramping up efforts to identify them [see Nature Special (2015) on Challenges in irreproducible research].  Addressing them effectively (if it is possible at all) will require nothing short of a complete re-structuring of the culture of science, with new and revised manuals of ‘best practice’ (e.g. see Nosek et al. 2015, Sarewitz 2015; and see Center for Open Science: Transparency and Openness Promotion (TOP) Guidelines). 


Some of the reasons for irreproducibility, however, will not go away easily.  In addition to outright fraud, there are at least six more — some more unscrupulous than others:


(1) Page space restrictions of some journals.  For some studies, results cannot be reproduced because the authors were required to limit the length of the paper.  Hence, important details required for repeating the study are missing.   







(2) Sloppy record keeping / data storage / accessibilityResearchers are not all equally meticulous by nature.  In some cases, methodological details are missing inadvertently because the authors simply forgot to include them, or the raw data were not stored or backed up with sufficient care.








(3) Practical limitations that prevent ‘controls’ for everything that might matter. For many study systems, there are variables that simply cannot be controlled.  In some cases, the authors are aware of these, and acknowledge them (and hence also the inconclusive nature of their results).  But in other cases, there were important variables that could have been controlled but were innocently overlooked, and in still other cases there were variables that simply could not have been known or even imagined.   The impact of these ‘ghost variables’ can severely limit the chances of reproducing the results of the earlier study. 


(4) Pressure to publish a lot of papers quickly. Success in academia is measured by counting papers. Researchers are often anxious, therefore, to publish a ‘minimum publishable unit’ (MPU), and as quickly as possible, without first repeating the study to bolster confidence that the results can be replicated and were not just a fluke.  Inevitably of course, some (perhaps a lot) of the time, results (especially MPUs) will be a fluke, but it is generally better for one’s career not to take the time and effort to find out (time and effort taken away from cranking out more papers).  When others do however take the time and effort to check, more incidences of irreproducible results make the news headlines — news that would be a lot less common if the culture of academia encouraged researchers to replicate their own studies before publishing them.


(5) Using secrecy (omissions) to retain a competitive edge.   As Collins and Tabak (2014) note:  “…some scientists reputedly use a 'secret sauce' to make their experiments work — and withhold details from publication or describe them only vaguely to retain a competitive edge.”






(6) Pressure to publish in ‘high end’ journals.  Successful careers in academia are measured not just by counting papers, but especially by counting papers in ‘high end’ journals — those that generate high Impact Factors because of their mission to publish only the most exciting findings, and disinterest in publishing negative findings. Researchers are thus addicted to chasing Impact Factor (IF) as a status symbol within a culture that breeds elitism — and the high end journals feed that addiction (many of them while cashing in on it). The traditional argument for defending the value of 'high-end branding' for journals (supposedly measured by high IF) is that it provides a convenient filtering mechanism allowing one to quickly find and read the most significant research contributions within a field of study.  In fact, however, the IF of a ‘high-end’ journal says very little to nothing about the eventual relative impact (citation rates) for the vast majority of papers published in it (Leimu and Koricheva 2005). A high journal IF, in most cases, is driven by publication of only a small handful of 'superstar' articles (or articles by a few 'superstar' authors). Journal 'brand' (IF) therefore has only marginal value at best as a filtering mechanism for readers and researchers.  

Moreover, addiction to chasing Impact Factor, despite not delivering what its gate-keepers proclaim, is ironically at the heart of the irreproducibility problem — for at least two reasons:  First, it fuels incentives for researchers to be biased in the selection of study material (e.g. using a certain species) that they already have reason to suspect, in advance, is particularly likely to provide support for the 'favoured hypothesis’ — the 'exciting story'.  Any data collected for different study material that fail to support the 'exciting story’ must of course be shelved — the so called ‘file drawer’ problem — because high end journals won’t publish them.

Second, addiction to chasing IF can motivate researchers to report their findings selectively, excluding certain data or failing to mention the results of certain statistical analyses that do not fit neatly with the ‘exciting story’.  This may, for example, include ‘p-hacking’ — searching for and choosing to emphasize only analyses that give small p-values.  And obviously there is no incentive here to repeat one’s experiment, ‘just to be sure’; self-replication would run the risk that the ‘exciting story’ might go away.
  


All of this means that the research community and the general public are commonly duped — led to believe that support for the 'exciting story’ is stronger than it really is.  And this is revealed when later research attempts, unsuccessfully, to replicate the effect sizes of earlier supporting studies.  Negative findings in this context then, ironically, become more ‘publishable’ (including for study material that was already used earlier and that ended up in a file drawer somewhere).  Hence, empirical support for an exciting new idea commonly accelerates rapidly at first, but eventually starts to fall off ('regression to the mean') as more replications are conducted — the so called ‘decline effect’ (Lehrer 2010).

The progress of science happens when research results reject a null hypothesis, thus supporting the existence of a relationship between two measured phenomena, or a difference among groups — i.e. a ‘positive result’.  But progress is also supposed to happen when research produces a 'negative result' — i.e. results that fail to reject a null hypothesis, thus failing to find a relationship or difference.  Science done properly then, with precision and good design but without bias, should commonly produce negative results, even perhaps as much as half of the time.  But negative results are mostly missing from published literature.  Instead, they are hidden in file drawers, destroyed altogether, or they never have a chance of being discovered.  Because positive results are more exciting to authors, and especially journal editors, researchers commonly rig their study designs and analyses to maximize the chances of reporting positive results.

The absurdity of this contemporary culture of science is now being unveiled by the growing evidence of failure to reproduce the results of most published research.  The results of new science submitted for publication today, in the vast majority of cases, conforms to researchers' preconceived expectations — and always so for those published in high end journals. This is good reason to suspect a widespread misappropriation of science.

There is a lot that needs fixing here.


References

Baker M (2015) Over half of psychology studies fail reproducibility test: Largest replication study to date casts doubt on many published positive results. Nature

Bartlett T (2015) The results of the reproducibility project are in: They’re not good. The Chronicle of Higher Education.  http://chronicle.com/article/The-Results-of-the/232695/

Begley CG, Buchan AM, Dirnagl U (2015) Robust research: Institutions must do their part for reproducibility. Nature.  http://www.nature.com/news/robust-research-institutions-must-do-their-part-for-reproducibility-1.18259?WT.mc_id=SFB_NNEWS_1508_RHBox

Collins FS, Tabak LA (2014) Policy: NIH plans to enhance reproducibility. Nature.

Hayden EC (2013) Weak statistical standards implicated in scientific irreproducibility: One-quarter of studies that meet commonly used statistical cutoff may be false. Nature.

Ioannidis JPA (2005) Why most published research findings are false.  PLoS Medine

Jump P (2015) Reproducing results: how big is the problem? Times Higher Education. https://www.timeshighereducation.co.uk/features/reproducing-results-how-big-is-the-problem?nopaging=1

Lehrer J (2010) The truth wears off: Is there something wrong with the scientific method?  The New Yorker.  http://www.newyorker.com/magazine/2010/12/13/the-truth-wears-off

Leimu R, Koricheva J (2005) What determines the citation frequency of ecological papers?  Trends in Ecology and Evolution 20: 28-32. 

Nosek et al. (2015) Promoting an open research culture.  Science 348: 1422-1425.

Sarewitz D (2015) Reproducibility will not cure what ails science. Nature.