Friday, June 14, 2019

Why Many Scientific Articles are Wrong

 Suppose you are a professional academic who wants to publish a journal article in order to improve your chance of getting an offer, getting tenure, getting a raise. One way to do so is to produce and write up research that provides support for a novel theory. One problem is that, if the theory is true, it is quite likely that someone else in your field, over the past century or so, has already discovered it and published it, making your result not novel, hence likely to be rejected by the journal you submit it to.

If, on the other hand, your theory is false, the odds are much better that nobody else will have come up with it, found evidence to support it, and published. So if you can produce what looks like good evidence for a false theory, the odds that it will be novel, hence publishable, hence will contribute to your career, is much higher than for a true theory.

How do you produce evidence good enough to be publishable for a result that is not true? 

One solution is a specification search, aka p-hacking. Your theory is that eating onions reduces the risk of Alzheimers disease. To test it, you find a sample of old people who have been tested for symptoms of cognitive decline and survey them on their dietary habits. As a first crude test, you run a regression with degree of cognitive decline as the dependent variable, estimated previous onion consumption as the independent variable.

Unfortunately, that doesn't work—there is no significant correlation between the two. You rerun the regression, this time doing it separately for men and women. Then separately by race. Then by race and gender. Then limited to people over 80. Then to people over 90. Then making your independent variable not estimated onion consumption but whether they report eating onions frequently, occasionally, or not at all. Then do that version for all your racial and gender categories. Then ...

When you are done, you have run a hundred different regressions, testing different variants of the theory that onions are protective against Alzheimers. You are gratified to discover that three of them are significant at the .05 level, with the right sign. You pick the best one and publish it: "Cognitive Effect of Self-Reported Onion Consumption on Elderly Afro-American Women."

The fact that a regression result is significant at the .05 level means that, if your theory is not true, the chance that the evidence in favor of it will, by pure chance, be as good as what you found is only .05. It follows that, if your theory is false, a hundred separate experiments can be expected to produce about five that support it at the .05 level.

In this version of the story, the researcher is deliberately trying multiple experiments and only reporting the results that support his theory. The same effect could occur via multiple experiments by multiple researchers. If a hundred different researchers produce one experiment each, all for false theories, about five will show evidence for the theory at the .05 level. 

And get published.


Lawrence Kesteloot said...

Obligatory xkcd:

John C Goodman said...

One of your very best posts.

Ricardo Cruz said...

Very good point, I never realized that.
Furthermore, a lot of research is made by Ph.D. students who have the most to gain to get publications accepted since they have to finish their Ph.D. program (which is 3 years in a lot of countries) and are starting their careers. They also have the least to lose if caught, and even good supervisors cannot follow the experiments too closely. Sometimes it's not even deliberately, you just spend more time trying more combinations of whatever you propose. For example, in AI research, which is a very empirical field (unlike what a lot of people think), it's common to hear people suggesting "just try a different metric" or "try a different dataset".

Anonymous said...

Dear David,

Does the following theory have sufficient documentation to warrant further investigation ? Perhaps an audit of the client accounts (of government funds) by the GAO ?

Have ...Wall Street as the initiators of such oppressive actions which are now being brought home to the U.S. ?

The pasted writing details the precise method Wall Street utilizes the Federal Reserve to embezzle money from the US government. The FRBNY maintains exclusive control over the disbursement and handling of funds from the auction accounts of Treasury securities. The accounts currently handle >$10T annually and have never been audited. They are client accounts; not operational accounts. Ref. 31 CFR 375.3.

It is a tribute to my professor who taught a graduate course of Money and Banking. His inspiration has resulted in a mathematical conclusion from TreasuryDirect documentation, GAO reports, inherent economic parameters, and other sources.

The “ultimate goal” for Wall Street --- to collect on the [fraudulent] $21 trillion U.S. national debt in the manner of Greece --- is identified by others in footnotes 20 and 21.

Perhaps the research may be of interest.
Jim Carter


“What difference does an increase in the National Debt make? We owe it to ourselves.” virtually every congress-critter and every Modern Monetary Theorist has declared. Such a paraphrased statement, reflecting on the exoskeleton structure of the Federal Reserve, ignores the inner historic mechanisms of Rothschild banking, the intense subterfuge and arm-twisting of the Fed’s creation, and the proven destructive forces inherent but hidden therein. 1

The medieval Rothschild Banks established a line of credit for the King provided the King issued a written promise to pay gold, with interest, to the bank at a time in the future. The book-entry Rothschild credit was used to pay for obligations incurred by the king. The credit continued to be circulated in the kingdom between merchants. The bankers sold the king’s interest bearing promise of gold to investors. The promise was renewed by the king on its maturing date and became perpetually rolled-over. 2

VOILA !!! The king made the suppliers of services happy with Rothschild credit; the bankers had the gold from investors; the investors had a promise the king would eventually pay them in gold—which would never happen. 3 Everything went smoothly as long as the bankers could sell the promise and the investors did not demand the gold. 4 As Benjamin Ginsburg has lamented in FATAL EMBRACE; (bankers) AND THE STATE, eventually the schemes, which stole the wealth from the people with book-entry fiat money, would come to a catastrophic climax. 6

The Federal Reserve system, claimed

Continuation and Footnotes are available at Ref. if they are deleted by software [or limitations] or they can be emailed to an address.

Fortnit said...

Very incredible epiphany of the original article. Besides explaining the current trend of the proliferating daily headlines of absurd fabricated studies; This has more implications than just scientific journals. Again it is a basic lesson from economics, where the moral of the story is that there was an incentive and where it creates bad behavior and crime where there was none before. In this case, the "cheat" is able to make up lies and trick others for their own sole personal gain. What other fields and industries do you think this also happens?

@Ricardo Cruz - your reply is far beyond the scope of this page. It there anywhere else where others can discuss that topic? It is extraordinary and frightening how far the corruption goes, the American public wouldn't put up with it if they knew the truth. Unfortunately, the media is also privately owned and they prefer to start their own incited riots/organized protests over fake and exaggerated news, real crime is highly censored. It is next to conspiracy theory where others would mock or outright ignore bringing up the subject. I too personally have been researching the history the fractional banking privately owned US Fed Reserve and found many troubling correlations no one else on the internet has found. I am very interested to learn what you know.

Chris said...

The effects of mobile phone electromagnetic radiation seems to be rife with this. There's almost endless variables. Divide animals into mice rates, male and female - different power strengths, check for literally all types of illnesses and changes in trace element content in teeth etc. The major study that's cited for blocking 5G rollout by NTB appears to me to be suffering from this and is not helped by the fact that is has been peer reviewed by a panel where the majority but not all agree that the statistics do actually show statistical correlation - because they do...

Fortnit said...

@Anonymous Chris, yeah the reports are endless. While some are obviously good laws such as banning micro-abrasives in facial cleansers and certain Air-conditioning refrigerants, there are other questionable things such as banning of straws and the banning of LED street lighting because the so called "research" that it leads to cancer and death of the planet. It is a danger in the way socialism is, where the politician will pass laws for with a good will that end up doing the extreme damage to citizens. The false reports of these false studies also do damage to society and the world when it's only benefit was to the original author who just wanted to get published and get a raise. There are 100s of government regulations and agencies created to overlook private industry, but there are no such thing for fake-science reports. It's supposed to be peer-reviewed but there is little incentive for others to fact check as it takes more energy to check/review than it is worth to prove. Normally the free hand of economics will settle the truth but again it is governments that interfere and ban any attempts to oppose it. Remember how Marijuana is the class1 drug where it cannot even be tested medically? It is soo dangerous to the America, we should make it illegal across the globe so that Tobacco will have less competition.

Fortnit said...

Is anyone else disappointed that we cannot edit our own posts here at the blogger? We write things in haste only to come back and reread it and see massive misspellings, bad grammer and run ons. I surmise forum would only attract the most intelligent and like-minded out of the very few who are lucky enough to know it exists, so I want to be clear in sharing my bad ideas and thoughts.
I wanted to add a few more examples of studies and tests, for years it was touted that Coconut Oil was healthy and to use it as much as possible despite its high price, now this year a new study says it is the most dangerous possible oil to consume as it is pure saturated fat. And how many times have you seen a product for sale on tv or magazine ad, where it makes a statement that you need this type of oil in your daily intake, using it's own paid studies or paid lobby groups to sell something? Even infomercials routinely use statistics and medical claims in order to advertise.
In the worst possible case, can't fake studies be used to destroy certain companies/competition, be used to pass new laws, to ban certain things, in order to benefit the individuals who knew it was untrue information based on bias studies? Has anyone ever been held accountable for being caught posting fake/false research studies? It seems there is soo much to unfairly gain for very little work, where this is no risk for any type of evil or sin committed, sounds very much like Communism and Socialism.

Eric Rasmusen said...

A related idea: most business opportunities are unprofitable, because if they weren't, you could pick an opportunity randomly and make money. As a result, if an opportunity looks profitable to you, you probably have positive estimation error, and you'd better take that into account.This boils down to Bayes' Rule. The same goes for research ideas.

``Managerial Conservatism and Rational Information Acquisition, '' Journal of Economics and Management Strategy (Spring 1992), 1(1): 175-202. Conservative managerial behavior can be rational and profit- maximizing. If the valuation of innovations contains white noise and the status quo would be preferred to random innovation, then any innovation that does not appear to be substantially better than the status quo should be rejected. The more successful the firm, the higher the threshold for accepting innovation should be, and the greater the conservative bias. Other things equal, more successful firms will spend less on research, adopt fewer innovations, and be less likely to advance the industry 's best practice.

naivetheorist said...


there is actually one reason that so many scientific research articles are wrong. it is because most scientific research is wrong (at least in my field of theoretical physics). this is a variant of what i heard the head of the science fiction writers association say in respond to being asked why 95% of science fiction was garbage; his response was 95% of everything is garbage.

'naive theorist'

RKN said...

First of all, what's your estimate - how many is "Many"?

Regardless, I doubt your examples account for even a tiny fraction of actual false papers.
Competent anonymous peer review would usually reject manuscripts of the kind you describe. If
you believe an association study (onions & Alzheimer's) would be published merely because it
showed a significant result (p<=0.05) then you've been badly misled in terms of the real high
bar scientists must get over to publish their experimental results in credible scientific
journals. No doubt some false papers occasionally slip through, but not "many" times, certainly not by way of your cynical examples.

Ricardo Cruz said...

@RKN, p-hacking is a serious problem. Sometimes it's not even intentional. In fact, many researchers themselves are skeptical of the literature in their own field. Some journals now require pre-registration of experiments to mitigate the statistical abuse mentioned by Professor Friedman.

Ricardo Cruz said...

But you're right that examples as absurd as a link between onions and Alzheimer would be unlikely to be published. I think it was an exaggeration to drive the point home. Usually, the spurious publications are more subtle.

RKN said...


How serious? This is the same thing I asked David, how many? What is his (or your) estimate of how many scientific papers are wrong - 1%, 11%, 70%, what? It's a bold claim and bold claims require strong evidence.

Your cite talks about the problem of reproducibility of results. That problem (and I agree it is a problem) is not the same problem as p-hacking. The latter is an intentional effort to mine for spurious statistical associations and then try to publish them. The failure of one lab to reproduce the results of another lab can happen for a wide variety of reasons. That alone does not mean the published result was spurious or ill-intentioned.

I have some experience in this area, having published in peer-reviewed literature on proteo-genomic approaches to bio-marker discovery. I have also served as a peer reviewer for ~12 biomedical journals. In that role, where I was skeptical of a result or its reproducibility, at a minimum I conditioned acceptance of the paper on the authors providing replicates. This is now derigueur for publication in most credible scientific journals, at least in the biomedical space. (Not saying this alone solves the problem of reproducibility).

Ricardo Cruz said...

@RKN, Again, p-hacking may not be intentional.

"What is his (or your) estimate of how many scientific papers are wrong - 1%, 11%, 70%, what? It's a bold claim and bold claims require strong evidence."

This paper found evidence of p-hacking in over 50% of papers in several disciplines. When they asked the researchers, over 50% admitted not revealing all statistical tests that were performed.

RKN said...

The term clearly carries a negative connotation. Honest mistakes made in analysis or reporting results deserve a different classification and can be reduced by improving peer review.

I'll need to look more carefully but a quick read of the introduction and figures leaves me skeptical of the 50% claim. Evidently I'm not alone.

Unless I missed it, the followup with researchers was restricted to only the field psychology (Box 3).

BC said...

So, what should be done? The financial industry's version of the replication problem is that strategies that backtest well often don't subsequently perform well in live trading. The adage in finance is that no one has ever seen a bad backtest (because those backtests don't get shown to anyone). However, when someone invests in a strategy with a good backtest but not good subsequent live results, that person loses money. The result is that very few (experienced) investors will be persuaded to invest by a backtest.

What is the cost to authors, journals, and reviewers that publish a paper which subsequently doesn't replicate? David mentions that authors actually benefit, at least in the short run. Even in the long run, authors probably do not suffer a cost from replication failure unless some gross misconduct is revealed. Journals also do not seem to suffer much cost nor do reviewers who are of course anonymous. That may explain why p-hacking continues.

Also, researchers seem to be rewarded for absolute number of accomplishments rather than percentage success. For example, if a researcher publishes 10 papers, 9 of which subsequently fail to replicate but 1 of which turns out to be true, the 9 false positives don't take away from the researcher's 1 long-term success. That also seems to incentivize publication of false results, again as long as there is no gross misconduct. By incentivize, I mean that the researchers don't do everything they can to disprove their hypotheses before publication. That doesn't imply misconduct, just not incented to intensely self-scrutinize.

BC said...

Maybe, someone should establish betting markets for which papers will replicate in the future. In the short-term, the markets' implied probability of replication are another form of review, one where the reviewers are incentivized to make accurate forecasts. Most likely, such markets will need to be subsidized, both to incentivize market participants --- why would people participate in a zero-sum or negative-sum market --- and to incentivize researchers to do replication studies. However, we already subsidize research, so why not use some of those funds to support these markets?

David Friedman said...


Your point is essentially the same logic as the winner's curse in auctions.

Several others:

My point does not depend on anyone deliberately cheating, although I expect some people do. You could try multiple versions of your theory in the hope of discovering which one is true, without thinking about the statistical problem of doing so. You could run a single version while other people in the field are running other versions. If they are all nonsense, about five percent will show significance at the .05 level.

My central point is that there is a bias towards testing false theories, because true theories are much more likely to have already been proposed, tested, and confirmed.

I do not know how many scientific articles are wrong, although the replication crisis suggests that it is substantial. My point is not how many but why.

RKN said...

My central point is that there is a bias towards testing false theories, because true theories are much more likely to have already been proposed, tested, and confirmed.

Theories and hypotheses are different things. The aim of most academic publications is to provide experimental evidence to support an hypothesis, a small incremental advance in our understanding of a larger question. I see no reason to believe these hypotheses should be biased toward being false.

I do not know how many scientific articles are wrong, although the replication crisis suggests that it is substantial.

As I alluded to up-thread, there's lots of reasons experimental results in a given publication may fail to get reproduced by other labs. Doesn't mean those publications are evidence of p-hacking, intentional or not.