naturalSCIENCE Logo
Home
Cover Stories
Articles
Letters
News
Books
Open Forum
Comment
Whatsnew



(Double) Blind Faith

Keywords: double-blind trial, dropout rate, drug testing, methodology, peer review, placebo effect, publication bias, scientific fraud.

FRED LEAVITT

Psychology Department, Cal State University, Hayward, CA, USA (fleavitt@bay.csuhayward.edu)

Received June 4, 2003; published June 6, 2003

Every research method is imperfect. Each method has possibly incorrect assumptions built in, and plausible alternatives invariably exist for any interpretation of data. For example, many plausible alternatives can be imagined in clinical studies when people receive a treatment and then experience a change in their condition. Three obvious ones are that (1) the condition treated would have changed naturally with the passage of time (most people recover from the common cold even if untreated); (2) the treatment had no specific effects and acted only by modifying the recipient's expectations (the placebo effect); and (3) the posttreatment change occurred by chance (some heavy smokers live to be 90).

Good scientists maximize the probability that their explanation is correct by eliminating as many plausible alternatives as possible. So they typically use not one but many subjects and randomly assign some to a control group. Experimental and control subjects are treated exactly the same except for the treatment itself. Thus, subsequent differences between the groups can reasonably be attributed to the treatment. To prevent bias, researchers do not let subjects know which group they are in; nor do they themselves know until after the data have been collected. The procedure, called the double-blind experiment, is the most rigorous design available to most medical and psychological researchers. It has become the standard design in many fields.

During a long teaching career I've explained to thousands of students in introductory psychology, research methods, and critical thinking classes that results from double-blind studies published in peer-reviewed journals are the most trustworthy type of evidence. I've written textbooks praising the double-blind design as the best way to get clear answers to various research questions. I've taught students to disregard their feelings and intuitions as methods for evaluating the effectiveness of treatments and urged them to devalue their experiences.

But I've recently become an apostate, although still aware that publishing in a peer-reviewed journal remains the best way for scientists to persuade colleagues of the validity of their work. However, intelligent laypeople can make informed judgments about treatments administered to them and people they know. Laypeople may be unaware of the need for comparison subjects and other controls, but they are motivated primarily by hopes of benefiting. They seek the truth. Somebody who takes a drug designed to increase height and grows 6 inches during the following week can reasonably assume that the drug was at least partly responsible (unless he is a baby giraffe). If, after failing to improve for many years, a subject receives a new treatment and improves dramatically, she should consider the treatment at least partly responsible.

Unlike laypeople, researchers have complex motives. Prestige, job security, and commitment to a particular theory or perspective often play important roles in how scientists collect and interpret data. Huge financial incentives may create a strong inducement to improperly manipulate results. Clinical researchers often receive subsidized international trips, board memberships, speaking fees, and consulting deals from drug companies that earn them more than $1 million per year. Continued funding is contingent on results that cast company products in a favorable light. Newspaper and television reporters have their own reasons for exaggerating research findings. Thus, significant plausible alternatives exist to inferences from media accounts of double-blind experiments in peer-reviewed journals.

Methodological concerns: Single studies

Editors of scientific journals send submitted manuscripts to experts to determine worthiness for publication. This peer review process was designed to screen out manuscripts that fail to meet minimal methodological standards. It has not worked as intended. Participants at international conferences have pointed out many inadequacies. Reviewers are unpaid and do not always take their work seriously. The reliability of their assessments is low, they miss major flaws, and they are strongly influenced by the extent to which manuscripts support their theoretical biases. Harvard physician John Darsee published nearly 100 papers in two years, most based on forged data. All but two papers had errors detectable from the text alone, and 12 of them had at least ten errors. Yet his papers passed the scrutiny of reviewers and editors.

Thornley and Adams rated the methodological quality of 2,000 journal reports of clinical trials over a fifty-year period on people with schizophrenia, serious or chronic mental illness, psychosis, or movement disorders. Only twenty trials received the maximum rating of 5, slightly more than one-third received a score of 3 or more, and 1,280 scored 2 or less. Scores did not improve with time. From 1950 to 1997 the mean quality score was consistently under 2.5. (1) Bogardus and colleagues reported that 38% of articles on molecular genetics published in prestigious medical journals were inadequate methodologically. (2)

Journal editors and reviewers virtually never see the raw data that provide the foundation for articles; in an unknown but probably substantial number of instances, the data or computations on them are inaccurate. For example, when psychologist Keith Wolins requested data from 37 authors of published studies, only seven complied. Three of the data sets had substantial errors. (3)

Many researchers with limited training in statistics use the wrong tests for their data. Others interpret nonsignificant differences between groups to mean that the groups are statistically equivalent, which is a claim that the null hypothesis is true. It is axiomatic among statisticians that the null hypothesis cannot be proven. Reuven Dar and his colleagues documented a variety of common, serious statistical errors and concluded that much current research that uses statistical tests is flawed. (4) McGuigan (5) found statistical errors in 40% of articles in the British Journal of Psychiatry, and Welch and Gabbe (6) reported a 19% error rate in use of statistics in the American Journal of Obstetrics and Gynecology.

Researchers who test multiple end points increase the probability that at least one statistically significant result will appear as a false positive. Dar and colleagues reported that more than 75% of studies with multiple end points published in the Journal of Consulting and Clinical Psychology did not use a necessary compensation procedure. The authors of one study tested 857 end points. (7)

The end points often lack validity. Some variables cannot be measured easily, so surrogate end points are used instead. For example, high blood pressure is a surrogate end point for subsequent heart disease. A good surrogate end point correlates strongly with the more important clinical end point and can reduce the length and cost of a research project. Much can be said for streamlining research to reduce cost, but it is both scientifically unsound and dangerous to approve treatments solely on the basis of their effects on surrogate end points. Just because a surrogate normally correlates strongly with a clinical outcome does not ensure that a treatment-induced change in the surrogate will produce a corresponding change in the clinical outcome. For example, obesity is a risk factor for many diseases. The drug dexfenfluramine helps people lose weight, so the FDA approved dexfenfluramine on the grounds that weight loss was a good surrogate end point for improvement in health. But dexfenfluramine caused side effects that outweighed the benefits of weight loss and was eventually withdrawn.

The number of end points used to evaluate treatments for psychological disorders is in the triple figure range. Practitioners don't agree on which end points are best, and consumers should be wary. Whereas Freud sought to help his patients gain greater self-understanding, many modern therapists try to foster positive illusions in their patients. The two goals are largely incompatible. Almost all U.S. programs for treating drug abuse have total abstinence as the goal, but a few de-emphasize reductions in drug use in favor of promoting their clients' overall ability to function in society. Comparisons of program effectiveness are difficult when one measures success by global lifestyle changes and another by cleanliness of urine.

Many researchers focus only on data that supports their claims. Spiro described a study on a drug for peptic ulcers: "At the end of two weeks, the ulcer crater had healed in more than half the patients given the active drug and in only a third of patients taking placebos; that one observation point provided the desired statistical significance to permit the claim that the active drug, in this case cimetidine, 'speeded the healing' of peptic ulcer. . . . But at every other period of assessment, cimetidine and the placebo proved equally effective. . . . " (8)

In the 1980s, pharmaceutical giant Upjohn sponsored a large-scale study to assess the efficacy of Xanax in treating panic disorder. (9) Even before all the data were in, Upjohn used the research to promote Xanax at company-sponsored conferences and symposia and in supplements to journals sent to thousands of psychiatrists. Then the results were published in four separate articles in the May 1988 issue of the Archives of General Psychiatry (which accepts advertising for Upjohn drugs). The results, which were promoted in ads for Xanax, indicate that panic attack sufferers who took Xanax were much more likely to be free of panic attacks than those given placebo. The ads were true as far as they went. But they did not mention three key points:

The drug phase of the study was supposed to last eight weeks. Many subjects dropped out before the end.

Although the Xanax recipients had fewer panic attacks during the first four weeks than those given placebo, by the eighth week the two groups were virtually identical. (The ads cited only the four-week results.)

At the end of the eight weeks the Xanax subjects were gradually withdrawn from the drug. They suffered rebound panic attacks at a slightly greater frequency than they had been experiencing at the beginning. They were worse off. By contrast, the number of panic attacks in placebo subjects declined significantly throughout the study.

Methodological concerns: Publication bias

Corporate support for U.S. science increased from less than $5 million in1974 to hundreds of millions in the early 1990s; and 70% of the money for testing drugs comes from industry. Funding sources can exert great pressure on scientists to suppress or minimize inconclusive results and reports of adverse effects. Easterbrook, Dickersin, and their colleagues analyzed the fates of approved research proposals. Of 285 studies approved by one committee between 1984 and 1987 that had been completed by 1990, 138 had been published. A smaller proportion of drug company-sponsored studies was published than that supported by government or voluntary organizations. The analysts cited data management by these companies as a major reason for nonpublication. (10, 11)

Researchers, reviewers for journals, and journal editors justifiably prefer positive to negative studies. A discovery that improved memory, cured cancer, or enabled people to fly would be newsworthy; a "discovery" that failed to do anything would not be. But the asymmetry, called publication bias, has serious consequences. Suppose that fifty published studies indicate that recipients of a certain treatment do 25% better than controls, whereas fifty equally well-done studies produce negative results and hence go unpublished. A review of the published research will overestimate the treatment effect at 25%. Although sixteen published studies indicated a significant survival advantage of combination chemotherapy, the advantage was found to be illusory when several unpublished studies were located and added to the database. Publication bias in reviews of studies on the treatment of obesity resulted in substantial overestimates of treatment effectiveness.

The problem is exacerbated by the large dropout rate from individual studies, often as high as 50% of the subjects. People are likely to drop out if a treatment is producing unpleasant side effects or not working. About 50% of clinical trials funded by the drug industry are negative and go unpublished.

Some researchers conduct a single study and then publish their results in more than one journal. Reviewers may be fooled into thinking the studies are independent, especially if different authors are listed. The result is an overestimation of treatment effects. Huston and Moher identified twenty articles describing randomized, double-blind trials of the antipsychotic drug risperidone. But all the data came from seven small and two large trials. Huston and Moher wrote, "Multiple renditions of the same information is self-serving, wasteful, abuses the volunteer time of peer reviewers, and can be profoundly misleading; it brings into question the integrity of medical research." (12)

Scientific fraud

As can be inferred from above, the line between unintentional error and fraud is often blurred. Gould documented many instances of both unintentional (possibly) and clearly intentional bias perpetrated by leading psychologists in the intelligence-testing field. (13) Rosenthal found an error rate in published data of about 1%, with more than 70% of the errors favoring the researcher's hypothesis. (14) Write-ups do not always reflect the actual conduct of double-blind experiments. Somebody must keep a record of how subjects were assigned to treatments. Schulz and his colleagues visited laboratories and found that, when the codes indicating group assignment were poorly concealed from the primary researcher, the experimental treatment was reported effective 30% more often than when codes were kept strictly confidential. Schulz quizzed 400 researchers after promising them anonymity. More than half admitted to opening unsealed envelopes containing the assignments, cracking simple codes meant to hide the identity of the two groups, searching for a master list of codes, or holding sealed envelopes up to the light. (15)

Subjects are often assigned to groups nonrandomly, which undermines the purpose of the double-blind, introduces serious bias, and leads to inflated measures of effectiveness. But summaries of the research nevertheless falsely use the word randomized. Two studies identified by a MEDLINE search as "randomized controlled trials" had one patient each. (16)

Barnes and Bero identified 106 reviews of the health effects of passive smoking published from 1980 to 1995. Of the 39 reviews that conclude that passive smoking is not harmful to health, 31 had been written by authors with affiliations to the tobacco industry. Three-quarters of the articles failed to disclose the sources of funding for the research. The authors inferred that "the tobacco industry may be attempting to influence scientific opinion by flooding the scientific literature with large numbers of review articles supporting its position that passive smoking is not harmful to health." (17)

Scientists who commit fraud are unlikely to be caught. They restrict access to their laboratories and show raw data only to collaborators and friends. Dishonest psychologists, pharmacologists,
A conference held under the auspices of the National Depressive and Manic-Depressive Association (NDMDA), a patient alliance group, concluded with a lengthy consensus statement that began as follows: "There is overwhelming evidence that individuals with depression are being seriously undertreated. Safe, effective and economical treatments are available. The cost to individuals and society of this undertreatment is substantial." (15)

The "overwhelming evidence" came from double-blind experiments. But most were conducted by scientists paid by drug companies. Before calling your friendly psychiatrist, consider that the conference was funded by Bristol-Myers-Squibb, the maker of the antidepressant nefazodone. Cochair Dr. Martin Keller was on the Executive Committee of the Scientific Advisory Board of the NDMDA, and he earned about $1 million over a two-year period from drug companies that market antidepressants.

and biologists have additional cover, because biological organisms are variable. Thus, identical procedures do not always give identical results and failures to replicate can be explained away. Fraud occurs most frequently in the biomedical sciences, not in physics, astronomy, or geology.

After several cases of scientific fraud had been documented and analyzed in the early 1980s, establishment spokespeople grudgingly acknowledged that scientists cheat occasionally but added that the vast majority are accurate and truthful. Cheaters were considered aberrant and as rare as serial killers. But in 1992, in response to concerns about fraud, the government established the Office of Research Integrity (ORI). Since then, ORI has found more than 100 scientists guilty of misconduct. (ORI deals only with cases involving research supported by U.S. Public Health Services funds or applications for such funds.) Broad and Wade noted the powerful incentives to cheat and the very small risk of getting caught and guessed that for each case of major fraud uncovered (several per year in recent years), about 100,000 major and minor ones go undetected. (19) Philip Fulford, the editor of Journal of Bone and Joint Surgery, said "Fraud now seems to be endemic in many scientific disciplines and in most countries. Recent cases have attracted media attention, but these are probably only the tip of the iceberg." (20)

Generalizability of results

Results from tests on mice and even single-celled organisms often accurately predict effects on humans. However, to maximize generalizability, subjects in treatment evaluation studies should be similar to the patients who will eventually be treated. But test subjects are typically younger and healthier. Although 63% of people in the United States with cancer are over 65, only 25% of subjects in tests of cancer drugs are over 65. Bodenheimer cited a study of drugs used primarily by elderly patients. Even though the drugs produce a high incidence of side effects in the elderly, only 2.1% of the test subjects were 65 years of age or older. (21)

Some researchers exclude patients who have ever taken medication to treat their conditions, for those are the patients least likely to be helped by the test drug. Had they responded to prior treatments, they would probably not have volunteered for a new study. Zimmerman and colleagues evaluated 346 patients with major depression for coexisting conditions such as suicide risk, unstable medical illnesses, eating disorders, obsessive-compulsive disorder, panic disorder, anxiety disorder, and a history of manic episodes or drug abuse. Any one condition is sufficient to exclude a person from most clinical trials of antidepressants (ADs). About 85% of the 346 patients would not have been eligible for inclusion in most clinical trials. Yet more than 90% of them for whom prescribing information was available were being treated with ADs. The authors concluded that subjects in AD trials represent a minority of patients treated for major depression in routine clinical practice. The effectiveness of ADs for most of the patients who receive them is unknown. (22)

People with only mild depression are also typically excluded, for drug company researchers worry that they will respond as well to a placebo as to the drug. (Depression is the only condition for which people with the disorder are routinely excluded from clinical trials because they are not sick enough.)

Experimenters routinely initiate studies of ADs and certain other psychiatric drugs by giving placebos for seven to ten days to all potential subjects. Those who markedly improve are disqualified from further participation. The washout procedure eliminates placebo responders, who constitute a significant portion of the psychiatric population and would be expected to show the smallest drug/placebo differences.

The number of subjects in clinical trials of drugs rarely exceeds 4,000, which is too small to reliably detect rare adverse reactions. To have a 95% chance of detecting an adverse reaction that affects one per 10,000 patients, 30,000 subjects would be needed. Personal experiences can reveal more than published reports on a limited number of subjects. A person whose sibling reacts adversely to a drug would be wise to avoid it even if dozens of peer-reviewed reports document its safety. Kalow described a particularly tragic case in which a young boy died following general anesthesia. Some time later his sister required an operation, and the anesthesiologist assured the parents that the boy's death had been an extremely rare occurrence. He said that no special precautions were needed. He was mistaken, and she died under the same circumstances. (23)

Generalizations are made along other dimensions besides subjects. Data are collected at a particular time and place with particular instructions to subjects and specific equipment. Variables are measured in specific ways. Patients in clinics are generally told which treatment they are getting, whereas volunteers in experiments should (if proper methodological and ethical procedures are followed) be told that they may get the treatment but may be put in the control group. Even a seemingly trivial change in procedure can profoundly affect results. Weller and Livingston asked college students to read three vignettes, each describing a murder or rape. Then the students answered questions about their reactions to the crimes. The vignettes were identical for all students, but the questionnaires were printed on pink, blue, or white paper. Several differences were found in anger aroused by the crimes and judgments about probable guilt and appropriate punishments, depending on the color of the paper used. (24)

The media

Most people don't read original articles in scientific journals -- they learn about the articles from the media, which have a huge impact on laypersons' views of science. Surveyors found that 64% of people get their information about cancer prevention from magazines, 60% from newspapers, and 58% from television. (25) Most probably assume that science stories are bias free. But scientific discoveries and controversies often have political or economic ramifications, and media outlets with leftist, rightist, business, or environmental leanings slant their coverage accordingly. Brian Trench concluded, after comparing reports of the same story in major Irish, French, German, Spanish, and British newspapers, that readers of different papers might not realize they were reading about the same research. (26)

Articles published in scientific journals are read primarily by trained professionals, whereas stories about the articles in newspapers and magazines and on television reach a much greater audience. Conventional wisdom is not newsworthy, so the media play on public fears while emphasizing the frontiers and fringes of science. Reporters with limited scientific training and under severe time constraints sometimes use just one source and thus present just one point of view. They fail to distinguish among preliminary findings, conference reports, small pilot studies, and more substantial works. They compress lengthy, complex articles into brief stories or sixty-second time slots, omitting important details in the process. They ask interviewees for definitive answers, which leads to speculations beyond what the data warrant.

Vendors of scientific products capitalize on reporters' quotas and deadlines. The vendors' public relations departments produce and distribute press kits including entire scripts for television science reporters. Their dramatic, slick video news releases are essentially lengthy commercials but are more compelling than most reporters' best efforts. So they are often shown in their entirety, masquerading as news. Schwitzer cited a survey of 2,500 editors and reporters that found that 90% of ideas for health articles had originated with a public relations person. (27)

A research team led by Ray Moynihan analyzed more than two hundred stories on new drugs that had been published in newspapers or presented on television. (28) Most of the accounts overstated benefits, did not mention risks, and ignored the source of funding. Forty percent of the reports that quoted an expert failed to mention that the expert had financial ties to the drug's manufacturer.

Conclusions

Unless an ulterior motive is obvious, as when a scientist testifies as an expert witness in a courtroom and favors the side paying him, scientists are largely exempt from accusations of playing loose with the truth. By contrast, when salespeople expound on their products' virtues, or lawyers proclaim their clients' innocence, or politicians tell us that all hope of global peace and prosperity will vanish unless they are elected, we recognize that their arguments may be colored by self-interest. But we should not exempt scientific research from skepticism. Methodological deficiencies are common, and crucial aspects of many studies are directed by parties motivated by considerations other than the search for truth. Personal experiences that conflict with media accounts of double-blind experiments should not be ignored. There are often plausible alternative explanations for the experimental results.

References

(1) Thornley, B. and C. Adams. 1998. Content and quality of 2000 controlled trials in schizophrenia over 50 years. Br. Med. J. 317:1181-1184.

(2) Bogardus, S. et al. 1999. Clinical epidemiological quality in molecular genetic research: the need for methodological standards. J. Am. Med. Assoc. 281:1919-1926.

(3) Wolins, K. 1962. Responsibility for raw data. Am. Psychol. 17:657-658.

(4) Dar, R. et al. 1994. Misuse of statistical tests in three decades of psychotherapy research. J. Consult. Clin. Psychol. 62:75-82.

(5) McGuigan, S. 1995. The use of statistics in the British Journal of Psychiatry. Br. J. Psychiatry. 167:683-688.

(6) Welch, G. and S. Gabbe. 1996. Review of statistics usage in the American Journal of Obstetrics and Gynecology. Am. J. Obstet. Gynecol. 175: 1138-1141.

(7) Gotzsche, P. 1989. Methodology and overt and hidden bias in reports of 196 double-blind trials of nonsteroidal anti-inflammatory drugs in rheumatoid arthritis. Controlled Clin. Trials 10:31-56.

(8) Spiro, H. 1997. Clinical reflections on the placebo phenomenon. In The Placebo Effect: an Interdisciplinary Exploration. Ed. A. Harrington. Harvard University Press, Cambridge, MA, pp 37-55.

(9) 1993. High anxiety. Consumer Reports, January 1993, pp 19-24.

(10) Easterbrook, P. et al. 1991. Publication bias in clinical research. Lancet 337:867-872.

(11) Dickersin, K. et al. 1992. Factors influencing publication of research results: follow-up of applications submitted to two institutional review boards. J. Am. Med. Assoc. 267:374-378.

(12) Huston, P. and D. Moher. 1996. Redundancy, disaggregation, and the integrity of medical research. Lancet 347:1024-1026.

(13) Gould, S.J. 1981. The mismeasure of man. W.W. Norton, New York.

(14) Rosenthal, R. 1966. Experimenter effects in behavioral research. Appleton-Century-Crofts, New York.

(15) Schulz, K. 1995. Empirical evidence of bias. Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. J. Am. Med. Assoc. 273:408-412.

(16) Bero, L. and D. Rennie. 1996. Influences on the quality of published drug studies. Int. J. Technol. Assess. Health Care 12:209-237.

(17) Barnes, D. and L. Bero. 1998. Why review articles on the health effects of passive smoking reach different conclusions. J. Am. Med. Assoc. 279:1566-1570.

(18) Hirschfield, R. et al. 1997. The National Depressive and Manic-Depression Association Consensus Statement on the Undertreatment of depression. J. Am. Med. Assoc. 277:333.

(19) Broad, W. and N. Wade. 1982. Betrayers of the truth. Simon and Schuster, New York.

(20) Fulford, P. 1988. Fraud and plagiarism. In The COPE Report. http://www.bmj.com/misc/cope/tex6.shtml.

(21) Bodenheimer, T. 2000. Uneasy alliance: clinical investigators and the pharmaceutical industry. N. Engl. J. Med. 342:1539-1544.

(22) Zimmerman, M. et al. 2002. Are subjects in pharmacological treatment trials of depression representative of patients in routine clinical practice? Am. J. Psychiatry 159:469-473.

(23) Kalow, W. 1967. Pharmacogenetics and the predictability of drug responses. In Drug Response in Man. Eds. G. Wolstenholme and R. Porter. Little, Brown, Boston.

(24) Weller, L. and R. Livingston. 1988. Effect of color of questionnaire on emotional responses. J. Gen. Psychol. 115:433-440.

(25) Nelkin, D. 1987. Selling science. W.H. Freeman, New York.

(26) Lowe, I. 1998. Tell it like it is. New Scientist, October 24, 1998.

(27) Schwitzer, G. 1992. The magical mystery media tour. J. Am. Med. Assoc. 267:1969-1971.

(28) Moynihan, R. et al. 2000. Coverage by the news media of the benefits and risks of medications. N. Engl. J. Med. 342:1645-1651.

About the author

Fred Leavitt received his Ph.D. in psychopharmacology from the U. of Michigan. Since 1970, he has taught at Cal State U. in Hayward, California. He has been a visiting professor at universities in Nairobi, Kenya; Vancouver, Canada; Palmerston North, New Zealand; Bushey, England; Utrecht, the Netherlands; and Istanbul, Turkey. He is the author of five textbooks and The REAL Drug Abusers, which has just been published by Rowman & Littlefield.

Your comment on this article is invited and should be addressed to: publisher@naturalscience.com. For further information on submitting a contribution to naturalSCIENCE, please see the Author Guide

Home
Cover Stories
Articles
Letters
News
Books
Open Forum
Comment
What's New
Copyright © 2003 naturalSCIENCE Heron Publishing, Victoria, Canada