* yellowbrickstats.com *

** ~ a loose collection of statistical and quantitative research material for fun and enrichment ~ **

how significance tests distort our thinking

yellowbrickstats homepage || my statistical and research consulting

Hypothesis testing is admirably suited for the allaying of insecurities which might have been better left unallayed.

Jerome Cornfield

In my first few years studying statistics, I thought significance tests were fascinating, and remarkably useful. How nifty that one could take a real-life result in practically any field and test it to see how likely it would occur by chance alone. And I eagerly learned all manner of methods for conducting these tests -- to compare group averages, to evaluate runs or streaks, to assess the strength of associations between variables, and so on.In those days I never imagined I would become so disenchanted with the way people use these tests. But I find that

people overwhelmingly focus on significance/nonsignificance and fail to address the natural, more important next question, of How much? or How strong? or How different? It's remarkable how often people stop at the significance test and say nothing about the effect size, the parameter of interest. All over the mass media, in the vast majority of articles reporting quantitative results, this kind of thinking impoverishes discourse, in ways that might not be apparent at first glance.Just a few examples of such articles from a sea of parametrically-challenged, effect-size-barren reports that washes over us:

- "Key study links childhood weight to adult heart disease"
^{1}takes up almost 1/4 page of a major newspaper. It provides all manner of details about the study in question as well as portions of an interview with the investigators, but we never learn how much more likely heart disease is among the obese. We are left to imagine whether it is 60% more or 2% more, and as I'll explain below, we usually do a poor job of guessing.

- "Study counters seafood advice: Finds eating fish in pregnancy cuts the risk of low IQ" offers excellent detail on the study's methods and sample characteristics; it presents alternative speculations about why this connection might exist; but it never tells by how much fish cuts the risk of low IQ.
^{2}

- "Disappointing jobs data weighs on stocks" is the headline, followed by the explanation: "A surprisingly poor signal on the jobs market sent stocks lower yesterday as investors remained worried about a lack of hiring." (The drop was 1/20 of 1%.)
^{3}I'd like to go into one such article in more detail.

So began an extensive piece covering nearly 1 1/2 pages of a"News flash: New research concludes that the sensationalism sweeping local news is bad for ratings"Boston Sunday Globe.^{4}I'll grant that this author, after exploring the nuances, implications, and complications involved in the (unquestioned) finding about TV news ratings differences, finally did reveal the magnitude of this difference -- in a sidebar, paragraph 34 of 35. Before I present that difference, I'd like you to form a picture in your mind of what it might look like. Try to imagine two bell curves, one for the ratings of quality news stories and one for those of sensational stories. One curve would be shifted to the right of the other, reflecting its significantly higher average.

*** The answer: the "strong correlation between high quality scores and high ratings" seems to have translated into a mean difference of .25 ratings points, or 1/4 of 1% of viewing households. My graph below is true to the size of this difference, with the dotted lines showing the group means; the data depicted are simulated.

Yes, the upper and lower halves look the same. There is much, much more variability within each group than between the two groups. Now, if you ask why I haven't labeled the numbers on the y-axis, well, that gets exactly to my point. For the magnitude and thus the meaningfulness of the group difference, it doesn't matter whether the y-scale runs from 0 to 10 or 0 to 10,000. The distributions' shapes -- the look of the group difference -- would be the same.

^{5}

*** People tend to overestimate the magnitude of differences or correlations that are reported merely as "significant." The mind's tendency -- need, really -- to create patterns has been extremely well documented by anthropologists and psychologists. You'll see this in the excellent literature on perception of statistics and probability, exemplified by the writings of Tversky, Kahneman, Oakes, and Gigerenzer. If we are given a small tidbit such as "this is greater than that," we naturally cook up a convincing picture involving a substantial difference. If our mind's eye happens to creates paired histograms in response to such a factoid, well, the distributions don't overlap very much. Similarly, when we hear in shorthand about a correlational finding such as "fish diet linked with higher IQ," we don't imagine some barely visible correlation such as 0.2, but something that would clearly show up on a scatterplot -- a 0.7 or higher. I can't prove this last point, but I'd be willing to bet. (And I'm interested to hear about any findings you may have encountered on this question.) I surprise myself by how often I fall into the same kind of thinking, even about my own research results. I'll come out with some finding, say, a correlation of .25; I'll begin a conversation soberly characterizing it as "a slight but statistically significant correlation"; and within 10 minutes I'll have slid into taking it for granted as a "meaningful" or even an "important" connection.

***

Sometimes, our training in significance testing can lead to a curious contradiction in thinking. Very informally, I asked the following question of five experienced educational/social science researchers I know:

What are the approximate odds that Candidate A will beat Candidate B if, in a quintessentially scientific poll, Candidate A leads by 1 percentage point, with a margin of error of +/- 5 points for the lead?Most of us are familiar with the kind of reasonable thinking that goes, "without a significant result, we can't put too much faith into there being much of a lead, and we couldn't reliably call it greater than zero." From there, it's unfortunately not such a big slide into "without a significant result, we can't say

A) The odds are even. The sample lead is not statistically significant and therefore we should discount it.

B) 6 to 5

C) 3 to 2

D) 2 to 1anythingabout what the lead is." In other words, if ourp-value is inconclusive enough (large enough), by this latter distortion we couldn't even say whether Candidate A had anything other than a 50-50 chance of winning. Whereas, if we merely and naively noted the small lead that we observed in the sample, we would strongly tend to believe she has at least asomewhatbetter chance. Most people untainted with hypothesis testing would affirm this. Who would be willing to give Candidate B even odds in a bet? Or, more telling, to give even odds across a large number of comparable situations? Those who would are under the sway of the same kind of thinking that produces that especially insidious phrase, "statistical dead heat."But many a competent introductory statistics student could

- start with the observed lead of +1;
- draw a bell curve with that as best estimate and 2.5 (half the margin of error, assuming a "default" CI
_{95}) as the standard deviation for the estimate; and- show that in the resulting curve about 2/3 of the area is to the right of zero.

Alternatively, one could use a standard introductory textbook's Appendix with Area under the Normal Curve to show that

- Candidate A's lead is equivalent to 1/2.5 = 0.4 standard errors from zero (
Z= 0.4);- being 0.4
Zfrom an hypothesized mean (zero, under the null) places a result at the 66th percentile of the total distribution;- the one-tailed
p-value is 0.34, and- .66/(.34) is about 2 to 1, which is the best estimate we have of Candidate A's odds of winning.
(It may not be immediately obvious why this should be a one-tailed test. But it wouldn't make much sense to conduct a two-tailed test, hypothesizing that Candidate A's lead is exactly zero.)

What 4 out of my 5 researcher-subjects said in essence was "I was going to favor Candidate A's odds, but of course I realize the

correctanswer is "A, 'Call the odds even, we have no information.'" Which I'm afraid is tantamount to " I can plainly see that A's odds are better than 50-50, but I will yield to the way I was taught the Significance Test." I was struck by how often the words "of course" came up in deference to the 'no information' option.The contradiction I mentioned? People often use a statement about probability to support the argument that they can make no statement about probability. Or only the most rigid, artificial kind of statement. They see the nonsignificant

p-value and deny the ability to further interpret the probability or odds of their result. Of course, this becomes less and less tenable the closer thep-value comes to their cutoff of, e.g., .05. Few people would say that they have no information whenp= .06. And yet they still might be reluctant to assign any odds--even though, in this "quintessentially scientific" election poll example, "p= .06" means that the odds of A having a lead are 94:6. Under the distortion I'm talking about, it is only whenpslips below .05 that suddenly some odds other than 50-50 seem to spring into being. And at this point many would be willing to specify those odds as exactly 19:1.***

The ironic thing is, before people take their first course in statistics, the meaningful questions come naturally. It would seem strange to reduce inherently interesting research topics to artificial, either/or probabilistic statements. Yet in the process of learning significance tests, we have that natural interest beaten down, and what we come to report and discuss about research are the nearly sterile and often misleading results of hypothesis testing, often mixed with misconceptions.

There are prominent voices that for some time have been advocating a change. If you look up the writings or lectures of Gene Glass, Jacob Cohen, Robert Rosenthal, Laurence Phillips, or Michael Oakes, you'll encounter eloquent arguments for deemphasizing or marginalizing significance testing in favor of better estimating what some quantity is likely to be. This group began making these arguments as early as the 1970s. They have had some effect among those passionate about statistics, as is shown by the recent rise of Bayesian methods among some of the most creative and advanced quantitative thinkers. But among the rank-and-file, and especially among those who draw on statistics only pragmatically, just enough to make their case in psychology or nutrition or what have you, hypothesis testing with its attendant thinking is still considered the essence of quantitative methods.

Do I offer any recommendations? For one thing, when you report research results, try to keep thinking about what sort of information people can best use. Seldom will that be simply a significant/nonsignificant classification. Often it will turn out to be your best estimate of a quantity, accompanied by some sort of interval containing its likely upper and lower limits. That could mean a traditional confidence interval if you're lucky enough to be analyzing random samples, or comparing two groups who have been randomly assigned and who are solidly representative of their respective populations. If you're not, hopefully you will use your judgment to create some kind of modified confidence interval -- perhaps using the Bayesian method to arrive at a "credible interval."

^{6}Most important, don't let anyone drum your curiosity out of you! If your course in significance testing is not making sense, it's quite possible the instructor has glossed over, or failed to thoroughly consider, some of the implications of the method. Teaching statistics is almost universally difficult. I have met intelligent, dedicated people for whom teaching hypothesis testing is so problematic that they avoid engaging in conceptual discussion with students along the way. If you run into this, stay committed to real understanding, and do what it takes to get your questions answered.

"Seize the moment of excited curiosity on any subject to solve your doubts; for if you let it pass, the desire may never return, and you may remain in ignorance."

-- William Wirt (1772 - 1834)

***

I welcome any comments you'd like to make privately or publicly about this piece.

Technically-minded people who enjoyed this essay may want to follow up by reading Jacob Cohen's "The Earth is Round (p< .05)" .

Notes1. Rob Stein,

Washington Post, December 6, 2007.

2. Bloomberg News, December 2007. Here I won't go into the important correlation/causation issue that headlines like these raise.

3.Boston Globe, August 6, 2010.

4. Drake Bennett,Boston Globe, October 14, 2007.

5. On rereading the Bennett article after writing this piece, I discovered I had sold him a little short, as the mean difference was in some cases .25 rating points but in others 1.25. So I have exaggerated the problem to some degree.

6. Nowadays when I report statistical results based on other-than-random samples, I use a disclaimer such asWhile, strictly speaking, inferential statistics are only applicable in the context of random sampling, we follow convention in reporting significance levels and/or confidence intervals as convenient yardsticks even for nonrandom samples. See Michael Oakes'sIf you're getting the feeling that I idolize Michael Oakes, you're right. You can read Amazon.com reviews of his book here.Statistical inference: A commentary for the social and behavioural sciences(NY: Wiley, 1986).

yellowbrickstats.com homepage

copyright 2008-12 by roland b. stark.