The Perils of Tiny Effect Sizes
Or, how social scientists continues to vastly mislead the general public (and probably themselves) about lots of stuff...
This past week, for some reason, a meta-analysis from 2018 suddenly started making the social media rounds. The gist of this research was that printed books are better than digital books for reading comprehension. Why is this 6-year-old meta getting attention now? Probably because we’re in the midst of yet another of our “screens = bad” cycles and these kinds of “hey the things old people do are better than the things young people do” cater to an audience of a certain age.
I don’t particularly have a dog in the print/tablet book fight. Myself, I prefer “real” books and never got into the digital books craze. But a quick glance at the meta-analysis in question revealed a glaring problem: for all the back-slapping in certain circles regarding this victory for regular print books, the meta-analysis only found an effect sizes of g = .21 (or equivalent to r = .10) for the difference between “real” and digital books.
Now, those who weren’t graced with college statistics (or have blessedly forgotten them), may wonder: What does that mean? Or even better: Why should I care? It means that, in fact, digital books are just fine, and you’re being lied to about the benefits of physical books. That effect size in that meta is so small, it’s indistinguishable from statistical noise and never should be used to support a hypothesis even if “statistically significant.”
Now when I say being lied to, I don’t mean that the study authors are liars…they’ve probably been lied to as well, or maybe are just self-deceptive. Much of social science is a house of cards built on a foundation of self-deception passed down to our students. Established professors sucker poor grad students with a lot of nonsense about how every effect size is sacred, just like the sperm in The Meaning of Life. But social science is stuck in this mire where we all kind of know these kind of findings are unreliable, yet carry on anyway. The precision of our research designs and measurement tools just isn’t good enough to distinguish noise from signal at this miniscule level. But we ignore this so long as we can pump sample sizes high enough (as with meta-analysis) in order to reach that magical finish line of p = .05, even though we all know statistical significance is a bad way to access hypothesis support.
I’m not even talking about practical significance (though there also is that). I’m simply talking about distinguishing signal from noise, which we can’t do with the level of precision necessary to take r = .10 seriously. That’s an issue most social scientists just ignore. Of course, for practical significance, social scientists have invented a whole other host of deluded arguments why any effect size, no matter how obviously trivial, is still somehow important. Usually scholars self-servingly argue that by sprinkling tiny effects like pixie dust over a large population, they magically become meaningful. That’s just not how effect sizes work1.
Honestly, the field of psychology and the rest of social science has simply failed to grapple with the absolute triviality and unreliability of the vast majority of research we produce2. I don’t think this is because of ignorance. The podcast More a Comment Than a Question covered this nicely, recently, and as they suggest, if you were to ask most social scientists what proportion of research in their field is rubbish, you’ll generally get pretty high estimates3.
Why do we keep promoting it then? That’s probably complicated. First, I understand that, by choosing to spend one’s life researching something like psychology, there’s understandable cognitive dissonance when you realize most of what we’ve been telling people is unreliable, untrue, or (as in the case of DEI or implicit bias theories, as noted by the More a Comment panel) outright harmful. In that sense social science isn’t really different from any other industry being defensive about its product. That doesn't mean people are evil, just human4. We all kind of know psychology is mostly crap but saying that out loud is still threatening.
Second, I think we’re all easily caught by the allure of a good story. I often pick on really, truly bad fields like DEI or the moral panic over social media. But, if I’m being honest, I’m far from immune to being susceptible. For instance, in the last few years of racial/identity panic, one paper made the rounds suggesting that liberal/progressive youth are more likely to have mental health problems than conservative youth. I found this to be interesting and I’m sure I’ve referenced this a few times based on my casual (and imperfect) understanding of it. But, recently taking a deeper dive into the central claims I found that they are once again based on an effect size of…you guessed it!...r = .10. So, this finding, which I thought was kind of cool/interesting is just as unreliable as many I was skeptical of to begin with.
This, I think, is the issue. It’s not bad to be skeptical of research that doesn’t sound right to us. Tin foil hats aside, skepticism is mainly good. But too often we toss it aside in praise of an interesting or clever finding, even if it may be based in weak evidence.
The reality is that the overwhelming majority of social science studies are junk, and they would still be junk even if they were methodological sound (which too often, they are not), based on these trivial/noise effect sizes alone. Until our fields come to grapple with this central fact, we’ll be as guilty as any pseudoscience of foisting nonsense on an unsuspecting general public. In the meantime, don’t throw your reading tablets away just yet.
Assuming an absense of noise (which is a fantasy), an effect size is better thought of as the degree of relationship between two variables that may be noticeable for any given individual. It does not tell us how many individuals in a population may exceed some critical clinical value of importance.
And I include the Open Science folks here as well as everyone else. Open Science is good, but still of only limited value if we remain unconcerned with junk effect sizes (which is what most studies produce).
I am sympathetic to the argument raised, I think it was by Daniel Lakens, that some small proportion of social science research is “gold” even if the rest is junk. I think he might have used the figure of 2%, but my memory is notoriously faulty, so I’ll apologize if I got that number wrong in advance. But this creates a situation in which social science is basically a needle in a haystack. Having 2% or 20% gold science is still useless if nobody can tell what is gold and what is junk.
And, to be clear, my intention is not to label accusations against any specific study authors, including those I highlight for this article. My concern is that this is business as usual for social science up to it’s highest levels in our professional guilds.