Association for Psychological Science takes a hard stand against criminal justice reform

Here’s the full quote, from an article to be published in one of our favorite academic journals:

The prescriptive values of highly educated groups (such as secularism, but also libertarianism, criminal justice reform, and unrestricted sociosexuality, among others) may work for groups that are highly cognitively sophisticated and self-controlled, but they may be injurious to groups with lower self-control and cognitive ability. Highly educated societies with global esteem have more influence over global trends, and so the prescriptive values promulgated by these groups are likely to influence others who may not share their other cognitive characteristics. Perhaps then highly educated and intelligent groups should be humble about promoting the unique and relatively novel values that thrive among them and perhaps should be cautious about mocking certain cultural narratives and norms that are perceived as having little value in their own society.

I have a horrible feeling that I’m doing something wrong, in that by writing this post I’m mocking certain cultural narratives and norms that are perceived as having little value in my own society.

But, hey, in my own subculture, comedy is considered to be a valid approach to advancing our understanding. So I’ll continue.

The Association for Psychological Science (a “highly educated and intelligent group,” for sure) has decided that they should be humble about promoting the unique and relatively novel values that thrive among them. And these unique and relatively novel values include . . . “criminal justice reform and unrestricted sociosexuality.” If members of the Association for Psychological Science want criminal justice reform and unrestricted sociosexuality for themselves, that’s fine. Members of the APS should be able to steal a loaf of bread without fear of their hands getting cut off, and they should be able to fool around without fear of any other body parts getting cut off . . . but for groups with lower self-control and cognitive ability—I don’t know which groups these might be, you’ll just have to imagine—anyway, for those lesser breeds, no
criminal justice reform or unrestricted sociosexuality for you. Gotta bring down the hammer—it’s for your own good!

You might not agree with these positions, but in that case you’re arguing against science. What next: are you going to disbelieve in ESP, air rage, himmicanes, ovulation and voting, ego depletion, and the amazing consequences of having an age ending in 9?

OK, here’s the background. I got this email from Keith Donohue:

I recently came across the paper, “Declines in Religiosity Predicted Increases in Violent Crime—But Not Among Countries with Relatively High Average IQ”, by Clark and colleagues, which is available on research gate and is in press at Psychological Science. Some of the authors have also written about this work in the Boston Globe.

I [Donohue] will detail some of issues that I think are important, below, but first a couple of disclaimers. First, two of the authors, Roy Baumeister and Bo Winegard, are (or were) affiliated with Florida State University (FSU). I got my PhD from FSU, but I don’t have any connection with these authors. Second, the research in this paper uses estimates of the intelligence quotient (IQ)s for different nations. Psychology has a long and ignoble history of using dubious measures for intellectual ability to make general claims about differences in average intelligence between groups of people – racial/ethnic groups, immigrant groups, national groups, etc. Often, these claims have aligned with prevailing prejudices or supported frankly racist social policies. This history disturbs me, and seeing echoes of it a flagship journal for my field disturbs me more. I guess what I am trying to say is that my concerns about this paper go beyond its methodological issues, and I need to be honest about that.

With that in mind, I have tried to organize and highlight the methodological issues that might interest you or might be worth commenting on.

1.) Data source for estimates of national IQ and how to impute missing values in large datasets

a. The authors used a national IQ dataset (NIQ: Becker, 2019) that is available online, as well as two other datasets (LV12GeoIQ, NIQ_QNWSAS – I can only guess at what these initialisms stand for). These datasets seem to be based on work by Richard Lynn, Tatu Vanhanen, and (more recently) David Becker for their books IQ and the Wealth of Nations and Intelligence: A Unifying Construct for the Social Sciences, and Intelligence of Nations. This data seems to be collections of IQ estimates made from proxies of intelligence, such as school achievement or scores on other tests of mental abilities. Based on my training and experience with intelligence testing, this research decision seems problematic. Also problematic is the decision to input missing values for some nations based on data from neighboring nations. Lynn and colleagues’ work has received a lot of criticism from within academia, as well as from various online sources. I find this criticism compelling, but I am curious about your thoughts on how researchers ought to impute missing values for large datasets. Let’s suppose that a researcher was working at a (somewhat) less controversial topic, like average endorsement of a candidate or agreement with a policy, across different voting areas. If they had incomplete data, what are some concerns that they ought to have when trying to guess at missing values?

2.) Testing hypotheses about interactions, when the main effect for one of the variables isn’t in the model

a. In Study 1, the authors tested their hypothesis that the negative relationship between religiosity and violent crime (over time) is moderated by national IQ, by using fixed effects, within-country linear regression. There are some elements to these analyses that I don’t really understand, such as the treatment of IQ (within each country) as a time-stable predictor variable and religiosity as a time-varying predictor variable (both should change, over time, right?) or even the total number of models used. However I am mostly curious about your thoughts on testing an interaction, using a model that does not include the main effects for each of the predictor variables that are in the interaction effect (which appears to be what the authors are doing). I don’t have my copy of Cohen, Cohen, West, & Aiken in front of my, but I remember learning test models with interactions in a hierarchical fashion, by first entering main effects for the predictor variables, and then looking for changes in model fit when the interaction effect was added. I can appreciate that the inclusion of time in these analyses (some of the variables – but not IQ? – are supposed to change) make these analyses more complicated, but I wonder if this is an example of incongruity between research hypotheses and statistical analyses.

b. Also, I wonder if this is an example of over-interpreting the meaning of moderation, in statistical analyses. I think that this happens quite a lot in psychology – researchers probe a dataset for interactions between predictor variables, and when they find them, they make claims about underlying mechanisms that might explain relationships between those predictors or the outcome variable(s). At the risk of caricaturing this practice, it seems a bit like IF [statistically significant interaction is detected] THEN [claims about latent mechanisms or structures are allowed].

3.) The use of multiverse analyses to rule-out alternative explanations

a. In Study 2, the authors use multiverse analysis to further examine the relationship between religiosity, national IQ, and violent crime. I have read the paper that you coauthored on this technique (Steegen, Tuerlinck, Gelman & Vanpaemal, 2016) and I followed some of the science blogging around Orben & Przybylski (2019) use of it, in their study on adolescent well-being and digital technology use. If I understand it correctly, the purpose of multiverse analysis is to organize a very large number of analyses that could potentially be done with a group of variables, so as to better understand whether or not the hypotheses that the researchers’ (e.g,. fertility effects political attitudes, digital technology use effects well-being) are generally supported – or, to be more Popper-ian, if they are generally refuted. In writing up the results of a multiverse analysis, it seems like the goal is to detail how different analytic choices (e.g. the inclusion/exclusion of variables, how these variables are operationalized, etc.) influence these results. With that in mind, I wonder if this is a good (or bad?) example of how to conducts and present a multiverse analysis. In reading it, I don’t get much of a sense of the different analytic decisions that the authors considered, and their presentation of their results seems a little hand-wavey – “we conducted a bunch of analyses, but they all came out the same…” But given my discomfort with the research topic (i.e., group IQ differences), and my limited understanding of multiverse analyses, I don’t really trust my judgement.

4.) Dealing with Galton’s Problem and spatial autocorrelation

a. The authors acknowledge that the data for the different nations in their analyses are likely dependent, because of geographic or cultural proximity, an issue that they identify as Galton’s Problem or spatial autocorrelation. This issue seems important, and I appreciate the authors’ attempts to address it (note: it must be interesting to be known as a “Galton’s Problem expert”). I guess that I am just curious as to your thoughts about how to handle this issue. Is this a situation in which multilevel modeling makes sense?

There are a few issues here.

First, I wouldn’t trust anything by Roy Baumeister, first because he has a track record of hyping problematic research claims, second because he has endorsed the process of extracting publishable findings from noise. Baumeister’s a big fan of research that is “interesting.” As I wrote when this came up a few years ago:

Interesting to whom? Daryl Bem claimed that Cornell students had ESP abilities. If true, this would indeed be interesting, given that it would cause us to overturn so much of what we thought we understood about the world. On the other hand, if false, it’s pretty damn boring, just one more case of a foolish person believing something he wants to believe.

Same with himmicanes, power pose, ovulation and voting, alchemy, Atlantis, and all the rest.

The unimaginative hack might find it “less broadly interesting” to have to abandon beliefs in ghosts, unicorns, ESP, and the correlation between beauty and sex ratio. For the scientists among us, on the other hand, reality is what’s interesting and the bullshit breakthroughs-of-the-week are what’s boring.

To the extent that phenomena such as power pose, embodied cognition, ego depletion, ESP, ovulation and clothing, beauty and sex ratio, Bigfoot, Atlantis, unicorns, etc., are real, then sure, they’re exciting discoveries! A horse-like creature with a big horn coming out of its head—cool, right? But, to the extent that these are errors, nothing more than the spurious discovery of patterns from random noise . . . then they’re just stories that are really “boring” (in the words of Baumeister) stories, low-grade fiction.

Second, Baumeister has some political axe to grind. This alone doesn’t mean his work is wrong—someone can have strong political views and do fine research, indeed sometimes the strong views can motivate careful work, if you really care about getting things right. Rather, the issue is that we have good technical reasons to not take Baumeister’s research seriously. (For more on problems with his research agenda, see the comment on p.37 by Smaldino here.) Given that Baumeister’s work may still have some influence, it’s good to understand his political angle.

And, no, it’s not “bullying” or an “ad hominem attack” to consider someone’s research record when evaluating his published claims.

Second, yeah, it’s my impression that you have to be careful with these cross-national IQ comparisons; see this paper from 2010 by Wicherts, Borsboom, and Dolan. Relatedly, I’m amused by the claim in the abstract by Clark et al. that “Many have argued that religion reduces violent behavior within human social groups.” I guess it depends on the religion, as is illustrated by the graph shown at the top of this page (background here).

Third, ok, sure, the paper will be published in Psychological Science, flagship journal bla bla bla. I’ve written for those journals myself, so I guess I like some of what they do—hey, they published our multiverse paper!—but the Association for Psychological Science is also a bit of a member’s club, run for the benefit of the insiders. In some way, I have to admire that they’d publish a paper on such a politically hot topic as IQ differences between countries. I’d actually have guessed that such a paper would push too many buttons for it to be considered publishable by the APS. But Baumeister is well connected—he’s “one of the world’s most prolific and influential psychologists”—so I guess that in the battle between celebrity and political correctness, celebrity won.

And, yes, that paper is politically incorrect! Check out these quotes:

Educated societies might promote secularization without considering potentially disproportionately negative consequences for more cognitively disadvantaged groups. . . .

We suspect that similar patterns might emerge for numerous cultural narratives. The prescriptive values of highly educated groups (such as secularism, but also libertarianism, criminal justice reform, and unrestricted sociosexuality, among others) may work for groups that are highly cognitively sophisticated and self-controlled, but they may be injurious to groups with lower self-control and cognitive ability.

OK, I got it. We won’t throw you in jail if you’re from a group with higher self-control and cognitive ability. But if you’re from one of the bad groups, you can do what you want: you can handle a bit of “libertarianism, criminal justice reform, and unrestricted sociosexuality, among others.” Hey, it worked for Jeffrey Epstein!

Fourth, there are questions about the statistical model. I won’t lie: I’m happy they did a multiverse analysis and fit some multilevel regressions. I’m not happy with all the p-values and statistical significance, nor am I happy with some of their arbitrary modeling decisions (“the difference led us to create two additional dummy variables, whether a country was majority Christian or not and whether a country was majority Muslim or not, and to test whether either of these dummy variables moderated the nine IQ by religiosity interactions (in the base models, without controls). None of the 18 three-way interactions were statistically significant, and so we do not interpret this possible difference between Christian majority countries and Muslim majority countries”) or new statistical methods they seemed to make up on the spot (“We arbitrarily decided that a semipartial r of .07 or higher for the IQ by religiosity interaction term would be a ‘consistent effect’ . . .”), but, hey, baby steps.

Donohue’s questions above about multiverse analysis and spatial correlations are good questions, but it’s hard for me to answer them in general, or in the context of this sort of study, where there are so many data issues.

To see where I’m coming from, consider an example that I’ve though about a lot: the relation between income, religious attendance, geography, and vote choice. My colleagues and I wrote a whole book about this!

The red-state-blue-state project and the homicide/religion/IQ project have a lot of similarities, in that we’re understanding social behavior through demographics, and looking at geographic variation in this relationship. We’re interested in individual and average characteristics (individual and state-average incomes in our case; individual and national-average IQ’s in theirs). Clark et al. have data issues with national IQ measurements, but it’s not like our survey measurements of income are super-clean.

So, if we take Red State Blue State as a template for this sort of analysis, how does the Clark et al. paper differ? The biggest difference is that we have individual level data—survey responses on income, religious attendance, and voting—whereas Clark et al. only have averages. So they have a big ecological correlation problem. Indeed, one of the big themes of Red State Blue State is that you can’t directly understand individual correlations by looking at correlations among aggregates. The second difference between the two projects is that we had enough data that we can analyze each election year separately, whereas Clark et al. pool across years, which makes results much harder to understand and interpret. The third difference is that we developed our understanding through lots of graphs. I can’t imagine us figuring out much, had we just looked at tables of regression coefficients, statistical significance, correlations, etc.

This is not to say that their analysis is necessarily wrong, just that it’s hard for me to make sense of this sort of big regression; there are just too many moving parts. I think the first step in trying to understand this sort of data would be some time series plots of trends of religiosity and crime, with a separate graph for each country, ordering the countries by per-capita GDP or some similar measure of wealth. Just to see what’s going on in the data before going forward.

Fifth, I’m kinda running out of energy to keep staring at this paper but let me point out one more thing which is the extreme stretch from the empirical findings of this paper, such as they are, to its sociological and political conclusions. Set aside for a moment problems with the data and statistical analysis, and suppose that the data show exactly what the authors claimed, that time trends in religious attendance correlate with time trends in homicide rates in low-IQ countries but not in high-IQ countries. Suppose that’s all as they say. How can you, from that pattern, draw the conclusion that “The prescriptive values of highly educated groups (such as secularism, but also libertarianism, criminal justice reform, and unrestricted sociosexuality, among others) may work for groups that are highly cognitively sophisticated and self-controlled, but they may be injurious to groups with lower self-control and cognitive ability”? You can’t. To make such a claim is not a gap in logic, it’s a chasm. Aristotle is spinning in his goddam grave, and Lewis Carroll, Georg Cantor, and Kurt Godel ain’t so happy either. This is story time run amok. I’d say it’s potentially dangerous for the sorts of reasons discussed by Angela Saini in her book, but I guess nobody takes the Association for Psychological Science seriously anymore.

I’m surprised Psych Science would publish this paper, given its political content and given that academic psychology is pretty left-wing and consciously anti-racist. I’m guessing that it’s some combination of: (a) for the APS editors, support of the in-group is more important than political ideology, and Baumeister’s in the in-group, (b) nobody from the journal ever went to the trouble of reading the article from beginning to end (I know I didn’t enjoy the task!), (c) if they did read the paper, they’re too clueless to have understood its political implications.

But, hey, if the APS wants to take a stance against criminal justice reform, it’s their call. Who am I to complain? I’m not even a member of the organization.

I’d love to see the lords of social psychology be forced to take a position on this one—but I can’t see this ever happening, given that they’ve never taken a position on himmicanes, ESP, air rage, etc. At one point one of those lords tried to take a strong stand in favor of that ovulation-and-voting paper, but then I asked him flat-out if he thought that women were really three times more likely to wear red during certain days of the month, and he started dodging the question. These people pretty much refuse to state a position on any scientific issue, but they very strongly support the principle that anything published in their journals should not be questioned by an outsider. How they feel about scientific racism, we may never know.

P.S. One more thing, kinda separate from everything else but it’s a general point so I wanted to share it here. Clark et al. write:

Note also that noise in the data, if anything, should obscure our hypothesized pattern of results.

No no no no no. Noise can definitely obscure true underlying patterns, but it won’t necessarily obscure your hypothesized patterns. Noise can just give you more opportunities to find spurious, “statistically significant” patterns. The above quote is an example of the “What does not kill my statistical significance makes it stronger” fallacy. It’s an easy mistake to make; famous econometricians have done it. But a mistake it is.