The trouble with posting graphs and statistics on social media

The trouble with posting graphs and statistics on social media

Reflecting on a salutary lesson in how not to post statistics on social media, Jonathan Portes discusses the limitations of posting statistics and data visualizations online and how simple visualizations can often take on unintended meanings.


When I tweeted this, as part of a 14-tweet thread on the November migration statistics (reproduced in full here), I didn’t think much of it, mostly because there was nothing new: neither the chart itself, nor the linked blog (which contains the chart), nor the research paper which the blog summarised, which had been published a month before. Neither of these publications had attracted much attention outside a narrow circle of academics and researchers interested in migration issues.

Not so this tweet, which within a few days had been viewed more than a million times, and “quote-tweeted” several hundred times, mostly with unfavourable comments. These ranged from the usual negative responses to anything that appears to be “pro-immigration”, to people claiming that the chart does not in fact show an association at all, to those arguing that it was fundamentally wrong or misleading to post any sort of chart showing a correlation without posting basic statistical tests (such as the correlation coefficient or the R-squared value, which are commonly used, although sometimes misleading). Much of this was encouraged when the post was picked up by Dominic Cummings, who suggested that I was using “fake data” (as the blog explained, the data were sourced directly from the ONS website).

So what did I get wrong? To understand, it helps to briefly summarise the original research. We look at the relationship between EU and non-EU migration, on the one hand, and productivity, at the level of UK regions and industrial sectors, on the other. We first plot some simple charts of the form above – plotting changes in the proportion of workers from abroad against changes in productivity. We then undertook some standard regression analysis, finding a positive and (mostly) statistically significant association between non-EU migration and productivity, and a weaker, generally insignificant, and negative association between EU migration and productivity.

1705326789 979 The trouble with posting graphs and statistics on social media | lifefromnature

XKCD #552 Correlation, via xkcd.com

How to interpret this? In writing the paper, I had one key concern – not to claim that this shows that non-EU migration increases productivity, but merely that there was a reasonably clear statistical association. Non-EU migration might be good for productivity, but equally the causal link might run in reverse, or something else might be the key driver. For what it’s worth, my best guess is that the causality runs both ways – but we don’t claim that the paper shows that. And the same applies to the blog, which is careful in its wording:

“There are a number of possible explanations…we do not claim to have established causality in our analysis.”

When I tweeted the chart, I thought I was merely illustrating in simple, graphical terms what the paper shows (and says) with standard regression analysis. I could have tweeted Table 1 with the actual regression results – but this would be incomprehensible to anyone who doesn’t know at least some econometrics, and not terribly user-friendly for those who do, but hadn’t read the actual paper. I thought the chart communicated the key result much more clearly – and that those who wanted the detail would click through the link.

Moreover, in my preoccupation with the distinction between causality and correlation – an obsession with empirical economists, who spend most of their time coming up with clever ways of showing that their favourite correlation is indeed causal, and much of the rest showing that other economists’ favourite ones aren’t – I missed the bigger picture. Does the chart show a correlation at all? Well, yes, but only a weak one. And it’s notable that the vast majority of my (serious) twitter critics were not economists but rather “hard” scientists – who, because of the nature of their experiments, typically expect much clearer and more convincing correlations in their data than we social scientists, working with messy “real world” social and economic data, normally get.

for most people on twitter – even those with a scientific or quantitative background – it’s much more fun to “dunk” or criticise than to engage in the detail.

Why, then, did I not do what many of the responses criticised me for, and not report the correlation of the points on the chart (or the R-squared)? This would at least have attached a veneer of statistical sophistication and knowledge to my claim. I could easily have done just that, and avoided much of the flak. It turns out that the correlation shown in the chart is in fact (just about) “statistically significant” (a linear regression has a t-value of about 2); and the R-squared value, at 0.2, is not bad for this sort of data. The many people on twitter who claimed that (just by looking at the chart) they could tell there was no association, or one which did not approach significance, were simply wrong.

But that’s not really the point. The truth is that I hadn’t even bothered to do these calculations at all, because I didn’t – and don’t – think that the chart in itself was “proof” of anything, with or without statistical tests, as opposed to being suggestive. I fundamentally don’t think that statistical tests on scatterplots with 16 points are very meaningful, and I thought that including them would imply that I did. The “proof” of the positive association, and the explanation of what I’d done and what it meant, was in the paper. In that sense, the critics were right – the tweet on its own was something of a non sequitur.

In retrospect, then, I made two mistakes. First, I absorbed, only too well, what the evidence says about communicating research on social media – that you need to present it accessibly, ideally backed up with an easily understood chart or graphic. But my chart did this only too well – the message was clear, but too simplistic, understandably confusing some, and annoying others. I foresaw that. But, and this was my second and fatal error, I thought that those who questioned the implicit conclusion, or who – correctly – thought that the chart in itself was far from convincing, would click through the links if they were genuinely interested in the data and evidence.

As it is, I can see that while the tweet has been viewed more than a million times, the paper has registered about 2,000 downloads, so we know how that worked out. It turns out that for most people on twitter – even those with a scientific or quantitative background – it’s much more fun to “dunk” or criticise than to engage in the detail. Again – and being far from innocent in this regard myself – I should have known.

So what I have I learned? Most importantly, that when you present information in a chart or graphic, you cannot be too clear about what you’re saying (or not saying) – without expecting people to click a link. Always assume that if something can be taken out of context, it will. If it’s backed up with a regression or a statistical test, say so. And finally, to paraphrase the words of Principal Skinner perhaps it’s out of touch to assume more than superficial engagement on social media, but it’s also the children who are wrong.

 


The content generated on this blog is for information purposes only. This Article gives the views and opinions of the authors and does not reflect the views and opinions of the Impact of Social Science blog (the blog), nor of the London School of Economics and Political Science. Please review our comments policy if you have any concerns on posting a comment below.

Image Credit: Alphavector on Shutterstock


Print Friendly, PDF & Email