Is data surfacing the future of empirical social research?

Is data surfacing the future of empirical social research?

Taking a broad sweep of the history of empirical social research and its use by the state, Steve Fuller argues that as data analysis moves increasingly to commercial non-state actors, researchers and civil servants might learn from and harness the empiricist style of ‘data surfacing’ deployed by data analytics companies.

Civil servants were the original audience for the systematic gathering of data about people’s health and wealth. But until the early nineteenth century, these data were collected in the spirit of stock-taking for purposes of tax, trade or war. It took the English Utilitarians to see that today’s data might be read more dynamically, with an eye to future data trending in a ‘progressive’ direction through strategic state interventions. And it took their French Positivist counterparts to introduce the complementary idea of arranging data around a norm and its various deviations, which could then be treated as signs of a society’s health or pathology. By 1870, this dual approach to ‘statistics’ had become widespread, resulting in general beliefs that, say; Bismarck’s united Germany was ‘ascendant’ in Europe, France was in ‘decline’ and Britain had ‘overextended’ its imperial commitments.

Empirical social research has since changed markedly. Although the word ‘statistics’ was coined to capture the state’s overriding interest in data, the state is no longer the dominant gatherer and utiliser of data on a mass scale. Moreover, its data practices may not even be more sophisticated or reliable than those of ideological thinktanks or profit-driven firms. The exception, being China, whose version of state capitalism relies on the sharing of data between public and private sectors.

Publics are sufficiently informed about how researchers work to strategically withhold information, if not outright lie and deceive, simply to confound research findings.

There is a general sense that we live in ‘post-truth’ times. Post-truth is not about the rejection of facts, let alone truth. On the contrary, it’s about the proliferation of those capable of producing, gathering and distributing the data on which facts are based and truth is inferred. It has resulted in a loss of any default understanding of what is real and fake. Publics are sufficiently informed about how researchers work to strategically withhold information, if not outright lie and deceive, simply to confound research findings. Reflecting on the rise of mass media in Weimar Germany, Elisabeth Noelle-Neumann outlined a precedent for this condition, when she described a ‘spiral of silence’, whereby the dominant voices drive those holding opposing views underground, until the time comes to cast a secret ballot, when they reveal themselves to be what US President Richard Nixon called the ‘silent majority’.

Public opinion and data integration – the original challenges to empirical social research

The success of empirical social research has historically depended on an asymmetrical relationship of knowledge and power between the researcher and the researched. The tools for both constructing and deconstructing this asymmetry were forged at the dawn of twentieth century. When the power contained in public opinion polls and surveys was first laid out in Human Nature in Politics in 1908, its author, Graham Wallas (one of LSE’s founding politics lecturers) argued that these research instruments can tap into unconscious tendencies in the collective mind that would not be normally elicited in votes or sales. Once crystallised as data, they could be used as evidence for views that people have about things they have never been asked about or perhaps even never explicitly thought about. For Wallas, this knowledge constituted a new form of unelected power that can be wielded to either channel or unleash the latent energies of liberal democratic societies. Twenty years later, Edward Bernays celebrated this newfound expression of people power as ‘public relations’, the backbone of modern advertising.

As it happens, Wallas’ book is dedicated to a member of his student audience at Harvard, where its ideas were first aired. That student, Walter Lippmann, went on to become the most influential US journalist of the twentieth century. Lippmann famously concluded that public opinion was a ‘phantom’ construct propagated by those conducting – and funding – the polls and surveys, but no less powerful because of it. Thus, he called for the state licensing of all such activities, which he characterised (sixty years before Noam Chomsky) as the ‘manufacture of consent’. This call has fallen on deaf ears. Consequently, we live in a world where ‘public opinion’ looms as an entity, fake or real, that both the state and commercial sectors must appease, nudge and sometimes dodge. In this context, academic researchers function as wildcards who provide high-grade ammunition of potential use to all sides.

Around the same time (1907), another watershed moment occurred. H.G. Wells put himself forward for the first Sociology chair at the LSE. Wells saw himself in the line of Auguste Comte, Karl Marx and Herbert Spencer, who combined many diverse data streams to project the future of society. Wells was not appointed to the chair, and the only self-described ‘sociologist’ who has taken his vision of the field seriously was Harvard’s Pitirim Sorokin, who spent the middle third of the twentieth century seeking empirical indicators that could point the world towards altruism.

Nowadays we regard the efforts of Wells as ‘science fiction’. Nevertheless, they left their mark as a style of assembling and presenting data that remains present in public intellectual discussions. I have referred to this style derogatorily as ‘intellectual asset stripping’, specifically with evolutionary psychology in mind. The style involves de-contextualising data from the theories and methods that originally gave meaning to them. Thus, evolutionary psychologists and like-minded ‘futurists’ throw together data drawn from statistics, experiments, surveys, ethnographies, histories, as well as the testimony of journalists and the opinions of academics, into an indiscriminate pile of ‘evidence’ for whatever case they wish to make. The conclusions drawn clearly depend on the relative weighting assigned to these heterogeneous bodies of data, yet readers of these works are typically left no wiser about the weighting principle in play. In medical and psychological practice, the field of ‘meta-analysis’ has been developed to tackle this problem. The vast array of ‘conspiracy theories’ that have come to the fore in our post-truth world are arguably a grassroots version of the same tendency.

The difference made by ‘big’ data

The advent of big data has changed matters considerably. Whereas in the past subjects generated data by explicitly engaging in the research process, big data mainly consists of information that subjects generate as a by-product of some other activity, such as online browsing, clicking and liking. It constitutes a scaled-up version of what the social psychologist Donald Campbell called ‘unobtrusive measures’, but where the traces now are digital rather than physical. In this context, the term ‘metadata’ is increasingly used to signal that interest in the data is not intrinsic but relational. The big data analyst is less interested in exactly what you purchased at Amazon than whether it can be used to predict other, more interesting features about you.

‘Data mining’ is an apt phrase for the algorithms deployed by clients to get exactly what they need for, say, their marketing campaigns and then disregard the rest as noise.

It is here that the bigness of the data matters, especially when understood as a platform for clients to draw connections relevant to their marketing campaigns. On the one hand, this has made the transactions recorded on Google, Amazon, Facebook and Twitter/X the engine of wealth production in the big tech economy. On the other hand, it has led to calls for those unwitting users of such platforms to secure greater knowledge, if not control, over how their data is used. One proposal, made by Wired magazine founder Kevin Kelly, which he calls ‘coveillance’, would allow platform users to acquire the same access to their data as the big tech clients, on whose behalf their data had been gathered.

To be sure, the clients for big data can usually do more with user data than the users themselves. Nevertheless, client utilisation itself tends to be very targeted. ‘Data mining’ is an apt phrase for the algorithms deployed by clients to get exactly what they need for, say, their marketing campaigns and then disregard the rest as noise. In that respect, data mining techniques apply confirmation bias to the data available, successfully so from the client’s standpoint. And that may even provide some consolation to those concerned that the advent of big data might eventuate in a mass surveillance society. At the same time, however, it does raise serious questions about whether the methods of empirical social research are up to the challenge of making the most of big data.

Palantir and data surfacing

Into the breach steps the Silicon Valley data analytics firm Palantir, which has explicitly set itself against data mining. Known mainly as a cybersecurity provider to the US Defense Department, it was awarded a half-billion-pound contract to design a ‘federated data platform’ for the UK National Health Service. It promotes the strategy of ‘data surfacing’, which starts by presuming that all the available data is potentially valuable, and then presenting it in a way that its full value might been seen. The client is taught how to interpret these visualizations, with an eye to identifying emergent phenomena that may escape preconceived ways of reading the data. The value of this proactive approach in the context of national security and public health should be obvious, but it requires a recalibration of how researchers normally deal with data, in effect relaxing prior expectations and intensifying focus on anomalies to get an overall sense of the data’s direction of travel.

Palantir regards big data, which after all is generated by many humans, as itself one big subject for interpretation.

When I first presented this argument as a talk, a civil servant in the audience astutely compared this approach to the social research method of grounded theory, whereby the researcher is focused on capturing the categories by which subjects spontaneously conceptualise their experience rather than verifying a hypothesis driven by a particular theoretical agenda. Indeed, this is fairly seen as distinguishing data surfacing and data mining at the level of qualitative research. Palantir regards big data, which after all is generated by many humans, as itself one big subject for interpretation. The difference from when Wallas and Lippmann pondered the mysteries of public opinion a century ago, is that now the relevant data no longer needs to be explicitly elicited in polls and surveys. Social media has proven to be the great mass disinhibitor of human expression, unleashing enormous data streams that permit a deep investigation of the collective unconscious without direct contact with the subjects. It should perhaps come as no surprise that Palantir’s CEO Alex Karp completed a PhD on the Frankfurt School, which probably did the most to leverage the one-to-one encounters of Freudian psychoanalysis into an all-purpose diagnostic for modern capitalist society.

If Palantir and other big data analytics firms deliver on their promise to provide comprehensive and efficient access to all that is exploitable in the data, then it has the potential to redefine researchers’ relationship to data more generally. Commercial clients aren’t the only ones who are prone to ignore most of the data in the name of data mining. Academic knowledge production displays a disturbingly similar path dependency, whereby most published work is actively ignored because it doesn’t conform to the dominant research agendas. This remainder, which may be as much as 4/5 of all academic publications, has been dubbed by the library scientist Don Swanson as ‘undiscovered public knowledge’. Swanson himself showed how tapping into this epistemic reserve could be used to solve problems that cross disciplinary boundaries, and thereby save the cost of commissioning ‘original’ research. But perhaps the ultimate data miners are politicians, policymakers and their favoured thinktanks. It is here that the civil service can provide a useful service as complement, if not foil, by cultivating the interpretive skills involved in data surfacing.


This paper originated as the keynote address to the annual social research away day at the UK Department for Education on 14 November 2023 in London. I would like to thank Andreas Zacharia for the invitation.

The content generated on this blog is for information purposes only. This Article gives the views and opinions of the authors and does not reflect the views and opinions of the Impact of Social Science blog (the blog), nor of the London School of Economics and Political Science. Please review our comments policy if you have any concerns on posting a comment below.

Image Credit: NicoElNino on Shutterstock.


Print Friendly, PDF & Email