How big data Is revolutionizing social sciences

Share:

The traces left by individuals on the internet and social media constitute a considerable source of digital data, known as big data. Some predicted the death of the social sciences with the emergence of this massive data set. On the contrary, it seems that the social sciences are transforming and refining their research methods thanks to digital data. Caution remains, however, due to the non-representative nature of the samples used and the opacity of the algorithms—not to mention the invasions of privacy linked to data collection.


The traces we leave on search engines, social networks, online shopping sites, as well as the growing number of connected objects (smartphones, watches, cameras, thermostats, speakers, sensors), feed a fabulous deposit of digital data. It illuminates even the micro-details of our daily behaviors, our movements, our consumption patterns, our health, our leisure activities, our interests, our social networks, our political and religious opinions, without us always being aware of it. The accelerated digitization of archives and documents, previously inaccessible, carried out by administrations, companies, political parties, newspapers, and libraries also contributes to this.

The result is data that is extraordinary in terms of its volume, variety, and velocity (the “3 Vs”), commonly referred to as “big data.” And the means to extract, code, quantify, and analyze it in just a few clicks have developed alongside it, thanks to advances in artificial intelligence (AI). As Dominique Boullier points out in his latest book, this process is revolutionizing the social sciences landscape, for better or for worse.

In this regard, two theses have been clashing since the birth of the Web. In an article with the provocative title, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” , Chris Anderson, editor-in-chief of Wired magazine dedicated to new technologies, sees this as the programmed death of the social sciences. Correlations will replace causality, there is no need for an explanatory model or unified theory and “the numbers speak for themselves”. In total disagreement, researchers like Burt Monroe and Gary King welcome the potential for renewal of theories and methods that these data bring and advocate the hybridization of social sciences and “data science”.

Along the same lines, I will give some examples illustrating the contribution of big data, particularly on sensitive subjects such as racism or sexuality, which are difficult to grasp in surveys or interviews because of “social desirability” bias, i.e. the temptation for the interviewer to hide their opinion if it does not conform to current social norms.

Big Data and Research on Racism

The field of research on racism, particularly anti-black racism, is particularly developed in the United States and several surveys have naturally sought to measure its potential impact on votes for Barack Obama in the 2008 and 2012 presidential elections. They do not produce conclusive results and a researcher, Seth Stephens-Davidowitz , had the idea of using an indirect indicator of racism, the proportion of Google searches containing the word ”  nigger(s)  ” during the four years preceding the election, which he linked to votes for Obama in 2008 and 2012, state by state. Despite the ban on this term, he found that the ”  N-word  ” is Googled on average 7 million times per year. Alone in front of their screen, the person has no reason to self-censor. The results, after controls, are conclusive. They show that the states where the term is most frequently searched for on Google extend well beyond the traditionally more racist Southern states. And use of the word is negatively correlated with voting for Obama, costing him an average of four percentage points in both elections. Anti-Black racism is significantly underreported in surveys, and it has had a significant impact on voter choice—a phenomenon that, until now, had flown under the radar.

In France, the National Consultative Commission on Human Rights (CNCDH) reports annually to the Prime Minister on the state of racism, anti-Semitism, and xenophobia, drawing in particular on the Racism Barometer for opinions and statistics provided by the relevant ministries for acts. But hate speech on social media remained outside its scope. Hence its decision, in 2020, to ask the Sciences Po Media Lab, associated with the Center for European Studies and Comparative Politics (Sciences Po) and the Interdisciplinary Laboratory for Sciences-Innovations-Societies (Lisis, Gustave Eiffel University) to launch a study on online anti-Semitism .

The team chose to analyze the comments posted on the main news and current affairs channels on YouTube, 628 in number, over a period of one year. A corpus of nearly two million comments was extracted and an algorithm trained to detect antisemitism, including in its most allusive forms. The dissemination of antisemitic remarks appears relatively low (0.65% of the total comments). The far-right channels contain the largest proportion, followed by counter-information and alternative health channels. The themes of conspiracy and Judeophobia appear more present than anti-Zionism. The results therefore qualify the thesis of a “new” antisemitism based on anti-Zionism replacing the old one and which has moved from the far right to the far left. The investigation has since been extended to other forms of racism, notably anti-Muslim racism, masculinism and conspiracy theories .

Big data and sexuality research

Big data is also valuable for addressing issues of gender and sexuality. French universities are regularly presented as being plagued by gender studies and intersectionality, including by ministers .

The meticulous investigation conducted by sociologist Étienne Ollion and his colleagues shows that this is not the case. Analyzing the place held by the question of gender in 120 social science journals over a quarter of a century, representing a corpus of 58,000 article abstracts, using an artificial intelligence model (Large Language Model), the article shows that it has increased from 9% in 2001 to 11.4% of the total in 2022. From one discipline to another, the results are contrasted, with the proportion of articles dealing with gender increasing from 33.7% to 36.6% in demography journals in the broad sense, but from 3.3% to 5.8% in political science. And they are still predominantly written by women. While intersectional approaches crossing gender and race and/or class remain residual (4% at the end of the period).

Marie Bergström, a sociologist at INED, used big data to shed light on the causes of the age gap observed in heterosexual couples, where the man is generally older than the woman. Cross-referencing the results of the “Study of Individual and Conjugal Paths” (Epic) survey, conducted by INED and INSEE in 2012-2014 among 7,800 people, questioned about their preferences in terms of age gap, with data from the dating site Meetic (400,000 profiles and 25 million emails) providing information on actual practices, she highlights the gap between what is said and what is done and the gaps according to gender.

At the declarative level, women are the most attached to an age gap in favor of the male partner, especially the younger they are, while men say they are indifferent to age. Thus, 79% of them say they would accept an older woman while only 53% of women would consider a younger partner. But on the dating site, it’s a different story, the gap being particularly marked among men, who clearly like younger women, especially as they get older .

Dangers of big data

The dangers of big data are no less great: non-representativeness and instability of samples not constructed for the needs of research, opacity and failure of algorithms and models, difficulties in accessing data, ethical problems, invasions of privacy, security problems (theft, misuse of data), exorbitant energy costs, political domination of the North over the South, and of the United States over the rest of the planet . Caution is necessary and the need for regulation is clear. But we cannot deprive ourselves of such a pool. And the new generations of doctoral students have immediately seized it .

A growing number of doctoral students are now using big data for their theses and are attracting followers. Whether they are interested in the positioning of European parties on climate or immigration , European energy policies or the media framing of target groups , they are able to build gigantic corpora of several million texts (reports, legislative texts, posts on social networks, images, press articles, parliamentary speeches, press releases), covering several countries and over long periods. To analyze them, they use Supervised Learning, training AI models to code these texts according to their research question and hypotheses. This allows them to revisit classic objects of political science with a fresh perspective and on a completely different scale, fitting into the booming trend of ”  augmented social sciences

Author Bio: Nonna Mayer is the Research Director at the CNRS/Center for European Studies at Sciences Po

Tags: