Fact-checking and media literacy specialists thought they had found a way to combat “ deepfakes ” , these video manipulations based on artificial intelligence, with verification tools such as Invid-Werify and the work of image analysis skills (visual literacy), with programs such as Youverify.eu .
But a few recent cases show that a new form of cyberattack has just been added to the panoply of disinformation actors: deepfake audio.
In the United States, in January 2024, a robocall generated by artificial intelligence and pretending to be the voice of Joe Biden reached the inhabitants of New Hampshire, urging them not to vote, just days before the Democratic primaries in that state. Behind the attack was Steve Kramer, a consultant working for a Biden opponent, Dean Phillips.
In Slovakia in March 2024, a fake conversation generated by AI featured journalist Monika Tódová and Slovak Progressive Party leader Michal Semecka fomenting electoral fraud. The recordings circulated on social media may have influenced the outcome of the election.
The same month, in England, a so-called leak on X shows Keir Starmer, the leader of the Labour opposition, insulting members of his team. And this, on the very day of the opening of his party conference. A deep fake seen more than a million times online in a few days.
A single “deepfake” can cause multiple damages , with complete impunity. The implications of the use of this technology affect the integrity of information and the electoral process. Analyzing how deepfakes are generated, interpreting why they are inserted into destabilization campaigns and reacting to protect against them is part of Media and Information Literacy .
Analyze: A phenomenon linked to the new era of synthetic media
Deepfake audio is a component of synthetic media , which is media synthesized by artificial intelligence that is increasingly removed from real, authentic sources. AI-synthesized audio manipulation is a type of deep imitation that can clone a person’s voice and make them say things they never said.
This is possible thanks to advances in voice synthesis and voice cloning algorithms that make it possible to produce a fake voice, difficult to distinguish from the authentic speech of a person, based on snippets of statements for which a few minutes, or even seconds, are enough.
The rapid evolution of deep learning methods , in particular generative adversarial networks (GANs), has contributed to its improvement. The public availability of these low-cost, accessible and efficient technologies has made it possible either to convert text into sound or to carry out deep voice conversion. Current neural vocoders are capable of producing synthetic voices that imitate the human voice, both in timbre (phonation) and prosody (accentuation, amplitude, etc.)
Sound deepfakes are incredibly effective and tricky because they also draw on revolutionary advances in psychoacoustics – the study of human perception of sounds, particularly in terms of cognition. From the auditory signal to the meaning, through the transformation of this stimulus into a nerve impulse, hearing is an activity of voluntary and selective attention. Added to this are sociocognitive and interpretative operations such as listening and understanding the speech of others, to help us extract information from our environment.
Not to mention the role of orality in our digital cultures, supported by online and mobile uses, as evidenced by the popularity of podcasts. Social media have seized this human reality to build artificial tools that instrumentalize the voice as a narrative tool, with applications such as FakeYou. Voice and speech are part of the register of the intimate, the private, the confidential… and the last frontier of trust in others. For example, radio is the media that people trust the most, according to the latest Kantar trust barometer published by La Croix !
Interpret: influence operations facilitated by artificial intelligence
Voice cloning has enormous potential to destroy public trust and allow malicious actors to manipulate private phone calls. Audio deepfakes can be used to generate audio spoofs and spread disinformation and hate speech, disrupting the functioning of various sectors of society, from finance to politics. They can also damage people’s reputations to defame them and cause them to fall in polls.
The deployment of audio deepfakes poses multiple risks, including the spread of false information and “fake news”, identity theft, invasion of privacy and malicious alteration of content. The risks are not particularly new but nevertheless real, contributing to a worsening political climate, according to the Alan Turing Institute in the United Kingdom .
This industrial-scale amplification should therefore not be underestimated. Audio deepfakes are harder to detect than video deepfakes, while being cheaper and faster to produce: they can easily be grafted onto recent news and the fears of certain, well-identified sectors of the population. In addition, they are an advantageous part of the arsenal of extremists during interference campaigns in peacetime such as elections.
React: from fraud detection to regulation and education
There are several approaches to identify different types of audio spoofing. Some measure the silent segments of each speech signal and note the higher or lower frequencies, to filter and localize manipulations. Others train AIs to distinguish between natural authentic samples and synthetic samples. However, existing technical solutions fail to fully address the issue of synthetic speech detection.
This detection remains a challenge because manipulators try to remove their counterfeit traces (by filters, noises, etc.), with deepfake audio generators that are increasingly sophisticated . Faced with these democratic vulnerabilities, various human solutions therefore remain, ranging from self-regulation to regulation and involving various types of actors.
Journalists and fact-checkers have increased their contradictory research techniques to take this new situation into account. They rely on their strategies for verifying sources and validating the context of the broadcast. But they are also appealing, via Reporters Without Borders, to the legal profession, for the protection of journalists, so that they create a “deepfake offence” capable of deterring manipulators.
The social media platforms (Google, Meta, Twitter and TikTok) that convey and amplify them through their recommendation algorithms are subject to the new EU Code of Practice on Disinformation . Strengthened in June 2022, it prohibits Deepfakes and requires platforms to use their tools (moderation, deplatformisation, etc.) to ensure this.
Teachers and trainers in Media and Information Literacy must in turn be informed, even trained, to be able to alert their students to this type of risk. The youngest are the most targeted. To their visual literacy skills, they must now add skills in sound literacy.
Resources are lacking in this regard and require preparation. This is possible by choosing good examples such as those related to political figures and by paying attention to the 5Ds of disinformation (discredit, distort, distract, deflect, dissuade). Relying on the context and timing of these cyberattacks is also fruitful.
For politicians, who are ultimately concerned but very poorly trained, the Alan Turing Institute offers a strategy that can be shared by all, the 3Is: inform, intercept, insularize. In the pre-election phase, this consists of informing about the risks of audio deepfakes; in the campaign phase, this involves intercepting deepfakes and dismantling the underlying threat scenarios; in the post-election phase, this requires strengthening strategies for mitigating incidents identified and making them known to the public.
All these approaches must be combined to ensure the integrity of information and elections. In any case, pay attention to your listening and take in some AIR: analyze, interpret, react!
Author Bio: Divina Frau-Meigs is Associate Professor and Professor at the Sorbonne Nouvelle University