Information search, content production, translation, detection of hate speech… generative artificial intelligence (AI) promises significant productivity gains in the media world .
The media accompany us on a daily basis and are a support for democracy: they have the freedom to show different points of view and ideas, to denounce corruption and discrimination, but also to show social and cultural cohesion.
While the public turns to the media for information, culture and entertainment, the media cannot escape the economic concerns and profitability of an industry measured in terms of audience and sales. In this context, generative AI brings powerful new tools and will be used more and more.
But it is crucial to remember that generative AIs do not have ideas, and that they repeat comments which can be arranged in ways that are as interesting as they are absurd (we then speak of “hallucinations” of AI systems). . These generative AIs do not know what is possible or impossible, true or false, moral or immoral.
Thus, the profession of journalist must remain central to investigating and reasoning about complex social and geopolitical situations. So how can media outlets leverage AI tools while avoiding their pitfalls?
The National Pilot Committee for Digital Ethics (CNPEN) delivered a general opinion in July on the ethical issues of generative AI, which I co-coordinated, to the Minister responsible for the Digital Transition. It specifies in particular the risks of these systems.
Powerful tools for journalists
The media can use AI to improve the quality of information, fight against fake news, identify harassment and incitement to hatred, but also because it can help advance knowledge and better understand realities complex, such as sustainable development or even migratory flows.
Generative AIs are fabulous tools that can bring out results that we could not obtain without them because they calculate at levels of representation that are not ours, on a gigantic quantity of data and with a speed that a brain doesn’t know how to treat. If we know how to equip ourselves with safeguards, these are systems which will save us time searching for information, reading and producing and which will allow us to fight against stereotypes and optimize processes.
These tools are not arriving now by chance. While we are effectively drowned in a flood of information broadcast continuously by traditional channels or content accessible on the Internet, tools like ChatGPT allow us to consult and produce summaries, programs, poems, etc., from a set of gigantic information inaccessible to a human brain in human time. They can therefore be extremely useful for many tasks but also contribute to a flow of unsourced information. We must therefore tame them and understand how they work and the risks.
Learning generative AI
The performance of generative AI depends on the self-supervised learning capacity (i.e. without being guided by a human hand, which is a different concept from real-time adaptation) of their internal models, called “foundation models” , which are trained from enormous data corpora consisting of billions of images, texts or sounds very often in dominant cultures on the Internet, for example GPT3.5 from ChatGPT is mainly powered by data in English. The other two types of learning were also used: before its availability at the end of 2022, ChatGPT was optimized using supervised learning and then using reinforcement learning by humans in order to refine the results and eliminate comments. not desirable.
This optimization by humans has also been widely criticized. How are they trained? Who are these underpaid “click men” ? These “undesirable” comments, moreover, are not decided by an ethics committee or the legislator, but by the company alone.
Learning that forgets the sources
When learning foundation models on texts, the system learns what are called “lexical embedding vectors” (of size 512 in GPT 3.5). This is the “transformers” system. The training principle of the foundation model is based on the distributional hypothesis proposed by the American linguist John Ruppert Fith in 1957: we can only know the meaning of a word by its frequent occurrences (“You shall know a word by the company it keeps”).
These entities (“ token ” in English) are on average four characters long in GPT3.5. They can only consist of one and one blank. They can therefore be parts of words or words with the advantage of being able to agilely combine these entities to recreate words and sentences without any linguistic knowledge (apart from that implicit in the sequence of words), the disadvantage obviously being to be less interpretable. Each entity is encoded by a vector which contains information about all the contexts where this entity has been seen thanks to attention mechanisms. Thus two entities having the same neighborhood will be considered close by the AI system.
The generative AI system on texts thus learns a production model with mechanisms that have nothing to do with human production located with a body, however it is capable of imitating it from the texts of the learning . This operation has the direct consequence of losing the sources from which the identified neighborhoods are extracted, which poses a fundamental problem for the verification of the content produced. No verification of the veracity of the statements is easily produced. We have to find the sources and when we ask the system to do so, it can invent them!
When you provide a prompt to ChatGPT, it will predict the next entity, then the next and so on. A key parameter is that of “temperature” which expresses the degree of randomness in the choice of entities. At a high temperature, the model is more “creative” because it can generate more diverse outputs, while at a low temperature, the model tends to choose the most likely outputs, making the generated text more predictable. Three temperature options are offered in Microsoft’s Bing conversational tool (GPT4) (more precise, more balanced, more creative). Often, system hyperparameters are not revealed for cybersecurity or confidentiality reasons as is the case in ChatGPT… but temperature allows different answers to the same question.
“Hallucinations” and other risks
It is therefore easy to imagine some of the risks of generative AI for the media. Others will certainly appear as they are used.
It seems urgent to find how to minimize them while waiting for the promulgation of an AI Act for the European Union by providing ourselves with good practice guides . The CNPEN’s opinion on generative AI and ethical issues includes 10 recommendations for research and 12 for governance. Here are some of the risks identified for the media:
- Trusting too much in what the machine says without cross-checking with other sources. The crossing of several data sources and the need to investigate are becoming fundamental for all professions: journalists, scientists, professors and others. It also seems fundamental to teach how to use these systems at school and university and to cultivate the art of debate to develop ideas .
- Understand that ChatGPT is built with data predominantly in English and that its cultural influence may be significant.
- Massively using ChatGPT lazily in the media, producing a lot of new unverified artificial data on the Internet that could be used to train new AI. It would be tragic if there were no longer any guarantee of truth about this data reconstituted by the machine. Two American lawyers, for example, were tricked into referring during a procedure, on the advice of the algorithm, to case law that did not exist .
- Replace certain tasks in many media-related professions with AI systems. Some jobs will disappear, others will appear. It is necessary to create interfaces with trust measures to help cooperation between humans and AI systems.
- Using AI systems and demystifying them is becoming an absolute necessity while being careful not to unlearn and being able to do without them.
- It is necessary to understand that ChatGPT makes many mistakes, for example it has no concept of history or understanding of space. The devil is in the details but also in the choice of data used to create the model. The AI law calls for more transparency on these AI systems to verify their robustness, non-manipulation and energy consumption .
- It must be verified that the data produced does not infringe on copyright and that the data used by the system is used correctly. If “synthetic” data tomorrow replaces our knowledge in training future foundation models, it will be increasingly difficult to disentangle fact from fiction.
- Provide access to AI systems (for example Dall -E or Stable Diffusion ) which can be used to create deepfake to produce images. The phenomenon reminds us of the importance of checking not only the reliability of article sources, but also images and videos. It is a question of putting watermarks (or watermarks ) in the texts, images or videos produced to know if they were made by AI or of labeling the data “organic” (or produced by humans).
AI laboratory on the emergence and limits of generative AI
The arrival of ChatGPT was a tsunami for everyone. He has amazed experts and non-experts alike with his text production, translation and even computer programming skills.
The precise scientific explanation of the “spark of emergences” phenomenon in foundation models is a current research topic and depends on the data and hyperparameters of the models. It is important to massively develop multidisciplinary research on the emergence and limits of generative AI and on the measures to be deployed to control them.
Finally, we must educate at school on risks and ethics as much as on programming , and also train and demystify AI systems to use and innovate responsibly while being aware of the ethical, economic and societal consequences. and environmental cost.
France could play a major role within Europe with the ambition of being an AI laboratory for the media by studying ethical and economic issues in the service of the common good and democracies.
Author Bio: Laurence Devillers is Professor of Artificial Intelligence at Sorbonne University