Open up AI models so they don’t remain the preserve of web giants

Share:

Large language models, such as the one behind ChatGPT, are “closed”: we do not know how they are developed, on what data and with what parameters. Even so-called open models are only very partially open, which poses obvious problems of transparency and sovereignty. Developing open models is a realistic and desirable alternative in the medium term.


From machine translation to content generation, language models (or language models ) rely on massive datasets and complex algorithms. One of the big questions for the AI ​​community is whether these models should remain closed – controlled only by a few large companies – or be open and accessible to the public – especially to researchers, developers, and public institutions.

An open model has several advantages. First, it allows for greater transparency. Users can see how the model was trained, what data was used, and what algorithmic decisions underlie its predictions. This promotes trust in the results produced and allows the scientific community to check and correct for biases that may be present. Second, an open model encourages innovation . By allowing other researchers, developers, and companies to work with these models, we can accelerate the development of new applications and solve complex problems in a more collaborative manner .

Closed models, on the other hand, pose significant problems. Their opacity makes it difficult to identify legal liabilities, as it is almost impossible to determine what data was used during training or how the system’s decisions were made. This opacity therefore creates potential risks of algorithmic discrimination , misinformation , and misuse of personal data . In addition, these closed models reinforce technological monopolies, leaving little room for competition and thus limiting the possibilities for developing competing solutions.

While truly open source language models are still relatively marginal today, they remain a viable option in the medium term. For them to flourish, not only will technical hurdles need to be overcome, but funding and regulatory models will also need to be rethought to ensure that innovation is not the preserve of a handful of tech giants. The future of open AI and its potential to benefit society as a whole is at stake.

Lobbying and business strategies

Intensive lobbying is being conducted with governments and regulatory bodies to advance the argument that the complete opening of LLMs could lead to abuses. The fear of misuse, whether it be the mass dissemination of false information or cyberattacks – or even the fantasy of a takeover by super-intelligent machines, is put forward to justify the closure of these models.

OpenAI, along with others, claims that opening up models would be dangerous for humanity. The debate is often difficult to follow: some talk about danger , or even call for a moratorium on this type of research, but continue to invest massively in the sector in parallel.

For example, Elon Musk signed the Future of Life Institute ‘s letter in March 2023 calling for a six-month pause in AI research, while launching xAI , a competitor to OpenAI, in July 2023 ; Sam Altman, who heads OpenAI , also frequently talks about danger while aiming for multi-billion dollar fundraising to develop ever more powerful models.

While some people probably really believe that there is a danger here (but we would have to define exactly what it is), others seem to be manoeuvring according to their interests and the immense sums invested.

So-called “open” models that aren’t so open after all

Faced with this, other companies, such as Meta with its Llama models, or Mistral in France, offer so-called “open” models. But are these models really open?

Openness is indeed most often limited to access to the model’s “weights”, i.e. the billions of parameters that are adjusted during its training using data . But the code used to train these models, and the training data (these masses of crucial data that allow the model to analyze and produce text) generally remain well-kept secrets, out of reach of users and even researchers, thus limiting the transparency of these models. In this respect, can we really talk about an open model if only the weights are available and not the other essential components?

However, opening up the weights offers certain advantages. Developers can adapt the model to specific data (through “fine tuning”) and, above all, these models offer better control than completely closed models. They can be integrated into other applications, without it being a black box only accessible by “prompt engineering”, where the way a query is formulated can influence the results, without anyone really knowing why.

Access to weights also promotes model optimization, particularly through techniques such as “quantization ,” which reduces the size of models while preserving their performance. This allows them to be run on more modest machines, laptops or even phones.

By making the models partially open, proprietary companies benefit from the interest of thousands of developers, which allows for potentially faster progress than for closed models, which are necessarily developed by smaller teams.

Towards truly open source models?

But can we envisage tomorrow the creation of truly open source language models, where not only the weights, but also the training data and the learning codes would be accessible to all? Such an approach raises significant technical and economic challenges.

The main obstacle remains the computing power needed to train these models, which is currently the preserve of companies with colossal resources (Google, Meta, Microsoft, etc.); OpenAI, or Mistral in France, use computing power offered by different players, including the aforementioned IT giants. It is partly to cover these costs – access to computing power – that these companies must regularly raise significant funds . The cost of energy, equipment, and human resources is prohibitive for most players.

Yet, initiatives exist. Research communities and non-profit organizations are seeking to develop open and ethical models, based on accessible, or at least transparent, datasets.

Thus, Allen AI (a private non-profit research center, originally funded by Paul Allen, the co-founder of Microsoft who died in 2018) has developed the Olmo and Molmo models (language model and multimodal model), which are completely open.

SiloAI, a Finnish company, in collaboration with the University of Turku has developed a completely open multilingual model, Poro , which performs well for Scandinavian languages.

In France, Linagora and others are also working to develop open systems, in line with Bloom (a completely open model, developed by a collective of researchers under the impetus of the company Hugging Face in 2022).

The economic model of these initiatives remains to be determined, as does the long-term return on investment of the colossal sums currently at stake on this theme at the international level.

In practice, these models are often trained on public infrastructures ( Lumi in Finland for Poro, Genci in France for Bloom): these are often collaborations between academics and private companies that can then market the solutions developed, since an open model is not synonymous with completely free, and additional services such as adapting models for specific needs can contribute to the financing of such initiatives.

Another avenue lies in the development of specialized language models, less costly in terms of data and infrastructure, but which could meet specific needs, which would allow more modest companies or players to stand out.

Author Bio: Thierry Poibeau is DR CNRS at Higher Normal School (ENS) – PSL

Tags: