How AI learns the secret language of DNA, and what research gains from it - World leading higher education information and services

If you’ve ever used a language model like ChatGPT or Mistral, you probably remember the initial impression: impeccable spelling, fluid grammar, and sentences that make sense. Yet, under the hood, these systems do only one very simple thing: predict the next word in a sentence. They use statistics learned from a vast corpus of texts, and that’s how they “speak” French, English, and many other languages.

A fruitful idea then germinated among geneticists: what if we trained the same class of models to learn the language of life, the sequence of letters A, T, G, C, inscribed in our genomes? This is the promise of genomic language models: they learn the hidden grammar of DNA and offer research a valuable ally to explore, propose and test scientific hypotheses more quickly.

What does an AI model do?

An artificial intelligence (AI) algorithm is, essentially, a number-processing machine. Input data, which can be images, sounds, or text, is first encoded into numbers. Then the algorithm performs simple operations (addition and multiplication by internal network parameters and thresholding) and returns the results (more numbers) as output. On a large scale, this very simple mechanism is sufficient for playing Go , driving a car… or understanding genomes.

The trick is not just the encoding: it’s mainly the learning. The model adjusts its internal parameters for each example (association between an input and a target output), a bit like tuning an instrument: with each note played, the string is tightened or loosened until the melody sounds right.

The applications of this simple principle are numerous and varied. In the game of Go , AI looks at the position of the stones (a table of numbers) and proposes the next move; in a sentence, the model suggests the next word. In genomics, it reads ATGC… and predicts the next base. If its predictions are correct, it means it has learned something about the hidden structure of the problem it is solving.

The first models of genomic languages

It was by following this principle that the first genomic language models were trained using genomes instead of text corpora. One of the most recent versions, Evo 2 , was developed by a large team at the Arc Institute research center in Silicon Valley. This model has been trained on numerous genomes, totaling nearly 10 trillion bases (the famous letters A, C, G, T), which represents 3,000 times the size of our genome.

The model reads a million bases at each step, and the calculation always boils down to the same very simple question: among the four possible letters (A, C, G, or T), which is the most likely to follow the ones just read? The enormous size of its “reading window” allows it to grasp both local rules and distant dependencies (regulation of genes at a distance). This leap in scale is not just a technical feat: it changes the way we can ask questions in biology, particularly in those non-coding regions (those not translated into proteins) that often remain poorly understood and constitute the “dark matter” of the genome.

In practice, learning resembles a guessing game: each time the model correctly guesses a hidden letter within a sequence, it reinforces the internal pathways that led it there; when it makes a mistake, it corrects these pathways. Over time, it identifies recurring patterns: certain motifs often precede the beginning of a gene, others signal the end, and certain motifs in the sequence reveal how the cell cuts RNA ( splicing ) or assembles the machinery for translating RNA into proteins.

Learning begins on a global scale. The model reads a wide variety of genomes and learns a general grammar of living organisms. Then, it can potentially be adapted to a family of organisms or to a specific question (for example, by specializing it on a group of viruses or bacteria).

AI learns the hidden grammar of DNA

This is where the research gets exciting: by simply learning to complete the sequences, the models recognize biological signatures without them being pointed out to them.

They rediscover the three-letter periodicity of the genetic code: the text of life is read in triplets (codons), and the models “hear” this rhythm, like a musical measure. They also identify gene starts and stops, with strong constraints on the most important letters, where errors are expected to be rare. They detect signals useful to the cellular machinery: in bacteria, ribosome binding sites; in eukaryotes, the boundaries between exons (conserved) and introns (sequences to be removed), as if the model were distinguishing between paragraphs and spaces in a text.

Even more surprisingly, they also reveal mobile elements (for example, viruses integrated into the genome during evolution) and even imprints linked to the 3D shapes of proteins ( α helices, β sheets ) and RNAs. The model then outlines the contours of the final sculpture. For it is indeed a sculpture.

The genome doesn’t just contain instructions—it encodes shapes. A protein, an RNA molecule, isn’t simply a string of letters: they fold, twist, and knot themselves in space to adopt a precise architecture, upon which their function depends. It is this shape that allows one molecule to recognize another, to bind to it, to trigger a reaction. The contacts that stabilize this shape sometimes occur between regions very far apart in the sequence—and yet, models seem capable of capturing them, as if, by repeatedly reading the text, they were deducing which letters correspond despite the distance separating them.

What may be surprising is that these discoveries weren’t taught; they emerged spontaneously from learning. And sometimes, paradoxically, when you try to refine the model by showing it well-known examples, it loses some of what it had discovered on its own. It’s as if guiding the student too much makes them forget what they had intuitively understood.

To make this “black box” more readable, researchers use “sparse autoencoders” that break down the model’s internal representations into understandable features. Each feature lights up like a lamp above a sequence element (exon, motif, mobile element). These features act as a guide. They indicate where the model has seen a signal, what type it is, and how it varies from one organism to another. These features can even be transferred to poorly studied genomes, paving the way for multi-species functional atlases built faster and less expensively than with traditional approaches.

In our own research, Evo 2 is primarily a point of comparison: it demonstrates how far a very large model can go when given enormous amounts of data and computing power. It’s also important to recognize that this demonstration serves as a showcase for Nvidia, the largest manufacturer of AI processors, which lent its computing power to the Arc Institute to develop Evo 2. The underlying idea is to show that gigantic models and exceptional computing infrastructures are necessary to unlock the secrets of life. The result is impressive, but it’s not necessarily the only path to advancing biology.

We launched the PLANETOID project , funded under the France 2030 program, to explore a complementary strategy: building much smaller, faster, easier-to-train, and deployable models in academic laboratories. The goal is to leverage rich biodiversity data produced by our partners—particularly at the National Museum of Natural History and marine stations—to annotate genomes and metagenomes (sets of genomes) across the entire tree of life, including for so-called “non-model” species, which represent the vast majority of life but often remain poorly understood.

PLANETOID also aims to produce reusable resources and tools, so that these approaches do not remain reserved for a few actors capable of mobilizing industrial resources, but can feed into public research, and then ultimately into health and the environment.

The future: estimating the effect of a mutation or writing new genomes

Because a language model assigns a likelihood score to each sequence, it becomes possible to compare the reference version with a mutated version. If the mutation lowers the likelihood score, it becomes suspect. This score acts as a map to guide researchers: it shows areas where a variation risks disrupting a function and directs which experiments to prioritize.

Another application is gaining momentum: the generation of “functional” sequences in silico . Researchers have shown that it is possible to compose genetic text that has all the characteristics of natural genomes. However, this practice raises important ethical questions (eugenic risks, the possibility of synthetic viruses, etc.) and must remain strictly regulated—it is more of a societal issue than an immediate research challenge.

Author Bios: Julien Mozziconacci is Professor of Computational Biology at the National Museum of Natural History (MNHN) and Élodie Laine is Professor of Computational Biology at Sorbonne University