Decoding the Mystery of Speech: How Our Brain Turns Sound into Meaning - World leading higher education information and services

When we read the sentence “He left for work”, we clearly distinguish the different words that constitute it, because they are separated by a space. But if, instead of reading, we hear the same sentence pronounced by someone, the different parts that we call “discrete linguistic units”, such as words or syllables, are not so directly and easily accessible.

Indeed, what reaches the listener’s ear, the “speech signal”, is not organized in discrete and very distinct units, but rather as a continuous and uninterrupted flow. How then do we transform this continuous signal into distinct linguistic units? It is this question, which for several decades has driven a number of research works on speech perception, that we address in an original mathematical model, recently presented in the journal Frontiers in Systems Neuroscience .

Different models of speech perception

In the literature, there are two major classes of speech perception models. Models in the first category, such as TRACE , the classic domain model, consider that speech segmentation occurs quite naturally with the decoding of the acoustic content of speech: the listener can directly decode the continuous stream of speech at from the acoustic information contained in the signal, using his knowledge of words and sounds. Segmentation would then be a simple product of decoding.

On the contrary, for the second class of models, there would indeed be a process of segmentation (with a detection of the borders of the linguistic units) distinct from another process operating the association of the segments thus obtained with lexical units. This segmentation would be based on the detection of marker events of the borders between segments. These two separate processes would work in an integrated way to facilitate the understanding and processing of the continuous flow of speech.

Such mechanisms are observable in babies who, although they have not yet developed a vocabulary of their language, are nevertheless capable, up to a certain point, of segmenting speech into distinct units .

In line with this second conception of segmentation, developments in neuroscience in the last 15 years have led to new proposals concerning the processes of speech flow segmentation, in connection with the processes of synchronization and neural oscillations. These processes refer to coordinated brain activities that occur at different frequencies in our brain. When we listen to speech, our brain must synchronize and organize the different acoustic information that arrives at our ears to form a coherent perception of language. Neurons in the auditory areas of the brain oscillate at specific frequencies, and this rhythmic oscillation facilitates the segmentation of the speech stream into discrete units.

A flagship model in this field is the TEMPO neurobiological model . TEMPO focuses on the temporal detection of amplitude maxima in the speech signal to determine the boundaries between segments.

This approach is based on neurophysiological data showing that the neurons of the auditory cortex are sensitive to the temporal structure of speech, and more specifically on the fact that there are synchronization processes between the neuronal oscillations and the syllabic rhythm.

How to understand a sentence in the hubbub

However, while these models provide a finer and more accurate perspective on how our brain analyzes and processes complex acoustic speech signals, they still do not explain all the mechanisms involved in speech perception. An outstanding question concerns the role of higher-level knowledge, such as lexical knowledge, ie knowledge of the words one knows, in the process of speech segmentation. More specifically, we are still studying how this knowledge is transmitted and combined with the cues extracted from the speech signal to achieve the most robust speech segmentation possible.

Suppose, for example, that a speaker named Bob utters the sentence “he’s gone to work” to Alice. If there is not too much ambient noise, if Bob articulates well and does not speak too quickly, Alice then encounters no difficulty in understanding the message conveyed by her interlocutor. Without apparent effort, she will have known that Bob spoke the various words il , E , paRti , o , tRavaj (the phonetic transcription of the words spoken in the SAMPA transcription system ). In such an “ideal” situation, a model that would be based only on the amplitude fluctuations of the signal without calling on additional knowledge would suffice for the segmentation.

However, in everyday life, the acoustic signal is “polluted”, for example by the noise of car engines, or the songs of birds, or the music of the neighbor next door. Under these conditions, Alice will have more difficulty understanding Bob when he pronounces the same sentence. In this case, it is likely that Alice, in this situation, would use her knowledge of language , to get an idea of what Bob is likely to say or not. This knowledge would allow him to complement the information provided by the acoustic cues for more effective segmentation.

Indeed, Alice knows many things about the language. She knows that words are linked together in syntactically and semantically acceptable sequences, that words are made up of syllables, which are themselves made up of smaller linguistic units. Since she speaks the same language as Bob, she even knows very precisely the “classic” durations for realizing and producing the speech signal herself. It therefore knows the expected durations of the syllables, and can thus rely on this information to help its segmentation process, in particular when it encounters a difficult situation, such as hubbub. If the ambient noise “suggests” syllabic boundaries that do not correspond to her expectations, she can ignore them; conversely, if a noise masks a boundary actually produced by Bob,

In our published articlein the scientific journal “Frontiers in Systems Neuroscience” we explore these different theories of speech perception. The model developed comprises a module for decoding the spectral content of the speech signal and a temporal control module which guides the segmentation of the continuous stream of the speech signal. This temporal control module combines, in an original way, the sources of information coming from the signal itself (in accordance with the principles of neural oscillations) and those coming from the lexical knowledge that the listener has on the durations of syllabic units and this , regardless of whether the speech signal is disturbed (excess event or missed event). We have thus developed different fusion models which allow either to eliminate irrelevant events due to acoustic noise, if they do not correspond to coherent prior knowledge, or to find missing events, thanks to linguistic predictions. Simulations with the model confirm that using lexical predictions of syllable durations produces a more robust perception system. A variant of the model also makes it possible to explain behavioral observations obtained in a recent experiment, in which the durations of syllables in sentences were manipulated, precisely to correspond, or not, to the durations naturally expected. Simulations with the model confirm that using lexical predictions of syllable durations produces a more robust perception system. A variant of the model also makes it possible to explain behavioral observations obtained in a recent experiment, in which the durations of syllables in sentences were manipulated, precisely to correspond, or not, to the durations naturally expected. Simulations with the model confirm that using lexical predictions of syllable durations produces a more robust perception system. A variant of the model also makes it possible to explain behavioral observations obtained in a recent experiment, in which the durations of syllables in sentences were manipulated, precisely to correspond, or not, to the durations naturally expected.

In conclusion, in a real communication situation, when we find ourselves in an environment where the spoken signal does not suffer from any disturbance, relying on the signal alone is probably enough to access the syllables, as well as the words constituting it. On the other hand, when this signal is degraded, our modeling work explains how the brain could use additional knowledge, such as what we know about the usual syllabic durations we produce, to help speech perception.

Author Bio: Mamady Nabe is a Doctor in Computer Science at Grenoble Alpes University (UGA)