Learn sign language with generative AI and 3D avatars


Faced with accessibility problems that affect a growing number of deaf people, researchers are interested in new technologies for learning sign languages. In the digital field, research is now focusing on the development of artificial intelligence (AI) making it possible to automatically translate written or spoken vocal language into sign language and vice versa, using avatars signing in 3D.

The targeted applications concern in particular automatic translation which makes it possible to produce content in sign language from text (SMS, web content, etc.) or speech, to textually subtitle videos in sign language, or even to design educational applications and videobooks intended for deaf children. Facilitating communication for deaf people and their access to information and thus promoting their inclusion in society is the primary objective of this work.

Constraints specific to sign languages

Sign languages ​​are languages ​​in their own right that constitute pillars of deaf identity and cultures. There are as many sign languages ​​as there are communities of deaf people, with their national variations, their specificities and their regional variants. They are practiced not only by deaf people, but also by all people who are part of their family, educational and professional environment. French sign language (LSF) is the gesture language practiced by deaf people in France. Banned for a long time, it was only recognized in 2005, which may explain the fact that it is poorly equipped with resources such as lexicographic, grammatical, or even encyclopedic reference works, or even annotated digital data.

The few existing sign language databases generally consist of videos, possibly accompanied by translations into written vocal language. This is particularly true for the LSF which essentially has lexical dictionaries accessible on the Web. Furthermore, there is no consensus in the international community working on sign languages ​​to define a written form. It is therefore difficult to align, as with spoken languages, between a video in sign language and the corresponding text. The development of digital tools dedicated to sign languages ​​therefore raises significant linguistic and technological issues.

Linguistic specificities

With their own grammars, sign languages ​​are shaking up the theories associated with vocal or written languages. Carried by visual and gestural modalities, they are characterized by the paralleling and simultaneity of movements and configurations of the hands, but also non-manual movements including the orientation of the bust, facial expressions and the direction of gaze.

In addition, they rely on iconic and spatial dynamics. Thus, the deaf speaker deploys his speech in the space surrounding him: he positions through his gestures the entities of his speech, animated or not, at specific locations in the space which allow these entities to be referenced and recalled.

Iconicity, which is expressed by a relationship of resemblance between the gesture and what it means, manifests itself in different ways. Certain manual configurations, carrying meaning, can represent lexical signs (a plane, a person), or be inserted in a statement. For example, the sentence “the plane is taking off” is represented in LSF by a single sign made with two hands: one of the hands (flat shape) represents the static runway, the other hand, dynamic, represents the plane whose the movement follows its trajectory relative to the track.

These iconic and spatial characteristics, omnipresent in sign languages, presuppose grammatical mechanisms which exploit the geometry in space as well as the temporality of movements. This constitutes a technical challenge to overcome for the synthesis of movements.

Create animated signing avatars with motion capture

From a digital point of view, video, which remains the preferred medium for the deaf, does not guarantee anonymity and imposes strong constraints on the storage and transport of data. A suitable response for sign language production is found in signing avatars , 3D animated virtual characters capable of signing utterances. Thanks to interactive 3D interfaces integrating these avatars, it then becomes possible to edit the animations, slow them down, replay them, change point of view or zoom in on specific parts of the avatar’s body, or even modify its graphic appearance. Designing such signing avatars requires first recording the movements of a deaf person using motion capture techniques ( motion capture or mocap ) and adapting this data to a 3D character model.

Over the past decade, translating text into signed animation has been accomplished using algorithms based on mocap data . These algorithms make it possible to synthesize new statements by concatenation of existing movements. While these approaches are satisfactory in terms of precision and quality of the sign language produced, they require very long, tedious and complex processing of mocap data to automate. Furthermore, the creation of new utterances remains constrained by the vocabulary and grammatical structures of the utterances initially recorded. With this type of technology, it seems difficult to build large corpora of data with sentences containing semantic variations in diverse thematic domains.

Generate sign language content with generative AI

Recent advances in generative AI in the field of text, image or video generation open up obvious perspectives in terms of translation between vocal language (spoken or written) and sign language. Current research aims to integrate these advances by developing deep neural network architectures adapted to automatic language processing. Using machine learning algorithms, they learn, by exploiting large databases, to identify and recognize trends and correlations hidden in the data in order to produce large language models that are increasingly efficient and versatile.

In the context of sign languages, neural networks learn in parallel and in context linguistic (e.g. a manual configuration) and kinematic (e.g. a sequence of 2D or 3D skeletal postures) features extracted from sentences and movements respectively. .

AI for the integration of deaf people

If the performance of these systems, particularly in terms of precision of the animations produced, can be improved in the near future, they already make it possible to considerably increase the available resources, in particular by replacing videos with avatar animations. High resolution 3D. In addition, qualified end -to-end methods will make it possible to design a global neural network for complex translation tasks from one language to another.

In the short term, it becomes possible to include this type of digital automatic translation tool in educational applications exploiting the respective logics of vocal and signed language grammars, thus promoting coordinated learning of the two languages. These language resources and digital tools are essential to ensure access for deaf people to educational pathways, essential passages for the acquisition of the knowledge and skills necessary for successful integration into society.

Author Bio: Sylvie Gibet is University Professor of Computer Science at the Université Bretagne Sud