Ylva Ferstl, a Trinity PhD student and ADAPT Researcher, is exploring how to make virtual human avatars seem more lifelike. As humans, we have evolved to process both spoken language and body language when having a conversation with someone. Ylva introduces us to her research that seeks to make it easier for computers to generate natural gestures to accompany their speech, thereby making our interactions with them more natural.
Virtual humans are reaching wide-spread adoption, whether it is through video games, customer service interfaces, or virtual tour guides in museums. Often, however, they lack presence; they don't quite hold our attention as they are intended to and appear rigid and shallow. A key component in having an engaging presence is nonverbal behavior, encompassing a range of body movements such as head nods, eye blinks, and gesture, naturally produced in human-to-human interaction. In my research, I aim to enable our virtual counterparts to mimic some of this nonverbal behavior, specifically the component of gesture. Gesture is an integral part of our communication, so ingrained that we often produce gestures even on the phone, when the other person cannot see us. I am working on a system to automatically create the gesture motion for an avatar's speech.
In developing a gesture production system able to imitate our natural behavior, I focus on machine learning. The system takes lots of example snippets of speech and the accompanying gesture, and tries to infer the way the two modalities relate. For training such a system, the essential first step is creating an appropriate set of training data, the examples of speech with gesture. For this, I recorded hours and hours of a person speaking, and captured their exact gesture movements with a motion tracking system, as in the picture below: Reflective markers are placed on the actor, and infrared cameras track their position.
Creating and publishing this dataset of speech and gesture was a significant step in the field of gesture generation. Previously, there was no large scale dataset available for speech and concurrent gesture motion. Various recent gesture generation works by other researchers have relied entirely on our dataset.
After creating the dataset of speech and gesture, the speaker's speech is analyzed using metrics such as the pitch of their voice, indicating for example emphasis and emotion. This processed speech is then input to the machine learning gesture generation model.
With the captured motion, the gesture movements are now represented through joint positions in 3D space over time, and this is the output of the model we train. Through many training iterations, the model gets its output closer to ground truth, the gesture the speaker actually produced.
Capturing the speech-to-gesture relationship in the model, however, is difficult due to the inherent ambiguity. For starters, a single speaker may produce different gestures for the same or similar speech utterance at different times, meaning the mapping from speech to gesture is non-deterministic. When extending gesture modelling to different speakers and different speaking styles, the variety of gesture 'output' increases hugely again; think of the very different speaker appearance of an extroverted person aiming to draw attention, versus an introverted person slightly uncomfortable in the spotlight.
Currently, we are still just at the beginning stages of being able to capture such variety in speech-accompanying gesture. For automatic gesture generation, advances have been made recently by addressing the non-deterministic nature of the speech-gesture relationship. This means moving away from standard machine learning approaches that are tuned to strong input-output mapping rules (think for example of mapping speech to the same speech in a different language). As one alternative model, GANs, generative adversarial networks, have become popular in this regard in many applications, and have been a significant step forward in the speech-to-gesture research. Instead of incrementally approximating the ground truth gesture output during training, a GAN aims to produce output that 'looks real' rather than output that is exactly right, allowing more freedom and creativity in the model's output. We presented one such model recently (https://doi.org/10.1016/j.cag.2020.04.007).
Notably, so far, gesture generation research has largely focused on producing so-called beat gestures, gestures that follow the rhythm and emphasis of speech; these gestures relate to prosodic markers of the speech, such as the pitch. Presenting only formless beat gestures, however, can appear unsatisfactory, lacking depth and variety. Varied gestures that describe a specific shape or meaning are much harder to model as they require a deeper understanding of the semantic content of the speech. A key concern for future work is finding ways to generate a variety of meaningful gestures, supplementing the more shapeless, rhythmic beat gestures with gestures that carry meaning. Meaningful gestures can both aid the semantic understanding of an agent's speech and create a richer, more enjoyable experience for us.
Improving the automatic generation of appropriate gesture behavior can have a significant impact on making life-like agents available wide-spread. While currently production of realistic virtual gesture behavior is reserved for high-cost movie or games productions, an automatic gesture system would make engaging virtual agents widely accessible even for individual developers outside of big-budget companies. This will mean that interacting with virtual agents will more often become an engaging, pleasurable experience for us.
Ylva Ferstl is a PhD candidate in the area of machine learning and animation. She received her MSc in Cognitive Science in 2015 at the University of Tübingen. Her PhD project focuses on automatic speech gesture generation, supervised by Dr Rachel McDonnell, and funded by ADAPT Centre. Currently, she is completing a PhD internship at Facebook Reality Labs, with the AR/VR team.
Share this article: