Data-driven Communicative Behaviour Generation: A Survey

The development of data-driven behaviour generating systems has recently become the focus of considerable attention in the fields of human–agent interaction and human–robot interaction. Although rule-based approaches were dominant for years, these proved inflexible and expensive to develop. The difficulty of developing production rules, as well as the need for manual configuration to generate artificial behaviours, places a limit on how complex and diverse rule-based behaviours can be. In contrast, actual human–human interaction data collected using tracking and recording devices makes humanlike multimodal co-speech behaviour generation possible using machine learning and specifically, in recent years, deep learning. This survey provides an overview of the state of the art of deep learning-based co-speech behaviour generation models and offers an outlook for future research in this area.


INTRODUCTION
Recent years have seen an increase in the development of systems for the generation of human-like communicative behaviour.This is driven by the need for socially interactive virtual and robotic agents in various domains.For instance, artificial agents may range from household service robots to museum guide avatars and social robots in education and medicine, whose primary function is not only to assist people but to connect with people through effectively producing social signals [13].
Research has long established a rule-based approach as an advantageous one in human behaviour generation [12,109,141].However, in light of state-of-the-art developments, major issues in the rule-based approach have been identified.While it is efficient in producing human behaviours for a single or a limited number of modalities, its is hampered by the need for explicitly formulating rules, resulting in a practical limit on the number of rules, which in turn curbs the expressiveness of behaviour [62].Additionally, rule-based systems typically fall short of producing multimodal behaviours, as the number of rules increases rapidly when new modalities are added [170].Recent evidence , , Oralbayeva et al.
suggests that rule-based models seem to fail when producing natural variations of human behaviour, often because they do not cover the entire range of behaviour or their naturalness is found to be lacking [125].
In contrast, models that are trained by learning from available corpora of speech, text, audio, and multimodal data allow for a more robust human-agent interaction, as they can learn correlated behaviour which is difficult or labour-intensive to capture in rules.For example, it is believed that computational models based on data hold promise in uncovering the complex relationships between verbal and non-verbal human behaviours [124,218].Advances in the deep learning and machine learning models, and the availability of large datasets have led to a growing interest in data-driven systems for behaviour generation [85,111,228], dialogue systems [173], and speech synthesis systems [197,211].The data-driven approach to interaction design is deemed to improve on the labour-intensive rule-based approach.Human behaviours are generally produced through various modes that make communication multimodal [7].
Those are primarily speech and different types of bodily gestures such as facial gestures, movements of the head, and manual (hand, arm, shoulder) gestures [7].These all play an integral role in conveying social signals and information [147].Moreover, the affective states of an interlocutor are consciously or unconsciously communicated by means of these verbal and non-verbal communicative channels [7].Data from several studies suggest that robots and virtual agents able to cause affect in human users are perceived as more vivid and human-like [54,160].
Compared to other recent reviews [127,226], this survey intends to take stock of the dynamically expanding field of co-speech gesture and behaviour generation for anthropomorphic agents, and of the methodological approaches used for the evaluation of such models.We review existing research on data-driven approaches in verbal and non-verbal human behaviour generation and cover progress in data-driven communicative behaviour generation from the last five to six years.Furthermore, this work attempts to identify challenges and directions, and in doing so sets a road-map for future research in this field.
Section 2 explains the methodology for the review.Sections 3, 4, 5, 6 and 7 are dedicated to reviewing data-driven models, generating various communicative behaviours that occur in human-human interactions and designed for human-agent and human-robot interaction scenarios.Section 7 finishes the review and focuses on speech synthesis, the communicative behaviour in which most resources have been invested for arguably the longest period of time and which therefore holds essential lessons for data-driven behaviour generation.Section 8 provides an outlook for the field and concludes the paper.

MATERIALS AND METHODS
This paper reviews empirical studies published within the past five to six years (2014-2021), with some exceptions for studies published between 2011 and 2012, and which were considered relevant for this survey.Moreover, reference lists of the selected articles and significant review papers were examined to identify other relevant studies for inclusion.A list of research keywords used in this work are summarized in Table 7 (Appendix A).
A total of 825 records were retrieved from various publication databases.The search result statistics across databases (i.e., Google Scholar, Scopus, Web of Science, ACM, IEEE) can be seen in Figure 1.After retrieving meta-data about the papers, the titles and abstracts of all 825 articles were screened to identify the journal articles and conference papers deserving a full-text review.Papers were withheld when containing appropriate keywords and model descriptions.
The number of articles was reduced to 534 after the exclusion of overlapping titles and abstracts.Thus, a total of 291 publications were carried over to the full-text review stage.
During the full-text review only publications were included according to the following criteria, where a work: • introduced a model with the capability of training (which in most cases was a neural network); • relied on a corpus or dataset for training; • presented clear evaluation metrics; • presented test-bed platforms for the proposed models.
A paper was excluded if: • it was focused solely on rule-based systems; • it did not describe the evaluation metrics; • it did not provide information on the dataset and corpora for training and validation.
As a result, of 291 works that were considered in the full-text review, 231 works with no evaluation metrics or corpora were excluded.Among them were articles describing rule-based models, which were out of the scope of this survey and hence were removed from the review.The final list of publications thus contained 53 papers meeting the eligibility criteria.The selected papers are organized according to the type of behaviour presented in separate sections in this survey.Note that we are agnostic about the form of the agent on which the behaviour is produced: this survey focuses on the generation of behaviours for both humanoid and non-humanoid robots as well as virtual conversational agents and avatars.

HEAD GESTURES
Head gestures constitute an important part of human body language during communication and co-occur with speech.
Speech-driven head gesture synthesis through data-driven approaches has attracted attention over the last decade.
Unlike rule-based models for gesture synthesis, data-driven models can learn dependencies between data so as to map a sequence of speech features to meaningful head animations.The related literature shows different frameworks employing Deep Neural Networks (DNNs) [184], Bi-directional Long Short-Term Memory (BLSTM) networks [172], and deep generative models [72,179], which are capable of learning the temporal and cross-modal dependencies of continuous signals.
Ding et al. [45] discussed a Deep Neural Network (DNN) for synthesizing head motion from speech features.To this end, they pre-trained a Deep Belief Network (DBN) [89], using stacked Restricted Boltzmann Machines (RBMs) [178] with a target layer for fine-tuning the DBN model parameters, creating a DNN model.The objective evaluation criteria depend on three measures: Canonical Correlation Analysis (CCA) [83], Average Correlation Coefficient (ACC) [159], and Mean Square Error (MSE) [6] for the differences between predicted head movements with respect to ground truth movements, where the results show that the generative pre-trained DNN model outperformed the randomly initialized network trained through back propagation.Furthermore, Ding et al. [47] showed that this DNN model outperformed a traditional Hidden Markov Model (HMM) approach for head motion synthesis from speech [91] in the CCA analysis.
Ding et al. [46] compared two types of neural network models, BLSTM and feed-forward networks, to learn the correspondences between speech and head motion.The results show that the BLSTM model significantly reduced the root mean squared error (RMSE) -of predicted movements with respect to ground truth movements -compared to that of the feed-forward model that does not converge when the number of hidden layers is bigger than two.Furthermore, the BLSTM model, with different numbers of hidden layers, achieves a better performance than that of the feed-forward model in the Canonical Correlation Analysis (CCA) [83].Over and above, a hybrid network composed of two BLSTM layers and one feed-forward layer in between, shows a higher performance in objective evaluations and in subjective evaluation -measuring the naturalness of head motion -than a separate BLSTM model and the other stacked network architectures.
Haag and Shimodaira [82] presented a bottleneck Deep Neural Network (DNN) architecture, where bottleneck features -resulting from a DNN model containing a hidden bottleneck layer and trained on the features of speech and head motion -are used with speech features as input to another DNN model with a BLSTM layer in a forward pass in order to synthesize head motion.These bottleneck features can capture the dependencies between the features of speech and head motion curves, which allows for improving the accuracy of generating head movements.They report that bottleneck features enhanced the performance of the DNN-BLSTM architecture and achieved better scores in the Canonical Correlation Analysis (CCA) [83] than when they were not present in the architecture.
Greenwood et al. [77] introduced a Bi-directional Long Short-Term Memory (BLSTM) model to predict head motion from speech and further extended the model through conditioning by a prior motion input in order to limit the possible head motion predictions for speech.Moreover, they proposed a generative Conditional Variational Autoencoder (CVAE) [179] using BLSTM models as encoder and decoder to map speech to head motion.This last model allows for predicting a variety of output head motion curves for the same speech input by sampling from the Gaussian space and conditioning on speech features.
Sadoughi and Busso [165] presented a conditional Generative Adversarial Network (GAN) [72] with BLSTM cells for generating head movements for speech segments.It learns, during training, the conditional distributions of head motion curves and prosodic features of speech.The performance of the proposed model was compared with a Dynamic Bayesian Network (DBN) [132] and a BLSTM model [46].The results show that the proposed conditional GAN model outperformed of the baseline DBN and BLSTM models in terms of the log-likelihood measures as well as in subjective evaluation.The IEMOCAP database [20] 38 mins from one actor 14 mins from the actor Log-likelihood measures [64] Questionnaire, A pairwise comparison Table 1 summarizes the related information to the corpora and evaluation approaches used in the studies covered in this survey.While most of these studies considered objective measures to evaluate the proposed models, some of them had subjective evaluations.It is noteworthy that the sizes of the corpora and the scale of evaluations are often small; therefore, measuring how appropriate the generated head gestures is not always possible, and new metrics supplementing the existing objective metrics might be needed.

Summary: Head Gestures
• Different data-driven models can be used for successfully generating expressive head motion from speech, all are likely to achieve a satisfactory level of subjective and objective performance.
• Defining a credible metric for the quality and appropriateness of the generated head motion is still an open challenge.
• The size of the training and test corpora are generally limited, which could affect the quality of the generated gestures.Creating larger corpora for head gesture generation is likely to be a good investment.

FACIAL EXPRESSIONS
The human face is an important channel for non-verbal communication [61].Most research has focused on facial animation to express facial affect (or emotions) Pantic [146], and typically use the facial Action Units (AU) schema by 0 The reporting of dataset durations for training and test splits from different works in this table and hereinafter was constrained by their availability. 1The reporting of dataset durations for training and test splits from different works in this table and hereinafter was constrained by their availability. 2Each of the following datasets has been processed by the authors to extract the characteristics of speech and head motion in order to train the proposed models, except in Ding et al. [46] and Sadoughi and Busso [165] where audio-visual data and features are provided [20,158]. 3Not applicable, w.r.t the evaluation metric, a particular metric is not applied in the work. 4The authors did not provide clear information on the size of the training and testing data. 5Dataset sizes are not available. 6Greenwood et al. [77] did not use any objective or subjective measures.Instead, they discussed the characteristics of the generated head motion with respect to the ground truth.
Ekman et al. to present facial animations in a numerical manner [50].Along with the basic emotional model suggested by Ekman, Facial Action Coding system (FACS) [51] -a systematic method for describing and measuring facial movements in response to emotions -is leveraged as a common representation of facial affect in most of the works on facial expression generation.Researchers consider such facial modalities as the gaze, eyebrow actions, head motion [132] or eye behaviour, mouth, eyebrows, nose, the shape of the face, cheeks, wrinkles, neck and even hair [190] and lip motion Mancini et al. [130] to contribute to the facial behaviour and expression generation.While the majority of studies consider facial expressions in close relation to emotions [25,164], elsewhere research focuses on facial units regardless of emotions, using the term facial gestures [53,61].Generally, facial expression generating models are based on Dynamic Bayesian Networks (DBN) [132], Generative Adversarial Networks [72] and Long Short-Term Memory (LSTM) [90].In this survey, facial expression generation is discussed in two subsections, distinguishing natural facial behaviours (such as blinking, lip-syncing, etc.) and affective facial expressions.

Natural Facial Expressions
The following works center around the facial expressions deemed "independent of facial expressions of emotions" such as raising an eyebrow, winking, shaking the head [53] or blinking and frowning [206].
Taylor et al. [188] proposed to use a Sliding Window Deep Neural Network (SW-DNN) [103] to generate lip movements using the Mel-frequency Cepstral Coefficients (MFCCs) of the speech input from the audio-visual KB-2k [189] speech dataset.The model was benchmarked against the HMM inversion (HMMI) [66] and was also evaluated subjectively for perceived realism alongside ground truth (GT) and HMMI, determining the average response rate.As a result, the SW-DNN model achieved optimal results in generating the output of lip movements and mouth shapes.
van der Struijk et al. [202] developed a generative FACSvatar7 framework for modelling virtual avatars' facial animation based on Facial Action Coding System (FACS) [161] data.The framework enables a data-driven generation of facial animation through a simple Gated Recurrent Unit (GRU) neural network implemented with Keras8 .Input data was obtained through OpenFace2, which, from FACS-based [51] input, sent AU eye gaze and head rotation to ZeroMQ in real-time.The subjective evaluation results regarding the generation of facial configurations demonstrated that the DNN model in the machine learning module requires further improvements.Moreover, the performance of the FACSvatar framework was tested on several modules, such as CSV offline, Bridge, AU to Blend Shapes, Visualisation in Unity 3D and Machine Learning.The main limitation of this framework is the shortage of datasets with different AU intensities, which seems to impede the machine learning process.
Jonell et al. [99] proposed a probabilistic method to generate interlocutor-aware facial expressions using four modalities: an interlocutor's acoustic features and facial features as well as the avatar's acoustic features and existing facial features.Although the model resembles the MoGlow [87,105], it differs by using multiple modalities and encoding each modality by separate networks, such as Multi-layer Perceptrons (MLPs), Recurrent Neural Networks (RNNs) and 1D-convolution networks (CNNs).As an objective measurement, the authors used log-likelihood and its ablations as well as mismatched sequences.As for the subjective evaluation metrics, a user study used a single question across five experiments with the participants on their perceptions of the system.The experimental results demonstrated the significance of multimodal input in generating appealing facial expressions in response to the interlocutor.The network takes two types of input: half a second of audio and a description of an emotional state.The former (audio) is used to output the 3D vertex positions of a fixed-topology mesh that correspond to the center of the audio window, while the latter (emotional state) disambiguates facial expressions and speaking styles.

Affective Facial Expressions
This subsection focuses on expressive facial animation generation.Research into the affective facial expression generation in the domain of Embodied Conversation Agents (ECA) has produced some seminal works, such as those by [101,164], to name but a few.In the following paragraphs, we elaborate on works that consider emotion information, such as the six universally recognized emotions suggested by [52] -happiness, sadness, disgust, anger, fear, and surprise -in the design of facial expression generation models.
Karras et al [101] presented a model based on a deep neural network to generate expressive 3D facial animations from speech audio (Fig. 2).The emotional states were presented as -dimensional vectors 9 fed to the network as a secondary input.The performance of the proposed model was compared in a subjective user study against video-based performance capture from the DI4D 10 system and dominance model-based animation produced by FaceFX 11 [39] as baselines.While the proposed model was outperformed in the naturalness of the output facial animations by the video-based performance capture model, it showed an outstanding performance over the dominance model.The major shortcoming of the proposed model was caused by its inability to represent eye motion due to mismatches with the audio.Therefore, combining the proposed approach with generative neural networks would provide a better synthesis of such details.While the model succeeded to produce plausible results for several emotional states (e.g., amused, surprised), a larger dataset might be useful to advance the model further.
Huang and Khan [94]  However, the authors emphasize directions for further improvements of the model in terms of using a larger dataset with multiple interviewers to enable the generalisation to different identities.Another way of enhancement would be combining the proposed model with a temporal recurrent network, namely, LSTM [90] to obtain video frames of facial expressions.
Sadoughi and Busso [164] presented a BLSTM [232] trained with speech features (i.e., Mel-frequency Cepstral Coefficients (MFCCs)) and the extended Geneva minimalistic acoustic parameter set eGeMAPS [57] for emotional speech-driven lip motion generation designed specifically for conversational agents.The proposed approach relied on multitask learning (MTL) 12 , which created shared representations for the tasks.The study results were measured objectively through single task learning (STL) 13 and MTL comparison and benchmarked against state-of-the-art baselines [163,188].Moreover, the subjective evaluation used Tukey's multiple comparisons test to assess the naturalness of the lip movements.The results demonstrated the advantage of MTL in the generation of lip movements corresponding to the original sequences, achieving the naturalness of animation.It is noteworthy that the MTL-based framework can be trained on partial information (i.e., without necessitating the full labelling of data).
Sadoughi and Busso [167] proposed a Conditional Sequential Generative Adversarial Network (CSG) model that learns the relationships between emotion, lexical content and lip movements using the sceptral and emotional speech features as conditioning inputs to generate expressive and naturalistic lip movements.Compared against three DNN-based baselines [59,163,188] with the Parzen estimator [72], the model displayed higher log-likelihood and outperformed other baselines in the objective evaluation.The subjective evaluation results showed a better performance for the CSG model in terms of the naturalness of the generated lip motions.The generated lip movements were also evaluated for their ability to convey emotional cues, manifesting that the CSG model allows conveying expressive cues close to the original recordings.
Table 2 presents the summary of the corpora and evaluation metrics used in natural and affective facial expression generation.Corpora-wise, there seems to be large diversity in datasets to train models.In terms of representations, while some opted for Action Units [25], others relied on readily available large databases of facial expressions [61,94,202].

Summary: Facial Expressions
• Data-driven production of facial expressions, also known as facial gestures, has focused on creating natural (neutral) and affective facial expressions.
• Application domains vary significantly and range from the games industry to HRI.
• In terms of representation, some approaches opt for high-level Facial Action Units and audio-visual features [25], while others rely on readily available large databases of facial expressions [61,94,202].Yet, there is an overall lack of more sophisticated datasets, i.e. with a high spatial and temporal resolution, emotional audio-visual data.
• There is a lack of sophisticated expressive animation rendering toolkits for off-the-shelf production of facial expressions [167].

HAND GESTURES
As a natural mode of interaction, hand gestures carry important functions in human-human communication, such as maintaining an image of a concrete or abstract object and idea (iconic and metaphoric gestures), pointing and giving , , Oralbayeva et al.
directions (deictic gestures), or emphasizing some parts of the speech (beat gestures) [134].Hand gestures, including fingers and arms, also act as an independent modality or part of modalities designed for various virtual agents and robots, adding expressivity to their motions.This versatility of hand gestures served as an incentive for their application in such domains as human-computer interaction (HCI) [207] and its related fields -human-robot interaction (HRI) [128] and human-agent interaction (HAI).In HRI, hand gestures are applied to socially assistive robots (SARs) because of the expressivity they add to robots' verbal and non-verbal communication with humans [170].Besides, hand gestures are believed to ease the interaction between humans and robotic agents [142].
A considerable amount of research has been conducted on a data-driven generation of hand gestures, utilizing various databases and displaying a range of architectural choices [113,194,228].For example, the earliest work by Chiu and Marsella [29] in 2011 made use of Hierarchical Factored Conditional Restricted Boltzmann machines (HFCRBMs) [30], whereas the most recent works resorted to models such as Long Short-Term Memory networks [85,186] and a Variational Autoencoder (VAE) [111], to mention a few.Despite their purely communicative nature, sign language gestures are not covered in this survey as they rely solely and largely on a visual modality.Thus, in the paragraphs that follow, we cover the hand gestures that are characteristic of co-speech communication of information.
Chiu and Marsella [29] relied on Hierarchical Factored Conditional Restricted Boltzmann machines (HFCRBMs) [30] -an extension of Deep Belief Network [89] -to generate hand gestures that are tied to prosodic information.
In particular, the gesture generator function learns the relationship between previous motion frames, audio features Bozkurt et al. [17] presented a speaker-independent framework for joint analysis of hand gestures with continuous affect attributes, such as activation, valence, and dominance, and speech prosody using Hidden semi-Markov models (HSMMs) [230].Moreover, during the synthesis step, prosody feature extraction and continuous affect attributes are followed by the HSMM-Viterbi algorithm.Gestures in motion capture data were represented by joint angles of arms and forearms.Consequently, the animation is generated via unit selection applied on a gesture pool with regard to a multi-objective cost function.Their system was trained on multimodal USC CreativeIT database [135].Phrase-level gesture sequences for 1) affect and prosody feature fusion, 2) prosody only, and 3) affect only configurations were evaluated based on Canonical Correlation Analysis (CCA) scores [83] and symmetric Kullbeck-Leibler (KL) divergence.
Their findings suggest that affect and prosody fusion provides the best correlation with the original gesture trajectories, and has the best gesture and gesture duration modeling.On the other hand, affect only configuration has the least kinetic energy difference with the original sequence.Subjective evaluations were planned for their future work.
Takeuchi et al. [186] used deep neural networks with Bi-directional Long Short-Term Memory (BLSTM) [232] to study the production of metaphoric hand gestures from speech features of audio.During the data pre-processing, the hand gestures were represented as rotations of bone joints.The network is composed of three non-recurrent layers, a BLSTM layer, and a final output layer.The first non-recurrent layer takes Mel-frequency Cepstral Coefficients (MFCCs) features of audio as input, while other non-recurrent layers take independent data.On the other hand, the final output layer takes the backward and forward recurrence units from the BLSTM layer as input.Thus, the model output -the vector of prediction -is represented in a BioVision Hierarchy (BVH) format.The objective evaluation, conducted by comparing Hasegawa et al. [85] presented the BLSTM model integrating it with Bi-directional Recurrent Neural Networks (RNN) [75] to generate co-speech 3D metaphoric hand gestures from speech audio.Specifically, speech audio features were converted to mel frequency cepstral coefficients (MFCC) features and the joint positions of a whole body were used to represent the gestures.The network learns the relationship between speech and audio with backward and forward consistencies.Similar to the model proposed by Takeuchi et al. [186], the architecture consists of five layers shown in Figure 3.The objective evaluation was performed through Average Position Error (APE) 19  [117], which displayed insignificant errors in the left and right wrists in terms of accuracy.Moreover, the user study revealed that the generated gestures among the three gesture conditions (original, mismatched, and generated) were perceived as significantly more natural but significantly less time and semantically consistent than original gestures.
Kucherenko et al. [112] presented a novel speech-input and gesture-output Deep Neural Network (DNN) framework consisting of two steps.First, the network learns the lower dimensional representation of human motion with a denoising autoencoder neural network.Then, an encoder network SpeechE learns a mapping between speech and a corresponding motion representation.Kucherenko et al. [112] applied representation learning on top of the DNN model to make learning from speech and speech-to-motion mapping easier.The objective evaluation compared the proposed network with the baseline BLSTM model presented in Hasegawa et al. [85] using Average Position Error (APE) 20  [117] and Motion Statistics 21 as metrics for the average distance between the generated and original motion as well as the average values and distributions of acceleration and jerk, respectively.The proposed model achieved better results compared to the baseline and demonstrated the plausibility of the generated gestures.A further validation of the results through a user study confirmed the model's performance in terms of producing natural gestures.
Ginosar et al. [70] presented a model based on Convolutional Neural Network with General Adversarial Network (CNN-GAN) and log-mel spectrogram input, which can predict and generate hand gestures from a large dataset of speech audio [70].For gesture representation, the authors used skeletal keypoints corresponding to the neck, shoulders, elbows, wrists and hands, which were obtained through OpenPose [24].The network learns to map speech to gesture 19 APE compares the predicted positions with the original ones that accompany speech and calculates the Euclidean distance. 20Ibid., p. 10 21 The average values and distributions of acceleration and jerk for the produced motion.
using L1 regression, while the adversarial discriminator D ensures that the produced motion is plausible.Using the L1 Regression Loss and percent of correct keypoints (PCK) [225] as objective evaluation metrics, it was discovered that the proposed model outperformed an RNN-based baseline [176] in gesture generation.Besides, the extent to which the produced gestures were convincing was measured through a perceptual study applying the percentage of the generated sequences, labelled as real, as a metric.The result of the comparison between fake (produced by an algorithm) and real pose sequences did not display any statistical significance.
Yoon et al. [228] deployed a Bi-directional Recurrent Neural Network (RNN) model consisting of an encoder and decoder for co-speech gesture generation from speech text input.More specifically, the encoder takes the input text, while the decoder RNN with pre-and post-linear layers generates gestures.The model was trained on the TED Gesture Dataset [228] to produce four common types of gestures -iconic, metaphoric, deictic, and beat gestures -from both trained and untrained speech texts.A gesture is represented as a sequence of human poses, namely, joint configurations of the upper-body.As for the speech text, it is represented as a sequence of words, and each word is encoded as a one-hot vector that indicates the word index in a dictionary.The results indicated that anthropomorphism and speech-gesture correlation were the most crucial factors for participants' perception of the generated gestures, as demonstrated in the subjective evaluation.The results also showed significance over the three baseline methods measured with BLEU22 [149].While the study used only speech text resulting in the weak coupling of the gestures with audio, it could be improved with audio input.Tuyen et al. [194] employed a conditional extension of the Generative Adversarial Network (CGAN) [72] with an additional input condition.The GAN network includes convolutional Generator (G) and Discriminator (D) networks.
Altogether, the model generates communicative gestures by synthesizing the verbal content of speech.Here, the gestures were represented as human joint configurations.The objective evaluation was carried out through covariance with temporal hierarchical construction [95].Overall, the results illustrated the successful training of the model to imitate hand gestures that corresponded to the meaning of an utterance, which matched the iconic gestures by definition [134].
Lee et al. [118] introduced a temporal neural network, trained with Inverse Kinematics (IK) loss to generate finger motions and hand gestures taking upper body joint angles and audio as input from a multimodal 16.2-million-frame (16.2M) dataset [118], created alongside the model.The audio features included frequency (e.g., pitch, jitter), energy, amplitude (e.g., shimmer, loudness), and spectral features.The IK was applied to LSTM [90], Variational Recurrent Neural Network (VRNN) [35], and Temporal Convolutional Network (TCN) [198] to incorporate kinematic structural knowledge.The ablation study results demonstrated the advantages of IK loss function contrary to joint angle loss, Data-Driven Communicative Behaviour Generation: A Survey , , whereas the subjective evaluation yielded positive results with respect to the proposed model and its capability to generate natural human-like finger gestures.Hasegawa et al. [85] Gesture-speech dataset [187] 143 minutes 25 (767 sentences) 16 minutes 26 (90 sentences) Average Position Error (APE) [117] Questionnaire (naturalness, time consistency, and semantic consistency) Kucherenko et al. [112] Gesture-speech dataset [187] 171 minutes 20 minutes Average Position Error (APE) [117] Rating of statements on 7-point Likert-scale (naturalness, time consistency, and semantic consistency) Ginosar et al. [70] Person-specific video dataset [70] 115.2 hours 14.4 hours (2048 intervals) L1 Regression Loss 27and percent of correct keypoints (PCK) [224] Questionnaire (real vs. fake), pairwise comparison Yoon et al. [228] TED Gesture Dataset [228] 52 hours N/A28 N/A Questionnaire (anthropomorphism by Godspeed, likeability, speech-gesture correlation) Ferstl et al. [63] Natural speech and 3D motion dataset [63] 3.75 hours (226 minutes) 6.5 minutes Accuracy of the binary cross-entropy objective N/A Tuyen et al. [194] KIT whole-body motion database [131] 20 optical markers in 3D Table 3 presents the summary of the corpora and evaluation metrics employed in the studies above.The majority of studies relied on both objective and subjective evaluation criteria, while a few studies either used objective [194] or subjective evaluation criteria [96,228].To sum up, the works reviewed here demonstrate the prevalence of speech input data among data modalities used for hand gesture generation.Model-wise, recent research [63,85] shows a comprehensive exploration of recurrent networks to capture the dynamics of human motion, which excel at solving gesture generation tasks.That being said, an omnipresent limitation of such models lies in the dearth of gesture-rich datasets required to enable a robot to produce a wide range of hand gestures as opposed to certain predefined gestures produced with sparse datasets [29].Interestingly, the training and test sets used in [29] seem arguable considering the training and test set sizes used in other works.Thus, the following section reviews the existing state-of-the-art on models that consider other body parts along with hands, hence outputting appropriate behaviours.

Summary: Hand Gestures
• Data-driven generative models for hand gestures aim to generate four types of gestures -beat, deictic, iconic and metaphoric -but struggle with the latter two as semantics are often poorly modelled.
• The generated gestures often look natural, but the match to the spoken content is not yet good enough.Generating semantically matched hand gestures remains a challenge.
• Two important limitations are the scope of datasets and the lack of diversity.Most studies use singlespeaker datasets, with English being the dominant language across corpora.Interactive applications would benefit from dyadic or multiparty datasets.Cultural diversity and appropriateness would benefit from datasets from other languages and cultures.

MULTIMODAL GESTURES
In this survey, we define multimodal gestures when referring to the multimodality of the output.In particular, we refer to the interpretation of multimodal output by Rojc et al. [160], who emphasized the importance of synchronisation of generated non-verbal gesture types (facial expressions, head, hands, and body) with verbal (speech audio or video) in an attempt to make the interaction more natural and fluent.Therefore, the generation of such multimodal outputs as head and facial movements synchronized with speech [26,48,58,132] or body behaviours involving shoulder and torso along with facial movements [31,49,113] accompanied with speech will be discussed in this section.
An audiovisual model by Mariooryad and Busso [132] relied on three joint Dynamic Bayesian Networks (jDBNs) to generate facial gestures, involving head and eyebrow movements, by mapping the acoustic speech data from the IEMOCAP database [20] to Facial Animation Parameters [145].The model was trained by adapting the algorithms used for HMM and FHMM [68].Using the Canonical Correlation Analysis (CCA) [44,83], the joint DBN model was compared to similar models used to synthesize head and eyebrow motions separately.Overall, the objective evaluation results revealed that the jDBN models can cope with speaker variability, while the subjective results showed an increase in the quality of jointly modeled eyebrow and head gestures as well as their naturalness.
Ding et al. [48] proposed an animation model of a virtual agent, based on a fully parameterized Hidden Markov Model (HMM), which produces head and eyebrow movements in synchronisation with speech.As an extension of the contextual HMM, in FPHMM [216], contextual variables control and parametrize the means, covariance matrices, transition probabilities as well as initial state distribution.The model was evaluated objectively and subjectively on the Biwi 3D AudioVisual Corpus of Affective Communication database [60], considering facial motion and speech Data-Driven Communicative Behaviour Generation: A Survey , , features.An objective evaluation, compared with the baseline proposed by [132] using the Mean squared error (MSE) [6] demonstrated the best performance by the HMM-based joint model.Overall, the proposed model demonstrated an ability to capture the link between speech prosody and head and eyebrow motions.Subjectively, the perceptual questionnaire struggles to validate the objective evaluation as the results were marginally significant, showing quite identical performance in terms of expressiveness.
Ding et al. [49] presented a multimodal behaviour generation model based on the contextual Gaussian model and a Proportional-Derivative controller (PD).They leveraged the AVLaughter database [196] for producing multiple outputs (lip, jaw, head, eyebrow, torso and shoulder motions) synchronized with laughter audio.Using the pseudo-phonemes and speech features as input, motion synthesis was carried out in three steps: first, the lip and jaw motions were synthesized by a contextual Gaussian module (CGM); second, speech features were extracted for predicting head and eyebrow movements, consequently, torso and shoulder motions were synthesized from the previous step of synthesis by concatenation.The sophisticated subjective evaluation of the generated laughter and bodily behaviours, using a questionnaire adapted from [143] and Likert-scale rating, manifested users' preference for an agent which produces synchronized speech and laughter animations.
Chiu and Marsella [31] introduced a combined model to learn a twofold mapping: from speech to a gestural annotation using Conditional Random Fields (CRFs) and from gestural annotation to gesture motion by applying Gaussian Process Latent Variable Models (GPLVMs) [208].The model was subjectively evaluated against the approach by [29], which used direct mapping.The subjective evaluation was followed up by an objective assessment to establish the performance of the model against support vector machines (SVMs) [42].As a result, the proposed method performed significantly better in generating and coupling the gestures with speech, despite the hurdles of the inference model that requires temporal information.
Fan et al. [58] discussed the use of deep Bi-directional Long Short-Term Memory (DBLSTM) [232] to model the temporal and long-range dependencies of audio/visual stereo data for a photo-real talking head animation from audio, video, and text input.To train the network, the study used back-propagation through time algorithm (BPTT) [214,215].
The study demonstrated the advantages of two BLSTM layers sitting on top of one feed-forward layer on the datasets.
As a result of objective (RMSE [73,162,209] and CORR [215]) and subjective evaluation (A/B preference test [108]), the proposed deep BLSTM model showed higher performance compared with the previous HMM-based approach.
Li et al. [123] adopted a deep Bi-directional Long Short-Term Memory (DBLSTM) [232] recurrent neural network as a regression method to generate audiovisual animation of an expressive talking face.This method was devised to overcome the shortcomings of the previous state-of-the-art models in incorporating lip movements with emotional facial expressions.Thus, Li et al. [123] proposed five methods based on DBLSTM trained using a large corpus of neutral data and a smaller scale corpus of emotional data.Specifically, in method (a), the DBLSTM network is trained with emotional corpus only; method (b) and (c) capture neutral and emotional information simultaneously by training a single DBLSTM network; while method (d) and (e) capture neutral information by a separate DBLSTM network in addition to emotional DBLSTM.To evaluate the proposed approaches, the authors adopted root mean squared error (RMSE) between the predicted Facial Animation Parameters (FAP) and ground truth.This revealed how different regression models worked for different emotions.Notably, information from the neutral dataset was found more valuable for peaceful expressions (e.g., sadness) than exaggerated expressions (e.g., surprise and disgust).A further frame-wise comparison of RMSE values displayed the effectiveness of the proposed methods in modelling the interaction between emotional states, facial expressions and lip movements.Finally, the subjective evaluation results confirmed the effectiveness of using the neutral dataset as it can improve the performance of an expressive talking avatar.
Suwajanakorn et al. [183] used recurrent neural networks to learn the mapping from raw audio input (MFCC audio features) to lip landmarks (PCA), synthesizing lip textures and then merging them into the 3D face to output a realistic talking head with clear lip motions synced with the input audio.The network consisted of LSTM nodes and was trained using backpropagation through time with 100 time steps.When compared against AAM approach [41] and Face2Face algorithm [191] in an objective evaluation, the proposed method synthesized cleaner and more convincing lip movements.
Chung et al. [37] proposed an encoder-decoder CNN-based Speech2Vid model, taking still images and audio speech segments to output a video of the face, including lip synchronized with the audio.The architecture constitutes three modules, such as the audio encoder, identity encoder, and image decoder, which were trained together.Learning the joint embedding of the target face and speech segments is central to this approach in generating a talking face.Evaluations, conducted to qualitatively measure the quality using the alignment and the Poisson editing [150] techniques, determined the ability of Speech2Vid to generate videos of talking faces with certain identities.
Chen et al. [26] developed a method that takes speech audio and one lip image of a target identity as input and generates an output of multiple lip images with the accompanying speech audio.The model is designed by combining correlation networks with an audio encoder and an optical flow encoder, implemented on 3D RNN to mitigate delayed correlation problems.The generated lip movements were evaluated quantitatively and qualitatively on the GRID [40] corpus, LRW [36] and LDC [157] dataset, not used previously for training purposes, as well as with different metrics -LMD, CPBD [140], and Structural Similarity (SSIM) and Peak Signal-to-Noise Ratio (PSNR) [213].The proposed model generated realistic lip movements and proved their robustness to view angles, lip shapes, and facial characteristics.
However, the main limitations are bound to learning from a single image, which resulted in difficulties in capturing lip deformations.
Plappert et al. [153] introduced a model based on deep Recurrent Neural Networks (RNNs), and sequence-to-sequence learning [182], which learns a bi-directional mapping between whole-body motion and natural language.One model is fed the encoded motion sequences obtained from motion capture recordings during training, and the other is trained on natural language descriptions to generate whole-body motions.Based on the quantitative comparison with the baseline model, the language-to-motion model demonstrated the capability of generating proper human motion, achieving higher performance rates.The performance of the model was also measured by BLEU scores [149], which suggested minimal overfit and generalisation to previously unseen motions.The model showed a capability to generate whole body motions given proper descriptions in natural language.
Alexanderson et al. [5] adapted a deep learning-based MoGlow [87] for a probabilistic speech-driven model to output full-body gestures synced with speech.Particularly, the normalising flows were used the same way as GANs to generate output by a nonlinear transformation of latent noise variables.Thus, four models were trained on a speech-only condition, while the other four were conditioned on style control.The model was compared against three baselines taking the same speech representation as input: unidirectional LSTM [90], conditional variational autoencoder (CVAE) [77], and the audio-to-representation system (ARP) [112].While the subjective evaluation of the style control experiment yielded significant results in favor of the MoGlow-based model for the human-likeness of the gesticulation, the model trained on speech only achieved better results compared to the second baseline.
Dahmani et al. [43] used a conditional generative model based on a variational auto-encoder (VAE) framework for expressive text-to-audiovisual speech synthesis.The proposed model learns from textual input, which provides the VAE with embedded representation to further capture emotion characteristics (Fig. 4).Although the experimental results showed a high recognition rate for almost all emotions in audiovisual animations, sadness and fear turned Yoon et al. [227] discussed an end-to-end model that takes speech text, audio, and speaker identity to generate upper-body gestures, co-occurring with speech and its rhythm.The proposed method is based on Bi-directional GRU [32] along with recurrent neural networks used for encoding three different input modalities.The ablation study demonstrated that all three modalities had a positive effect on the generation of gestures.Overall, the proposed model performed well as identified by a novel objective evaluation metric called Fréchet Gesture Distance (FGD) [88], subjective outperformed the state-of-the-art approaches for gesture generation and provided a path towards performing gesture style transfer across multiple speakers.Perceptual studies also showed that the generated animations by the proposed model were more natural whilst being able to retain or transfer style.
Wang et al. [210] introduced an integrated deep learning architecture for speech and gesture synthesis (ISG) model to synthesize two modalities in a single model, compatible with both social robots and embodied conversational agents (ECAs).The proposed model is adapted from Tacotron 2 [174] and Glow-TTS [102], with Tacotron 2 being auto-regressive and non-probabilistic and Glow-TTS being parallel and probabilistic, and takes text as input to generate speech and gesture.Subjective tests performed separately for each modality demonstrated that one of the proposed ISG models (ST-Tacotron2-ISG) performs comparably to the current state-of-the-art pipeline system while being faster and having much fewer parameters.
Huang et al. [93] proposed a fine-grained Audio-to-Video-to-Words framework, called AVWnet, which is deemed to produce videos of a talking face in a coarse-to-fine manner and maintain audio-lip motion consistency.The framework architecture consisted of tree-like architecture and a GAN-based [72] neural architecture for synthesizing realistic talking face frames directly from audio clips and an input image.The GAN framework is conditioned on image features to enable further fusion of facial features and audio information in generating the face video.Compared with the state-of-the-art approaches [27,37], the performance of AWVnet excelled on all three adopted metrics and datasets as a result of objective evaluation.Metrics such as Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Landmark Distance Error (LMD) were used to evaluate the model objectively.A comparison of the proposed model with the model by Chen et al. [27] through perceptual user study revealed the former to be as good as the existing model.
Zhou et al. [236] presented a model that learns from disentangled audio-video representations to generate a talking face corresponding to speech.Both talking video and audio were used to train the Disentangled Audio-Visual System (DAVS).The DAVS network demonstrated several advantages over the previous baseline [36], which encompass the improvement of lip-reading performance, unification of audio-visual speech recognition and synchronisation in an end-to-end framework, and the achievement of a high-quality and temporally accurate talking face generation as a result of both subjective user study and effectiveness verification by Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) [213].
Sadoughi and Busso [166] demonstrated a Constrained Dynamic Bayesian Networks (CDBN) [132], to overcome the individual limitations of rule-based and data-driven approaches in gesture generation.The authors aimed to build a generative model to produce believable hand gestures along with head gestures with bimodal audio-speech and video data synchronisation.The model was evaluated by two objective metrics: canonical correlation analysis (CCA [21,83]) and log-likelihood rate (LLR) [136].Based on the results of the subjective evaluation, the CDBN model is perceived to generate more appropriate and natural gestures compared to baseline models.Overall, the hand gestures generated by the constrained model showed 85% accuracy for certain types of gestures.
Vougioukas et al. [206] discussed the GAN-based talking face generator, consisting of a temporal generator and multiple discriminators, which takes a single image and raw audio signals as input.The quality of the generated video output was evaluated on the GRID [40] corpus, TCD TIMIT [84] corpus, CREMA-D [23] and LRW [36] datasets by applying reconstruction (Peak Signal-to-Noise Ratio and Structural Similarity [213]), sharpness (cumulative probability blur detection (CPBD) measure [139]), content (average content distance (ACD) [193] and word error rate (WER)), and audio-visual synchrony metrics.When assessed subjectively, the results of the Turing test 30 showed naturalness of the generated faces.Moreover, compared to baselines [37,183], the model demonstrated an ability to not only capture and maintain identity but generate facial expressions matching the speaker's tone and speech.
Sinha et al. [177] approached the generation of identity-preserving and audio-visually synchronized 2D facial animation through GAN, utilizing DeepSpeech features, given an audio input of speech, and facial landmarks from the benchmark corpora as GRID [40] and TCD-TIMIT [84].Same objective evaluation metrics as in [26] were used in the study.Moreover, a qualitative evaluation compared the model with the state-of-the-art baselines of [26], [206], and [236].These evaluations yielded overall positive results regarding identity preservation, superior image quality and texture clarity, and smooth audio-visual synchronisation.
Tables 4 and 5 summarize the state-of-the-art in multimodal gesture generation, concerning the corpora and evaluation metrics used.Even though studies emphasize objective evaluation as a challenging task, the existing literature shows effective and nuanced exploitation of objective metrics along with subjective ones.Note that objective metrics are often the same as the cost functions used to optimise the generative models, with authors assuming that optimising the cost functions equates with improving the model's performance.However, for now subjective measures remain the gold standard for assessing the quality of the generated behaviour and this is recognised across the field..

Summary: Multimodal Gestures
• Multimodal gesture generation creates an opportunity for a holistic approach to generating social behaviour, and improves over generating isolated behaviours (e.g., hand gestures, speech synthesis).
Early demonstrations exist combining speech and hand gestures, and speech and body behaviours, to mention but a few.
• Future developments are expected to broaden the scope of multimodal gesture generation.Potential low-hanging fruit is using or predicting emotional states, e.g. from audio, to produce corresponding communicative behaviour [183], and moving towards gestures driven by semantic content [5,113].
• In most multimodal generative systems, the different modalities are still considered in isolation.Building a flexible system that is able to jointly generate whole-body gestures, from and with verbal cues, remains a challenge [183,227].

SPEECH SYNTHESIS
Speech is often a prime aspect of interactive communication, and in embodied systems often co-occurs with gestures.
Tacotron: Wang et al. [211] presented a system based on a sequence-to-sequence (seq2seq) model [11,182] with an encoder that encodes input character embeddings into context vectors, an attention-based decoder [11,204] that turns the encoder final representation into a Mel-scale spectrogram, and a CBHG 41 -based post-processing net that converts spectrogram frames to waveforms using the Griffin-Lim reconstruction algorithm [78].The results show that the Tacotron model achieved a better Mean Opinion Score (MOS) [156] in terms of speech naturalness than that of the parametric speech synthesis system [231], and a marginally lower score than that of the concatenative speech synthesis system [71], which is a promising result considering the audible artifacts produced by the Griffin-Lim synthesis approach.This opened the door to another improved version of the system; Tacotron 2 [175], which is a combination of convolutional and recurrent neural networks and WaveNet vocoder (derived from the WaveNet architecture [197]).This model outperformed the parametric, concatenative, Tacotron (Griffin-Lim), and WaveNet text-to-speech systems in subjective evaluation.
Deep Voice: Arik et al. [8] discussed a system for speech synthesis, where each model of the system is based on an independently trained deep neural network.The main sub-models of the system have the following functions: segmenting voice for calculating phoneme boundaries, in the training pipeline only, using a recurrent architecture with connectionist temporal classification loss [74], in addition to converting grapheme (text)-to-phoneme using encoder 41 CBHG is an efficient module for calculating sequence representation.It consists of a one-dimensional convolutional filters' bank, highway networks [181], and a Bi-directional Gated Recurrent Unit (GRU) net [34].
and decoder with Gated Recurrent Units (GRU) [32], predicting phoneme duration and fundamental frequency, and synthesizing audio based on WaveNet architecture [197] with a bi-directional Quasi-RNN (QRNN) conditioning network [18] in both the training and inference pipelines.The results show relatively lower (but promising) Mean Opinion Scores (MOS) [156] for the synthesized audio with respect to ground truth recordings.This opened the door to other improved/novel 42 multi-speaker versions of the system; Deep Voice 2 [69] with a high quality of synthesized audio that outperforms that of the Deep Voice synthesis system, and Deep Voice 3 [151] that outperforms Deep Voice 2 and Tacotron (Griffin-Lim), while it has a similar performance to Tacotron 2 in case both are using WaveNet vocoder.
VoiceLoop: Taigman et al. [185] introduced an approach for speech synthesis inspired by the working memory model; the phonological loop [10].An input sentence (text) to the model is represented as a set of phonemes, where each phoneme is represented through an embedding vector.These vectors are weighted and summed to create a context vector using attention weights.The model uses a memory buffer, which is updated by a new, speaker-dependent, representation vector, at each time step, calculated with a shallow fully connected network that has as input: the context vector with speaker embedding, and both the output and buffer vectors at the previous time step.The output of the model is calculated through another network of the same architecture that has as input the buffer vector at the current time step with speaker embedding.The results show that the VoiceLoop model outperformed the Tacotron and Char2Wav [180] models in the Mean Opinion Scores (MOS) [156] -subjective evaluation -and Mel Cepstral Distortion (MCD) scores -objective evaluation -in single and multi-speaker speech synthesis.
WaveGlow: Prenger et al. [155] proposed a flow-based network capable of generating high-quality speech from mel-spectrograms.Following the examples of Glow [106] and WaveNet [197], the WaveGlow produces efficient and high-quality audio without the need for auto-regression.An experimental study is conducted to subjectively compare the proposed model against two baselines, such as the Griffin-Lim [79] algorithm and WaveNet [197], using the Mean Opinion Scores (MOS) [156] as a metric.The results showed that WaveGlow delivers audio quality as good as the best publicly available WaveNet implementation trained on the same dataset.
WaveGrad: Chen et al. [28] presented a conditional speech synthesis model of waveform samples that estimates the gradients of the data log-density as opposed to the density itself.It is non-autoregressive as it requires only a constant number of generation steps during inference.In particular, starting from Gaussian noise, gradient-based sampling is applied using as few as 6 iterations to achieve accurate audio.The experiments demonstrated that WaveGrad is capable of generating high-fidelity audio samples, outperforming adversarial non-autoregressive models [15,116,222,223] in an objective evaluation and matching one of the best autoregressive baseline models [100] in terms of subjective naturalness.

Affective Speech Synthesis Systems
Lee et al. [120] introduced an altered version of Tacotron, injecting an emotional embedding e to attention RNN to generate speech with specifications of emotion and personality of a human.The model was trained and evaluated on two Korean emotional speech datasets -one from Acriil, the other from ETRI -the former containing speech, audio, emotional label pairs, while the latter containing a drama script.Through quantitative experiments, the authors identified two areas of improvement concerning attention alignment.First, due to the scarcity of the frame of a 42 Deep Voice 2 has a modified architecture with respect to Deep Voice through separating between the phoneme duration and frequency models and adding batch normalisation and residual connections in the convolutional layers in the segmentation model.Deep Voice 3 is a novel fully convolutional attention-based speech synthesis system.It consists of an encoder that maps textual features to an internal representation, a decoder that maps the encoder representation to an audio representation, and a converter as a post-processing net.It is a fully convolutional system (unlike Tacotron), which makes computation and training very fast.

Data-Driven Communicative Behaviour Generation
spectrogram, the authors opted to concatenate attention text to the attention RNN's input to achieve an alignment of the speech with pronunciation.Second, they applied residual connections to the Convolution Bank + Highway + bi-GRU (CBHG) module [119] for a sharper and clearer attention alignment.Overall, the results showed that the quality of the generated speech was highly correlated with the sharpness of the attention alignment, despite the limited emotional representation in the speech.
Um et al. [195] developed a text-to-speech system based on the intra-category distance that generates emotional speech and controls the intensity of emotion representation.In doing so, they first proposed an inter-to-intra distance ratio algorithm to enable the inclusion of a wider range of emotions simultaneously and enhance their clarity utilizing the ratio between intra-and inter-cluster embedding vectors.Then an interpolation technique was introduced to control the intensity of the emotions effectively.During training, the global style token Tacotron (GST-Tacotron) model [212] was used as a baseline, taking a large number of neutral utterances as input.The effectiveness of the method was assessed subjectively using Mean Opinion Score (MOS) tests [156] in terms of the quality of the synthesized speech, while the preference test measured the expressiveness of sadness, anger, and happiness against the mean-based method.
As a result, the proposed approach outperformed the conventional mean-based method in both criteria.
Byun and Lee [22] proposed a multi-conditional emotional speech synthesizer through the Tacotron [211] model by providing it with an emotional embedding from a multiple-speaker Korean emotional speech database [22].For the Tacotron to synthesize multi-conditional speech, the authors injected the embedding vector into the Decoder RNN, which enables the generation of mel-spectrogram frames.In addition, the Attention module of the Tacotron was trained using both the emotional speech dataset and a large set of speech data for TTS.The extent to which the model was emotionally expressive and clear was evaluated by the Mean Opinion Score (MOS) test [156] in a subjective study, which resulted in the superiority of the proposed method of emotional speech synthesis generating four emotions as output: happiness, anger, neutrality and sadness.
Li et al. [122] introduced a novel reference-based approach for emotional speech synthesis based on Tacotron to synthesize speech with neutral and six basic emotions [52].Specifically, the model integrates four losses such as the basic Tacotron MSE loss, two emotion classification losses and the style loss [67,98].As input, the model takes the Chinese test first converted into a character sequence, then, CBHG module [119] converts a pre-net output into the final encoder representation, and finally, the mel-spectrogram is transformed using the CBHG post-net to obtain a linear spectrogram.The model's ability to transfer emotion was evaluated through ablation studies, while the emotion strength control was measured by strength ordering test against the RA-Tacotron [237] in a subjective evaluation.It was observable from the results that the speech synthesized with the proposed method was more accurate and expressive, displaying less emotion confusion.
Lei et al. [121] proposed a fine-grained emotion transfer (FET), control, and prediction approach for expressive speech synthesis that shares architecture with Tacotron [211] and Tacotron2 [175], generating mel-spectrogram through a CBHG-based text encoder and an attention-based auto-regressive acoustic decoder.As regards emotion expression, emotional information is learned from the input text in emotion transfer, reference audio in emotion control, and manual labels in emotion prediction.To control the emotion category, the authors adopted the emotional embeddings, which is further treated as the global render of speech in the seq2seq model for emotion transfer.The emotion prediction, on the other hand, learns directly from the phoneme sequences without any reference audio or labels.Finally, the FET was compared subjectively with the GST model [212] and the utterance-level emotion transfer model (UET) [237], trained by ground-truth mel-spectrogram, using mel-cepstral distortion (MCD) [110] and A/B preference test [108] as metrics.For objective evaluation, Dynamic Time Warping (DTW) [137] was adopted to evaluate the predicted features The proposed strategy is called Tacotron-PL due to the use of perception loss (PL) [98] for style reconstruction loss.In a comparative study, there were five Tacotron-based text-to-speech systems developed, including baseline Tacotron and its four variants with the proposed Tacotron-PL among them.Three different evaluation metrics were used for an objective performance evaluation with regard to spectral modeling, F0 modeling, duration modeling, and deep style features.Subjective evaluations are conducted through Mean Opinion Score (MOS) [156], A/B preference tests [108], and Best Worst Scaling (BWS) [65].By outperforming the other baselines, Tacotron-PL demonstrated the advantages of the proposed training strategy in terms of expressiveness and feasibility in synthesizing four emotional categories including sad, happy, angry and neutral.
Wu et al. [220] integrated two descriptors -Capsule Network (CapNet) and Residual Error Network (RENet) -for a sequence-to-sequence (seq2seq) architecture of an end-to-end emotive speech synthesizer which synthesizes speech with anger, happiness, sadness and other emotions.CapNet is employed for speech emotion recognition (SER) by outputting a set of probabilities that correspond to the emotions, while RENet is considered advantageous for deriving latent emotive representations.Unlike the existing methods, this method utilizes an utterance exemplar for emotion specification.Specifically, exemplary descriptors are integrated into the seq2seq to control the synthesis.Thus, this work proposed five E-TTS systems based on categorical descriptors -emotion code vector (EC-TTS), various emotions (EP-TTS), logit-based descriptor (EL-TTS) from SER, and automatically derived descriptor -EA-TTS and EAli-TTS from RENet.An experimental study evaluated the emotion similarity and speech quality objectively by calculating the mean squared error (MSE) [6] and subjectively through mean opinion scores (MOS) test [156] on an audio-book corpus from the 2011 Blizzard Challenge [104].Among the two baselines (Tacotron [211] and GST-Tacotron [212]) and five proposed E-TTS systems (EC-TTS, EP-TTS, EL-TTS, EA-TTS, and EAli-TTS), the E-TTS systems performed significantly better than the baselines, while EA-TTS achieved the best performance in emotion similarity.
Annotated here are the advanced versions of the speech synthesis systems both for neutral and affective speech, primarily based on Tacotron [211], the performance and quality of which were proven through objective and subjective measures (See Table 6 for details) and benchmarking against the state-of-the-art models.Nonetheless, a few shortcomings have been encountered during training.For instance, Lee et al. [120] pointed out the scarcity of the emotional representations in speech as a significant limitation.It can also be observed from Table 6 that the subjective evaluations prevail compared to the objective evaluations. 43Dataset sizes are not available 44 Not applicable, ibid., p. 5 45 This is an approximation based on the details provided in the article, where authors each file lasting from two to three hours for each of the four actors. 46As a quantitative measure, the authors computed MSE values.Taigman et al. [185] CSTR VCTK corpus [203] LJ database [97] The Nancy corpus [104] English audiobook [154] N/A N/A Mel-cepstral distortion (MCD) [110] MOS [156] Lee et al. [ Chen et al. [28] Proprietary speech dataset [28], LJ database [97] 385 hours, 23 hours 1,000 sentences Log-mel spectrogram mean squared error metrics (LS-MSE), MCD [110], F0 Frame Error (FFE) [33] Listening test (5-point MOS scale) [156] Summary: Speech Synthesis • Speech production, known as text-to-speech synthesis, has benefited considerably from data-driven approaches, and is the most mature data-driven social behaviour available, with some artificial speech being almost indistinguishable from human speech.
• Commercial vendors have invested considerably in data-driven models, which far outperform academic products especially for neutral speech.Still, there is considerable spread in quality between languages.
• Most speech synthesis engines are unable to adaptively overlay affect and emotion, with most voices sounding neutral.This, currently, is a limitation for the field of Human-Robot Interaction (HRI), which calls for rich affective speech.
• Last but not least, it is noteworthy to mention that the high fidelity of artificial speech might not always suit the needs of HRI: studies [22,185] suggest that a human-like voice might not fit the robotic appearance and that a more robotic voice might be more appropriate to the context of interaction.

OUTLOOK
It is clear that data-driven methods relying on connectionist architectures are an important and perhaps definitive answer to the question of how to generate human-like communicative behaviour.Never before have models produced , , Oralbayeva et al.
such rich and varied behaviour without the need for explicit programming.However, there are a number of challenges that still face the relatively young field of data-driven behaviour generation.
Multimodal behaviour generation.Most models take a single signal and map it onto a modality: text to speech, emotion to facial expression, speech to gesture.However, in human-to-human communication all modalities are intertwined: emotion colours speech and gestures, gestures have an impact on speech, context influences eye gaze, etcetera.The fact that communication is a highly interdependent process is glossed over in current data-driven generation methods, for obvious reasons.Still, in future systems we would expect more modalities to be taken into consideration.In the speech generation community, for example, emotion has long been the subject of study, and research systems are able to generate speech modulated by emotion.However, the flipside to this is that for a data-driven approach more data will be needed.Already the amount of data required to train systems is expensive to collect for two connected modalities, adding other modalities is likely to increase the size of the required training data exponentially.How this will be overcome is as yet unclear.
Dyadic and multiparty communication.The large majority of data-driven models do not take the receiver into account.Instead they are trained to produce communicative behaviour as if it would concern a monologue in which the receiver of the message does not respond.In human-to-human communication, most interactions are multiparty interactions and our communicative behaviour is finely tuned to the reactions and responses of others.We watch for signals showing understand or misunderstanding, monitor for affective responses and are sensitive to bids for turn-taking.All these elements are largely missing from current data-driven methods, as they are exclusively trained on data that does not take into account the interactive nature of communication.Again, it seems likely that more data could resolve this problem, but at the same time collecting this data comes at a great cost and might be beyond the means of most R&D labs.
Measuring quality of generated behaviour.Assessing the quality of generated behaviour relies on objective and subjective measures.Objective measures are the workhorse of data-driven methods, as they form the cost function against which the models are optimised.Unfortunately, these objective measures only weakly correlate with subjective measures (see for example [114]).Subjective measures, during which people (or simulated subjective raters) judge the quality of the generated behaviour, remain the gold standard in evaluation.However, using human raters is expensive and time consuming and as such subjective measures cannot be used during training when many millions of evaluations are needed to drive the model ever closer to generating behaviour that is human-like.Recent work on gesture generation showed how subjective measures still are better for measuring the quality of models, and that objective measures often fall short as they only optimise a quantitative metric which is often a poor representation of qualitative assessment [217,219].Simulated subjective raters might be a way forward, as in GAN models in which one part of the model is trained to discriminate between artificial and human-like output, pushing the generated behaviour ever closer to being indistinguishable from human behaviour.Another challenge is the lack of common standards to evaluate models.Sometimes this is informed by the need to evaluate very specific elements of the generated behaviour, or because the accepted standard has outlived its usefulness.Benchmarks often form the focus of intense research investment and are often reached in just a few years, at which point they become useless as a target to aim for.Challenges, where different models are pitted against each other, have proven useful in this context -co-speech gestures for example have benefited from a series of challenges pushing the field, but also pushing the way in which models are evaluated [114,229].
Common datasets and evaluation methods.From the survey it appears that there are few common datasets on which models are trained and evaluated.Researchers and engineers prefer taking a pragmatic approach when chosing data to train and evaluate against.Factors such as availability, easy-of-use, feature availability, cost and appropriateness for the task at hand are deemed important and are often used as a reason to not use datasets which have been used by others.One corollary is that the field would benefit from agreed datasets and evaluation standards, something which happens for some modalities (such as speech synthesis) and is slowly being adopted for other modalities (such as gesture generation [114]).
Semantics of multimodal communication.Communication serves to change the mind of others.As such, any communicative act carries semantics.However, this is usually glossed over in data-driven models.In some cases, this is not too much of a problem.Speech generation, for example, generates speech from text.Text has a well-agreed notation and speech generation maps this orthography to sound.However, speech generation is largely context-free and the production of human-like speech is possible without requiring much access to the semantics of the text and without access to the internal affective state of the agent.For exceptions to this the context of the neighbouring text is sufficient to disambiguate the required speech sounds.For example, disambiguating "bass" as a fish (/bas/) or a musical instrument (/beIs/) can often be done by relying on other words nearby.Other modalities are different in that what they convey is tightly linked with affect, emotion and semantics of the message.Current data-driven methods do not have access to these, and while the models can with sufficient data pick up semantic correlations, the training cost at which this comes is prohibitive.
Fine tuning models.One promising benefit of data-driven neural models is the potential for fine-tuning (also known as transfer learning) of a pre-trained model.In this, a model is first trained using a large amount of data and then later training continues often on a smaller dataset so that the pre-trained model is more relevant for a specific task.While few behaviour generation models have been made available for fine-tuning, the practice is already well established in other fields, such as Large Language Models, where models can be relatively easily fine-tuned for other language-based generative tasks (e.g., [233]).
Hardware does not match the dynamics of software generated behaviour.Most social robots rely on actuation technology, such as electric motors and planetary gears, which do not offer the velocity, acceleration and jerk typically seen in the human body.This leads to multimodal social behaviour that appears unnaturally slow.Some solutions exist: some robots, such as Keepon, rely on simpler, smaller and lighter bodies which allow low-cost actuators to generate highvelocity dynamics.Others, such as EngineeredArts' Ameca or RoboThespian animatronic robots, rely on alternative actuation technology, often using pneumatics, to produce high-velocity animations matching human dynamics.However, human-like dynamics are for the moment still out of scope for most commercial and research social robots.Despite these challenges, data-driven methods for the time being look to be the way forward.But to achieve nearhuman multimodal behaviour, a number of important obstacles will need to be overcome.One striking observation is that a developing child does not have access to thousands or perhaps millions of hours of training opportunities.
Instead, children learn to interact multimodally through a combination of observation and online learning, and innate biases and constraints.This combination allows them to become skilled multimodal communicators in just a short few years.Perhaps future data-driven models should, instead of taking a tabula rasa approach, also start with biases and constraints to make the training process more efficient.

Fig. 2 .
Fig.2.An illustration of a deep neural model used for generating facial expressions using speech as input, from Karras et al.[101].The network takes two types of input: half a second of audio and a description of an emotional state.The former (audio) is used to output the 3D vertex positions of a fixed-topology mesh that correspond to the center of the audio window, while the latter (emotional state) disambiguates facial expressions and speaking styles.

(
inputs) and current motion frame (output) to generate hand gesture animations.The model was trained on motion capture and audio data from human conversation.Particularly, the motion capture data contained joint rotation vectors with 21 degree of freedom, whereas audio features used prosodic information such as pitch and intensity values.During the subjective evaluation, three animation types -Original, Generated, and Unmatched -were compared against each other in a user study.The results demonstrated the naturalness of the movements of generated gesture animations and the consistency of the motion dynamics with utterances.

Fig. 3 .
Fig. 3.The outline of the network architecture presented by Hasegawa et al. [85] consisting of five layers.the final loss results from the proposed model with a simple Recurrent Neural Networks (RNN) implementation, resulted in significantly better performance of the proposed model.The subjective evaluation of the original, mismatched, and generated gestures demonstrated significantly lower ratings of the generated gestures than the former two (original and mismatched) in terms of naturalness, matching in timing, and context.This result, as the authors explain, might be affected by the gesture motion's frequent moving.

Ferstl
et al.[63] attempted to map speech to 3D gestures through training networks with multiple adversaries to generate co-speech gestures.The authors extracted MFCC and pitch emphasis (F0) from the recorded speech and used upper-body joint positions to represent the gestures.The model architecture consists of a two-layer recurrent network composed of Long Short-Term Memory[90] cells and a feed-forward layer for input processing.Moreover, a Gated Recurrent Unit (GRU)[32] propagates the input for faster training purposes in producing joints.The novelty of the model lies in the training of the recurrent network with multiple generative adversaries instead of a standard regression loss.Drawing on the objective evaluation measured by the accuracy of the binary cross-entropy objective for each discriminator, the authors report the effectiveness of discriminators in solving a distinct sub-problem in the gesture generation task.

Fig. 4 .
Fig. 4. The architecture of the audiovisual model for animation generation by Dahmani et al. [43].
user study and in comparison to other state-of-the-art models.Despite the superiority of the proposed model over baselines, the main disadvantage still remains the demand for a large dataset as the generated motion quality and upper-body gestures were limited to the dataset used in the study.Additionally, the gesture generation process lacks controllability.Other limitations regard the FGD, which made it atypical to analyze mixed measurements of motion quality and diversity.Ahuja et al. [3] presented a Mixture-Model guided Style and Audio for Gesture Generation (Mix-StAGE) model which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures in an endto-end manner.A novelty of Mix-StAGE is to learn a mixture of generative models which allows for conditioning on the unique gesture style of each speaker.The model used a Temporal Convolution Network (TCN) module for both content and style encoders.It is trained on a custom-made dataset PoseAudio-Transcript-Style (PATS) designed specifically for this work.In the experimental study, the Mix-StAGE model was compared against existing baselines capable of generating similar co-speech gestures (i.e., single speaker models Speech2Gesture [70], CMix-GAN and multi-speaker models MUNIT [92], StAGE).The results of the objective evaluation revealed that the Mix-StAGE model significantly , , Oralbayeva et al.

Table 1 .
Corpora 1 and evaluation used in the head gesture generation literature

Table 2 .
Corpora and evaluation used in the facial expression generation literature

Table 3 .
Corpora and evaluation used in the hand gesture generation literature

Table 4 .
Corpora and evaluation used in the multimodal gesture generation literature

Table 5 .
Corpora and evaluation used in the multimodal gesture generation literature (continued) [126]t features.The FET model demonstrated better performance compared to the baselines in terms of coarse emotional expressions and its flexibility in synthesizing the emotional speech with the six basic emotions as happiness, anger, fear, sadness, disgust and surprise[52].Liu et al.[126]proposed a novel training strategy for Tacotron-based speech synthesis which does not require prosody annotation for training.Instead, the model unifies frame and style reconstruction loss.It is then implemented on speech emotion recognition (SER) and used as a style descriptor for extracting high-level prosody representations. and

Table 6 .
Corpora and evaluation used in the speech synthesis literature