Generation of speech and facial animation with controllable articulatory effort for amusing conversational characters

Engaging embodied conversational agents need to generate expressive behavior in order to be believable in socializing interactions. We present a system that can generate spontaneous speech with supporting lip movements. The neural conversational TTS voice is trained on a multi-style speech corpus that has been prosodically tagged (pitch and speaking rate) and transcribed (including tokens for breathing, fillers and laughter). We introduce a speech animation algorithm where articulatory effort can be adjusted. The facial animation is driven by time-stamped phonemes and prominence estimates from the synthesised speech waveform to modulate the lip-and jaw movements accordingly. In objective evaluations we show that the system is able to generate speech and facial animation that vary in articulation effort. In subjective evaluations we compare our conversational TTS system's capability to deliver jokes with a commercial TTS. Both system succeeded equally good.


INTRODUCTION
There is a rich history in the development of embodied conversational agents (ECAs) designed to engage in spoken interactions with users [9].Early examples include the astronomy guide Gandalf [50], the desktop agent PPP persona [1], the virtual tutor Steve [39], the publicly available August [17] and the real estate agent REA [8].In recent years, ECAs have been employed in various applications, such as museum guides [5,26,44], educators [3,19,60], computer game characters [16,40], and virtual companions [6,7,52].There is also a recent trend to explore humour in virtual agents and social robots [34], where some even work towards a robot theatre where robots can track the audience response to their jokes [22].This trend stems from humor being found to promote engagement and increase motivation to follow the advice of coaching ECAs [33].Zhang et al. investigated how humor styles influence the perception of joke-delivering robots [62].They selected 20 jokes in five styles (Affiliative, Self-enhancing, Self-defeating, and Aggressive) from famous comedians and the Jester joke dataset, finding that robots telling self-defeating jokes received higher scores and more laughter detected from the users.
Numerous studies have examined the perception of different voice types for interactive agents [41].Social robots and ECAs typically use the same kind of TTS as voices assistants, which are designed for simple, transactional interactions where users ask questions or issue commands, and the agent responds verbally or performs actions.As a result, these TTS voices aim to emulate a neutral, warm, and informative speaking style.However, to develop more amusing and opinionated conversational characters, it is crucial to incorporate more engaging vocal performances in their TTS voices [2].In real-world scenarios, conversational agents must adapt their speaking styles according to the situation, such as speaking more clearly to securely convey messages or varying vocal effort to deliver dramatic or engaging content.Lindblom's H&H theory posits that human speech production is influenced by physiological economy constraints [30], with hypo-articulated speech requiring minimal articulation effort and hyper-articulated speech maximizing clarity.One study found that neural TTS underperformed in speech intelligibility in noisy environments compared to the clearer concatenative TTS [11].
In the current study, we have developed a neural conversational TTS system capable of controlling the articulatory effort of its synthesized speech.Furthermore, we have integrated an avatar with lip movements that are coherent with the generated speech, considering both phonetic content and articulation effort.We conducted objective evaluations where we generated 100 sentences in different manners of speaking and measured the results automatically.To assess its effectiveness subjectively we decided to conduct a very demanding task -the Ebert test, as detailed below.We utilized the large language model GPT-4 [35] as a joke generator.Subsequently, we generated 16 word pun jokes with a self-defeating twist that required our amusing conversational character to transition from hyper-articulation to hypo-articulation.
The Ebert Test: If the computer can successfully tell a joke, and do the timing and delivery as well as Henny Youngman, then that's the voice I want!-Roger Ebert, 2011

RELATED WORK
In a study comparing the likability of human speech with TTS voices of the robot Sophia and IBM Watson [27], TTS voices were perceived negatively, often characterized as too smooth or lacking comprehension.Conversely, natural voices garnered positive feedback for their prosody, paralinguistic, and extralinguistic cues, such as audible breathing and smiling voice.A recent study investigated the perceived personality in a virtual agent controlled by human speech and gestures or using TTS and state-machine-like animation [49].Extroversion was mainly communicated through motion, while speech influenced agreeableness and emotional stability.
To develop style-specific TTS, researchers often use corpora with specific speaking styles.This method was employed to create unit-selection TTS voices with distinct personalities for animated characters in a speech-enabled computer game [18].Another approach involves using a large corpus containing varied speaking styles and automatically detecting a given number of Global Style Tokens (GST) and then categorize these by listening to them [58].Style tokens have also been used for emotional TTS [53], training the system on voice actors who read drama scripts manually tagged for emotions (happiness, sadness, anger, or neutral).However, as argued by Marge et al. [31], it is not obvious that it is possible to extend style tokens from book reading or acted emotions to the kinds of communicative functions [59] you would need in human-machine interactions.Efforts have been made to create a TTS corpus closer to a conversational speaking style by recording an actor reading chatbot scripts [61].However, spontaneous conversational speech has been found to be more varied in pitch and speaking rate than scripted conversational speech [28].Some systems are trained on a large number of speakers speaking in a range of styles, where the manner of speaking is controlled using a reference audio that is given as input along with the text [10,29].However, it can be difficult to chose the set of reference audios to use given a specific situation.The neural conversational TTS system presented in this paper was trained on a corpus of a male speaker who either read texts in a clearly articulated manner or was the moderator in casual three-party interactions [25].This combined corpus of read and spontaneous speech of the same speaker contains a range of verbal behaviours.We have then developed a method where we can both mix the speaking styles and control the prosodic realization.
Embodied conversational agents need to have lip movements that are synchronized and coherent with the synthesized speech.Taylor et al. trained a deep neural network on a large audio-visual dataset containing a single actor reciting 2543 phonetically diverse sentences in neutral tone [48].Given speech and the phonetic transcription, the system generates lip movements that represent the phonetic content of the speech, but not the manner of speaking.JALI is an animator-centric workflow for the automatic creation of lip-synchronized animations [14].They introduce the JALI viseme field with a lip and a jaw axis.This is used to capture speaking styles like mumbling, screaming and normal conversation.In recent years, there have been notable advancements in multimodal systems capable of generating speech accompanied by non-verbal behaviors such as co-speech gestures [57] and facial expressions with synchronized lip movements [20].Additionally, high-fidelity talking systems based on neural radiance fields have emerged, exemplified by the work of Guo et al. [15].However, one limitation of these systems is their rendering time, which currently takes approximately 12 seconds per frame on an RTX 3090 GPU.This rendering delay poses a challenge for real-time conversational applications where instantaneous generation is crucial.While real-time alternatives like NVIDIA Audio2Face [51] exist, they typically require extensive training data of around 60 minutes per actor.Fortunately, recent advancements by Pan et al. [36] present a promising lip sync system that can be trained on smaller datasets.In our paper, we introduce a flexible lip sync system capable of adapting to various facial rigs.This means that the system can drive any blendshapebased rig used in virtual agents or robots.Moreover, our method offers explicit control over articulation effort, which means that we can make the lip movements hyper or hypo clear, in order to match different speech styles and allow this to be manipulated.This flexibility enables manipulation in research settings.Notably, our approach is training-free and addresses a common use-case that is often overlooked in virtual agent and social robot applications.

SYSTEM DESCRIPTION 3.1 Conversational Text-To-Speech system
The conversational TTS was trained using Tacotron 2 with an added utterance-level prosody control method, similar to [38], and a speaking style control using an 8-dimensional speaker-like embedding, similar to [54].In a three-party dialogue corpus the moderator turns were used as a TTS corpus [25].These were segmented into breath groups (stretches of speech delineated by breath events) using a deep learning-based breath detector [46].These were then transcribed using Whisper ASR [37] and subsequently corrected to ensure accurate transcription of all fillers and repetitions.For each breath group, we measured the mean f0 and speech rate (approximated by the peaks of the wavelet matrix) using the Wavelet Prosody Analyzer [43].The mean f0 and speech rates of the breath groups were normalized by aligning the 1st and the 99th percentile points of the data to -1 and 1, respectively, while allowing outliers to extend beyond that range.At inference it is possible to extrapolate on the features by going beyond this normalised range, enabling the model to generate hyper-and hypo articulated speech using both a limited set of actual training data in this range, but also relying on the full corpus for robustness.Normalized values for these two features were appended to each utterance's encoded text and passed to the attention and decoder blocks from the pretrained model.A model is first initialised on a pre-trained read speech model, and trained for 70k iterations on the corpus with two embeddings, indicating whether the utterance is from the read or spontaneous part of the corpus.This model is then further trained including the prosodic features for an additional 100k iterations.We used a HiFi-GAN [24] vocoder fine-tuned on the same corpus for 383k iterations on the top of the published model.Data collection is further described in Sec.4.1.

Speech Animation with Adjustable Effort
We introduce a new, pseudo-biomechanical algorithm for generating speech animation.This algorithm offers a straightforward yet effective approach to account for co-articulation with adjustable articulatory effort by minimizing energy while maximizing adherence to articulatory targets.By varying the weight between these (often conflicting) goals, it is possible to produce speech animation with varying degree of clarity, in line with Lindblom's H&H theory [30] where articulatory effort (here: the required energy) stands in proportion to informational requirements (here: adherence to targets).The algorithm requires no training data and can be applied to different facial rigs.The input to the speech animation algorithm is a time-stamped phoneme sequence.In our experiments, this is derived from a phoneme recognizer based on wav2vec2.0[4].We use five high-level parameters to describe visual speech targets and articulatory motion.Drawing inspiration from Öhman's model of coarticulation [32], which proposes to view articulation as the superposition of continuous vowel motion and rapid consonant articulations, we use two parameters for vowel articulation: jaw [0..1] (degree of jaw-opening) and retraction-rounding [−1..1] (negative values correspond to lip retraction, positive values to lip rounding), and three parameters for consonant articulations: bilabial [0..1] (where 1 means lip closure, regardless of state of jaw), labiodental [0..1] (lower lip/upper front teeth contact + raised upper lip) and dental [0..1] (parted lips + elevated tongue), see Fig. 2 For each phoneme, and for each parameter, a tuple ,  describes articulatory target position and a weight that dictates the importance of the target.As an example, the consonant k may be either rounded or retracted and will therefore have a weight of  = 0 for the retraction-rounding parameter.In the case of the bilabial consonant b, the bilabial parameter target  = 1 is paired with a weight of  = ∞ ensuring that the target will always be reached for this phoneme.To synthesize a new animation   , a target sequence   is formed by placing targets on a timeline according to a provided time-stamped transcription, along with the corresponding target weights   .We approximate articulatory effort by the total acceleration summed over an articulatory parameter trajectory   as  1 =  |  −1 +  +1 − 2  |.We approximate information content loss as a weighted sum of deviation from articulatory targets:  2 =  |  −   |  and calculate the final parameter trajectory   that minimizes the sum  =  1 +  2 , where the  1 term effectively tries to straighten the track while  2 tries to adhere to the defined targets as closely as possible.In order to model varying levels of prominence in the articulation, we can do two things in this model: 1) increase the weight of targets belonging to prominent syllables, thereby forcing the trajectory closer to the target; and 2) shift the target to a more extreme position; this applies to the vowel parameters (jaw opening and retraction-rounding) which may simply be scaled up or down.In practice, we use prominence estimates to modulate both the target weights and the vowel target scaling, along with a global scaling for hyper-hypo articulation, see Fig. 1.
In our experiments, the generated articulation tracks are be used to drive the Furhat social robot or its digital twin simulator, but they can be easily implemented on other facial animation rigs.Videos of 16 sentences generated with our conversational TTS in high or low articulatory effort and with the standard Furhat lipsync (baseline in our experiments) and the Amazon Polly TTS voice Matthew can be found at https://www.speech.kth.se/tts-demos/iva2023/.

METHOD 4.1 Speech synthesis corpus
Developers of conversational systems should ensure that the TTS voices they use are trained on ecologically valid data [2].Our longstanding goal is to build social robots capable of engaging in multiparty interactions.To achieve this, we require data to train models that generate appropriate speech, facial gestures, and gaze behaviors.Consequently, we have recorded a corpus in which the same male American speaker acted as a moderator in 15 one-hour, threeparty interactions [25].
In these recordings, the moderator and two participants were assigned the task of decorating an apartment using a GUI on a large touch screen.The recordings took place in a motion capture lab, where all participants wore headset microphones, eye-tracking glasses, and gloves, and were filmed by three video cameras.All channels were synchronized and timestamped using the Farmi framework [21], ensuring that all aspects of the multi-party interaction were captured effectively, see fig. 3.In each interaction, the moderator first engaged in small talk with the participants before introducing the task at hand.He then assumed the role of an interior decorator, offering suggestions on how to decorate the apartment and providing instructions on using the GUI for this purpose.Occasionally, he adopted a self-directed speaking style while contemplating design options or commenting on the users' progress.As the moderator switched between small talk, instructions, advice-giving, and casual commentary, the resulting corpus encompassed a wide range of spontaneous speaking styles.In total, the conversational TTS corpus contains 5 hours and 40 minutes of moderator speech.To facilitate the generation of hyper-articulated speech, the TTS corpus was supplemented with 2 hours and 30 minutes of clear speech, in which the moderator read sentences from the CMU Arctic [23] and newspaper texts.The total TTS corpus spans approximately 8 hours.

Joke delivery generation
Typically, TTS voices are evaluated using mean opinion scores (MOS), where generic sentences suitable for reading aloud are synthesized.However, this approach tends to favor neutral, warm, and informative speaking styles, which are best suited for reading news or engaging in transactional interactions with voice assistants.In this paper, our goal is to evaluate a conversational TTS voice capable of expressing different attitudes while speaking.As mentioned earlier, the ultimate test for an expressive TTS voice would involve delivering jokes with the appropriate timing and intonation.We challenged ourselves by selecting joke delivery as the speech synthesis evaluation task.Instead of using existing jokes from corpora like the Jester joke dataset, we decided to generate the jokes to be synthesized using the large language model GPT-4.The prompt used for generating the joke candidates was: "Can you invent words  that do not exist and then describe what they mean in a fun and entertaining manner?".We also generated self-defeating comments for each joke, using the following prompt: "Can you give a sarcastic comment as a response to this joke?".Examples of the jokes are listed in Table 2.We used the recently proposed So-to-Speak system [47] ] to generate three different levels of articulation (see Tab. 1).This interface allows users to generate and interact with hundreds of synthetic speech samples using multi-dimensionally controllable TTS.The design displays prosodic feature variations on the axes of an interactive grid, where samples can be played by selecting them.The style function can be varied interactively, with a slider enabling users to scroll through grids exhibiting various levels of conversational and read speech styles.The samples displayed on the grid are playable upon clicking, and they are marked and colored according to an automatically generated naturalness MOS score using [12].The scores range from 1-5 (with 5 being "completely natural"), and the corresponding colors range from red (1) to green (5).This provides users with an estimate of how the settings on the controllable features affect the quality of synthetic speech.The control interface and an example grid with TTS samples are illustrated in Fig. 4. Using this interactive tool, one of the authors selected specific ranges of the controllable features to create three manners of perceptually distinctly different articulation, hyper-, normal and hypo-articulation, as presented in Tab. 1.Since the TTS engine is built on Tacotron 2 [42] which is probabilistic at inference, the samples are synthesized with natural variation within each setting.Just what we need, more deep thoughts that make no sense.AcciDelight Cake It is a cake that didn't turn out as expected but still tastes delicious, reminding us that life's imperfections can still bring joy.
Nothing like setting the bar low and still managing to trip over it.

OBJECTIVE EVALUATION
In [45] a tool called Starmap is introduced for visualizing and exploring the variety of prosodic styles across a corpus using the dimensionality reduction method t-SNE [55] and normalized utterancelevel means of prosodic features extracted with the Wavelet Prosody Analyzer [43].As an objective evaluation, we apply this method to validate the system's ability to produce varying degrees of clarity in articulation, With Starmap, it is possible to estimate speech rate based solely on acoustics, using peaks in the maximum energy scale, which correlate with the locations of syllables.However, in conversational speech, particularly hypo-articulated speech, syllables are often reduced or dropped entirely.This can result in a difference between the number of syllables identified in the speech samples and the number of syllables in the written prompt used

hyper-articulated hypo-articulated normal
Figure 5: Density graphs for dropped syllable ratio: the ratio of the number of syllables in the input text to the estimated number of syllables from the signal (using the peaks of the energy scale) for the 100 synthesis samples in three styles.
as input to the TTS.We can use this metric, the ratio of estimated versus written syllables, as a measure of how much the prosodic features (f0, speech rate, and energy) influence the clarity of articulation.The same 100 utterances are synthesized in different articulatory styles, namely normal (middle of the distribution of all features), hyper-articulated (high f0, slower speech rate, high energy), and hypo-articulated (low f0, faster speech rate, low energy).The dropped syllable ratio (DSR) of each style is shown in Fig. 5.
Our hypothesis that hyper-and hypo-articulated speech both significantly alter the DSR of synthetic speech is confirmed by pairwise t-tests on the distribution (hyper vs. normal p ≪ 0.001, and hypo vs. normal p ≪ 0.001).The measured prosodic features, as well as the DSR of the evaluation utterances, are visualized in a t-SNE in Fig. 6.
A two-dimensional Kolmogorov-Smirnov test is performed to verify that the distribution of hyper-and hypo-articulated utterances are different from the normal utterances and from each other, The results confirm that both the hyper-(p ≪ 0.001) and hypo-articulated (p ≪ 0.001) synthesis results in different distribution of prosodic representation compared to the normal population.The same holds true between the hyper-and hypo-articulated populations (p ≪ 0.001).

SUBJECTIVE EVALUATION
To investigate the effect of the proposed methods, we carried out two online perceptual tests, looking at joke-delivery and audiovisual speech matching.

Method
We generated 16 utterances, according to the procedure described in 4.2.We synthesized the new words in a hyper-articulated speaking style, the funny descriptions with an expressive prosodic realisation and the self-defeating comments in a hypo-articulated speaking style.In addition to the conversational TTS voice, the utterances were also synthesized using a commercial TTS voice.In the jokedelivery task, we asked subjects to listen to synthesized jokes (audio only) and rate how well the joke was delivered on a 5-point scale from poor delivery to great delivery.Each subject received the 16 jokes, 8 in each voice.The pairing of joke and voice was randomized between subjects, as was the presentation order.At the end of the experiment, we asked follow-up questions about their experience with speaking machines such as Alexa or Siri, what they based their ratings on, if they believe computers should have human traits such as humor or sarcasm, as well as if they had general comments on the study.For the audiovisual speech matching task, we presented animations rendered with the virtual Furhat robot of the same 16 utterances, and asked subjects to rate how well the lip movements match the speech on a 5-point scale from not matching at all to perfect match.The animations were generated by the speech animation method presented in section 3.2, and using a baseline method (the Furhat systems built-in lipsync).We used the same two voices as in the joke-delivery test, and generated videos representing all four configurations of speech animation method (new vs baseline) and voice (commercial vs conversational).Each subject was presented with 16 animations, four in each configuration.Pairing of joke and configuration was randomized between subjects, as was the presentation order.For each of the tests, we recruited 70 subjects on the Prolific crowd-worker platform for the task.An attention check was used in the middle of the sequence.Median completion time was 5:30 minutes and subjects received a 1.50 GPB compensation.

Results
Ratings from the two experiments were analysed by means of a one-way ANOVA and a post-hoc Tukey multiple-comparisons test for statistical significance.In the joke delivery task, the mean rating and 95% confidence interval for the commercial and conversational voice was 2.5 ± 0.1 and 2.6 ± 0.1 respectively, but the difference was not statistically significant.Results from the audiovisual speech matching task are shown in Fig. 7 top.The conversational TTS + new animation configuration got the highest rating, and the commercial TTS + baseline animation the lowest.All differences were significant ( < 0.05).The joke-delivery test also contained a set of open questions.For the question What did you base your rating on?, intonation was most frequently mentioned, followed by timing, funniness, human-likeness and clarity, see Fig. 7 bottom.Often they gave several of these where the most common combinations were funniness and human-likeness, or intonation and timing.When computing the average scores per reason both voices got the same score for all reason accept for clarity, where the commercial TTS got 3.3 and the conversational 2.7.The lowest score for both (2.2) was from the users who based their scores on how funny the actual joke was, and not how it was read.Otherwise the average scores for both TTS voices were 2.7.In response to the question Do you think computers should have human traits, like humor and sarcasm?44 said "yes", 19 "no", and 6 "I don t know".

DISCUSSION
Our work aims to advance the development of social robots and embodied conversational agents which can serve as companions or conversational peers.To achieve this goal, we have created an audiovisual speech generation system for expressive conversational characters and developed methods to control the manner of speaking with accompanying facial animation.As pointed out by Wagner at al. when we evaluate our TTS systems we need "to assess and take into account listeners' application-specific needs and expectations" [56].Furthermore, the TTS evaluations should be as contextualized as possible to the participants.In our study, we choose a companion agent which could provide social company as the context and joke-delivery as the test case.We found that our conversational TTS voice performed on par with a state-of-the-art neural commercial TTS voice in the joke delivery task.Notably, several participants' comments revealed that many were impressed by the unique human-like attributes of the conversational TTS voice.Some particularly noteworthy general comments include: "One of the audios sounded very well like a human, I did not expect this technology to be this far in human mannerisms." "Some audio tracks sounded quite close to the way humans deliver a joke!You can still tell they're computer-generated but I'm floored by how much more advanced they sound compared with Siri" "As time went on, the jokes got funnier despite the quality of the jokes not getting any better.The absurdity may have had me rating a bit higher.".This tells us that while are not up for an Ebert test just yet, we have might have managed to build a rather capable conversational voice.This is in line with our long-standing goal of building more human-like conversational systems [13].The ability to change the intonation and articulation effort is crucial in situated interaction and during error handling and grounding.Finally, we found that our new speech animation method consistently outperformed the baseline in the audiovisual speech matching test for both TTSvoices.We also note that the conversational TTS voice, which is considerably more varied in speech rate and articulatory effort than the commercial TTS, also received a higher rating in the multimodal setting.

CONCLUSIONS
We presented a system capable of producing conversational synthetic speech and accompanying facial animation with an adjustable degree of clarity of articulation.Since the TTS is probabilistic, the generated speech has an added natural variation.With this functionality we hope to enable virtual agents to exhibit refined social behaviors such as mumbling, muttering, attracting attention, being engaging or talking more clearly during error resolution.A novel speech animation algorithm that allows control over articulatory effort, for varying prominence and hyper-hypo speech production, pairs particularly well with the conversational TTS voice.
In the objective evaluations we show that the system indeed was able to generate speech that vary in articulation effort with accompanying lip movements.In the subjective evaluations we compared our conversational TTS system's capability to deliver jokes with a commercial TTS in an audio-only setting.Both system succeeded moderately good at this task, indicating that today's TTS technology is not on comedian-level yet.In a multi-modal context, we found that the conversational TTS, combined with our novel speech animation algorithm, provided the best overall subjective audiovisual coherence.These findings suggest that our system has potential for creating more natural and engaging conversational agents.
A key contribution in this paper is the development of a TTS voice that is not only grounded in ecologically valid data but also capable of generating spontaneous speech with accompanying lip movements for ECAs.This was achieved by constructing a voice for conversational systems that allows for the manipulation of speaking style, articulatory effort, and prosodic realization.The TTS voice was trained on a diverse speech corpus, which included slow, clear read speech as well as conversational interactions from the same speaker.This training enabled the blending of read speech with spontaneous conversation, while also providing control over pitch and speaking rate.Furthermore, the system's facial animation is driven by time-stamped phonemes and prominence estimates derived from the synthesized speech waveform, allowing for the modulation of lip and jaw movements in sync with the speech.This adds a layer of realism and expressiveness to the ECAs.In addition, we developed a GUI for VUI designers, which facilitates the control of the blend between read and conversational speech, as well as prosody.This GUI is instrumental in pre-generating system prompts with precise prosodic realization and offers insights into the capabilities of the voice in terms of speaking style mix and prosodic realization.VUI designers can utilize this tool to learn the optimal mixes and ranges of pitch and speaking rate to pair with system prompt text for achieving specific pragmatic functions.Lastly, we introduced a novel evaluation paradigm that transitions from relying solely on Mean Opinion Scores (MOS) for naturalness to evaluating the multimodal speech generation system within an application context.A notable application demonstrated is joke delivery.In these evaluations we used chatGPT to generate the jokes, enhancing realism and demonstrating, where the system was found to be on par with commercial TTS systems in terms of performance.This showcases the system's potential for applications such as stand-up robot performances, and highlights the importance of evaluating ECAs in real-world contexts.

Figure 1 :
Figure 1: Animation tracks (solid lines) and targets (dots) for jaw opening and retraction-rounding for a conversational TTS utterance, showing larger movements in the first (hyper-articulated) part and smaller in the last (hypo-articulated) part.

Figure 2 :
Figure 2: The 5 parameter model for the lip synchronization.

Figure 3 :
Figure 3: Picture from the data collection where the moderator and two participants are decorating an apartment.

Table 1 :
The style and prosody controls used for the articulation efforts.Input values are based on normalized utterancelevel averages for f0 and speech rate, where -1 corresponds to the 1st percentile in the corpus and 1 to the 99th percentile.Articulation Read/Conversational Pitch Speech rate hyper 80/20 0 to 1 -2.0 to -1.0 normal 20/80 -0.5 to 0.5 -0.5 to 0.5 hypo 0/100 -2.0 to -1.0 1.0 to 2.0

Figure 4 :
Figure 4: An example grid with a sentence synthesized with 7 style settings and 11 prosodic feature steps, totalling 847 unique speech samples.The audios play upon clicking on a cell.Style slider on top, updates the grid to the requested style.Colors correspond to estimated MOS scores.

Figure 6 :
Figure 6: t-SNE visualization of the distribution of the prosody of 100 synthesized utterances in three styles, based on utterance-level normalized mean values of duration, f0, energy, speech rate (syl/s) and the dropped syllable ratio (DSR).

Figure 7 :
Figure 7: Subjective evaluation results.Top: Score from the audiovisual speech matching task, bottom: joke delivery task, summary of responses to What did you base your rating on?.

Table 2 :
Examples of the GPT-4 invented words, funny descriptions and sarcastic self-mockery