Large language models in textual analysis for gesture selection

Gestures perform a variety of communicative functions that powerfully influence human face-to-face interaction. How this communicative function is achieved varies greatly between individuals and depends on the role of the speaker and the context of the interaction. Approaches to automatic gesture generation vary not only in the degree to which they rely on data-driven techniques but also the degree to which they can produce context and speaker specific gestures. However, these approaches face two major challenges: The first is obtaining sufficient training data that is appropriate for the context and the goal of the application. The second is related to designer control to realize their specific intent for the application. Here, we approach these challenges by using large language models (LLMs) to show that these powerful models of large amounts of data can be adapted for gesture analysis and generation. Specifically, we used ChatGPT as a tool for suggesting context-specific gestures that can realize designer intent based on minimal prompts. We also find that ChatGPT can suggests novel yet appropriate gestures not present in the minimal training data. The use of LLMs is a promising avenue for gesture generation that reduce the need for laborious annotations and has the potential to flexibly and quickly adapt to different designer intents.


INTRODUCTION
Human co-speech gestures have been extensively studied.Research has demonstrated that gestures convey a range of meaning and can have a powerful impact on face to face interaction [12,27,37,48].Gestures positively impact the speaker's persuasiveness as well as an addressee's comprehension, recall, engagement, and trust in the speaker [7,35,68].However these impacts are dependent on the particular gestures being used and the context in which they occur [e.g., 33].Broadly speaking, people differ in the degree they use gestures, what types of gestures they use and the impact those gestures have [17,38,49,57].This is especially clear when one considers the situation and role of the gesturer.A clinician may use gestures to establish a safe, trusted place for the interaction with a client as they discuss difficult issues [24].A politician may seek to be dismissive of opponents [32], or arouse fear and anger [13].A comedian may use entirely different gestures to generate laughter at their own foibles or ridicule others [65].A lack of gestures or use of inappropriate gestures can appear uncanny [55,66].
This potential power of gestures can be exploited in the creation of virtual agents.Selecting gestures for virtual agents is, however, a complex problem due to the multidimensionality of the gesture space.To solve this problem, researchers have focused on two main approaches.The first approach is to predict appropriate gestures based on analysis of the virtual human's utterances [e.g., 16,44].Because of a lack of data sets that include both language and cospeech gestures, work in this area has sometimes exploited existing language tools [16,60,64] to analyze utterance content and suggest suitable gestures.In particular, machine learning derived tools have been used to analyze prosody and parse rhetorical structure of utterances [36] while ontology tools like Wordnet and Trains [4,50] have been used to analyze lexical and semantic content.In essence, these approaches aim to generalize example gestures to semantically related utterances.However, the emphasis is increasingly shifting to use of state of the art deep learning approaches to generate gestures directly from the utterance, in part driven by the fact that larger language+gesture corpora have become available (see [54] for a comprehensive review).
With any approach, the data that informs gesture selection is critical to the appropriateness of the output gestures.For example, it may be undesirable for a virtual human clinician to gesture like a politician, a comedian, or someone who rarely gestures.Similarly, it may be counterproductive to base the virtual human's gestures on an average over these different people and roles.Tailoring a virtual human's gestures to the role the virtual human inhabits and the designer's intent for the gesture's impact has potential to greatly improve the virtual human's impact [61,62].Given this context-dependency of gestures, the central question in gesture generation becomes what gestural data can realize designer intent.
Here, we explore this question from a large language model (LLM) perspective.LLMs have the key advantage that they comprise enough data to perform textual analysis.However, the issue remains how to map that analysis to gestures, in a way consistent with design intentions.In this paper, we demonstrate that LLMs, specifically OpenAI's ChatGPT and GPT-4, are a powerful tool for selecting appropriate gestures [3,10].We address the critical issue of design intention through the use of speaker annotations (see [41] for related work) consistent with design intentions to achieve mappings from the the utterance to gesture.
We illustrate the use of LLMs for context-relevant gesture selection based on several examples, starting with the popular but less recent model BERT (Bidirectional Encoder Representations from Transformers [20]) before moving on to explore GPT in more detail.These examples demonstrate the potential power of the approach to not only select individual gestures, but also suggest novel gestures outside the data the system was provided.With this, we aim to demonstrate the transformative potential of LLM's for gesture generation research.However, we also note that the evaluation of these gestures is still a fundamental challenge in the field [63,69] and lay out a road map to formally developing the use of LLM's for gesture generation.

RELATED WORK
Generating co-speech gestures faces two distinct challenges: The specification problem and the animation problem [52].While the specification problem deals with what gesture should be generated, the animation problem is concerned with how to achieve the animation of that gesture.In this paper, we focus on the specification problem.For a review of the work on the animation problem see [54].The specification problem can be approached with rule-based or data-driven approaches which the following sections discuss.

Rule-based gesture generation
As the name suggests, rule-based approaches rely on a knowledge base along with inference procedures to determine gesture selection, where gestures are often chosen from a gesture dictionary, i.e., a corpus of gestures [14].This approach frequently relies on manual annotations of gestures and co-speech.These annotations are then used to curate a gesture lexicon as well as synthesize rules for gesture selection.For example, the Behavior Expression Animation Toolkit relies on heuristics derived from extensive behavioral research to select pre-recorded gestures based on linguistic and contextual information [15].The system uses a bespoke language tagging module and WordNet [50] to automatically extract this information from text.Similarly, Marsella et al. [46] determined the communicative function of utterances using a collection of machine learning based prosodic and syntactic analyses as well as semantic database tools.The communicative functions were then mapped to gestures based on rules codified in an indexed gesture dictionary.
In contrast, Ravenet et al. [60] based their heuristics on image schemas -recurring cognitive processes that aid in establishing meaning [39,47].Example image schemas include 'big', 'full', or 'object'.The system uses WordNet to identify synonyms for in-text words that correspond to image schemas which they then map to gestures via a gesture dictionary compiled from the literature [60].
While such rule-based gesture generation systems offer a high degree of flexibility when it comes to adjusting rules for gestureselection, and thereby changing designer intent, they have several key limitations.Firstly, their reliance on gesture dictionaries, i.e., pre-determined sets of gestures, limits the range of gestures that can be produced.As a result, the gestures lack diversity and, by extension, these systems typically only produce a limited number of different gestures for the same speech input.Secondly, generating such extensive heuristics and corresponding gesture dictionaries is time consuming, relies on expert-knowledge and is therefore not scaleable.Because of these limitations, a growing body of work is focusing instead on data-driven gesture generation.

Data-driven gesture generation
Early approaches to data-driven gesture generation often relied on probabilistic mapping from semantic content to gesture type [34,53].For example, several approaches used annotated gesture data to produce gesture profiles to which they could apply statistical analyses to ultimately predict plausible gestures for new, annotated input-text [40,53].Similarly, Bergmann and Kopp [8] built a Bayesian network to determine whether and what type of gesture should be generated, based on variables such as the communicative action the gesture aims to achieve.While these approaches are more flexible than rule-based approaches, they still often rely on laboriously annotated data and a relatively small set of variables to apply the statistical analyses to.
Because of this, data-driven gesture generation increasingly uses machine learning techniques, and specifically deep learning [1,54].Many of these approaches use encode-decoder recurrent neural networks that learn the mapping from utterance text to gestures [6,72,73].However, these approaches often still rely on gesture lexica [6] and in all cases, these data-driven approaches are dependent on the data they are trained on, with limited data sets available.Additionally, much of the semantic richness is often lost [54].

Large Language Models
In contrast, the extensive pre-training of LLMs lends itself as a natural next step in automatic gesture generation.Currently, there are multiple available LLMs, such as OpenAI's GPTs, Meta's LLaMA, Google's Bard, and Anthropic's Claude, to name a few (for a recent review, see [71]) [3,5,10,28,67].One LLM that has found much traction in the research community is BERT [20].BERT consists of transformer encoder layers and is designed for language modeling and next-sentence prediction.It has been shown to powerfully predict semantic similarities between word-and sentence-pairs [51,58] and this makes it a potentially valuable tool for gesture generation as mappings from words to gestures can be generalized over synonyms [23,43].However, BERT does not have in-built explainable AI, limiting its application to realizing designer intent.For this reason, we explore BERT only briefly before we primarily focus on models from OpenAI.
We chose OpenAI's models for two key reasons: First, they provide easy access API and second, at the time of writing, GPT-4 is the most recent and powerful model [59].Specifically, we use two of OpenAI's Generative Pre-trained Transformer models or GPT: ChatGPT (gpt-3.5-turbo)and GPT-4 [3,10].GPTs are an autoregressive decoder-only transformer-based large language models.They are pretrained on large data sets to predict next words and fine-tuned using Reinforcement Learning with Human Feedback to make the models align to human preferences and common tasks [56].One of the key capabilities of these LLMs in relation to this work is in-context learning -their ability to learn a new task given only a few examples without having to fine-tune the model [10,70].
Recent work has begun to demonstrate the ability of GPTs' to approximate human cognitive processes.For example, Binz and Schulz [9] studied GPT-3's decision-making, information search, and causal reasoning using common cognitive psychological tests such as the conjunction and base-rate fallacies.They found that GPT-3 can solve these problems at a similar or better than human level but fall for some common human biases.Similarly, Kosinski [42] examined Theory of Mind in LLMs using standard false-belief tasks and found that ChatGPT and GPT-4 can solve most of these tasks.Similarly, Bubeck et al. [11] tested GPT-4 in various domains including mathematics, coding, medicine, law, and psychology, including Theory of Mind and emotion understanding.They showed that GPT-4 can perform these tasks well and outperforms Chat-GPT.Taken together, this body of works shows GPTs' promise for applications that normally require human cognitive input.
One such application is gesture generation.GPTs' extensive textual analysis training has the potential to yield gestures that are semantically rich and flexible while not relying on manual annotations.The first foray into this approach used ChatGPT to identify the intent of an utterance, based on this their system then selected a gesture based on a pre-compiled gesture dictionary [25].The work demonstrated that ChatGPT can successfully aid in gesture generation but did not yet fully realize its potential which lies in ChatGPT's ability to a) suggest new gestures with little input and b) adapt gestures to designer intent.

Gesture generation driven by designer intent
Work that aimed to realize designer intent has often focused on producing gestures for a specific novel target, i.e., a single gesturer.For example, Ahuja et al. [2] were able to generate gestures for a novel target by adapting a pre-trained source model based on just two minutes of target data.Their model did this by learning the difference in the crossmodal grounding, i.e., binding between language and physical gestures, between the source model and the target data.In contrast, other work has focused on learning different speaker styles that can then be used in a zero-shot style transfer to generate gestures for targets not present in training data [22,26].These approaches are often very efficient but are also tailored to generate gestures for specific targets.

EXAMPLES
One approach to applying modern LLMs to gesture generation is to use them to replace the generalization from examples.Earlier rule-based work discussed in previous sections relied on analyses using a range of ontology based tools [4,29,50].Such tools can be used to analyze text to find similarities with, and thereby generalize from, a set of annotated text-gesture pairings.For example, consider the phrase "important concept".A rule based ontological approach may codify the metaphoric mapping from "important" to big and the semantic mapping from "concept" to idea which in turn can be metaphorically mapped to a container gesture.As the following example shows, language models have the potential to replace this codification.

BERT
The pre-trained BERT model can realize a mapping from text to gesture directly through its cosine similarity measures, even without re-training.Given some phrase-gesture example mappings, derived for example from some annotations of a speaker, and a novel utterance, BERT can measure the cosine similarity between each example utterance and a novel utterance.Additionally it provides a metric for this similarity which can be used to choose potential alternative gestures.Consider the following example where BERT gives similarity measures between important concept and five phrases of varying relatedness.
Seed Phrase: important concept • big idea correlation: 0.861 • great thought correlation: 0.81 • red napkin correlation: 0.239 • weak idea correlation: 0.469 • silly idea correlation: 0.674 We see a strong relation between the seed important concept and big idea as well as great thought.As expected the relation to red napkin is the smallest as they are semantically unrelated.The seed's relation with weak idea as one might expect is somewhat stronger but still weaker than big idea and great thought.In contrast, the seed's relation to silly idea is slightly stronger than weak idea, arguably because silliness has both positive and negative attributes.Critically, however, the exact reason for this lower correlation is hidden from the user at this level of probing the model.
In short, BERT's pre-trained model could substitute for the lexical, semantic, metaphoric analyses discussed above.One would expect BERT to realize a much broader generalization than an ontological approach, though this depends on the boundary condition set on the cosine similarity.A key benefit of BERT is that it provides a clear, overall metric on this mapping as the above examples show.However, the derivation of this metric is opaque and outside the designer's control.
In contrast, the ontology based approach is more transparent as to how this mapping of a novel utterance to gesture suggestion is realized.The semantic, lexical and metaphoric elements are well-defined, which potentially supports finer grain designer manipulations.For example, specific analysis elements could be eliminated or discounted, such as mapping "important" to size, or specific analyses could be generalized, such as generalizing abstract concepts to any concept so any object, abstract or physical could be conveyed by a container.
Whereas BERT is arguably similar to prior approaches, ChatGPT, specifically its prompting ability, provides another, novel approach to achieving generalization from example annotations while allowing for designer control through the prompting mechanism.

ChatGPT and GPT-4
To illustrate the power of LLMs for gesture selection, we prompted ChatGPT and GPT-4 with text-gesture pairings which in practice would be drawn from annotations of an actual speaker or speakers.In the following example, we intentionally used opaque terms for the gesture that ChatGPT could not interpret, such as "lcg" meaning "large container gesture" and "ng" for "negation gesture".Based on just three example prompts when given the phrase "it is an excellent idea", ChatGPT responded with "lcg", shown in bold italics.

ChatGPT.
• Example prompt: It is an important idea.lcg  • Output: lcg This method is very similar to the BERT approach -it identifies the similarity between "important", "big", and "excellent" and produces the corresponding gesture.However, if we supply gesture descriptions that are interpretable to ChatGPT, then it produces novel gestural suggestions based on its interpretation of the gestural prompts.The following example demonstrate this.

ChatGPT.
• • Output: thumbs up gesture ChatGPT suggests a gesture that is appropriate for illustrating the utterance but was never prompted: "thumbs-up gesture".Such novel suggestions raise the question as to the degree to which the prompting is reflecting an underlying relation between utterance and gesture.To illustrate the depth of GPT's ability in this regard, consider an utterance sequence adapted from an interaction between a clinician and woman suffering from PTSD.This utterance and gestural sequence is adapted1 from an 90 minute corpus in which a real world clinician is going through a PTSD protocol with a woman actor playing the role of a client suffering from PTSD.She is talking to the clinician because of her husband who wants her to change behavior.A key moment in he interaction is as follows: The clinician is asking a client whether (1) there is anything the client wants to work on, adding the proviso of ( 2) besides what her husband wants.She uses the following gestures: As she says says (1) "anything", she performs a container gesture suggesting a set of all the possible topics they could discuss.Specifically, this container gesture depicts a domed region of space, perhaps suggesting a protected space (see Figure 1).Next, she clarifies what topic she does not want to discuss, (2; "besides what he wants") using a gesture that sweeps to the right, suggesting removing that topic from the container formed by the previous gesture.The "he" is a reference to the client's husband.See Figure 1.These gestures constitute what Calbris [12] calls an ideational unit, a sequence of related gestures.Such gestures tend to set up gestural space in a consistent fashion, such as depicting sets of abstract things and then removing or putting items into the set.
To explore GPT's ability to form ideational units, we first prompted ChatGPT to provide an explanation for the use of the container gesture in the clinician example.This yielded the following explanation that is consistent with a prior analysis of the clinician's gesture done by gesture researchers [19]: ChatGPT.
• Prompt: The clinician said we can talk about anything.She used a container gesture when she said anything.• ChatGPT: The clinician's use of a container gesture when saying "anything" may suggest that she wants you to feel safe and comfortable to discuss any topic that comes to mind, but also to keep those topics contained within the therapeutic space.
The container gesture is a nonverbal cue that can signal a sense of containment and boundaries, creating a safe space for exploration and self-discovery.Thus, ChatGPT captured the relation of the container to the possible topic.The response also unexpectedly captured the notion of a protected space which aligns with the specialized dome-shaped container gesture used by the clinician which we did not include in the prompt.GPT-4 yielded similar results but did not capture the subtlety of a "safe space" as explicitly.It instead suggested a container gesture to show inclusiveness and make the patient comfortable.Importantly, the following example shows that GPT-4 can provide a good prediction of what gesture the clinician could perform next consistent with the gesture sequence the clinician actually used (ChatGPT yielded similar results) .This suggests an ability to construct an ideational unit.

GPT-4.
• • GPT-4: After using the container gesture to illustrate "talk about anything", the clinician could use a swiping or brushing away gesture with one hand to represent "except what your husband wants us to talk about".This gesture would visually convey the idea of removing or excluding a specific topic from the otherwise open and inclusive container of conversation topics.
Finally, we prompted GPT-4 to visualize the entire action: • Prompt: Can you visualize it in some way?
• GPT-4: Sure! Here's a step-by-step visualization of the gestures: ( Note GPT-4 has gone beyond the hand gestures to suggest facial actions.These examples illustrate that there is important, and often surprising, potential for gesture selection and analysis in LLMs.
The next sections explore this potential more systematically.

APPROACH
In this section, we detail two approaches to investigating the use of LLMs to a) predict gestures from annotated examples and b) suggest gestures without example prompts.Note that these do not form an exhaustive exploration of using LLMs for gesture selection but are aimed at giving initial insight into the power of these tools.In the discussion, we set out a road map for developing these approaches and the use of LLMs in gesture generation.Note that all analyses in this paper are based on GPT versions from March 2023.

Data collection
To demonstrate GPT's potential for gesture selection, two researchers jointly annotated the first four minutes of Barack Obama's 2020 speech to the Democratic National Convention.Specifically, we focused on three gesture categories that were easily identifiable and semantically meaningful: sweep, span, and container.In total, we identified thirty-seven such gestures.We further distinguished between six unique physical gesture types: span, container, and four types of sweeps with different orientations of the palm (up, down, in, and forward).We also classified these gestures in terms of the semantic meaning conveyed, such as temporal span, negative sweep, and enumeration sweep.The final dataset contained seventeen unique semantic gestures (semantic gesture description + gesture category) and fifteen unique semantic gesture descriptions.Code and data are available at: https: //osf.io/c82tq/?view_only=4e0c9c1923cc432abd2c82d0ed8f05c8

GPT gesture prompting
For this work, we used ChatGPT and GPT-4.In all of our experiments, we set the temperature parameter to 0, making the outputs mostly deterministic to maximize the replicability of our results.All other parameters were set to default.
To elicit gesture suggestions, we prompted ChatGPT and GPT-4 in the following ways.First, we provided each model with a number of examples (2, 4, 6, or 33, i.e., leave-one-out) for each gesture category.For example, we provided two container gestures, two sweep gestures, and two span gestures.Each prompt was structured in three parts: (1) The following contextual statement: Barrack Obama is giving a speech at the Democratic National Convention.( 2) Examples of gestures Obama used.Each example consisted of three elements: The segment of the speech the utterance occurred in, the utterance associated with the gesture, and the gesture.For example: He said "I want to talk as plainly as I can about the stakes in this election because what we do these next 76 days will echo through generations to come."When he said "these next 76 days", he used the following gesture: "span".(3) The target utterance excluding the gesture.The models filled in the gesture at this point.We investigate four gesture specifications: Gesture category (span, sweep, container), physical gesture descriptions (e.g., sweep with palm facing up/down), semantic gesture descriptions (e.g., negative sweep), and semantic description only (e.g., negative).
The semantic specification in particular yielded a variety of gestures.We summarise an illustrative selection of these in Table 1 for the reader's convenience.
Finally  1: ChatGPT and GPT-4 gesture suggestions for semantic specifications measured pairwise cosine similarity between each original gesture and GPT-4's suggestions based on word embeddings derived using BERT (ChatGPT excluded for brevity, results are similar).This analysis accounted for cases in which GPT's description of the gesture differed from the annotation, therefore appearing inaccurate, but nevertheless showed high semantic similarity.Figure 2 shows the results.
Specifically, in the top row, each plot shows the prediction accuracy (y-axis) for each level of examples (x-axis), where 'LOO' denotes leave-one-out, for ChatGPT (purple) and GPT-4 (yellow), separately.The dotted line indicates chance performance (1/number of unique gesture types of that category in the dataset).In the bottom row, each plot shows the distribution of pairwise cosine similarities, ranging from 0 (very different) to 1 (very similar), for each level of examples (x-axis) for GPT-4.
Results show that both models predicted the correct gesture category and the correct physical gesture description with above chance accuracy when provided with four or more example prompts.Moreover, both models correctly predicted gesture types with semantic description and on only semantic descriptions with above chance accuracy from as little as two example prompts.Across all four gesture specifications, increasing the number of example prompts improved prediction performance and we observed the highest performance during gesture category and semantic description prediction.However, two examples appear to be too few to predict specific physical gestures.Instead, the models frequently generated gesture descriptions that did not map directly onto the gesture types we examined, thus causing the low performance.Notably, GPT-4 outperformed ChatGPT when predicting gesture categories and physical descriptions but not when predicting specifications involving semantic information such as 'negative' or 'inclusive'.
Figure 3 shows that misclassifications were primarily driven by confusions between span and sweep gestures.GPT-4 performed particularly poorly in this regard (shown in Figure 3).This indicates that the models used these gestures interchangeably.
Finally, to explore GPT-4's ability to generate appropriate cospeech gestures without any example input, we categorized the models' gesture output when no examples were provided based on its appropriateness.Specifically, we applied one of the following labels to each output gesture: Similar (to the original gesture), Different appropriate (the gesture was different but still appropriate in the context of the speech), Different inappropriate (the gesture was different and not appropriate in the context of the speech), No gesture (the model did not supply a gesture).Two researchers categorised all 37 output gestures of GPT-4, for brevity, and discussed any ambiguous cases until an agreement was reached.Figure 4 shows the resulting appropriateness levels.Red denotes inappropriate suggestions.
Results showed that in 43.2% of cases GPT-4 suggested a gesture similar to the true gesture.Importantly, when the produced gesture was not similar to the true gesture, in most cases (43.2%), the novel gesture was nevertheless appropriate in the context of Obama's speech.We found GPT-4 only produced inappropriate gestures in 13.5% of cases.
Notably, the model not only made appropriate suggestions for gestures but frequently supplied relevant and appropriate justifications for these suggestions.For example, GPT-4 described the gesture for 'unleashed' as "releasing something or setting free".This description very closely mirrored the actual gesture observed which was a quick motion of the right hand from the palm facing toward the body to the palm facing upward and the hand moving forward.

DISCUSSION
Here, we illustrated several key benefits of using LLMs for gesture selection.We showed that LLMs (1) can predict gestures used in a real-world context with above chance accuracy from as few as two example prompts and (2) can generalize over annotations to suggest novel, previously unseen gestures that are context-appropriate.Combined, these results demonstrate a key benefit of using LLMs for gesture selection -designer control, which is enabled by the selection of the speaker and context.For example, if the design goal for a virtual human is a clinical application where nonverbal behavior can play a critical role [31], the annotation of a trained clinician with highly effective gesturing skills interacting in a clinical session provides an ideal seed.Indeed, the clinician example highlighted the LLM's powerful ability to adapt to specific contexts as it appeared sensitive to the clinical context when providing an interpretation of the clinician's gestures.
Note that we provided the models with very few example prompts and required outputs that matched relatively strict criteria (e.g., 'sweep', 'palm down sweep', 'negat sweep', 'negat'; depending on gesture specification).Additionally, the differentiation between spans, sweeps, and containers is nuanced because each suggests a region in space.This further adds to the difficulty of the task.Finally, the models were not fine-tuned to achieve gesture selection.Taken these challenges into account, the difficulty of the task   should not be underestimated and the observed prediction accuracies of well above chance represent an impressive performance.Moreover, both models not only predicted gesture types and physical gesture descriptions but also semantic gesture specifications.This is especially impressive as we did not provide the models with information about different gesture types or how they relate semantically.Nevertheless, the models were capable of suggesting plausible gestures.
However, a key challenge to applying LLMs to gesture generation is the annotation process itself.For one, manual annotation is laborious.Additionally, designing a gesture naming scheme can be difficult due to the wide variety of real-world gestures and the lack of an agreed upon coding system for gestures, unlike the facial action coding system used in facial expression research [21].Any coding scheme used needs to reliably distinguish the speaker's gestures and there can be significant individual and situational differences [33].
Additionally, coding schemes need to distinguish physical properties across semantically related sequences of gestures, i.e., ideational units [12].As discussed earlier, in an ideational unit, the physical properties of a gesture can be carried over to subsequent gestures.For example, Obama uses a beat gesture with a precision grip to emphasize a specific point and then holds that prevision grip as he does a sweep gesture to illustrate a related issue.The precision grip is not intrinsic to this sweeping gesture but rather is carried over from the previous beat gesture.During annotation, the annotator must ideally distinguish such carryovers of form.
These shortcomings can be circumvented by not using annotations but instead prompting LLMs with just the context and utterance.However, the unconstrained gestural suggestions provided by the models can be hard to parse by machine making automated gesture generation on this basis difficult.Another approach is automatic annotation which would minimize laborious user input.However, this would also cede some degree of designer control.On the upside, we showed that leveraging an LLM's powerful generalization capability does not require a large amount of annotations.The important requirement that remains is to have sufficient annotations to span the speaker's gesture repertoire.
Another important issue is timing, specifically what parts of an utterance are accompanied by a co-speech gesture.Gestures are often associated with what the speaker seeks to emphasize [18].As a result, altering which part of an utterance is emphasized by a gesture can transform the meaning conveyed.It is therefore critical for co-speech gesture generation to derive emphasis information about the utterance.A common approach to this is to use prosodic cues such as pitch and pauses [22,30].However, this presumes that the spoken utterance that is driving gesture generation includes prosodic cues appropriate for the context in which the utterance occurs.A more flexible method is discourse analysis.For example, discourse analysis can identify the rheme of the utterance.This is the part of the utterance commenting on and providing new information about the topic or background of the conversation, what is referred to as the theme.The rheme tends to be more associated with gesturing [16,48].It is likely that LLMs can also be leveraged to identify rhemes.
However, in addition to temporal disconnects between gesture and speech, co-speech gestures often convey additional information that is not transmitted by the utterance at all.For example, in his speech, Obama says "I have sat in the Oval Office with both of the men who are running for president", accompanying the 'oval office' with a sweep of his right hand.This may simply function as an emphasis but, equally, alludes to the length of time that he has spent in that office -8 years -and the experience he has therefore accumulated.It is a temporal sweep not conveyed by the semantic content of the speech and thus difficult to identify not only for humans but also for machine approaches.
Finally, the models used in this paper are proprietary and not open source.As a result, any update to the models may change the results without recourse for the researcher.This poses an issue for both reproducibility and reliability of any applications based on these models.
It is clear then from the work in this paper that LLMs hold great potential for gesture selection and, with that, generation.Yet, a number of challenges remain to be addressed.In the following we set out our current research efforts to formally and systematically explore the use of LLMs for gesture selection and implementation for gesture generation.
(1) Expansion of analysis to several types of speakers, including, for example, politicians, psychotherapists, and orators.(2) Annotation of these videos, going beyond the restricted set of gestures used in this paper and instead annotating all gestures.
(3) Expansion of the types of prompts used to explore a) which prompts most effectively aid GPT in suggesting gestures and b) the minimal approach necessary to elicit appropriate and interpretable gestures.(4) Extension of analysis of the gestures produced to include rheme, theme, and ideational units.(5) Evaluation of the appropriateness of the suggested gestures using human observer studies in addition to expert opinion.
(6) Implementation of gesture generation by extracting gestures selected by GPT, matching to a gesture dictionary, and generating gestures for virtual human avatars.

CONCLUSION
In this paper we showcased the use of LLMs to realize a generative mapping between verbal utterances and accompanying co-speech gestures.We explored approaches to using LLMs for gesture selection based on minimal to no example prompts.LLMs consistently suggested context-appropriate gestures.While this approach largely relies on laborious manual annotation, GPT required very few annotated examples to suggest suitable gestures.Nevertheless, it's important to note that the effectiveness of these gesture suggestions hinges on annotations derived from performances by a skilled speaker in a context and role that mirror the designer's intent for a virtual agent.Throughout our exploration of LLMs in this context, we not only highlighted their potent ability to select gestures but also underscored the challenges that still need to be surmounted for the complete implementation of LLMs for gesture generation.We outlined our ongoing research endeavors aimed at overcoming these obstacles.
In conclusion, we view this novel approach as a valuable and versatile tool for gesture researchers to use.However, we recognize that its suitability and alignment with goals may vary, as it either necessitates annotation (when example prompts are used) or interpretation (when example prompts are absent).We nevertheless believe that it will be a fruitful avenue for many gesture researchers.

Figure 1 :
Figure 1: Ideational Unit showing a container gesture depicted a protected space followed by a sweep gesture suggesting something discarded from the container