Uncovering the Causes of Emotions in Software Developer Communication Using Zero-shot LLMs

Understanding and identifying the causes behind developers' emotions (e.g., Frustration caused by ‘delays in merging pull requests’) can be crucial towards finding solutions to problems and fostering collaboration in open-source communities. Effectively identifying such information in the high volume of communications across the different project channels, such as chats, emails, and issue com-ments, requires automated recognition of emotions and their causes. To enable this automation, large-scale software engineering-specific datasets that can be used to train accurate machine learning models are required. However, such datasets are expensive to create with the variety and informal nature of software projects' communication channels. In this paper, we explore zero-shot LLMs that are pretrained on massive datasets but without being fine-tuned specifically for the task of detecting emotion causes in software engineering: ChatGPT, GPT-4, and flan-alpaca. Our evaluation indicates that these recently available models can identify emotion categories when given detailed emotions, although they perform worse than the top-rated models. For emotion cause identification, our results indicate that zero-shot LLMs are effective at recognizing the correct emotion cause with a BLEU-2 score of 0.598. To highlight the potential use of these techniques, we conduct a case study of the causes of Frus-tration in the last year of development of a popular open-source project, revealing several interesting insights.


INTRODUCTION
Emotions play a crucial role in software engineering, influencing individual and team performance, communication, and decisionmaking.Numerous software engineering tasks have been found to be impacted by developer emotions, e.g., bug fixing efficiency [1,2], build success of continuous integration [3].Previous research has also studied the link between developers' emotions, their productivity, and their attrition in software development projects [4][5][6].Open source projects' success depends on attracting and retaining volunteer participants.Automatically detecting the emotions in the communication channels of a software project, e.g., pull requests and issues comments, chats, and discussion boards, can provide valuable insights into improving the outcomes of open-source software projects.
Going beyond detecting the occurrence of different emotions in developer communication channels, identifying the causes of those emotions is key for many uses.Simply knowing the existence of an emotion is often insufficient for understanding what or whom the expressed emotion is towards and for determining the appropriate reaction [7].However, when the emotion cause is known, along with the type of emotion, it becomes possible to reliably assess potential implications.This could allow for understanding developer opinion towards different aspects of the project like technical debt [8] or code reviews [9].For instance, a developer made the following comment on an open issue in the flutter/flutter GitHub project, "this is a really severe issue, the ux is pretty awful when you have a splash and then a landing page to simulate splash because it is very obvious that is a different view than the splash", which expresses Frustration (a sub-emotion of Anger).The cause of this emotion is the text span, "the ux is pretty awful".Extracting emotion causes automatically is challenging because of the distinct nature of software engineering communication (e.g., it includes domain-specific idioms like 'spaghetti code'), the variety of different channels (e.g., chats vs. issue comments), and the informal nature of developer communication (e.g., often containing informal abbreviations like 'AFAIK' [10]).It is likely that emotion-cause extraction requires a large amount of software engineering-specific training data that can capture this variability, in both emotion and language [11,12].
Large Language Models (LLMs) have recently emerged as a new powerful type of deep learning technique.These models are built by unsupervised pre-training on a very large dataset, followed by supervised fine-tuning on a smaller dataset.During pre-training, the model learns to predict the next word, given a sequence of words.During fine-tuning, the model is provided with labeled data relevant to a specific task.Some of the largest and most powerful LLMs, such as ChatGPT [13] and GPT-4 [14], are now widely available but do not disclose details about their dataset, training process, or model weights.Consequently, fine-tuning them for a specific task or dataset, such as detecting emotion-causes in software engineering text, is not possible.However, these LLMs can still be used as "zeroshot" models, where no task-specific fine-tuning is performed.Since constructing a large training dataset for emotion-cause extraction task in software engineering communication is expensive, using a zero-shot setup is an attractive option.Recently, models used as "zero-shot" setup, has been often refer as "zero-shot LLMs" in literature [15][16][17][18].In this research, we have used same phrase to refer when a model has been used in zero-shot setup.
This paper applies 3 models in zero-shot setup, ChatGPT [13], GPT-4 [14], and one that is open-source, flan-alpaca [19], to the problem of emotion-cause extraction in software engineering.We first examine the ability of such models to detect emotions in software engineering text, relative to state-of-the-art techniques and to LLMs fine-tuned for detecting emotions.Next, we examine the effectiveness of the zero-shot LLMs for the emotion-cause extraction task.The results indicate that these models are promising, achieving a BLUE-2 score of 0.598 on a manually curated dataset of 450 utterances.Finally, we perform a case study on the causes of Frustration, an undesirable emotion within a large open-source software project [20], to further highlight the utility of emotioncause extraction for software engineering.The main contributions of this paper are: • application and evaluation of zero-shot LLMs to the problem of emotion-cause extraction in software engineering.

• manually-curated emotion-cause extraction dataset of 450
GitHub comments.• case study highlighting the usefulness and purpose of automatic emotion-cause extraction in software engineering.• evaluation of zero-shot LLMs compared to state-of-the-art techniques, including fine-tuned LLMs, on the well-known problem of classifying emotions in software engineering communication.We publish the source code and annotated dataset to facilitate the replication of our study at: https://anonymous.4open.science/r/SE-Emotion-Cause-Replication-0C01.

PRELIMINARY STUDY: DETECTING EMOTION TYPES
Detecting the causes of emotions in text requires a reliable model that can accurately identify the type of emotion expressed.Therefore, before proceeding, we conduct a preliminary investigation to determine if zero-shot LLMs can accurately detect emotions in software engineering texts.We compare the performance of these models with 1) existing state-of-the-art emotion classification models in software engineering, and 2) fine-tuned LLMs.The models are evaluated on three different types of datasets: a) GitHub comments dataset [21], b) Stack Overflow comments dataset [22], and c) JIRA comments dataset [2].

Datasets
GitHub Dataset.For training and testing with each dataset we employ an 80%-20% stratified sampling approach.

Emotion Model
All these three datasets rely on the well-known Shaver's treestructured emotion model [23].In Shaver's model, for each of the six basic emotions, there are secondary and tertiary-level emotions, which refine the granularity of the previous level.GoEmotions is an alternative emotion model used in the literature that was proposed by researchers at Google with the focus on emotions that can be observed in written text [24].In their recent work, Imran et al. [21] extended Shaver's model by incorporating a few emotions from GoEmotions's [24] taxonomy in order to study emotions present in GitHub communications.Out of 27 emotions in GoEmotions' list, 26 are in the extended model by Imran et al.The only emotion that is not on the list is Gratitude.
In order to study both GoEmotions and Shaver's models, in this paper, we map the remaining emotion -Gratitude -within Shaver's tree-structured emotion model.We look into the definitions -how the authors defined Gratitude in GoEmotions [24] and if any emotion is defined similar way in Shaver et al. [23]'s definition.GoEmotions defined Gratitude as "a feeling of thankfulness and appreciation.", while Shaver et al. defined Love "involving the appreciation of someone." Therefore, we mapped Gratitude as a secondary emotion to the basic emotion Love in this study.
The extended model is shown in Table 1, with blue-colored emotions also appearing in the GoEmotions' listing.

Compared Models
Existing SE-specific models.We use three existing SE-specific models that have been shown to produce state-of-the-art performance in emotion classification.
• ESEM-E: A tool that is proposed by Murgia et al. [25] that uses unigram and bigram as features and an SVM as the ML model.It has been widely used in the literature for software engineering emotion classification tasks [21,25,26].Chen et al. [26], and built on top of the DeepMoji [28] model.This flexible model can identify different emotion categorization schemes, including Shaver's categorization.
For EMTk and SEntiMoji, the authors published the model implementations.We use the provided code for training and testing.As for ESEM-E, we carefully read the instructions provided by the authors and implemented the model by ourselves.Fine-tuned LLMs.We fine-tune two popular LLMs -BERT and RoBERTa -that have been widely used as emotion and sentiment analysis, including in software engineering [29][30][31][32].We leverage the pre-trained model weights from HuggingFace [33].
• BERT : Bidirectional Encoder Representations from Transformers is a widely-used LLM developed by Google.BERT is pre-trained using English Wikipedia and BooksCorpus [34].• RoBERTa: Robustly Optimized BERT, a variant of BERT, is developed by Meta.It is pre-trained using English Wikipedia, BooksCorpus, news articles, Web text, and stories [35].
Zero-shot LLMs.We use three (two commercial and one opensource) recent pre-trained and instruction-tuned models in a zeroshot setting, i.e., the models are not tuned for the task of emotion (cause) detection in software engineering.
The model was then instruction-tuned (from a large dataset of instructions with desired output) using Reinforcement Learning from Human Feedback (RLHF) [36].• GPT-4 [14]: We use the gpt-4 API by OpenAI.GPT-4 is a transformer-style model pre-trained using both publicly available data and data licensed from third-party providers; details of the training data are not released at the time of writing.GPT-4 introduced a rule-based reward model (RBRM) approach on top of RLHF.• flan-alpaca [19]: This is a variation of the Alpaca [37] finetuned model.Alpaca was developed by Stanford, based on Meta's LLaMA [38] model using 52K instruction-based data instances.Due to licensing issues, the original Alpaca model is not accessible at the time of our experiment.Instead, using the Alpaca instructions dataset, Chia et al. [19] fine-tuned Google's instruction-tuned Flan-T5 [39] model and released the weights.We use the flan-alpaca-xl version from Hugging Face [40].

Metrics
The F1-score is a widely used metric for assessing the effectiveness of a (multi-class) classification model.It is the weighted harmonic mean of precision and recall, which takes into account both false positives and false negatives. 1 −  = 2 *  *  + .To calculate the average score across all classes, i.e., emotions, we use the micro-averaged variant which has been widely used in related tasks [21,32,41].

Basic Emotion Prompting
The zero-shot LLMs we are considering are all instruction-(or prompt-) tuned.This recent category of LLMs use a fine-tuning process with instructional data, which helps the LLMs to better comprehend and respond to user-composed prompts.To our knowledge, there is no prior work on how to formulate prompts for emotion recognition in software engineering text using these LLMs.
A recent study by Kocon et al. [42] evaluated the performance of ChatGPT on various natural language processing tasks by designing over 38k prompts that covered 25 different tasks, including emotion classification using the GoEmotions dataset [24].Inspired by this study, we designed a prompt for emotion classification that we used on all three datasets.More specifically, we asked the models to act as a user in a specific platform, i.e., GitHub, Stack Overflow, and JIRA, and provided the utterances and a list of the basic (top-level) emotions: Anger, Fear, Love, Joy, Sadness, and, Surprise.The prompt is the following: Your task is to detect whether there is one of the following emotions aroused in you while reading the utterance.Emotions List: Anger, Fear, Love, Joy, Sadness, Surprise.Utterance: <insert utterance>.If there is no emotion in the text, write Neutral.Otherwise write exactly one word, the exact emotion from the emotions list.
Since the JIRA dataset does not contain Fear and Surprise, we do not list these two emotions in the prompt when evaluating with this dataset.Results and Discussion.Table 2 shows the results for the three emotion classification datasets and for all the models.It is clearly noticeable from the results that the zero-shot LLMs performed poorly across all datasets, lagging behind the SE-specific models and the fine-tuned LLM models.The fine-tuned LLMs performed best, e.g., RoBERTa achieved the best micro-averaged F1-score overall by averaging 0.592, 0.735, and 0.818 respectively for GitHub, Stack Overflow, and JIRA datasets.Among the SE-specific models, the deep learning-based SEntiMoji model performed best with an average F1-score of 0.529.
In order to understand where the zero-shot LLMs are making mistakes, next, we conduct an error analysis.Error analysis.One of the most common errors we observed is that zero-shot LLMs are misclassifying Love utterances as Joy for all datasets.For example, on the Stack Overflow dataset, the F1score for Love is 0.0, 0.116, and 0.078 for flan-alpaca, ChatGPT, and GPT-4 respectively.Compared to this, BERT, RoBERTa, ESEM-E, EMTk, and SEntiMoji obtained an F1-score of 0.840, 0.861, 0.757, 0.811, and 0.829 respectively.This is also evident in the number of false positive (FP) utterances in the Joy category, i.e., for the Stack Overflow dataset, the number of FPs for BERT, RoBERTa, ESEM-E, EMTk, and SEntiMoji are 34, 21, 29, 17, and 17 respectively, whereas, for flan-alpaca, ChatGPT, and GPT-4, the FPs are 259, 72, and 91.
Another common type of error was that the models predicted Neutral often.For instance, on the GitHub dataset GPT-4 identified 269 (67%) utterances as Neutral.In many cases a secondary or tertiary emotion for Shaver's categorization most closely describes the annotated utterances.However, those emotions were not provided to the model.For example, consider the following sentence from the GitHub dataset: "Any updates on this?I'm implementing a flutter application with barcode scanners, the soft keyboard on screen is really annoying.", annotated as Anger and, on a more granular level, as Annoyance.All zero-shot LLMs models predicted it as Neutral.
As another example, the following sentence is annotated as Worry, which is a tertiary-level emotion of Fear: "My concern is that more new atributes may appear [...] it may break their behavior.",while flan-alpaca and ChatGPT classified it as Neutral.
We also observed a number of hallucinations in the zero-shot LLMs output [43], where the models generated responses that were outside of what was asked.This led to situations where the models outputted emotions such as Apology and Appreciation, despite them not being in the prompted emotions list.For example, GPT-4 predicted the following sentence as Apology: "Doh.Sorry for wasting your time."even though the set of basic emotions provided in the prompt does not contain this emotion.
In order to address these issues, we experiment with constructing prompts with a more granular level of emotions, i.e., by considering the second and tertiary-level emotions in Shaver's extended taxonomy.This is also motivated by the study of Kocon et al. [42], who used all of GoEmotions' 27 emotions in their prompting experiments with ChatGPT.

Granular-level Emotion Prompting
In order to experiment with more granular emotions, we require a labeled dataset that includes these emotions.Therefore, we specifically conducted these experiments with Imran et al. [21]'s dataset, which provides a secondary and tertiary-level emotion annotation while the other datasets do not.First, we conducted prompt experiments using a part of Imran et al. 's training set (note that the zeroshot LLMs are not using the training data) varying the information used in the prompts for each instruct-tuned language model.More specifically, we randomly selected 400 comments from the training dataset using stratified sampling and tested with granular-level prompting using the following strategies: 1) all emotions (basic, secondary and tertiary) from the extended Shaver's categories -a total of 140 emotions; 2) only the basic and secondary emotions from the extended Shaver's categories -a total of 34 emotions; 3) GoEmotions' list of 27 emotions.
We mapped the output emotion from the secondary and tertiary emotions to corresponding basic emotions as shown in Table 1 and compared the results of the models at this level (as the SE-specific models can only produce results at the basic emotion level).We also found during the granular-level prompting that the models sometimes produced minor wording variations of the provided emotions, such as Confused instead of Confusion, Excited instead of Excitement.While mapping the outputs of the zero-shot LLMs to the basic emotions, we made adjustments as not to punish the models for these minor differences.
During our experimentation, we observed that all three models tend to suffer more strongly from the issue of hallucination [43]  when the complete emotion list (all of basic, secondary and tertiary level emotions) are provided.For example, GPT-4 suffered 50% more hallucinations when the complete list is provided compared to the basic list of emotions.In particular, the models tended to generate extrinsic hallucinations [44], i.e., information beyond what is asked in the given prompt.This led to situations where the models generated emotions such as Concern, Apology, and Appreciation, despite them not being in the prompted emotions.This suggests that providing a very large list of emotions may not be optimal.
Out of the strategies we attempted, providing GoEmotions' 27 emotions list produced the best performance.For example, on the sample of the training dataset, ChatGPT achieved an F1-score of 0.201 when all emotions from Table 1 were provided, 0.341 when basic and secondary emotions are provided, and 0.419 when GoEmotions' emotions are provided.As noted earlier that the GoEmotions' taxonomy is developed specifically for text-based emotion recognition [24].This can explain why it performed better than emotions selected directly from Shaver's taxonomy, which was developed based on psychological evidence and not specifically for text [23].
Therefore, we opt to use GoEmotions' list of emotions for prompting for emotion classification using the zero-shot LLMs.Next, we report the results on the Imran et al. [21]'s held-out test dataset.Results and Discussion.Table 3 shows the results for emotion classification on Imran et al.'s GitHub dataset for all the models.Overall, BERT and RoBERTa still achieve the best results with an average F1-score of 0.588 and 0.592 respectively, while among the SE-specific models deep learning-based SEntiMoji achieved the highest average F1-score of 0.529.From the table, it is clear that all three zero-shot LLMs improve in most categories of emotions and overall micro-averaged F1-score.It is also noticeable that they improved in distinguishing Love and Joy utterances.However, the zero-shot LLMs still perform badly for Fear.Overall, surprisingly, the open-source model flan-alpaca achieved the best performance with an average F1-score of 0.506 -an improvement of 19.33% from the basic emotion-level prompting, while the proprietary model GPT-4 achieved 0.482 -an improvement of 35.77%.Both of these are improvement over the three SE-specific models and the proprietary ChatGPT (gpt-3.5-turbo)model.
The results again point out that despite there having been major advancements in instruction-tuned LLMs, the fine-tuned deep learning models still perform better for specific, well-defined tasks that require domain-specific knowledge.To understand more where As also noticeable in Table 3, Fear is the most often misclassified category with (35/140) instances.The errors in this category are especially discernible with GPT-4.With basic emotions only, GPT-4 achieved an F1-score of 0.353 in the Fear category while at the granular level the F1-score went down to 0.049.The primary reason for it is that GPT-4 generated hallucinated output with labels such as Worry, Concern, etc., which are missing in the GoEmotions list.However, some of these emotions are present in Shaver's extended list and in Imran et al.'s annotation.In the annotated data, most of the Fear utterances are due to the tertiary-level emotion Worry.For example, the utterance "Isn't this a breaking change?Can we get away with it?" is annotated as Worry (3rd level of Fear) in the ground truth.Another example is the utterance: "I guess my concern is that it sets a precedent where somebody could see it and think that it would be fine to use in 'core'." The second most misclassified emotion category is Joy with (33/140) instances.Many of these errors are because the models are predicting conservatively, i.e., predicting Neutral instead of a specific emotion.For example, "Anyway, the syntax change is fine." -this utterance is annotated as Approval (2nd level of Joy).Another example, "[USER] can you assign this ticket to me, I can help in this." -this utterance is annotated as Enthusiasm (2nd level of Joy).Also, notable here is that Enthusiasm is not in GoEmotion's emotion list.Another type of error among Joy category is that they are often misclassified as Surprise.For example, "This was actually causing this test-case not to be executed!"-this utterance is annotated as Relief (2nd level of Joy), but the flan-alpaca and GPT-4 model predicted as Surprise.
The third most misclassified category is Sadness with (31/140) instances.We observed that these utterances are often misclassified as Anger, Surprise, or Neutral.For example, flan-alpaca and GPT-4 predicted Surprise for this utterance: "Ah sorry I thought 'ScaleUpdateDetails' was constructed in '_update' nvm." We observed hallucinated emotions as well (5 for GPT-4, 13 for flan-alpaca, 2 for ChatGPT), especially Concern and Worry among Fear utterances; and Appreciation among Love utterances.
Overall, the error analysis points out the need for having a more specialized emotion taxonomy for text-based emotion detection, in particular for software-engineering-related text.As noted earlier, Shaver's [23] taxonomy, developed in Psychology, includes many additional emotions that do not appear in the text and confuse the zero-shot LLMs.Meantime, while GoEmotions list focuses on textbased emotions, they are still missing some commonly observed emotions in software engineering such as Worry and Frustration.

EMOTION-CAUSE EXTRACTION
The results of the preliminary study suggest that zero-shot LLMs are capable at detecting emotion categories when provided with granular level emotions, performing slightly worse than the best evaluated models (i.e., in Table 3, flan-alpaca's F1-score is 0.506 relative to SEntiMoji's 0.529 and RoBERTa's 0.592).In this section, we examine their feasibility for the more challenging task of emotioncause extraction.
The use of LLMs for emotion-cause extraction has experienced a notable uptick in interest in recent years [45,46].Emotion-cause extraction seeks to identify the cause or event that instigates a specific emotion in a given text, providing essential insights into human behavior and deepening our comprehension of the underlying emotions behind text-based communication.Researchers have explored the potential of LLMs in detecting emotion causes across multiple domains, such as social media and news articles [45][46][47].
Despite the growing interest in emotion-cause extraction in different domains, there is a lack of research on this problem in software engineering communication text.This research gap inspires our study, which aims to investigate the effectiveness of zero-shot LLMs in detecting emotion causes in GitHub comments.
To this end, we first manually annotate emotion causes in a subset of Imran et al. 's [21] data, identifying the text span that represents the cause of emotion in the comment.We then use zero-shot LLMs to extract emotion causes and compare their performance against the annotated emotion causes using the BLEU score [48], a standard metric in machine translation to evaluate text sequence similarity.Below, we present a detailed description of our annotation process, zero-shot LLMs, and the comparison of BLEU scores across different models and configurations.

Annotation
To create a dataset for the emotion-cause extraction task, we begin by selecting 75 utterances for each of the 6 basic emotion categories (Anger, Love, Fear, Joy, Sadness, Surprise) from Imran et al. 's training dataset, totaling 450 utterances.Two senior undergraduate students (with 3+ years of experience in programming) are then tasked with annotating the dataset by identifying emotion causes, if any, based on the previously annotated basic, secondary, and tertiary emotions by Imran et al.We provide them with the following instructions: For each instance containing an emotion (Anger, Love, Fear, Joy, Sadness, Surprise), find the span of text (if any) that contributes to the annotated emotion.Each instance then should be annotated with its corresponding causes if existing.Emotion can sometimes be associated with more than one cause, in such a case, both causes should be marked.Since in some cases, more than one emotion can be present in an instance, the causes for emotion should be mapped as <emotion, cause span>.
The above instructions are adapted from Chen et al.'s seminal work on detecting emotion causes [49].We also provide the annotators with definitions and examples of different types of emotion causes.After the annotation task is completed, one of the authors of the paper manually reviewed both sets of annotations and noted disagreements in 44 of the 450 instances.To resolve these discrepancies, the annotators are asked to meet on Zoom and discuss and resolve their differences.This process ensures the annotated dataset's reliability and consistency.

Model Selection
For the automated emotion-cause extraction task, we evaluate the same three instruction-tuned models (ChatGPT, GPT-4, and flanalpaca) that we used for emotion detection in Section 2, i.e., the preliminary study.We do not use BERT or RoBERTa as those models require a large amount of domain-specific training data [50], which we lack.

Prompt Design
The structure of our emotion-cause extraction prompt is intended to mimic a real-world scenario where a GitHub user is going through issues and pull requests, experiencing various emotions, and trying to pinpoint the cause of a specific emotion in a given utterance.We use a two-step prompt that asks the model to first detect the emotion in the utterance using the procedure outlined in Section 2.Then, we prompt the model to identify the cause of this emotion, as shown in the framed box structure.
You are a GitHub user.You are reading GitHub comments.Your task is to extract the span that is causing the emotion <insert emotion> in the following GitHub utterance: <insert utterance>.Write the span of the cause within a double quote.Do not write anything else.

Results
To ensure consistency in our evaluation, we preprocess all comments, annotated causes, and model-extracted causes by removing punctuation, lemmatizing, and stemming.After preprocessing, the average length of the 450 utterances is 28.08 words, while the average length of the manually annotated emotion cause spans is 7.43 words.We find that the emotion cause spans extracted by GPT-4, ChatGPT, and flan-alpaca have average lengths of 8.85, 8.64, and 13.12 words, respectively.

BLEU score. The BLEU (Bilingual Evaluation Understudy)
score is a metric used to evaluate the quality of machine-generated text by comparing it to human-generated reference text [48].The BLEU score measures the similarity between the machine-generated text and the reference text based on the n-gram overlap between them.The higher the BLEU score, the closer the machine-generated text is to the reference text.The formula for the BLEU score is: where, •  is the brevity penalty, which is 1 if the machine-generated text is longer than the reference texts and less than or equal to them otherwise.•  is the maximum n-gram order.
•   is the precision score for n-grams.
•   is the weight for n-grams, which is usually set to 1   for uniform weighting of all n-gram orders.
3.4.2BLEU Score Interpretation.The interpretation of BLEU scores can vary depending on the specific domain and language being evaluated.In the software engineering domain, a BLEU score is commonly used to evaluate the quality of generated bug reports, code comments, and code summarization.Denkowski and Lavie [51] suggest that BLEU scores above 0.30 generally indicate that the generated text is understandable, while scores above 0.50 are indicative of good and fluent results.Previous research [51,52], including studies in software engineering [53], has used this scale to interpret the results of BLEU scores.It is important to note that the choice of n-gram order used to calculate the BLEU score can impact the final score; typically, 4-gram is used for BLEU score calculation [48,51].In the case of our study, however, the emotion cause spans are often short, making the bigram a more suitable choice for BLEU score calculation, i.e., BLEU-2.4. The score ranges between 0.450 to 0.637, which indicates that all models are generally able to extract the right emotion causes to some extent, especially GPT-4 and flan-alpaca as both models' BLEU scores are always above 0.5.
When considering BLEU-2, GPT-4 obtains the highest score of 0.598, followed by flan-alpaca with 0.543 and ChatGPT with 0.489.Out of 450 utterances, 107 cases are identified where all three models' BLEU-2 scores are higher than 0.5.We observe that these 107 utterances are relatively short, with an average length of 15.26 words, while the annotated cause spans have an average length of 7.02 words.The three models, GPT-4, ChatGPT, and flan-alpaca, extract similar length spans on average, which are 7.89, 7.79, and 7.95 words, respectively.For example, in the following utterance, "I'm not sure how to fix this, nor if this is acceptable in this test case.Namespaces in TS are magic to me  −   − ℎ − ", the annotated cause of Amusement (3rd level Joy) is "Namespaces in TS are magic to me".GPT-4 also extracted the same span as the cause.However, it is not always the case that the annotated cause span completely overlaps with the spans extracted by the models.
For example, in this utterance, "Oh, you didn't add composes and values.Well, I like it even more.Those features are hard to maintain.", the annotated cause span is "I like it even more", and the extracted cause span by GPT-4 is "Well, I like it even more." Out of 450 utterances, we observe that in 41 cases, all three models' BLEU scores are less than 0.30.These comments are relatively longer, containing an average of 44.17 words, while the annotated cause spans contain an average of 5.05 words.The extracted average lengths of spans for GPT-4, ChatGPT, and flan-alpaca are 10.10, 13.14, and 22.83 words, respectively.

Error Analysis.
To gain insight into the models' mistakes, we analyze the 41 utterances where all three models had a BLEU score of less than 0.30.Our examination reveals that the errors can be classified into a few primary categories, which are elaborated below.Incorrect Emotion.The main source of error for all three models is the misidentification of the emotion expressed in the utterance (24/41 utterances).This misidentification leads to the detection of an incorrect cause event.For instance, consider the utterance, "Oh right! −  −   This started as a Mac issue, I forgot to add the rest."The annotated emotion for this utterance was Neglect (2nd level Sadness) and the annotated cause span is "I forgot to add the rest."However, ChatGPT identifies the utterance as Confusion (2nd level Surprise) and extracts " −  −  " as the cause event instead.GPT-4 detects Amusement in the utterance and extracts the cause span as "Oh right! −  −  ."Meanwhile, flan-alpaca identifies "Curiosity (2nd level Surprise)" and extracts the cause span as "Oh right!"This error category emphasizes the importance of accurately detecting the emotion expressed in the text before extracting emotion causes.Incorrect Cause.This error occurs when the models correctly classify the emotion but detect a different cause than the ground truth (12/41 utterances).For example, in the following utterance "[USER] yep, it is bug, we will fix it, so we have it in 'experiments' :+1:", the annotated emotion is Approval, and the annotated cause span is "it is a bug", while GPT-4 detected the cause span "we will fix it".This error category highlights the difficulty in identifying the exact cause of events in conversational text, especially in longer, multi-part comments.Hallucinations.In addition to the two error categories described above, we also observe instances of hallucinations in the cause event extraction process.In some cases, the models' output "the entire sentence.", "the span: <followed by the span>", "span starting from word X to word Y", and other nonsensical outputs.We observe that ChatGPT produces more hallucinated data than the other two models, which is one reason why its BLEU score is lower.This highlights the need for continued research into developing more accurate and reliable models that can follow the prompt exactly.

INVESTIGATING THE CAUSES OF FRUSTRATION IN THE TENSORFLOW REPOSITORY: A CASE STUDY
Frustration is a pervasive emotion in software development [20]

Data Collection and Cause Extraction
To conduct our analysis, we collect all publicly available issues and pull requests comments made on the Tensorflow repository, hosted on GitHub, between March 30, 2022, and March 30, 2023.We choose this time period to ensure that our analysis covers a recent and substantial range of comments.Most GitHub repositories, including Tensorflow, differentiate different types of comment authors based on their relationship to the project, such as Contributors, Collaborators, Members and None2 .A Collaborator is a GitHub user invited to work on the repository, a Contributor has committed code before, a Member belongs to the owning organization, and None has no affiliation with the repository.Collaborators, Contributors, and Members are active developers, while None comprises user commenters.To analyze software developer Frustration, we exclude comments from the None category.
Following the emotion-cause extraction procedure described in Section 3, we extract the emotions and causes of each comment.We use the flan-alpaca model for this purpose, as it performed reasonably well in both emotion detection and emotion-cause extraction tasks compared to the proprietary zero-shot LLMs.Another advantage of flan-alpaca is that it is open-source and its weights are publicly available.This ensures the reproducability of our results.In contrast, closed-source LLMs may become unavailable, e.g., OpenAI's Codex LLM was deprecated in March, 2023.
We collect only the utterances that the model identified as expressing Frustration, resulting in a dataset of 1275 comments.

Clustering
To identify common themes among the causes of Frustration, we employ the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm [59].It has been effectively used in previous software engineering studies involving clustering textual data [60,61].The main advantage of using the DBSCAN algorithm is that it does not require a pre-specified number of clusters, which can be difficult to estimate in advance.This is particularly useful in the context of identifying common themes among the causes of Frustration, as it is difficult to know beforehand what the common themes are.Another advantage of the DBSCAN algorithm is its ability to automatically handle noise and outliers, which is relevant as the extracted causes by flan-alpaca can contain errors, as discussed in the previous sections.
While DBSCAN does not require to specify the number of clusters, it requires two key parameters [62]: 1)  -a real positive value -the maximum allowed distance between two samples to be considered that they are part of the same dense region, and 2)  -a small positive constant integer -the minimum number of samples required to consider a dense region as a cluster.We performed a manual parameter sweep, testing  values from 0.1 to 0.8 in increments of 0.05, and  values from 2 to 6, following standard guidelines for parameter tuning in machine learning and data mining [63].Based on the number of clusters, average number of elements per cluster, and cluster composition, we selected  = 0.3 and  = 4, which yielded 23 clusters.Before applying the DBSCAN algorithm, we perform standard text pre-processing such as removing punctuation, URL removal, and lemmatizing on the list of causes.We use the scikit-learn library's implementation of the DBSCAN algorithm with cosine similarity and sentence-level embeddings (all-mpnet-base-v2 model [64]).
To focus our analysis on the most common causes of Frustration, we limit our discussion to the top 6 clusters in terms of the number of comments in each cluster.The clusters are presented in Table 5, along with their description, size, and examples.We read the GitHub comments and the emotion causes to identify the underlying theme in each cluster that leads to Frustration.

Causes of Frustration
We utilized thematic analysis to identify the themes of the clusters [65].Specifically, one of the authors of this paper read each comment and coded the initial themes.Then another author reviewed the themes, then both authors discussed resolving discrepancies and finalizing the themes until the analysis reached saturation, with no new themes emerging [66].Each cluster theme is described below: TensorFlow Version and Dependency Issues: This cluster primarily includes project participants struggling with incompatibility issues due to version mismatches between TensorFlow and its related dependencies.They express frustration over difficulties in configuring TensorFlow to operate correctly on their system.They also express frustration over transitioning from legacy versions to newer versions.One possible way to address these issues is to provide a more comprehensive documentation on version compatibility between TensorFlow and its dependencies.Failing Tests: The cluster highlights the Frustration felt due to test failing, possibly flaky tests [68].The project participants report two main sources of Frustration: first, the inability to identify the root cause of test failures that seem unrelated to their code changes; second, unexpected test failures leading to their PRs being reverted.
Too Fine-Grained Commits: This cluster reflects developers' Frustration on commits that capture incomplete changes or partial progress on a task, which need to be squashed.

RELATED WORK
The related work can be divided into three parts: prompt engineering for zero-shot LLMs, automated emotion-cause extraction in NLP, and the role of emotions in software engineering.
Prompt Engineering for Zero-Shot LLMs.Zero-shot learning, a task where a model is trained to recognize and classify unseen classes without any explicit training data for those classes, has been a recent focus among researchers and practitioners for a variety of tasks, including image and text classification, question answering, language generation, and data augmentation [50,[71][72][73].Recently, researchers have focused on leveraging LLMs for zero-shot learning [74][75][76][77].In the context of zero-shot learning, prompt engineering with LLMs has emerged as an area of interest in recent years [18,75,76,78].One approach that has been explored is the use of task-specific prompts, which are designed to elicit the desired response from the model.These prompts can be constructed manually or generated automatically and can be tailored to the specific task at hand [18].For example, Brown et al. used an LLM to perform text classification using task-specific prompts [79].
Another approach is the use of general-purpose prompts, which are designed to be broadly applicable across a range of tasks [14,76].The recent advancements in language models such as ChatGPT [13], GPT-4 [14], BARD [80], LLaMA [38], and Alpaca [37] have made the general-purpose prompt approach increasingly popular.These models have achieved impressive performance across a range of tasks and continue to push the boundaries of NLP.To our best knowledge, there has been no research on the automated detection of emotion-causes in software engineering.To fill this gap, in this study, we examine the efficacy of existing state-ofthe-art large language models in automatically extracting emotioncauses.We also perform a case study to demonstrate how these models can be applied in real-world scenarios.

THREATS TO VALIDITY
In this section, we discuss the potential threats to the validity of our study grouped into three categories: construct validity, internal validity, and external validity.
Construct validity.Construct validity is the extent to which our study accurately measures the concepts and constructs it aims to measure.One potential threat to construct validity is the use of automated zero-shot LLMs to extract emotion causes from domainspecific comments.These models are designed to perform generalpurpose tasks and are not fine-tuned to extract emotion causes in software engineering communication text.To address this threat, we perform multiple error analyses to understand where these models make mistakes.Additionally, there could be a threat in the construction of the prompts.To mitigate this threat we followed existing literature and validated various versions of the prompt with labeled data in order to find a suitable prompt for the zero-shot LLMs.Another threat to construct validity comes from our manual labeling of the causes, which may introduce some subjectivity and bias, potentially impacting the accuracy of the reported results.
We reduced this threat via multiple annotators and by resolving discrepancies to achieve 100% agreements.
Internal validity.The concept of internal validity relates to the degree to which the manipulation of an independent variable is responsible for the outcomes of a study.In our examination of an open-source project, Frustration causes represent an independent variable.However, there are potential threats to internal validity, such as unaccounted factors like prior experience with the project or technical expertise that could contribute to software developers' Frustration.Moreover, the use of flan-alpaca for extracting frustration causes could result in the misclassification of some utterances, leading to the potential omission of certain clusters that could provide alternative explanations for Frustration or identification of some clusters that do not in fact represent this emotion.Nevertheless, the use of DBSCAN reduces the effect of random noise, and the list of Frustration causes provided in Table 5 follow the software engineering literature on common problems developers face during open-source software development [101][102][103].
External validity.External validity pertains to the generalization of our study's findings to other settings and contexts.For emotion detection, we used the categories from extended Shaver's taxonomy as well as GoEmotions' taxonomy from previous research [21,23,24].However, our findings may not necessarily be transferable to other emotion categories.Another potential threat to external validity is the specific nature of the open-source project we studied, i.e., TensorFlow.The project's characteristics, such as its size, development stage, and community culture, may not be representative of other open-source projects.Additionally, the programming language and technology stack used in the project may have influenced the types of causes of Frustration observed.Therefore, it is important to interpret our findings in the context of the specific project we studied and exercise caution when generalizing them to other open-source projects.Further investigation is needed to generalize these results beyond the three specific models and the data and projects we have used in our study.

CONCLUSIONS
In this paper, we presented an approach for automated emotioncause extraction in software developer communication using three zero-shot LLMs, namely ChatGPT, GPT-4, and (the open-source) flan-alpaca, through a prompting approach.We first conducted a preliminary study to evaluate the models' performance in emotion classification tasks on an existing recent dataset, and we found that they perform well compared to state-of-the-art models.We then showed the feasibility of using these models for emotion-cause extraction on a subset of 450 utterances from the same dataset by manually annotating the emotion causes of these utterances and automatically extracting the causes using prompts.We compared the BLEU score performances of the models and found that GPT-4 achieved the highest BLEU-2 score of 0.598, followed by flan-alpaca with 0.543, and ChatGPT with 0.489.To demonstrate the possible real-world applications of emotion-cause extraction, we conducted a case study on the causes of Frustration in a large GitHub opensource project -Tensorflow.
There are several avenues for future work.First, our case study only focused on one emotion and one open-source project.Future studies that use emotion-cause extraction should investigate other emotions and a broader range of projects to generalize our findings.Second, further work is needed to improve the accuracy of emotion-cause extraction from text in software engineering communication.This could involve few-shot prompting, fine-tuning language models, or developing domain-specific models tailored for software engineering communication.Overall, our study provides a starting point for future research to explore the potential of emotion-cause extraction in software engineering communication.

3. 4 . 3
Discussion.The BLEU scores for the three models using unigram, bigram, trigram, and four-gram are shown in Table Imran et al. curated a diverse collection of 2000 data points sourced from GitHub issues and pull requests comments [21].The dataset is manually annotated with six distinct emotion classes: Anger, Love, Fear, Joy, Sadness, and Surprise.Among the comments, 17% convey Anger, 11% Love, 9.90% Fear, 21.10% Joy, 13.70% Sadness, and 16.40% Surprise.The remaining comments remain devoid of associated emotions.Anger, 25.4% with Love, 2.2% with Fear, 10.2% with Joy, 4.8% with Sadness, and 0.9% with Surprise.The remaining contents of the dataset are neutral.JIRA Dataset.Ortu et al. annotated a comprehensive collection of 4000 comments extracted from JIRA, classifying them into four distinct emotional categories: Love, Joy, Anger, and Sadness (1000 comments each) [2].Within each category, Love, Joy, Sadness, and Anger account for 16.6%, 12.4%, 32.4%, and 30.2% respectively, while the remaining comments are neutral.
[22]k Overflow Dataset.Novielli et al. annotated a rich multilabel dataset comprising 4800 Stack Overflow questions, answers, and comments[22].Within this dataset, 18.1% of the samples are labeled with [21,22,26]Tk is proposed by Calefato et al.[27].EMTk uses unigram, bigram, emotion lexicon, politeness, and mood as features and SVM as the ML model.Similar to ESEM-E, it has been widely used in the SE community[21,22,26].• SEntiMoji: This deep learning-based model is proposed by

Table 2 :
Micro-averaged F1-score of emotion classification models for three different datasets.

Table 3 :
Micro averaged F1-score of emotion classification for different models using Imran et al.'s dataset.The zero-shot LLMs use the GoEmotions list of 27 emotions.
[57,58]et al. noted thatFrustration is the most commonly felt emotion during software development[55].Collaborative work, lack of control over external contributors' code, and the complexity of software development processes can all contribute to the Frustration of developers and end-users.In contrast to other emotions, such as Confusion or Excitement, Frustration is more strongly associated with obstacles, challenges, and difficulties.It is also often accompanied by other negative emotions, such as Anger, Disappointment, or Helplessness[56].Given the complexity and collaborative nature of open-source software development, Frustration is undesirable but likely to be a common experience for many contributors and users.Therefore, understanding the causes of Frustration in open-source development can provide valuable insights for project maintainers into what are the key issues that impede collaboration and the productivity of project participants.Tensorflow1is a popular open-source platform for developing machine learning models and has a large number of developers and a huge user-base, which makes it an interesting case study for investigating the causes of Frustration in open-source software development.For instance, monitoring of the causes of Frustration in TensorFlow contributors can aid in the construction of project maintainer dashboards that help attract and retain open source contributors[57,58].
, and it is particularly relevant in the context of open-source projects [54].

Table 5 :
Clusters of causes of Frustration in TensorFlow project participants in GitHub.[USER] there was failed ci.Is there anything to do? (2) CI failure does not look related to these changes, seeing the same failure on #56345 [...] so I assume this is noise.[...] Unfortunately this change needs to be rolled back, it seems it breaks JAX build under CUDA 11.4 and CuDNN 8.2 (2) [...] -Did you downgrade the CUDA to 11.2?Looking at Nvidia docs it looks like the display driver and cuda driver do not match [...] [91,[97][98][99][100]83]Extraction in NLP.Automatically extracting emotion-cause has gained attention in recent years in NLP[45][46][47][81][82][83].Emotion-cause extraction is challenging, as both emotions and their causes can be expressed in various ways, including but not limited to explicit statements, implicit suggestions, and contextual cues.Several techniques have been proposed to address this challenge, including rule-based approaches, machine learning-based approaches, deep learning-based approaches, and LLM approaches[45,[84][85][86].In recent years, the focus has been on LLM approaches[45][46][47].Researchers have explored this area with prompting as well[87].Wang et al. noted that ChatGPT achieves comparable performance on the emotion-cause extraction task in news articles[88].In this study, we apply prompt-based emotioncause extraction for three state-of-the-art LLMs, namely ChatGPT, GPT-4, and flan-alpaca[13,14,39].The Role of Emotions in Software Engineering.Emotions play a crucial role in software engineering, as software development is a complex and collaborative process that often involves multiple stakeholders with different perspectives and priorities[4,6,[89][90][91].Researchers have explored the role of emotions in software engineering through qualitative analyses, quantitative analyses, and surveys[4, 6, 25-27, 55, 92-96].Gachechiladze et al. conducted a study on where Anger is directed, i.e., towards self, others, and objects[7].Ford et al. conducted a survey with 256 software developers to identify common sources of Frustration[20].Graziotin et al. investigated the causes of unhappiness among software developers, using a survey of 2,220 participants[94].Later, Graziotin et al. conducted a study of the effects of unhappiness[95].More recently, there has also been a focus on studying conflicts, toxicity, and incivility in open source communities[91,[97][98][99][100].