Is ChatGPT Capable of Crafting Gamification Strategies for Software Engineering Tasks?

Gamification has gained significant attention in the last decade for its potential to enhance engagement and motivation in various domains. During the last year ChatGPT, a state-of-the-art large language model has received even more attention both in the field of scientific research and in common use by individuals or companies. In this study, we investigate the possibility of adopting ChatGPT as a tool for designing gamification platforms in the Software Engineering domain. Leveraging the capabilities of ChatGPT, we assess how good is it at generating effective suggestions and ideas for designers or developers. To evaluate ChatGPT's potential as a gamification platform creator we narrowed the context to one particular Software Engineering activity, asking for possible aspects of the activity to be gamified. Each proposed aspect was subsequently unraveled by ChatGPT both asking in a shared and separate context, first following the conversational nature of the model, then applying a validated design framework. The study assesses ChatGPT's ability to select and integrate game elements to build a thriving gamification environment by framing the design of the platform to a state-of-the-art conceptual framework. To evaluate the goodness of the design choices made we relied both on the Octalysis framework and on personal experience. The findings of the papers show that ChatGPT can only create simple playful experiences not very effective. Although, by instructing the model with more specific desired mechanics and dynamics, it is possible to guide it toward the application of the ideas suggested. We argue that ChatGPT is not capable of building a gamified environment on its own, but it could still be used to build the foundation of a gamification platform as long as the designers refine and rough out the advice gained from a user-centered solution.


INTRODUCTION
During the last decade, a positive trend in the use of gami cation techniques in several domains can be observed.Its wide adoption spans from marketing to education [19] [16], from healthcare to software testing [2] [10].Gami cation is a technique that can be de ned as "the use of game design elements in non-game contexts" [6] and its main objective can be the improvement of users' e ciency or e ectiveness at performing particular tasks, the promotion of a particular habit, or even the increase of in engagement of the gami ed activity.
Although gami cation can bring several positive aspects, this technique is not exempt from aws and drawbacks due to the high dependence on the human factor [18] [10].Developing a successful gami ed environment is a complex task, which largely depends on both the type of activity to be gami ed and the features of the involved users.
Since the release of ChatGPT, dated back in November 2022, the usage of this kind of Large Language Model (LLM) has spread from just IT experts to common people, reaching the amount of one million users in only 5 days [8], mostly thanks to the textual interface enabling users to interact with the Arti cial Intelligence directly using the Natural Language (NL).
Given ChatGPT's ability to quickly provide implementational solutions that t into a context that is made known to it, the question arises as to whether or not ChatGPT is capable of dealing with a task as complex as the development of a gami cation platform.In particular, the design phase is the most critical, being subject not only to structured methodologies but also to a creative component that cannot be attributed to machine learning algorithms by their de nition.
The massive popularity received by ChatGPT in the last months has led the research community to question the adoption of this tool in several domains of application, other examples of its usage are available in the literature [13].
In this paper, we will address this question to design a gamied platform that handles one Software Engineering activity with the aid of ChatGPT.The remainder of the paper is organized as follows: in Section II we provide some background regarding gamication principles and large language model, Section III illustrates the methodology we followed to perform our investigation, Section IV provides a discussion of our ndings, and nally, Section V concludes the paper with a nal resume and some remarks for all those who will approach its use in the future.

BACKGROUND 2.1 Gami cation In Software Engineering
To approach the topic of gami cation is necessary to introduce a rigorous vocabulary which, however, is not uniformly and consistently adopted in the literature.In this paper, we will stick to the de nitions proposed in the MDE framework by Robson et al. [23] distinguishing between game mechanics, game dynamics, and emotions.
Delving into the de nitions, game mechanics are the basic building blocks that form a game experience, the designer de nes them and are independent from player to player, i.e. for a particular game experience they are xed once.Examples of mechanics are points, achievements, and leaderboards.Whilst, Game dynamics are the desired behaviors that emerge during play: they are dependent on how players follow the mechanics set by the designer.Emotions are the mental states that players perceive during the gami ed experience, i.e. being the product of how mechanics are followed and the dynamics that have been generated.The task of a designer is to select and integrate the game mechanics that best suit the context to be played, to stimulate the dynamics that best trigger the desired emotions.
In the context of Software Engineering, a design framework for creating gami ed experiences has been proposed by Dal Sasso et al. [5].The framework is based on Activities that are to be gami ed.Each activity is decomposed into three aspects: (1) Analysis, which presents the rationale and the emotional goals to be reached.(2) Implementation, which includes actors involved in the activity, dynamics they are subjected to, possible hazards that may occur and game mechanics, which Dal Sasso et al. refer to as metas.In our paper we will refer to this aspect as game mechanics, adopting a stricter nomenclature.(3) Testing, which includes the target entities, i.e. metrics, the methodology that can be used to perform the testing, and lastly both the expected and the actual results.Another framework that allows characterizing the produced playful environment based on the selected game mechanics, the way they can be implemented, their e ects, and their relations is the Octalysis developed by Yu-kai Chou [4].The framework identi es eight Core Drives, which are various dimensions of gami cation mechanics that target di erent aspects of players' engagement (such as rewarding, social interaction, and ownership).
To emphasize the importance of balancing and integrating different categories, he divided them into two main groups.The rst group, known as "Right Brain"/"Left Brain", distinguishes mechanics that are based on logical reasoning from those based on emotional aspects.The former is more commonly used due to its ease of implementation and quick feedback; however, overuse can lead to stagnation and drive away users.The latter is more delicate as it must be tailored to suit the involved users to bond them with the environment.
The second group, "Black Hat"/"White Hat", focuses on the emotions the mechanics evoke in users.White Hat emotions are the positive feelings that the user wants to achieve while playing.Instead, the negative outcomes that users should avoid in the environment are referred to as Black Hat.Some mechanics may yield quick results but have negative long-term e ects, while others provide more balanced and positive sensations in the long run.
Gami cation techniques have been applied in several activities of Software Engineering, spanning from Requirements Engineering to Software Testing [24] [11], both for conducting and for teaching the target activity [9] [3].The degree of maturity in this research area has grown to the point where a tertiary study was published in 2020 [12], with other newer secondary studies released in later years [20] [10] [17].

Large Language Models
LLMs are a family of Arti cial intelligence (AI) techniques that are trained with a large amount of textual data with a huge amount of parameters.LLMs are particularly suitable to decode structures and patterns of NL, some state-of-the-art examples of pre-trained language are GPT (Generative Pre-trained Transformer) [21], T5 (Text-To-Text Transfer Transformer) [22], and BERT (Bidirectional Encoder Representations from Transformers) [7].
GPT 3.5 (the model which ChatGPT relies on) has shown a high capability of In-Context Learning (ICL), i.e. the model's feature to understand and provide customized answers based on an input context, instead of purely relying on internal knowledge of the model obtained during the pre-training phase [14].Another technique used in LLMs is Chain of Thought (CoT) [26] that provides multiple demonstrations to describe the CoT as examples within the prompt, guiding the model's reasoning process.The concept of selfconsistency extends the technique of CoT [1] [25], where a majority voting mechanism on the answers allows to reach higher reliability, keeping consistency throughout the conversation between the AI and the user.

RESEARCH METHODOLOGY
Our research objective is to verify whether ChatGPT can be used as a means to suggest e ective gami cation strategies in the context of Software Engineering.To do so, we queried the model with the conversational interface using the following methodology.
To address the viability of using ChatGPT as a design assistant, we started by randomly extracting one target Software Engineering activity: to this extent, we asked ChatGPT itself to list all the known activities and to randomly select one of them.The choice fell on Debugging and troubleshooting activity.
Once we extracted the target process for which to design a gami ed environment, we followed two distinct paths: rst, we tried a boundless approach, secondly we decided to rely on a more strict methodological design framework.We de ne the boundless approach as a free question-and-answer approach in which the user deepens the design aspects of the platform to build, making the ideas and their implementation emerge throughout the conversation.The adoption of this methodology was mainly driven by the conversational nature of ChatGPT, which allows AI to bring out the details of the proposed solution when requested.We de ne the structured approach as an iterative question-and-answer approach in which the user requests aspects that stick to particular frameworks with ad-hoc de nitions.To this extent, we selected the design framework provided by Dal Sasso et al. as a reference methodology to formalize the design choices [5].To follow this latter approach it was necessary to instruct the AI about the design framework, as it was not able to produce answers compliant with it by default.
For both approaches, all the questions asking for design solutions were asked twice: once using shared context queries, and once using separate context queries.We de ne a shared context as a query scenario, where all sub-questions are asked in one ChatGPT session in one single conversation, relying upon the ICL and CoT features of ChatGPT.Whilst, we de ne separate context as a query scenario in which each subject to be gami ed is asked within a separate and independent conversation.The above de nitions are adapted from Jalil et al. in [15] to t our application context.
The shared context allows having short prompts which are rened progressively by adding more details in subsequent inputs, constructing the information step by step.Conversely, the separate context requires the input to be more precise and verbose, enveloping the context, the methodology, and the request in one single input.Starting from several inputs used in a shared context it is possible to construct the corresponding input to be used in the separate one, although the outputs are far from equivalent, given the stochastic nature of the model.
We nally evaluated the outcomes in two ways: rst with the Octalysis framework, to characterize the proposed solution from a game design point of view, relying also on the score (that spans from 0 to 800) of evaluation and comments that arise when the corresponding octagon is shaped in the o cial website 1 .These results are plotted in the Octalysis in Figure 1.Secondly, assessing the adherence to Dal Sasso and MDE frameworks, mainly relying on our experience in the eld.A summary of this evaluation can be found in Table 1.
The original complete sequences of questions and answers are publicly available 2 .

RESULTS
In this section, we will describe separately what we obtained from the queries made, and then compare and analyze the results obtained.

Boundless Approach
4.1.1Shared Context.The querying process started by asking the model how to create a gami ed environment in the context of debugging and troubleshooting.The rst responses suggested features mainly related to the training aspect of the context with playful tools to learn how to debug and maintain the code.We then instructed the model to consider a real scenario in which the people involved use the gami ed platform to perform their daily job, e.g. a consulting company.From this input, ChatGPT extracted a list of 1 https://yukaichou.com/octalysis-tool/ 2 https://www.doi.org/10.6084/m9.gshare.23709747We asked the model to elaborate more on each item of the list, identifying which were the basic game mechanics to implement, the rules that govern them and the feature of the activity that had to be involved.At this point, the model started providing contradictory responses, suggesting both competitive and collaborative-based mechanics, confusing features with game mechanics and vice versa.By clarifying the concept of rules, game mechanics, dynamics and technical features the model improved a bit the accuracy, but partially lost the bigger picture given by the context described in the rst questions, returning to suggest proposals devoted to learning instead of conducting the activities.
Upon closer examination of the game mechanics, we noticed that the suggestions consistently pertain to scores, rankings, and achievements.Additionally, the game incorporates time-based competitions, rewards, and generic challenges.It is important to ensure that these elements are properly assigned and not confused with other artifacts.It is worth noting that a focus on collaborative activities was registered in the answer, even if no idea emerged about how to make such activities cooperative.
4.1.2Separate Context.Querying ChatGPT using separate contexts requires more thorough questions than in the shared context, for this reason, we condensed in only one question the description of the application context and the subject to be gami ed taking each item from the list.
Although in most cases the model misclassi ed game mechanics, rules, and features, it generated some fascinating outputs.The rules it suggested often included cooperative elements that aimed to produce high-quality work, promote player progress, manage time e ectively, and o er rewards for achievements.
After providing an unambiguous description of what game mechanics, dynamics, rules and technical features are, the model reacted by re ning its classi cation of the proposed elements.However, the output started becoming a bit contradictory as what happened in the shared context, such as an unjusti ed cooperative/competitive dichotomy.
We nally tried pointing out that the proposed elements for a gami cation platform were just the basics and that more complex dynamics were needed.We asked for suggestions to set up more complex dynamics, the model generated a list of items related to debugging and troubleshooting.Unfortunately, most of the items were either approaches for learning about the topic or elements that did not fall into the de nition of game dynamics.Repeating the same question but making some examples of what more complex dynamics are (e.g.push your luck, betting, auctions), the model made suggestions that t the provided examples, applying the correct de nition of the speci ed dynamic but only with generic advice.

Structured Approach
4.2.1 Shared Context.Following this approach we started from scratch selecting the target sub-activities and features to gamify.Once obtained the list, we needed to instruct the model with the context of the application and the methodological framework to follow.A rst attempt was made by presenting both in a single input, but the resulting output was lacking in adherence to both the context and the methodological framework.We then proceeded to ask the model to perform the design activity for each activity, proceeding "horizontally" considering one layer of the framework for each activity at a time.
Each time we queried the model on an activity in the list, we reported a summary of the output produced previously, to avoid responses with low internal consistency due to the possible information overload of considering several activities at the same time.
Querying ChatGPT in this way produced appropriate output for the Analysis layer, listing good rationales and emotional goals for most of the activities.The implementation layer is the one lacking the most, as the precision of the model at correctly classifying game dynamics is still pretty low, and the variety of game mechanics suggestion is low as well.Nonetheless, both the hazards found were actual risks to be raised and the Testing layer proposed good metrics and methodology for verifying the implemented solutions.We consider the generated expected results as too optimistic and peremptory, as ChatGPT only highlights positive results.

Separate Context.
Using the separate context querying approach we needed to summarize several questions in only one, as we could not rely on ICL.We started each conversation with one question aimed at instructing the model with the general application context, the sub-activity of Debugging to consider, and the general purpose of creating a gami cation platform.
One rst di erence with the shared context is that the model always listed more stakeholders than were actually needed.This overestimation is actually repeated for many of the questions asked.Additionally, in some cases, ChatGPT assigned the wrong roles to some stakeholders, failing at identifying the actual target of the activity.
Considering the analysis layer we note that the output obtained was rather generic.In fact, the rationale was almost repeated each time, being mostly referred to the general platform instead of focusing on the speci c task at hand.This append to the emotional goal, which was repeated as well, involving aspects of cooperation, excitement, and satisfaction.
The implementation level, instead was more precise at correctly identifying game dynamics and mechanics even though little or no clue on how to integrate all the proposed elements was actually given.On the other hand, the hazards were successfully identi ed by the model, proposing plausible threats that may occur in the gami ed environment.
Finally, considering the testing layer ChatGPT overestimated the number of metrics to measure, assessing dimensions related to other debugging sub-activities, ending up out of context in some cases.The methodology to be employed sometimes lacks feasibility, e.g.continuous monitoring, and adherence with the case of study e.g.surveys to collect the feedback which is di cult to operate assuming a continuative use of the platform.As in the shared context, we felt the expected results were too optimistic.
A summary of the obtained results can be seen in Table Table 1.

DISCUSSION
Upon examining the obtained output, one might at rst glance think that ChatGPT could be able to recognize the application context of gami cation and apply playful elements to it.However, while the solutions proposed may describe ways to implement gami cation platforms, their e ectiveness has yet to be proven.Once created the Octalysis of the gameful environment proposed by each approach, we gathered the scores and the comments from the website.In all four cases, the evaluation was negative, highlighting the lack of identity in the design and classifying the combination as a weak experience that is unable to produce desired outcomes.
As it can be seen, Figure 1 shows that the proposed game elements are mainly polarized in the accomplishment core drive.This trend in the model builds an unwanted situation where the game elements create an unbalanced playful experience.The result of this is the rapid abandonment of the platform by users.Additionally, this polarization is neither strong enough to avoid a shallow experience: this situation where there is neither balance nor a strong identity in any of the four dimensions gets the designer nowhere.
Considering the adherence to MDE and Dal Sasso Framework, we veri ed that the model by default has very low accuracy in classifying the output.Even when the concepts are made explicit throughout the conversation the model often makes classi cation mistakes.None of the applied paths can be unanimously considered better.We would nevertheless advise anyone approaching the use of ChatGPT as a design process assistant to use the structured approach for more consistent results.We summarized in Table 1 the obtained output, mapping the results to Dal Sasso's Framework.
We contend that the proposed design solutions are way too naive to rely upon, as none of the querying paths was able to produce an environment being both satisfactory and comprehensive of the starting problem.We emphasize that creating a successful

Analysis
The model suggested mainly educational aspects, providing no clues about the rationale.
The model identi ed the simple but sound logic, aiming at enhancing enjoyment and fostering collaboration.
The model set the emotional goals of empowerment and pride for the work done and collaboration between people with the rationale of enhancing user's ability and satisfaction.
The model recognized as rationale the performance improvement with the emotional goal of promoting collaboration between stakeholders with a sense of satisfaction.Implementation Only the main actor was identied, the model had very low accuracy in proposing dynamics and mechanics providing only the basic ones.Hazards were never mentioned.
Only the main actor was identied, a broader set of mechanics was suggested, mainly based on accomplishment, with an accurate description of how to implement them.Only simple and con ictual dynamics emerged, without any mention of potential risks.
The model correctly identi ed the actors and suggested a wide set of mechanics, mainly related to left-brain gami cation elements, a special focus was put on collaboration.The model warned about potential common risks that may arise.
The model overestimated the actors, involving too many of them.A wide set of game mechanics was proposed, not deviating from the base set of left-brain gami cation elements.The main dynamics proposed were collaboration, competition, and progression, potential drawbacks were correctly listed.

Testing
The model did not generate any information related to the way to be used to evaluate the gami ed environment produced.
Only some generic potential bene ts were highlighted as potential outcomes.
The model did not propose any method to evaluate the environment.Some potential bene ts have been generated as hypothetical outcomes from the use of the platform.
The model suggested proper metrics and methodology for evaluating the outcome of applying the playful environment, except for a few sporadic cases it suggested evaluation methods that are unfeasible in the way the platform is implemented.However, the expected results produced are too optimistic.
The model generated an overly broad set of metrics, including measurements out-of-scope.The methodology to employ was realistic, while the expected outcomes were overly estimating bene ts, assuming overly optimistic results.
gami cation environment is a complex process that involves not just the activity being targeted but also the people who will be using it.
It's expected that ChatGPT can provide suggestions based on its training.Therefore, it is not surprising that the gami cation elements recommended by ChatGPT match the most widely spread in the literature.However, we know that the success of an implemented solution does not depend solely on individual elements but on how well they're integrated and adapted to the speci c context.
On that basis, it is crucial to adopt a human-centered design approach based on the study of the personas, tailoring the platform to their needs in lieu of following a prede ned structure, which is done by ChatGPT.
In conclusion, we believe that ChatGPT, as well as other AI tools, are currently unable to generate valid solutions all on their own.However, they can serve as a fascinating and e cient way to provide feedback on assessing a proposed solution or identifying potential risks in utilizing a platform.They can also be used to improve designers' concepts, as long as the user has a clear understanding of their requirements and explicitly speci es them.

THREATS TO VALIDITY
In the adopted methodology, we followed what we called boundless and structured approaches which are ad-hoc procedures that we developed for the occasion.Given the novelty of ChatGPT, at present no comprehensive and clearly de ned methodology is widely recognized in the research literature for researchers to adhere to.This methodology has not yet been validated, nor are the results themselves generalizable as the output of ChatGPT is based on stochastic processes.Regarding the separate and shared context, we relied on a procedure previously used by Jalil et al. [15].
Although none of the four methodologies has clear bene ts and drawbacks demonstrated, we argue that this validity threat was partially mitigated by performing the design task four times in semi-independent ways (a slight dependence was present due to the shared list of features to be developed that was necessary to compare the obtained results).
Additionally, we used two di erent sets of tasks to be designed, one in the boundless, and one in the structured approach: this could constitute a validity threat as, having di erent inputs the reasoning of the model might have been a ected.However, our aim was not to compare the results themselves but to evaluate the soundness, a viable way to mitigate this threat could be blending the outputs and sieving only what the designer considers useful.
Other threats include construct validity as we admit that the whole experiment was carried out using the model GPT 3.5, which at the time of writing is free for everyone.The model GPT 4, which is recognized by OpenAI itself as their most advanced model, may have produced di erent yet better results compared to GPT 3.5.Since GPT 4 comes with a fee for the users, we opted for the free version to provide readers with an overview of the result that anyone can achieve simply by using Chat GPT.
Finally, we highlight an external validity threat regarding the generalizability of the results, that may not be applicable in situations where gami cation is used for activities outside of Software Engineering.While we discovered that the shortcomings of Chat-GPT stem from issues with game design application rather than a lack of knowledge about the target domain, it's possible that the model could be more e ective in other domains with appropriate training and development.However, this does not detract from the fact that the use of ChatGPT remains more akin to imitating pre-existing solutions rather than developing new customized ones.

CONCLUSION
This paper explored the application of gami cation in the context of software engineering, focusing on the design of gami ed experiences using ChatGPT as a tool for generating solutions.For this study we developed two di erent approaches were used: a boundless approach and a approach based on a design framework by Dal Sasso et al.The results and discussion shed light on the capabilities and limitations of ChatGPT in providing gami cation solutions for the speci c activity of debugging and troubleshooting.
The ndings indicate that ChatGPT can suggest gami cation elements, such as scores, rankings, and achievements, but often lacks coherence and understanding of the larger context in which these elements should be integrated.The suggested solutions are generally naive and not comprehensive enough to create a successful gamication environment, emphasizing the need for a human-centered design approach that takes into account the speci c needs and goals of the target users.
We found that ChatGPT can be valuable for providing feedback on a completed solution and identifying potential risks, but it is not yet capable of generating fully valid solutions all by itself without human supervision.
The research highlights the importance of human expertise and understanding in the design of gami cation platforms.As AI continues to advance, it is essential to leverage its capabilities while also recognizing its limitations and integrating human judgment to create e ective and meaningful gami ed experiences in software engineering and beyond.

Figure 1 :
Figure 1: Comparison of Octalysis evaluation of the gami ed environment proposed by each approach

Table 1 :
Summary of the ndings of each applied methods mapped to the used design framework