Gamification in Test-Driven Development Practice

The challenge of effectively developing and sustaining high-performance professional development practices in software engineering education is one that must be addressed. Test-driven development (TDD), an example of a key professional practical activity, is strongly linked to these high-performance practices. To examine the effects of gamification - the use of game design elements in a non-game context - on motivating students to develop and sustain TDD practice, an experiment was conducted and utilized ordinary least squares (OLS) regression to analyze the data. This experiment showed that gamification motivates students to do high-performing TDD practice. More specifically, gamification changes the individual's TDD behavior, increases engagement in the development activity, and the effect continues for a longer period even after gamification has ceased. Furthermore, a positive association between gamification and the maintainability of the team codebase was supported by the data.


INTRODUCTION
Gami cation has gained increased interest from the research community to improve individual and group performance [29] [33], which is de ned as using game design elements in non-gaming contexts [16].In this sense, gami cation does not imply "play" or "have fun" but means optimizing behaviors and activities following some rules .[26] indicate that gami cation can in uence users' behavioral outcomes, increasing engagement and providing positive e ects.Gami cation has also been investigated in the software engineering area.Recent literature proposes that gami cation impacts software engineering practice in several ways, for example, changing user behavior, increased engagement in the software engineering activity and software quality [15] [37].
Recent research has called for empirical evidence to support the e ectiveness of gami cation [37] .While current studies have focused on simple and low-order development activities, such as code convention [43], there is a need for further exploration of its application to professional development practice.Moreover, previous literature has mainly explored gami cation in terms of enhancing engagement levels in non-development activities such as requirement engineering [14], and there is limited research on its e ects on development activities.Additionally, there is a lack of data to establish a direct link between gami cation and software quality [15].Therefore, more empirical evidence is needed to understand the e cacy of gami cation.Taken together, these issues raise the question, "Can gami cation encourage students to do high-performing practice?"More speci cally, "Can gami cation change students' development activity, improve engagement level in development activity, and positively impact code quality?"Before answering these questions, we need to identify the following things: What development activity needs to be changed?How do measure the engagement level in development activity?What type of code quality is to be focused on?
Agile software development, such as eXtreme Programming (XP), has been identi ed as a high-performance practice [8].Test-driven development (TDD), also known as test-rst programming, is a core approach of XP methodology [42] and has been a focus of research.Fucci [21] state that TDD behaviors can lead to increased engagement levels in development activities.Additionally, [51] found a strong correlation between TDD behaviors and software maintainability.As maintainability is one of the most important metrics for code quality and can reduce costs of the system's life cycle [12], it is of paramount importance.However, high-performance development practices, such as TDD, are challenging for students to develop and maintain [24].
The purpose of this study was to investigate the impact of gamication on the establishment and maintenance of high-performing development practices (TDD) among students.Speci cally, the study aimed to employ gami cation strategies to promote TDD behaviors, increase student engagement in development activities, and enhance code maintainability.The hypothesis underlying this research was that gami cation would prove to be an e ective means of promoting TDD due to the empirical evidence supporting its e ectiveness: H0: Gami cation motivates students to develop high-performing professional practice.
Then, two sub-hypotheses are generated as below: H1: Gami cation method develops students' TDD behaviors and increases the engagement level of development activity.
H2: Gami cation method has a positive correlation with software maintainability.
The present study was carried out with two development teams, consisting of 6 and 20 students respectively, enrolled in the same software development module.The teams were tasked with developing a complex Android application, and the experimental period was 45 days.To assess the e cacy of gami cation, an empirical analysis was conducted using graphical representations and ordinary least squares (OLS) regression.This was done in order to gain a deeper understanding of the impact of gami cation on the development process.
This study makes several contributions to the existing literature: -It proposes the integration of gami cation into professional software development practice (TDD) by constructing suitable gami cation rules aimed at fostering student adoption and retention of high-performing TDD practices.
-It provides empirical evidence that the gami cation approach is capable of transforming and sustaining student TDD behaviors, increasing engagement in development activities, and positively impacting software maintainability.
-It aims to impart valuable insights to universities and industry stakeholders, demonstrating the potential positive impact of gami cation on software development practices.
-The experimental design employed in this study is replicable and can be generalized to wider contexts, making it a valuable contribution to the literature.
This paper proceeds as follows: section 2 documents the related work, and section 3 introduces our experimental design.Section 4 reports the method, and the results are presented in section 5.The validity is discussed in section 6, and concludes with a discussion in section 7.

RELATED WORK 2.1 Gami cation
In recent years, gami cation has garnered signi cant attention and has been widely adopted across various domains including business [25], medicine [41], education [35], energy, and government services [1].For example, gami cation has been utilized in the workplace to enhance employee productivity and has been incorporated into digital marketing strategies to increase customer engagement [16].Additionally, gami cation has been increasingly adopted in the eld of software engineering education [5].
The concept of gami cation in the computer science area was raised in CHI2011, which is de ned as using game design elements in a non-game context for a non-playful objective [16].In order to encourage competitiveness among users and therefore to achieve the goals, gami cation creates a series of rules by utilizing design elements that have been demonstrated to succeed in games.The gami cation design includes selecting appropriate gami cation strategies and setting structured rules, which are embedded in a realworld situation, to promote engagement in the desired behaviors.
In recent years, the potential advantages of gami cation in software engineering education have been discussed.[17] emphasized that the use of game elements in the software development process can be bene cial.[23] suggested that the incorporation of gamication can make software engineering tasks more entertaining.Recent literature has further indicated that the incorporation of gami cation can be used to engage students in teaching activities, enhance learning outcomes, and foster collaboration [37] [5].To be more speci c, the strategic incorporation of gami cation principles has further permeated the sphere of software testing, particularly aimed at enhancing engagement during the test creation phase [22].These outcomes are of particular relevance to software engineering education.
The positive outcomes observed may be attributed to changes in behaviors and engagement.For instance, gami cation can help to encourage users to adopt appropriate software development methods and to participate in collaborative e orts, both of which can bolster software quality [15].Although gami cation is hard to directly a ect software quality, it could potentially in uence mediators such as Test Driven Development (TDD) behaviors and the engagement level of development activities.Thus, to be effective in a software engineering context, gami cation rules and elements should be designed to incentivize behaviors and engagement that improve software quality.However, there is a lack of research on the e ects of gami cation on development activities such as Test-Driven Development (TDD) and on students' engagement in development activities.Furthermore, empirical evidence is scarce and there is a dearth of quantitative studies in this area.

TDD Behaviors
TDD is an agile development practice that gained popularity after being described as a key component of XP.TDD was introduced as a software development practice in the early 1960s during NASA's Mercury project [7].Nowadays, it has become one of the most widely used agile practices in the software development industry [8].Beck de ned TDD as: writing new product or source code only when automated unit tests are incrementally written [6].
Previous research has indicated that employing Test-Driven Development (TDD) has proven to be bene cial in improving software quality [8].For instance, [55] report that adopting TDD results in greater defect detection than the traditional approach, particularly for experienced and skilled developers, and [6] asserts that TDD can reduce code defects by 40%.Additionally, previous studies have also suggested that adhering to TDD principles can aid in enhancing users' engagement on the development process [48].
According to [21], the positive e ect of Test-Driven Development (TDD) on software quality may be attributed to developers' e orts to increase the number of TDD cycles performed in a given time.TDD behaviors, as proposed by [6], promote coding in rapid and small iterations -where a unit is divided into the smallest testable software components.Therefore, high-performing TDD includes fast iteration development.In light of this, it is recommended that students adopt TDD behavior, which entails following fast iteration development.

Engagement in Development Activity
In recent years, engagement in the development activity, including the coding and testing phase, has been shown to improve code quality [48].In the educational software engineering space, engagement has been examined from three perspectives: behavioral, cognitive and emotional engagement [57].Of these, behavioral engagement has been utilized most widely due to its inherent simplicity, as behavioral patterns can be easily de ned, observed and interpreted [34].
Behavioral engagement with school is recognized as an essential factor for realizing positive academic outcomes [45].Recently, researchers have started to quantify engagement from a behavioral perspective [50].For instance, [39] evaluated engagement scores based on the number of activities taking place during class.Other metrics used to measure engagement include the duration of students' stay in school [32] and the total mouse movement distance [34].[46] measured engagement in open-source GitHub projects by assessing the interval between commit submissions.

Maintainability
Maintainability has been identi ed as a metric to evaluate the ecacy of Test-Driven Development (TDD) practice, as high-performing TDD can help reduce the number of introduced defects and subsequently boost the software's maintainability [55].Maintainability, which is described as "must be speci ed, reviewed, and managed during the software development processes to reduce maintenance costs, " has grown in importance and demand over the past few years [18].Maintainability is one of the six characteristics of software quality based on the ISO/IEC 25010:2011 standard [18].With an increasing demand for maintainability, researchers have explored various approaches to enhance it, such as the use of new models, tools, design patterns, and initiatives to raise developer motivation [59].

EXPERIMENT SETUP
The objective of this experiment is to examine whether the gamication method helps students to establish and uphold professional software engineering practices, such as adhering to TDD behaviors and improving engagement levels, which will afterward bene t maintainability.The experiment is conducted in the Software Design and Implementation module of an Irish university; the participants of this study are the students in their nal semester of the Bachelor's degree, who possess basic programming skills and are ready to enter the job market.Hence, the results and observations of this experiment can be assumed to be representative of both academic and entry-level engineers of industrial circumstances.
The module's assignment necessitates the teams to produce a complex application for the Android platform.Our trial period is 45 days.Prior to the experiment, we educate the students in the fundamentals of unit testing and its application via an iterative development approach for two weeks.They are also instructed in test-rst dynamic development (writing unit test rst and then coding) with TDD-style, thereby attaining fundamental knowledge about using GitHub and TDD, albeit not expected to be experts.The students have selected Java as their major programming language for the development of the application.
Participation in the experiment is voluntary, with no grading rewards associated, thus allowing students the autonomy to choose whether or not to participate in the experiment.This is in accordance with the recommendations of [11] that, when utilizing gamication in educational settings, students should be given the option to volunteer.
The experiment was conducted across two development teams.The treatment group comprised six third-year undergraduate students from the 2020-2021 academic year, and the control group comprised twenty third-year undergraduate students from the 2021-2022 academic year.Both groups were from the same module, receiving the same content and teaching form (online), just from di erent academic years.The number of participants in the treatment group was limited, as students had to be entirely voluntary and demonstrate no hesitation in receiving the gami cation treatment.To better demonstrate the e ectiveness of gami cation, a control group was also established, which did not receive any gami cation treatment during the experiment period.The details of the treatment are reported in Part 4.
The experiment design follows the Pretest-Posttest Control Group Design.On the rst day, students in the treatment group were introduced to gami cation strategies and how to 'play'.In order to compare the performance with and without gami cation, the experiment was divided into two stages.From day 1 to day 22, the treatment group (O1) did not receive any gami cation treatment, whereas from day 23 to day 45, the treatment group (O2) received a gami cation treatment every 7 days on days 23, 30, and 37.The gami cation treatment consisted of scores, leaderboard ranking, and feedback.
Midway into the experiment, the gami cation intervention is introduced with the purpose of monitoring shifts in the performance of the student cohort both before and after the intervention.This methodology seeks a more lucid comprehension of how gami cation impacts their performance.Employing the identical group of students throughout the entire process is anticipated to yield more substantial and compelling outcomes.
A more detailed description of the gami cation methods can be found in Section 4. The students in the control group did not receive any information regarding gami cation during the duration of the experiment, as represented by O3 and O4 in Table 1.Meanwhile, all students in both groups were provided with a clear understanding of the task and its deadline.Data was collected one week prior to the deadline to eliminate any confounding e ects [13].

METHOD
This study examines whether gami cation motivates students to develop and maintain high-performing professional practice (H0).
More precisely, we examine the gami cation e ectiveness on whether students follow TDD behaviors, increase the engagement level in development activity, and have a positive impact on maintainability, by using statistical and graphic analysis.The sample of 250 observations (data appeared in regression model) is collected from git repositories.To assess H1, that gami cation changes TDD behaviors and improves engagement levels, an indicator variable for the gami cation treatment (gami cation) serves as the independent variable, with TDD behaviors (Cycle) and engagement levels (NC and FEQ) as the dependent variables.Similarly, to assess H2, that gami cation is positively correlated with maintainability, we use gami cation as the independent variable, and maintainability (MI and CC) as the dependent variables.We collect digital footprints (commits) from GitHub repositories and use the data extracted from the codebase to compute maintainability at di erent points in the repository's timeline.

Measurement of TDD Behaviors
The nature of TDD is test-rst dynamic development that writes test cases before function code.In addition, ideal TDD behaviors include reducing the time of each development cycle, such as generating a greater number of cycles in the project [21], which thereafter improves code quality [55].As the participants are faced with the same task and same time limit, so a higher number of development cycles represents a shorter duration.Thus, we proxy good TDD behaviors using writing the test cases rst and the greater number of development cycles (Cycle).We de ned the development cycle by manually analyzing the commits and code changing, which is the process of creating a new function and passing all existing tests.A test compilation error will render a cycle incomplete, thereby implying that a cycle can only be considered valid when the test compilation is successful.The level of commit granularity generally provides a su cient basis for observing the correct order.However, in cases where the commit granularity is insu cient, an alternative approach would involve manually reviewing the code to determine the number of development cycles.If the students generate test cases before production code, it is classi ed as test-rst.
To track behavioral changes, we assign points to various TDD behaviors, which are daily based.One development cycle is counted as one point.If the development cycle follows the test-rst procedure, this behavior is counted as two points.
Despite being an essential aspect of TDD, refactoring is not addressed in this study due to the students' limited exposure to small-scale projects, which restricts their opportunities for engaging in refactoring activities.Additionally, the complexity of learning refactoring techniques poses a challenge for students, as it is a skill that is better suited for experienced developers.Consequently, this study does not consider motivating refactoring.

Measurement of Engagement Level in Development Activities
The existing literature assesses student engagement from various perspectives, including in-class activities, length of school attendance, total mouse cursor movement distance, and the time interval between commit submissions [39] [32] [34] [46].
To gauge the level of engagement in development activities, this study employs the number of commits (NC) as a metric.Commits are critical in understanding software development as they represent snapshots of the entire Git repository at speci ed moments [56] [57].
For instance, the number of commits is used to re ect the continuous integration process [4], maintenance activities [36], and bugginess [19].A higher number of commits suggests more development activity [48], and a higher engagement level.To eliminate the impact of invalid commits, the study excludes duplicated changes, comments in the code, and commits that do not a ect the code.
The study also incorporates the frequency of commit updates (FEQ) as an additional proxy.The FEQ equation is de ned below, and the period is calculated based on working days (excluding weekends and public holidays) from the start to the end of the speci ed period.As Github is updated concomitantly with student updates to the code, a high FEQ value suggests frequent updates to the code functions and heightened engagement in development activity.
In addition, FEQ enables a deeper insight into the in uence of gami cation on the overall work dynamics of students.Given the varying degrees of development intensity, monitoring the frequency o ers a valuable perspective on how gami cation a ects the students' collective work patterns.

Gami cation Design
In the eld of software engineering, the objective of gami cation is not to entertain engineers, but rather as a mechanism to alter their behaviors.We aim to examine whether gami cation strategies change students' development behaviors to those that can improve software maintainability1 , and improve engagement levels in development activity 2 .The central aspect of gami cation design is the selection of appropriate strategies [44].Previous literature suggests that incorporating points, leaderboards, and rewards simultaneously can modify user behavior [2].This can be explained by the fact that students are more likely to conform to the behaviors that are encouraged by researchers, in pursuit of higher scores, positions, or nancial rewards.Such desired behavior in this paper refers to TDD style.However, it is worth noting that using scores and ranking in conjunction may lead to demotivation among those at the bottom of the ranking [2].Nonetheless, the small sample size of six students in the gami cation group eliminates this possibility.Additionally, continuous feedback can encourage users to remain engaged in the project, further enhancing their level of engagement [49].Thus, the design of gami cation strategies in this study incorporates points, leaderboards, rewards, and feedback as the primary components.
Points and Leaderboard -The gami cation rules are crafted to provide users with points, which enable them to advance their position on the leaderboard and, eventually, receive rewards.Points are assigned based on whether students adopt Test-Driven Development (TDD) behaviors and are more engaged in development activities.These assessments are obtained through manual analysis of the commits.The provisions of the gami cation rules are presented as follows: 1. Generating a failing unit test before the function code, the participant will get 1 point when the unit test passes after nishing the function code.
2. Finishing a development cycle (creating new function code and related test cases), the student will get 2 points.
3. One commit of updating production code, the student will get an extra 1 point.
4. One commit of updating testing code, the student will get an extra 1 point.
As students earn these scores, it is important to ensure that the new changes do not a ect existing codes.If the new code causes a bug, then the student will not receive points for knowing that the issue is resolved.Additionally, the testing phase constitutes a crucial aspect of the study.While it is acknowledged that individual test cases may not pass, the commendable e ort put forth by the students in conducting these tests is still duly recognized and rewarded.This approach is intended to foster a positive reinforcement mechanism, motivating students to actively engage in further test case execution.
However, the students' contribution to project administration is also factored into the evaluation.E ective project management entails comprehensive documentation and administration [53].This study prioritizes Test-Driven Development (TDD) behaviors and engagement levels in development activities, thus assigning fewer points for work in documentation and administration compared to code production.The second segment of the gami cation rules is outlined below: 5. Update one commit about documentation (e.g., XML le), the student will get 0.5 points.
6. Update one commit about administration (e.g., merge), the student will get 0.5 points.
To encourage students to self-coordinate, they are informed of their ranking on the leaderboard, and they receive supplemental leaderboard bonus points dependent upon their ranking.The criteria for obtaining the bonus is outlined below: 7. The highest ranking will earn 6 points, and the second highest will get 5 points, etc.The ranking will be refreshed every week, and the points will be accumulated.
Rewards -At the end of the experiment, students get real-world rewards (shopping vouchers) to motivate their engagement in development activities.
Feedback -The feedback is a suggestion about how to get more points based on students' most recent activity, which is distributed together with points and leaderboards.

Measurement of Gami cation
Assessing the e cacy of gami cation is a central component of our research design.We consider gami cation intervention as a 'shock' during the experiment.We have set up three e ect-windows: 1 day, 3 days, and 5 days post-gami cation treatment, and accordingly created three dummy variables: gami cation_1 (1 day), gami ca-tion_2 (3 days), and gami cation_3 (5 days).For instance, if on day 0 students receive gami cation treatment, gami cation_1 is equal to 1 on day 1, and 0 otherwise; gami cation_2 is equal to 1 on days 1 and 3, and 0 otherwise; and gami cation_3 is equal to 1 on days 1, 3, and 5, and 0 otherwise.As the treatment interval is every 7 days, so we observe its impact by the fth day.We start from day 1 rather than day 0 to better examine the changes in maintainability after introducing gami cation and allow one day for students to react.Similarly, we construct variables on a one-day basis rather than a daily basis, since the e ect on maintainability is not immediate.The rationale behind establishing three distinct observation periods was to facilitate a more comprehensive examination of the impacts of gami ed stimuli on students at various junctures, encompassing both immediate and prolonged e ects of gami cation.

Measurement of Maintainability
In order to re ect the maintainability of di erent time periods, we measure the maintainability of every source code le across the repository on a daily basis.Previous research has measured software maintainability from three aspects: process, architecture, and code level.The metrics used for this purpose, such as Mean Time to Repair (MTTR), Mean Preventive Maintenance Time (MPMT), Mean Corrective Maintenance Time (MCMT), and Maximum Corrective Maintenance Time (MaCMT), have been adopted from prior studies [3].In addition to the aforementioned metrics, more advanced metrics, such as Halstead Software Science, McCabe's Cyclomatic Complexity (CC), and Maintainability Index (MI), have been employed to evaluate maintainability [3].In this study, we utilize two common metrics, CC and MI.
CC is the total number of linearly independent paths through a program's source code and higher CC leads to lower code quality [3].MI was rst proposed by Oman and Hagemeister in 1992 [38] and is widely used in the industry and has been successfully applied to a wide range of software systems, including Microsoft Visual Studio [52].Recently, Microsoft has proposed a new approach to calculating MI [47].The equation speci cation for MI is described in below, wherein V denotes Halsted Volume, G denotes McCabe's Cyclomatic Complexity, and LOC is de ned as the total count of code lines: To elucidate the e ect of gami cation on group performance, we construct multiple maintainability indicators, including CC_mean, CC_o, MI_mean, and MI_o.CC_mean re ects the mean value of CC over the course of the entire experiment, while CC_o re ects the CC value upon completion of the project, thereby providing an overall view of the in uence of gami cation.Similarly, MI_mean and MI_o are used to evaluate the e ects of gami cation from a global perspective.We employ the xed e ects model (FEM) to explore the correlation between gami cation and software maintainability, taking gami cation_1, gami cation_2, and gami cation_3 as independent variables, and CC, CC_mean, CC_o, MI, MI_mean, and MI_o as dependent variables [27].Additionally, we consider various control variables, including the line of code (LOC) and comment ratio (CR), as they may in uence maintainability [40].The formula of CR is shown below, and C represents the total number of comment lines.

Diagnostic
In order to assess the reliability of statistical analysis results, we consider factors such as model robustness, heteroscedasticity, multicollinearity, and bias due to unobserved variables.Whenever variables like CC [31] had highly skewed distributions, we logtransformed and winsorized [9] the continuous variable at 1% and 99% levels to increase model robustness [30].Additionally, we included the robust standard errors in all models to address the heteroscedasticity problem [54].To diagnose multicollinearity, which may in uence the accuracy of coe cients and p-values estimated [20], and most likely improve model tness (R-square), we employed post-estimation diagnostics to evaluate the Variance In ation Factor (VIF).It is regarded as a multicollinearity issue if the mean VIF is higher than 4.0; some scholars suggest using 10.0 as the threshold [28].To address possible endogeneity due to problems with unobserved variables, we applied the impact threshold of a confounding variable (ITCV) to models in order to eliminate omitted variable bias [10].

RESULT
In order to assess the in uence of gami cation on the Test-Driven Development (TDD) behavior and engagement levels in development activity, we apply graphical representations to depict the trend.Subsequently, we examine the association between gamication strategies and maintainability using a regression analysis.For this purpose, three independent variables (gami cation_1, gami cation_2, and gami cation_3) are used to denote gami cation intervention, while six dependent variables (CC, CC_mean, CC_o, MI, MI_mean, and MI_o) are employed to measure maintainability.
To avoid alternative explanations, two control variables (LOC and CR) that might a ect maintainability are also incorporated.

Gami cation and Behaviors
Figure 1 demonstrates that TDD behaviors are altered by gamication intervention.In Panel A, the higher the behavior score, the more development cycles generated by an individual (students 1, 2, 3 ... 6) in accordance with TDD.The average value for the period is represented by the value of the day.Panel A highlights that gamication intervention increases the behavior score of the individuals, as well as the e ectiveness of the intervention is sustained even after the intervention is withdrawn.Although the score of Day 41 drops slightly, this is due to the fact that the number of development cycles has already reached a high level and consequently, it is dicult to maintain improvement.Since the number of participants in the treatment and control groups di er, the number of development cycles is averaged in the analysis.Panel B in Figure 1 assesses the average of the groups.Without gami cation treatment, both groups show similar TDD behaviors.However, post-gami cation intervention (Day 23), the treatment group outperforms the control group.This comparison provides evidence that gami cation strategies are e ective.Thus, the results con rm Hypothesis 1 that gami cation changes development behaviors to follow TDD.

Gami cation and Engagement
Figure 2 illustrates the engagement level in development activity from the number of commits (NC) perspective.The average value of NC for each day is displayed in Panel A. Following the implementation of the gami cation intervention on day 23, there is a notable increase in value for all students on day 30.Student 6 is the exception, as he had already completed his tasks prior to day 37. Panel B similarly displays a comparable trend between the treatment and control groups before day 23, however, the treatment group experiences a greater improvement in NC after the gamication intervention.Following the conclusion of the gami cation intervention on day 41, NC numbers remain unchanged.
Figure 3 illustrates the engagement level in development activity from a frequency perspective.The point of the day is the average frequency of engagement (FEQ) during the period.For instance, the value of day 5 is the average of FEQ between day 1 to day 5. Panel A of Figure 3 indicates that an individual's FEQ value increased after the gami cation intervention.To further evaluate the frequency changes after the implementation of the gami cation intervention, we compare the average of the team's frequency between the treatment and the control group, as demonstrated in Panel B of Figure 3. From day 1 to day 23, the treatment group and control group display a similar trend, while the treatment group displays superior performance after the gami cation strategy was introduced (after day 23), suggesting that the gami cation strategy has a positive impact on improving engagement level.Additionally, the FEQ in the treatment group remained at a similar level even after the termination of the gami cation intervention (day 41), indicating that the e ectiveness of gami cation intervention can be sustained for an extended period.This further substantiates Hypothesis 1, that gami cation is bene cial in altering behavior towards Test-Driven Development (TDD) and increasing engagement levels.

Gami cation and Maintainability
The observations tested spanned the entire experimental period of the treatment group, approximately 45 days.  2 presents the results of our analysis on the correlation between gami cation and maintainability.Models 1-3 examine the correlation between gami cation and Cyclomatic Complexity (CC) in the long term.Contrary to expectation, a signi cant positive correlation was found between gami_1, gami_3, and CC in Models 1 and 3.This suggests that gami cation leads to higher CC in the short term and worse maintainability.However, a strongly signi cant negative correlation between gami cation_1, gami -cation_2, and CC_o in Models 4 and 5 was found, indicating that gami cation leads to lower CC in the long term and better maintainability.Models 7-9 examine the correlation between gami cation and Maintainability Index (MI) in the long term, which suggests that gami cation leads to better maintainability.In conclusion, these results support our hypothesis (H2) that gami cation is positively related to software quality in the long term.Furthermore, they indicate that gami cation is negatively related to maintainability in the short term.
The remaining results of our analysis show that all models employed do not have multicollinearity problems, with the mean VIF at 1.3 to 2. The ITCV value is greater than all other remaining independent variables in each model, and all RESET values are greater than 0.05.Thus, our results are robust and do not su er endogeneity problems.

THREATS OF VALIDITY
In this section, we discuss the threats to the validity of our study, ordered according to the recommendations of Wohlin et al. (2012) [58]:

Internal Validity
In this study, various factors that could threaten the internal validity were considered and e orts were made to mitigate their impact.The potential threats to internal validity include history, maturation, testing, instrumentation, statistical regression, and selection.With regards to selection, the sample was constructed to minimize the selection e ect by selecting participants from the same college, grade, course, time limits, and task.This ensured that the participants have similar skills and task di culty levels.
It is important to acknowledge the potential in uence of selfselection bias.The participants who volunteered to be part of the gami cation intervention may possess characteristics or motivations that di er from those who did not participate.This could introduce a potential bias in the results, as the participants who self-selected may already be more motivated or interested in the topic, potentially in ating the observed e ectiveness of the gamication intervention.Consequently, the ndings may not be fully generalizable to all development teams or software development contexts.But the results provide insights into the e ects of gamication within the speci c sample studied.
Additionally, the threat of history validity was mitigated as all data was collected from the same college.The threat of maturation validity was reduced as the project only lasted for 45 days, which is not a signi cant time period for skill level changes.While e orts were made to minimize the testing threat by teaching the TDD technique before the experiment, there is a possibility that some participants may have prior experience with TDD.However, the instrumentation threat was eliminated as all data was collected from GitHub, and the students in the group had similar abilities to use GitHub.To further reduce the statistical regression threat, extreme values were excluded in the regression test.

External Validity
This study's participants do not encompass the entire population of developers, particularly senior engineers.However, the nature and scale of the task assigned do not necessitate a high degree of industrial experience, thus we consider the student population to be appropriate participants.
he limited nature and scale of the assigned task may not accurately re ect real-world scenarios, such as complex software development or advanced system design.
To reduce the impact of external factors, such as a pandemic or environmental disasters, all data was collected from the online platform GitHub.

Construct Validity
The measurement of TDD behaviors and software quality are based on transparent, previous de nitions that were argued and tested.
This study is not exposed to this threat because the students are not evaluated based on the results they obtained in the experiment.

Conclusion Validity
The limited statistical power of the test due to the small sample size may hinder the accuracy of the inferences drawn from the outcomes of our investigation.Nonetheless, by utilizing panel data with 250 observations, it is possible to decrease the e ect on power.We do not consider shing to be a potential risk.The data analysis is conducted in accordance with the same standards to minimize the fostering of particular results.The error rate, or signi cance level, was kept unchanged since the data analysis was conducted by a single researcher.As all subjects were administered benchmarked treatments concurrently, the execution of the treatments was consistent.Theoretically, the homogeneity should be diminished due to the similar backgrounds of the subjects.Empirically, we adopt Robust Standard Error to address this concern.

CONCLUSION AND DISCUSSION
The research questions addressed in this study are centered on the impact of gami cation on students' TDD practices and the improvement of software maintainability.This is motivated by prior literature that suggests gami cation can enhance user performance and promote personal focus.
The ndings of this study support the hypothesis that gami cation can e ectively improve students' TDD practices and software maintainability.Gami cation was found to positively in uence TDD behaviors and the engagement level in development activities, as well as positively relate to software maintainability.These results align with the notion that gami cation can enhance software engineering practice performance.
The study under consideration is the rst of its kind to explore the potential of gami cation in enhancing complex software development activities.Moreover, the ndings of this study have important implications for educators, software developers, and project managers who are seeking to improve the e ciency of software development.By incorporating gami cation into their curriculum, educators can enhance the TDD practices of their students.Similarly, software developers can use gami cation strategies to motivate team members and improve project outcomes, while project managers can increase the engagement levels of their team members and improve overall project quality through gami cation.The study also proposes gami cation as a promising and cost-e ective alternative to traditional methods such as introducing new methodologies or enhancing expertise.
Relatedly, our ndings here are applicable to student and novice developers but cannot necessarily be generalizable to senior engineers.Therefore, a direct extension of our work would be to examine our proposed engagement metric in a broader context, that is, experienced engineers and sophisticated projects.Further, future studies could also explore the potential role of gami cation in di erent development patterns such as refactoring.Additionally, apart from the examination of the application of gami cation on group projects, it would be of special interest to explore whether gami cation impacts the development process at the individual level.

Figure 1 :
Figure 1: Panel A. Gami cation and Behaviors

Figure 2 :
Figure 2: Gami cation and Number of Commits