Assessing AI Detectors in Identifying AI-Generated Code: Implications for Education

Educators are increasingly concerned about the usage of Large Language Models (LLMs) such as ChatGPT in programming education, particularly regarding the potential exploitation of imperfections in Artificial Intelligence Generated Content (AIGC) Detectors for academic misconduct. In this paper, we present an empirical study where the LLM is examined for its attempts to bypass detection by AIGC Detectors. This is achieved by generating code in response to a given question using different variants. We collected a dataset comprising 5,069 samples, with each sample consisting of a textual description of a coding problem and its corresponding human-written Python solution codes. These samples were obtained from various sources, including 80 from Quescol, 3,264 from Kaggle, and 1,725 from LeetCode. From the dataset, we created 13 sets of code problem variant prompts, which were used to instruct ChatGPT to generate the outputs. Subsequently, we assessed the performance of five AIGC detectors. Our results demonstrate that existing AIGC Detectors perform poorly in distinguishing between human-written code and AI-generated code.


INTRODUCTION
In recent years, an increase in capability has been observed in the development of LLMs, enabling them to generate high-quality, coherent paragraphs, answer questions, and even produce human-like code [20,26].LLM are designed to understand and generate human text, and they are trained using vast amounts of data scraped from the internet.Most of these LLMs, developed by major corporations, are generally made accessible to the public, allowing anyone with Internet access to utilize them [13].This accessibility enables individuals to swiftly obtain valuable and reference-worthy answers.
At the educational level, this accessibility can enhance learning and research efficiency while potentially influencing educational assessment and evaluation [30].The use of AI-based tools and resources has gradually led to a situation where students increasingly rely on these tools to quickly access answers and information.This growing dependence on AI-driven solutions has a noticeable impact on academic dishonesty [15].
As a consequence, educators find themselves compelled to utilize AIGC Detectors to ascertain whether students are involved in academic dishonesty [33].While existing AIGC Detectors have proven their proficiency in identifying AI-generated text, their effectiveness in recognizing AI-generated code remains uncertain due to the intricate nature of programming code [35].This discrepancy can lead to disparities in the evaluation of students' academic submissions, potentially resulting in unfair grading.
The paper aims to conduct an empirical study to evaluate the performance of different AIGC Detectors in detecting AIGC across diverse contextual and syntactical variations.
Overall, the main contributions of our paper are summarized as follows: • We conducted a comprehensive empirical study to assess the performance of five AIGC detectors using 13 variant prompts.
To the best of our knowledge, this is the first study specifically evaluating the performance of different AIGC detectors on AIGC generated with various question prompts.• We constructed 13 large datasets, each containing 5,069 samples representing specific code problem prompt variants.Each dataset comprises code problems, human-generated code for that problem, and AI-generated code for the same problem.

BACKGROUND AND MOTIVATIONS
In this section, we discuss some of the background related to our work, particularly in the domain of software engineering education and the use of generative AI technologies to help in teaching and learning processes.

Impact of AIGC on SE and CS Education
Software Engineering (SE) and Computer Science (CS) education form the bedrock of technological progress [41].These fields equip students with essential skills, such as problem-solving, logical thinking, and creativity.However, the cornerstone of SE and CS education is programming.Programming is not just a skill; it embodies the essence of SE and CS education.It empowers students to create innovative software solutions, analyze complex problems, and make significant contributions to the digital world [24].The emergence of generative AI and the subsequent proliferation of AIGC have brought about a transformative era in education.AIGC represents a paradigm shift where automated content creation is revolutionizing learning experiences for both educators and learners [18].This shift, while revolutionary, introduces a complex interplay of challenges and opportunities in the realm of education [6].AIGC can personalize and enhance learning, but it also poses questions about the authenticity of content and the methods of assessment in the digital age [5].
The widespread adoption of AIGC has led to a growing need for reliable detection mechanisms.One notable development in this area is the AIGC Detector is detailed in the work by Wang et al. [35].These detectors have demonstrated impressive accuracy in identifying AI-generated content.However, understanding the implications of the accuracy of AIGC Detector within educational settings, particularly in CS and SE learning environments, is essential.
In an era where these detectors ensure the authenticity of educational content, it is imperative to examine their effectiveness and limitations comprehensively.Relying solely on their accuracy might create a false sense of security, potentially allowing students to deceive these detectors.This raises questions about the robustness of educational assessments and the integrity of the learning process [31].
An intriguing aspect of this inquiry lies in potential scenarios where students might outsmart AIGC Detectors.For example, in specific situations, students may find ways to manipulate the system, deceiving even sophisticated AIs like ChatGPT [9].This raises concerns about the reliability of automated evaluations, especially if educators heavily depend on a single detector.Understanding these vulnerabilities is vital for ensuring the authenticity of educational assessments in the digital age [22].
In summary, this research is motivated by the transformative impact of AIGC on SE and CS education.Our study aims to significantly contribute by unraveling the complexities surrounding AIGC Detectors, exploring their accuracy, vulnerabilities, and implications for educational practices.By shedding light on these critical aspects, our work strives to enhance the understanding of AIGC Detection within the context of CS and SE education.Ultimately, our research seeks to ensure the integrity of educational assessments, thereby fostering a secure and genuine learning environment for students and educators alike.

Selection of AIGC Detectors
The integration of AIGC Detection in SE education significantly enhances academic integrity and ensures the authenticity of educational content.Educators can leverage these detectors to identify instances of plagiarism, unauthorized use of AI-generated code, and academic dishonesty among students, thereby fostering fairness and promoting originality within SE coursework.Moreover, AIGC Detectors play a pivotal role in verifying the authenticity of educational materials, allowing educators to maintain the credibility of resources utilized in SE courses.This technological integration not only strengthens the overall educational ecosystem but also stimulates meaningful discussions among students about responsible AI usage, emphasizing the importance of ethical practices in SE education.
Five AIGC Detectors are instrumental in achieving these objectives.GPTZero [1], Sapling [2], GPT-2 Detector [32], DetectGPT [7,25], and Giant Language Model Test Room (GLTR) [17,19,28] offer educators reliable tools to preserve academic integrity and encourage ethical AI usage.However, it is crucial for educators to thoroughly evaluate these tools and consider their limitations to ensure their effective use in SE education.

EMPIRICAL STUDY DESIGN AND METHODOLOGY
This section discusses the research questions, methodology and process of our empirical study.Figure 1 offers an overview of the empirical study process, highlighting the key phase of data collection, prediction, and analysis.We have made the replication package publicly available. 1he data collection phase is essential for ensuring the integrity and robustness of our study.We gathered coding problems from various sources and introduced different variants based on the collected coding problems.Then, by utilizing OpenAI API service,2 we collected AI-generated code and performed manual validation to ensure the correctness of the dataset.Moving on to the data prediction phase, by using AI-generated code and human-written code as input, we employed selected AIGC Detectors to perform AIGC Detection and collect the prediction results.In the data analysis phase, using the compiled prediction results, we evaluated the efficacy of each AIGC Detector via predetermined performance metrics and utilised the results analysed to answer our research questions.

Research Questions and Experimental Setup
In Section 3.1, we discuss the research questions and the experimental setup.
(1) RQ1: How accurate are existing AIGC Detectors at detecting AI-generated code?(2) RQ2: What are the limitations of existing AIGC Detectors when it comes to detecting AI-generated code?
Our objectives of RQ1 and RQ2 is to assess the performance of the chosen AIGC Detectors in identifying AI-generated code and investigate the limitations of AIGC Detectors in identifying AI-generated code.Both research questions share a similar experimental setup.
(1) Data Evaluation: Evaluation of the AIGC Detectors' performance is based on the 13 variants of the dataset we have meticulously collected.
(2) Experimental Configuration: For the selected AIGC Detectors, we have established a discriminative threshold of 0.5.This threshold serves as a criterion to distinguish AI-generated code samples.Specifically, if the output probability surpasses the 0.5 threshold, it signifies the presence of AI-generated content within the input samples.(3) Performance Assessment Metrics: The performance of the detectors is measured using a range of metrics such as accuracy, precision, true positive rate (TPR), false positive rate (FPR), true negative rate (TNR) and false negative rate (FNR).These metrics provide a comprehensive understanding of how effectively the detectors identify AI-generated code.

Data Collection
We gathered fundamental code problems along with corresponding solutions code from various online sources, such as using the existing datasets from Kaggle [40] and web scraping technique to crawl data from Quescol [39].Kaggle is an online community of data scientists and machine learning engineers that allows users to find datasets used in building AI models, publish datasets, collaborate with others, and enter competitions to solve data science challenges.
Quescol is an educational website that provides a huge collection of the previous year and most important questions and answers as per the exam perspective.For the AIGC dataset, we generated different variants of AI-generated code to explore different scenarios and find out situations that might be possible for students to fool AIGC Detectors.The details about the different variants are discussed in Section 3.4.
For the compilation of the AIGC dataset, we developed an automation script leveraging the API service from OpenAI. 2 The script will use the collected code problems as inputs, and then prompt ChatGPT to generate the outputs, subsequently label and store them as AI-generated code.The script was executed 13 times to generate different code variants, such as removing stop words or adding dead code, ensuring that a dataset of initial size 70,966 was collected for thorough analysis.
The coding problems and human-written code are obtained from three primary sources: 1.) Python Coding Question [38]: this source from Quescol contributed 80 samples to our dataset, which includes some common interview questions for Python; 2.) Coding Problems and Solution Python Code [37]: this source from Kaggle contributes approximately 3,264 samples to our dataset, which encompasses a wide range of Python code solutions to various coding problems; 3.) LeetCode Solutions and Content KPIs [36]: this source from Kaggle contributes around 1,725 samples to our dataset which includes solutions to coding problems from the popular online coding platform, LeetCode, along with additional content and key performance indicators.The collected dataset comprises 5,069 samples where each sample consists of a textual description of a coding problem and the corresponding solution code that were labeled as human-written code.The dataset covers diverse programming concepts, algorithms, and coding challenges.Based on collected code problems, we have created 13 variants of AIGC content.

Workflow Explanation
To ensure the replicability and clarity of our empirical study procedure, we have provided our procedure in a systematic, step-by-step algorithm format to eliminate misinterpretation of our procedure in Figure 1.As shown in Algorithms 1 to 4, four procedures are included on how variant dataset is collected and metric is calculated:

Data Variations
Table 1 displays the dataset sizes used for each AIGC Detector across various variants.The sizes of datasets for Variants 2 and 3 (we will discuss the specific details of each variant in the following paragraph) are smaller across all AIGC Detectors due to the removal of data resulting from invalid outputs by ChatGPT.Specifically, when applying the Sapling detector on Variant 2, the dataset size is further reduced as our initial approach has a minimum word requirement for code detection (which will be discussed in detail in the following paragraph).Similarly, the DetectGPT detector excludes data with fewer than 100 words due to its minimum word requirement, leading to a reduced dataset size.
In the process of preparing our dataset, we introduced 13 variations of AI-generated code.These variations were achieved by altering the prompts provided to ChatGPT or the solution code received from ChatGPT.Our aim was to find out the limitations and performance of each AIGC Detector under different variants of programming code.Detailed descriptions of each variation are available on Figshare. 3.Without modification: Adapted the prompt based on Sam Altman's example 4 .Prompted ChatGPT with unmodified code problems, adding preliminary conditions [8].Mimics users' typical approach to resolving coding problems.2. Removal of Stopwords: Preprocessed coding problems by removing common stopwords using the NLTK library.Aims to simulate users modifying questions to enhance output by eliminating stopwords.8. Replace variable names: Using the AI-generated code from variant 1, we used the AST library to replace all variable names with single-character letters (a to z).Simulates cases where students violate basic programming rules in variable naming.9. Replace function names: Using the AI-generated code from variant 1, we developed a Python script with the AST library to replace all function names with single-character letters (a to z).Simulates cases where students violate basic programming rules in function naming.10.Replace variable and function names: Using AI-generated code from variant 1, we developed a Python script with the AST library to replace all variable and function names with singlecharacter letters (a to z).Simulates extreme cases where programmers violate basic programming rules in naming convention.11.Long method: Long method, a term associated with code smells, indicates design weaknesses that raise the risk of bugs.Asked ChatGPT for a longer output in function block format to simulate scenarios with long method code smell, reducing code readability and maintainability.12. Short method: Contrary to the long method variant, prompted ChatGPT to generate a shorter output in function block format.Aims to simulate scenarios adhering to best programming practices with concise methods focusing on a single functionality.13.Adding 5 snippets of dead code: Injected 5 snippets of dead code into the output as noise for the AIGC Detector.Aims to explore the impact of dead code on the detector's performance, simulating scenarios where useless code is inadvertently included in the solution during development.

Evaluation Metrics
For this empirical study, we have selected a few metrics [23,34] to evaluate the performance of selected AIGC Detectors.Table 2 shows the confusion matrix used in the paper where 1 indicates human-written code, and 0 indicates AI-generated code.
TPR/Recall: True Positive Rate, also known as Recall, calculated as  / =     +  , where TP is the number of humanwritten codes correctly labelled as human-written, FN is the number  of human-written codes incorrectly labelled as AI-generated and TP+FN represents the total number of human-written codes.FNR: False Negative Rate, calculated as    =     +  , where FN is the number of human-written codes incorrectly labelled as AI-generated, TP is the number of human-written codes correctly labelled as human-written and TP+FN represents the total number of human-written codes.
TNR: True Negative Rate, calculated as    =     +  , where TN is the number of AI-generated codes correctly labelled as AIgenerated, FP is the number of AI-generated codes incorrectly labelled as human-written and TN+FP represents the total number of AI-generated codes.
FPR: False Positive Rate, calculated as   =     +  , where FP is the number of AI-generated codes incorrectly labelled as human-written, TN is the number of AI-generated codes correctly labelled as AI-generated and TN+FP represents the total number of AI-generated codes.
Accuracy (ACC): Calculated as  =   +    +  +  +  , where TP is the number of human-written codes correctly labelled as human-written, TN is the number of AI-generated codes correctly labelled as AI-generated and TP+TN+FP+FN is the total number of codes.
Precision: Calculated as  =     +  , where TP is the number of human-written codes correctly labelled as human-written, FP is the number of AI-generated codes incorrectly labelled as human-written and TP+FP is the total number of codes labelled as human-written.
F1 Score: Calculated as , which represents the harmonic mean of precision and recall.
AUC: Area Under Curve, which is used to evaluate how much the model is capable of distinguishing between classes.A higher AUC score means that it has a better capability to distinguish between positive and negative classes.When the AUC score is around 0.5, it indicates the model performs random choice when predicting the samples.

RESULTS
In this section, we showcase our experimental results and provide an analysis to address each research question.We have uploaded the full set of results on Figshare. 5.1 RQ1: How accurate are existing AIGC Detectors at detecting AI-generated code?
Table 3 shows the performance of GPT-2 Detector and GPTZero, where both detectors exhibit poor performance when compared with baseline variant 1.Their Accuracy (ACC) results hover around 0.5, suggesting a lack of effectiveness in distinguishing AI-generated code from human-written code.They tend to classify a significant portion of input code as human-generated rather than machinegenerated.This performance issue could stem from their primary training on natural language data or potential overfitting to natural language.
In the assessment of AIGC Detectors, both the GPT-2 Detector and GPTZero exhibited challenges in effectively distinguishing between human-written and AI-generated code across all evaluated variants.In Table 3, their performance was characterized by high TPR and low TNR while having an ACC of around 0.5, indicating a tendency to classify both human-written and AI-generated code as human-written code.
DetectGPT, akin to the aforementioned AIGC Detectors, showcases an ACC near 0.5 in Figure 2, indicating its struggle in distinguishing AI-generated from human-written code.Notably, Detect-GPT tends to misclassify the majority of human-generated code as AI-generated.However, a more positive aspect emerges when we compare DetectGPT to the baseline (Variant 1) and other variants like Function Name (Variant 9), Variable and Function Name (Variant 10), Long Method (Variant 11), and Short Method (Variant 12).DetectGPT demonstrates an improvement in TNR ranging from 5% to 9%.This improvement signifies its competence in accurately identifying AIGC.
The ACC performance of GLTR exhibits variability across different AIGC variants.Approximately half of the variants surpass the 0.6 Accuracy (ACC) threshold, while the remainder approach a more modest 0.5.In our evaluation of GLTR and Figure 3, we conducted a comparative analysis involving the baseline Variant 1 and several other variants, including "No Comment (Variant 4), " "Assertion (Variant 5), " "Test Case (Variant 6), " "Unittest Test Case (Variant 7)," and "Dead Code (Variant 13)."The aim was to assess the system's performance in detecting GPT-generated code as AIgenerated content.
The results uncovered a significant enhancement in GLTR's ability to identify AI-generated code when compared to the baseline variant.This improvement ranged impressively from 18% to 53% across the diverse set of variants.This suggests that the incorporation of these new variants, such as "No Comment", "Assertion", "Test Case", "Unittest Test Case", and "Dead Code", substantially bolstered the system's proficiency in AI-generated code detection.Additionally, it is noteworthy that the GLTR exhibited a notable 6% increase in the TNR when tasked with evaluating the "Long Method" variant 11.This improvement signifies an augmented capability to accurately categorize non-AI-generated code, particularly in the context of extensive and intricate code segments.Sapling detector outperforms other AIGC Detectors, consistently achieving ACC values above 0.6 in ten out of fourteen AIGC variants.In our evaluation of Sapling's AI Detector and Figure 4, we compared its performance to the baseline (Variant 1) across various code variants.Our findings highlight significant improvements, particularly in TNR, when the Sapling detector is applied to specific variants such as "Variable Name" (Variant 8), "Variable and Function Name" (Variant 10), and "DeadCode" (Variant 13).These variants exhibited a substantial increase in TNR, ranging from 9% to 15%, indicating their enhanced ability to detect AI-generated code accurately.
Answer to RQ1: Existing AIGC Detectors perform poorly in distinguishing between human-written code and AI-generated code, indicating the inherent weaknesses of current detectors.This underscores the need for further research and development in this domain to enhance their efficacy.

RQ2: What are the limitations of existing
AIGC Detectors when it comes to detecting AI-generated code?
From the results in Table 3, we have revealed significant sensitivity to code variants, particularly with GLTR, as demonstrated by a Samples Paired t-Test [14] comparing ACC differences between the original (Variant 1) and all other versions of variants (Variant 2 to 13), where the p-values for all five AIGC Detectors are 0.0296 (GLTR), 0.1032 (GPT-2 Detector), 0.2479 (GPTZero), 0.3816 (Sapling) and 0.5714 (DetectGPT).GLTR demonstrated the lowest p-value, signifying a significant performance difference across the variants.This underscores GLTR's heightened sensitivity to specific variant patterns, resulting in misclassifications, as evident in the wide range of accuracy from 0.4841 (Variant 3) to 0.7693 (Variant 7).This finding highlights a potential vulnerability of AIGC Detectors, particularly when students prompt ChatGPT to mimic a human's response when generating the code as GLTR tends to treat most AIGC incorrectly as human-written, potentially enabling academic plagiarism.Moreover, Variants 2, 3, and 5 induced a drop in performance (both ACC and TNR) across multiple models when compared to that of Variant 1, underlining the need for further research to address these specific vulnerabilities and ensure the integrity of AI-generated content detection methods in educational settings.
After some thorough analysis of each AIGC Detector's performance across each variant, we discuss some of the limitations that we observed for each AIGC Detector.1. GPTZero Limitations: GPTZero may struggle to accurately detect AI-generated code due to the distinct syntax and structure of programming languages.Code follows strict rules and conventions, deviating from the linguistic patterns GPTZero is designed to identify.This mismatch can result in limited effectiveness, leading to potential false positives or false negatives when applied to code detection tasks.2. Sapling Limitations: Sapling is designed for analyzing text generated by language models like GPT-3 and ChatGPT, excelling in identifying such content.However, it may not be optimized for detecting programming code, as AI-generated code can have distinct characteristics not aligned with natural language patterns.The limitations of Sapling AI Detector include its focus on language model-generated text and potential challenges in code detection.

GPT-2 Detector Limitations:
The GPT-2 Detector relies on a dataset of text generated by GPT-2 models, making it effective for identifying GPT-2 outputs.However, it may struggle with other AI models or programming code, as distinct characteristics might not be represented in the dataset.The detector's ability to detect code from different models is limited, given the variability in complexity and structure of programming code.4. DetectGPT Limitations: DetectGPT employs perturbations in input text for innovative text classification but may face limitations in detecting programming code.The intricate logic and syntax of code may not exhibit the same perturbation patterns as natural language, leading to challenges in adaptability and potential for false or missed detections in this context.5. GLTR Limitations: GLTR, relying on models like GPT-2, enhances text interpretability through word rankings and color coding.However, its limitations are apparent in programming code detection, where the unique linguistic patterns and nuances are not effectively captured by these methods.GLTR may struggle to provide accurate detections for programming code.
Our findings indicate a common limitation shared among current AIGC Detectors in their ability to detect AI-generated code: they consistently perform poorly.In our comprehensive assessment of various AI detection tools, we observed a lack of accuracy and reliability in their capacity to identify AI-generated code content.
These limitations encompass several key areas: 1.Detection Accuracy: Across the board, these tools struggled to accurately identify AI-generated code, often producing high rates of false negatives and false positives, leading to inaccuracies.2. Lack of Specificity: Existing tools encountered difficulties distinguishing between human-generated and AI-generated code, resulting in misclassifications and a lack of precision.3. Generalization Challenges: While fine-tuning improved performance within specific code domains, these tools faced difficulty in generalizing their capabilities.They often faltered when confronted with code from diverse domains or when AI-generated code closely resembled human coding styles.
Answer to RQ2: The limitations of AIGC Detectors such as GPTZero, Sapling AI Detector, GPT-2 Detector, DetectGPT, and GLTR become evident when applied to the detection of AI-generated code.Variants 2, 3, and 5 enable students to deceive most models, leading to a significant decrease in accuracy and TNR across the majority of models.These limitations stem from the fundamental differences between programming code and natural language text.Code adheres to strict rules, follows distinct patterns, and may not exhibit the linguistic characteristics that these models are designed to detect.

Suggestions to SE and CS Educators
AIGC has emerged as a transformative force in education, particularly in the domains of CS/SE.This section aims to distill insights from recent research papers, to provide educators with strategies and best practices for the successful integration of AI into CS/SE education.
Researchers from Hong Kong introduced the IDEE Framework, which serves as a guiding framework for utilizing generative AI in education [29].This framework emphasizes the importance of identifying desired outcomes, determining the appropriate level of automation, ensuring ethical considerations, and evaluating effectiveness.Moreover, the work by Kaplan et al. explores teachers' perspectives on generative AI technology and its potential implementation in education [21].The authors suggest that teachers exhibit positive perspectives towards generative AI, with more frequent usage leading to increased positivity.Educators perceive generative AI as a tool for enhancing professional development and student learning.This suggest that embracing generative AI in education has the potential to positively impact both educators and students, fostering continuous growth and improved learning experiences.
On the other hand, the work by Chan et al. delves into the experiences and perceptions of Gen Z students and Gen X/Y teachers regarding the use of generative AI in higher education [12].Students express optimism about the benefits of generative AI, while teachers emphasize the need for guidelines and policies to ensure responsible use.Furthermore, an AI Ecological Education Policy Framework for higher education [11].It addresses three dimensions: Pedagogical, Governance, and Operational, providing a comprehensive structure to navigate the implications of AI integration.
Based on the literature and research findings, it is imperative for educators to acknowledge the limitations inherent in current AIGC detectors, especially when dealing with code-based content.Our study emphasizes the critical need for ongoing exploration and advancement in specialized tools, algorithms, and frameworks.This becomes particularly evident when considering the predominant use of existing AIGC detectors for text-based content.Our results highlight a significant gap in their applicability to code-based AIGC, calling for the development of more refined tools and algorithms tailored to this specific context.This underscores the importance of staying abreast of technological advancements to ensure the effective detection and evaluation of AIGC within educational materials.In the absence of dedicated detectors tailored for code-based AIGC, we propose several key recommendations for educators: 1. Define Objectives: Precisely outline the educational objectives that are in harmony with the application of generative AI.This step ensures a seamless integration of technology into educational purposes, fostering alignment between technology utilization and educational aspirations.For example, in a Python programming course, educators could utilize generative AI to enhance students' algorithmic creativity.Objectives might include collaborative design and optimization of sorting algorithms using AI assistance, fostering a deeper understanding of algorithmic principles.2. Automation Level: When incorporating generative AI into education, it is essential to carefully consider the level of automation to employ.This decision hinges on whether to pursue full automation, where AI systems handle educational tasks entirely on their own or opt for a supplementary approach that blends AI capabilities with human involvement.This choice plays a pivotal role in shaping the educational landscape, as it determines the extent to which technology should be seamlessly integrated into the learning process, aligning with the unique goals and requirements of each educational scenario.Take for instance a data science course, when integrating a code generation AI tool, consider the automation level.Decide whether to fully automate coding assignments, letting AI generate solutions, or adopt a supplementary approach, blending AI capabilities with human involvement.This choice influences how students learn programming, aligning with the curriculum's goals-whether to emphasize algorithmic logic through manual coding or leverage AI for rapid prototyping and problem-solving.3. Ethical Focus: Give paramount importance to ethical concerns.Develop comprehensive guidelines and policies to safeguard the responsible and ethical usage of AI in the educational context.Educators can, for example, develop guidelines to ensure responsible and ethical usage of AI algorithms in assignments, particularly those involving sensitive data.This may involve creating policies addressing issues such as data privacy, bias mitigation, and transparency, fostering a learning environment that upholds ethical standards in AI applications.4. Continuous Evaluation: It is essential to maintain an ongoing process of assessing the effectiveness of generative AI in education.This involves systematically monitoring how well the technology serves educational objectives, identifying areas for improvement, and adapting strategies as needed, ultimately contributing to improved outcomes and enhanced educational experiences.For example, in a Data Science course using generative AI for predictive modelling, maintain ongoing assessment.Systematically monitor how well the technology contributes to achieving predictive modelling objectives, identify areas for improvement, and adapt strategies as needed, ultimately enhancing students' data science skills.5. Comprehensive Policies: In the integration of generative AI within educational settings, it is imperative to develop comprehensive guidelines and policies based on empirical evidence and best practices.By adopting an evidence-based approach, educational institutions can ensure that their policies are not only robust and compliant but also rooted in real-world outcomes and experiences, safeguarding both the educational mission and the well-being of all stakeholders.In a Software Engineering curriculum incorporating a code generation AI tool, establish comprehensive policies.Develop guidelines based on empirical evidence and best practices to govern the use of AI-generated code in assignments.These policies may address code review processes, plagiarism detection, and the ethical implications of AI-assisted programming, ensuring a fair, transparent, and secure learning environment for both students and instructors.6. Stay Informed: Promote ongoing research and evaluation of AI integration in educational settings to stay informed about advancements, benefits, and risks associated with AI technology.For instance, in a university's Computer Science department, educators actively engage in continuous research on AI integration in coursework.This commitment ensures they stay informed about the latest advancements, benefits, and potential risks associated with AI technology, allowing for adaptive and optimized teaching methods based on the field's latest insights.

Internal Validity.
The study faces challenges related to the varied prompts used to generate AIGC with ChatGPT.While these prompts aim to simulate average user inputs, they may not fully mirror real-world scenarios.Additionally, ChatGPT's non-deterministic nature, leading to diverse responses from the same prompt, could impact result reproducibility.
Another challenge pertains to verifying the authenticity of source code written by humans.For instance, datasets sourced from platforms such as Kaggle could include content generated by LLM tools, blurring the distinction between human and AI contributions.However, since these datasets were released to the public before LLM models went mainstream, the impact is estimated to be minimal.
In specific scenarios, vague queries leading to responses such as "I'm sorry, as an AI language model, I am unable to provide code as the question lacked specific details about the expression to be evaluated" pose a risk to internal validity.Ambiguous or openended coding questions can result in inaccurate AIGC.Therefore, the datasets of the question prompts underwent preprocessing.During this process, rows containing responses such as the one mentioned were removed, ensuring the integrity and accuracy of the dataset used in our study.

Construct Validity.
Construct validity explores the intricate interplay between theoretical constructs and empirical data, posing concerns about potential biases arising from platforms such as Kaggle, where datasets are curated and shared, and ChatGPT may have been trained on datasets similar to the one used in this study.However, understanding ChatGPT's ability to generate diverse and contextually relevant content, rather than simply replicate existing snippets, this research maintains objectivity and rigor.These insights allow for a nuanced approach, ensuring the study's authenticity and safeguarding the integrity of the research findings.

External Validity.
It is crucial to acknowledge that the conclusions drawn in this study might be specific to the datasets analyzed.To enhance the generalizability of our study, we meticulously curated our dataset by gathering data from diverse sources, aiming to replicate real-world software development scenarios as closely as possible.Furthermore, we concentrated our efforts on the Python programming language, a deliberate choice made to maintain consistency and control within our study.

RELATED WORK
This section discusses several studies and research articles that explore the detection of AI-generated content, particularly in academic and educational contexts.
The study by Otterbacher [27] highlights the need to develop a culture that promotes responsible and ethical use of generative AI in various domains, including science and education.Rather than relying solely on technical solutions to combat AI-generated content, the authors argue for a holistic approach that considers the broader implications of AI in academia.On the other hand, the work by Chaka [10] evaluated AI content detection tools, highlighting their limitations in accurately detecting AI-generated content, especially in academic contexts.These limitations can lead to academic integrity issues, including plagiarism.Chaka's study [10] also emphasizes the need for ongoing research to evaluate the accuracy and reliability of AI content detection tools.Improving these tools' ability to detect AI-generated content is crucial for combating AI-generated plagiarism in academia.AIGC, notably code produced by ChatGPT, shows promise in software-related tasks but raises concerns in education, particularly regarding plagiarism [35].
The study by Adnan aimed to compare generated abstracts with original ones and assess the impact of instructions given to GPT-3.5 models on abstract quality [3].The result warns against misuse of results and emphasizes the risks in deploying ML tools in academia due to a 1% false positive rate.At a larger scale, such misclassification could reject significant research, potentially impacting humanity.
Similar to many related studies suggesting AI's inherent risks, Farrelly and Baker propose that AI, particularly LLM, holds considerable disruptive potential in education and society [16].Minority and international students encounter increased allegations of academic integrity breaches due to these technologies.Nonetheless, they also offer valuable benefits, particularly for international students and individuals with disabilities.
In relation to our study, these works provide valuable insights into the limitations of existing AI content detection tools, especially in the context of AI-generated code.They underscore the need for a comprehensive evaluation of AIGC Detectors' performance and the development of ethical guidelines to ensure responsible AI-generated content use in education.

CONCLUSION AND FUTURE WORK
The rise of generative AI models presents both opportunities and challenges, particularly in education and programming.Our study aimed to assess AI content detection tools' effectiveness in identifying AI-generated code, revealing their limitations and offering insights for educators and students.We examined various AI content detection models using a dataset containing human-written and AI-generated code.Our evaluation focused on metrics such as recall, precision, F1 score, accuracy, and AUC.
We have revealed significant sensitivity in GLTR, as indicated by its remarkably low p-value of 0.0296 in the Samples Paired t Test.This sensitivity led to notable accuracy discrepancies ranging from 0.4841 to 0.7693, exacerbated by specific code variants such as variants 2, 3, and 5.These findings underscore the imperative for immediate research efforts to enhance the reliability of AI-generated content detectors, safeguarding academic integrity in educational contexts.Addressing these challenges is pivotal for educators and institutions to adeptly navigate the complexities introduced by AIgenerated code, ensuring the integrity of programming education.
Our findings suggest that these tools hold promise in distinguishing AIGC from human-written code but face challenges due to code complexity and writing style variations.Ethical guidelines for AIGC integration into education are essential.Moreover, we have found that GLTR is very sensitive to different variants introduced into the AIGC produced by ChatGPT, proven by the smallest p-value when the Samples Pair t-Test is applied to the AIGC Detector's accuracies and the wide range of the accuracy.
Educators and institutions should adopt strategies for responsible AI usage, considering curriculum design, assessment methods, and ethical directives.Additionally, the long-term impact of AIGC Detector tools on student skill development and creativity in programming warrants further exploration.In conclusion, our research highlights the evolving landscape of AI-generated code and the role of AIGC Detectors in education.Emphasizing responsible AI usage, ethical guidelines, and ongoing tool refinement can empower students in a technology-driven world while preserving academic integrity and fostering creativity.
Future research should prioritize enhancing AI content detection models to effectively handle a wider range of code variations and writing styles, bolstering their reliability in educational contexts.Investigating the long-term impact of AIGC Detector tools on students' learning and creative engagement in programming can provide valuable insights for educators.Additionally, exploring the adaptability of AI-driven content detection models to architectural design, software design, and UML diagrams will contribute to a more comprehensive understanding of their applicability across diverse domains, fostering the development of improved educational tools and strategies.

DATA AVAILABILITY
The replication package, along with the associated data, has been made publicly available. 6

Figure 1 :
Figure 1: Workflow of the AIGC and their variants' generation process, AIGC prediction with AIGC Detectors, and comparative analysis of AIGC Detector outputs with accuracy metrics.

Figure 4 :
Figure 4: TPR and TNR Performance for Sapling Generate Different Variation of Prompt Data Collection of Human and AI-generated CodeOutput: : Modified Data Collection 1: procedure GenerateCodes( , ) ← CollectVariants( ,    ,  ) 6: GenerateCodes( , ) 7:  ′ ← Post Process on  // Post-processing on generated codes 8: DetectCodes(,  ′ , ) 9:  ← Calculate accuracy, precision, recall, TPR, FPR based on  10: return  // Result of Metrics 11: end procedure Algorithm 2 CollectVariants: return  // Collection of various prompts 7: end procedure Algorithm 3 GenerateCodes: Collecting AI-Generated Code Input: •  : Collection of various prompts • : Input: • : Artificial Intelligence Generated Content Detector •  ′ : Processed Data Collection of Human and AI-generated Code • : Collection Results of 1(Human) and 0(AI) Output: : Detection Result Collection 1: procedure DetectCodes(,  ′ , ) 2: for  ∈  do 3: It iterates over each coding problem in the dataset and applies the variant modification method to modify the prompt.The modified prompts are then included in the  (collection of various prompts) and being returned.• GenerateCodes: From Algorithm 3, the GenerateCodes takes  (collection of prompts) and empty collection  (collection of Human and AI-generated codes) as input.It iterates each variant prompt and generates codes based on the variant prompt.The procedure generates the specified number of AI codes for each variant prompt and includes them in the data collection.The data collection  (Collection of Human and AI-generated codes) contains the code from humans and AI are returned.• DetectCodes: From Algorithm 4, the DetectCodes procedure takes the  (AIGC Detector),  ′ (Postprocessed Human and AI-generated codes), and an empty  collection (collection of AIGC Detector's results) for storing the detection results as input.It iterates over each AIGC Detector and performs classification on each code in the post-processed data using the corresponding model.The detection results are then included in the result  collection (collection of AIGC Detector's results).
DynCodeMetrics, plays a central role in coordinating various sub-algorithms to streamline the workflow of dynamic prompt generation, AI-generated code post-processing, code detection, and metric computation.Taking inputs  (Prompt Datasets),  (AIGC Detector), and VarPrompt (Variant Modification Method), it initializes collections  (Variant Prompts),  (AIGC Detector Results), and  (Human and AI-generated Codes).The procedure first applies the CollectVariants method to modify prompts using variants and stores them in  .Subsequently, it employs GenerateCodes to create AI-generated codes based on modified prompts and performs post-processing on  to clean and validate the dataset.The DetectCodes method utilizes the AIGC Detector to identify human and AI-generated codes, updating  accordingly.The algorithm concludes by calculating accuracy, precision, true positive rate (TPR), false positive rate (FPR), true negative rate (TNR), and false negative rate (FNR) based on detection results, providing a comprehensive analysis.The calculated metrics, denoted as , are returned as the output.•CollectVariants:From Algorithm 2, the CollectVariants takes three inputs:  (Collection of Prompts Datasets),   (Variant Modification Method), and empty collection  (collection of various prompts).Algorithm 1 combines Algorithms 2 to 4 to dynamically vary prompts, generate AI codes, perform code detection, and calculate relevant metrics.The resulting metrics can be used to evaluate the effectiveness and accuracy of the AIGC Detector.
Mimic Human: Appended "Please mimic a human response." to each prompt, guiding ChatGPT to generate contextually appropriate code mimicking human responses and enhancing its ability to mimic human-written code.4. Solution Without Comment: Instructed ChatGPT to exclude comments in the solution, simulating cases where students omit comments, violating good programming practices.5. Assertion Test Code: Instructed ChatGPT to include assertion test code, assessing AIGC Detectors' ability to identify syntactically valid code with test assertions-a common approach for verifying program correctness.6. Solution with Test Case: Instructed ChatGPT to include test cases, evaluating AIGC Detectors' ability to identify syntactically valid code with included test cases-a common method for verifying program correctness.7. Unittest Test Case: Instructed ChatGPT to include unittest test cases, evaluating AIGC Detectors' ability to identify syntactically valid code with included unittest test cases-a common approach for verifying program correctness.

Table 1 :
Post-processed dataset size for each Detector across variant

Table 2 :
Confusion Matrix used in this study

Table 3 :
Accuracy, TPR, and TNR of All 5 AIGC Detectors Note: green indicates an increase compared to Variant 1, while red indicates a decrease