CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X

Large pre-trained code generation models, such as OpenAI Codex, can generate syntax-and function-correct code, making the coding of programmers more productive. In this paper, we introduce CodeGeeX, a multilingual model with 13 billion parameters for code generation. CodeGeeX is pre-trained on 850 billion tokens of 23 programming languages as of June 2022. Our extensive experiments suggest that CodeGeeX outperforms multilingual code models of similar scale for both the tasks of code generation and translation on HumanEval-X. Building upon HumanEval (Python only), we develop the HumanEval-X benchmark for evaluating multilingual models by hand-writing the solutions in C++, Java, JavaScript, and Go. In addition, we build CodeGeeX-based extensions on Visual Studio Code, JetBrains, and Cloud Studio, generating 8 billion tokens for tens of thousands of active users per week. Our user study demonstrates that CodeGeeX can help to increase coding efficiency for 83.4% of its users. Finally, CodeGeeX is publicly accessible since Sep. 2022, we open-sourced its code, model weights, API, extensions, and HumanEval-X at https://github.com/THUDM/CodeGeeX.


Introduction
Given the description of a human intent, such as "write a factorial function", can the machine automatically generate an executable program that addresses this need?This is the problem of automatic program writing that has been explored since the early days of computer science in the 1960s (Waldinger and Lee, 1969;Summers, 1977).From LISP-based pioneering deductive synthesis approaches (Waldinger and Lee, 1969;Summers, 1977) to modern program synthesis systems (Solar-Lezama, 2008;Polozov and Gulwani, 2015), to end-to-end code generation via deep neural networks (Mou et al., 2015;Svyatkovskiy et al., 2020;Sun et al., 2020), tremendous efforts have been made to enable machines to automatically write correct programs as part of the quest to artificial general intelligence.
By treating programs as language sequences, neural sequential architectures, such as recurrent neural networks and transformer (Vaswani et al., 2017), can be naturally applied to code generation.In fact, transformer-based techniques (Svyatkovskiy et al., 2020;Sun et al., 2020) have shown the potential of automatic program writing by starting to generate code that is both syntactically correct and consistent in 2020.This progress is significantly furthered when large language models (transformers with billions of parameters) meet the massive open-sourced code data.Notably, the OpenAI Codex (Chen et al., 2021) model (Python only) with 12 billion (12B) parameters pioneered and demonstrated the potential of large code generation models pre-trained on billions lines of public code.By using the generative pre-training (GPT) strategy, Codex can solve introductorylevel programming problems in Python with a high probability.Research studies (Ziegler et al., 2022) also show that 88% of users of GitHub Copilot-a paid service powered by Codex-feel more productive when coding with it.Since then, large pre-trained code models have been extensively developed, including DeepMind AlphaCode (Li et al., 2022), Salesforce CodeGen (Nijkamp et al., 2022), Meta InCoder (Fried et al., 2022), and Google PaLM-Coder-540B (Chowdhery et al., 2022).
In this work, we present CodeGeeX, a multilingual code generation model with 13 billion parameters, pre-trained on a large code corpus of 23 programming languages.It was trained on more than 850 billion tokens on a cluster of 1,536 Ascend 910 AI Processors between April and June 2022, and was publicly released in Sep. 2022 (Cf. the GitHub repo).CodeGeeX has the following properties.First, different from Codex in Chen et al. (2021), both CodeGeeX-the model itself-and how such scale of code models can be pre-trained are open-sourced, facilitating the understanding and advances in pre-trained code generation models.CodeGeeX also supports cross-platform inference on both Ascend and NVIDIA GPUs.Second, in addition to code generation and code completion as Codex and others, CodeGeeX supports the tasks of code explanation and code translation between language pairs (Cf. Figure 1 (a)).Third, it offers consistent performance advantages over well-known multilingual code generation models of the similar scale, including CodeGen-16B, GPT-NeoX-20B, InCode-6.7B,and GPT-J-6B (Cf. Figure 1 (b) and (c)).
We also build the free CodeGeeX extensions in several IDEs, currently including Visual Studio Code, JetBrains, and Tencent Cloud Studio (a Web IDE).It supports several different modescode completion, function-level generation, code translation, code explanation, and customizable prompting-to help users' programming tasks in real time.Since its release, there are tens of thousands of daily active users, each of which on average makes 250+ API calls per weekday.As of this writing, the CodeGeeX model generates 4.7 billion tokens per week.Our user survey suggests that 83.4% of users feel the CodeGeeX extensions improve their programming efficiency.
Finally, we develop the HumanEval-X benchmark for evaluating multilingual code models as 1) HumanEval (Chen et al., 2021)-developed by OpenAI for evaluating Codex-and other benchmarks (Austin et al., 2021;Hendrycks et al., 2021;Nijkamp et al., 2022) only consist of programming problems in a single language and 2) existing multilingual datasets (Ren et al., 2020;Lu et al., 2021;Zhu et al., 2022) use string similarity metrics like BLEU (Papineni et al., 2002) for evaluation rather than really verify the functional correctness of generated code.Specifically, for each problemdefined only for Python-in HumanEval, we manually rewrite its prompt, canonical solution, and test cases in C++, Java, JavaScript, and Go.In total, HumanEval-X covers 820 hand-written problemsolution pairs (164 problems, each having solutions in 5 languages).Importantly, HumanEval-X support the evaluation of both code generation and code translation between different languages.
The contributions of this work can be summarized as follows: • We develop and release CodeGeeX, a 13B pre-trained 23-language code generation model that demonstrates consistent outperformance on code generation and translation over its multilingual baselines of the same scale.
• We build the CodeGeeX extensions on VS Code4 , JebBrains5 , and Tencent Cloud Studio.Compared to Copilot, it supports more diverse functions, including code completion, generation, translation, and explanation.According to the user survey, CodeGeeX can improve the coding efficiency for 83.4% of its users.
• We hand-craft the HumanEval-X benchmark to evaluate multilingual code models for the tasks of code generation and translation in terms of functional correctness, facilitating the understanding and development of pre-trained (multilingual) code models.

The CodeGeeX Model
CodeGeeX is a multilingual code generation model with 13 billion (13B) parameters, pre-trained on a large code corpus of 23 programming languages.As of June 22, 2022, CodeGeeX has been trained on more than 850 billion tokens on a cluster of 1,536 Ascend 910 AI Processors for over two months.
We introduce the CodeGeeX model and its design choices.The consensus reality is that it is computationally unaffordable to test different architectural designs for large pre-trained models (Brown et al., 2020;Chowdhery et al., 2022;Zhang et al., 2022;Zeng et al., 2022), though they define the inductive bias of models.

CodeGeeX's Architecture
The Transformer Backbone.Similar to recent pre-trained models, such as GPT-3 (Brown et al., 2020), PaLM (Chowdhery et al., 2022), and Codex (Chen et al., 2021), CodeGeeX follows the generative pre-training (GPT) architecture (Radford et al., 2018) with the decoder-only style for autoregressive (programming) language modeling.The core architecture of CodeGeeX is a 39-layer transformer decoder.In each transformer layer (in Figure 2), we apply a multi-head self-attention mechanism (Vaswani et al., 2017) followed by MLP layers, together with layer normalization (Ba et al., 2016) and residual connection (He et al., 2016).We use an approximation of GELU (Gaussian Linear Units) operation (Hendrycks and Gimpel, 2016), namely FastGELU, which is more efficient under the Ascend 910 AI Processor: Generative Pre-Training Objective.By adopting the GPT paradigm (Radford et al., 2019;Chen et al., 2021), we train the model on a large amount of unlabeled code data.The principle is to iteratively take code tokens as input, predict the next token, and compare it with the ground truth.Specifically, for any input sequence {x 1 , x 2 , ..., x n } of length n, the output of CodeGeeX is a probability distribution of the next token P(x n+1 |x 1 , x 2 , ..., x n , Θ) = p n+1 ∈ [0, 1] 1×v , where Θ represents all parameters of the model and v is the vocabulary size.By comparing it with the real distribution, i.e., a one-hot vector y n+1 ∈ {0, 1} 1×v of the ground-truth token, we can optimize the cumulative cross-entropy loss: The Top Query Layer and Decoding.The original GPT model uses a pooler function to obtain the final output.We use an extra query layer (Zeng et al., 2021) on top of all other transformer layers to obtain the final embedding through attention.As shown in Figure 2, the input of the top query layer replaces the query input X in by the query embedding of position n + 1.The final output is multiplied by the transpose of word embedding matrix to get the output probability.For decoding strategies, CodeGeeX supports greedy, temperature sampling, top-k sampling, top-p sampling, and beam search.Finally, detokenization will turn the selected token ID into an actual word.
Figure 3: Language distribution and tags of CodeGeeX's data.

Pre-Training Setup
Code Corpus.The training corpus contains two parts.The first part is from open source code datasets, the Pile (Gao et al., 2020) and CodeParrot6 .The Pile contains a subset of public repositories with more than 100 stars on GitHub, from which we select files of 23 popular programming languages including C++, Python, Java, JavaScript, C, Go, and so on.We identify the programming language of each file based on its suffix and the major language of the repository it belongs to.CodeParrot is another public Python dataset from BigQuery.The second part is supplementary data of Python, Java, and C++ directly scraped from GitHub public repositories that do not appear in the first part.We choose repositories that have at least one star and a total size within 10MB, then we filter out files that: 1) have more than 100 characters per line on average, 2) are automatically generated, 3) have a ratio of alphabet less than 40%, 4) are bigger than 100KB or smaller than 1KB.We format Python code according to the PEP8 standards.
Figure 3 shows the composition of the 158B-token training data, containing 23 programming languages.We divide the training data into segments of equal length.To help the model distinguish between multiple languages, we add a language-specific tag before each segment in the form of [Comment sign]language: [LANG], e.g., # language: Python.
Tokenization.The first step is to convert code snippets into numerical vectors.Considering that 1) there is a large number of natural language comments in code data, 2) the naming of variables, functions, and classes are often meaningful words, we treat code data the same as text data and apply the GPT-2 tokenizer (Radford et al., 2019).It is a BPE (Byte Pair Encoding) (Sennrich et al., 2015) tokenizer that deals with the open-vocabulary problem using a fixed-size vocabulary with variable-length characters.The initial vocabulary size is 50,000, we encode multiple whitespaces as extra tokens following Chen et al. (2021) to increase the encoding efficiency.Specifically, L whitespaces are represented by <|extratoken_X|>, where X=8+L.Since the vocabulary contains tokens from various natural languages, it allows CodeGeeX to process tokens in languages other than English, like Chinese, French, Russia, Japanese and more.The final vocabulary size is v = 52, 224.
After tokenization, any code snippet or text description can be transformed into a vector of integers.More details can be found in Appendix A.2.
The Input Word and Positional Embeddings.Given the tokens, the next step is to associate each token with a word embedding.By looking up the token ID in a word embedding matrix W word ∈ R v×h , where h = 5120 is the hidden size, a learnable embedding x word ∈ R h is obtained for each token.To capture positional information, we also adopt learnable positional embedding that maps the current position ID to a learnable embedding x pos ∈ R h , from W pos ∈ R nmax×h , where n max = 2048 is the maximum sequence length.Then, two embeddings are added to obtain the input embeddings x in = x word + x pos for the model.Finally, the entire sequence can be turned into input embeddings X in ∈ R n×h , where n is the input sequence length.

CodeGeeX Training
Parallel Training on Ascend 910.CodeGeeX was trained on a cluster of the Ascend 910 AI processors (32GB) with Mindspore (v1.7.0).We faced and addressed numerous unknown technical and engineering challenges during pre-training, as Ascend and Mindspore are relatively new compared to NVIDIA GPUs and PyTorch/TensorFlow.The entire pre-training process takes two months on 192 nodes with 1,536 AI processors, during which the model consumes 850B tokens, equivalent to 5+ epochs (213,000 steps).Detailed configurations can be found in Table 2. Table 3 shows the comparison of training efficiency before and after our optimization.The overall efficiency is measured by trained tokens per day.We observe that the efficiency per processor was improved 3× compared to the non-optimized implementation and the overall token throughput of 1,536 GPUs was improved by 224%.

Fast Inference
To serve the pre-trained CodeGeeX, we implement a pure PyTorch version of CodeGeeX that supports inference on NVIDIA GPUs.To achieve fast and memory-efficient inference, we apply both quantization and acceleration techniques to the pre-trained CodeGeeX.
Quantization.We apply post-training quantization techniques to decrease memory consumption of CodeGeeX during inference.We transform weights W in all linear transformations from FP16 to INT8 using the common absolute maximum quantization: where b is the bitwidth and b = 8. λ is the scaling factor.This quantization transform FP16 values in As in Table 4, the memory consumption of CodeGeeX decreases from ∼26.9GB to ∼14.7GB (down by 45.4%), allowing CodeGeeX inference on one RTX 3090 GPU.Importantly, Figure 4 shows that the quantization only slightly affects the performance on the code generation task (Cf Section 3.2 for details about HumanEval-X.).Acceleration.After quantization, we further implement a faster version of CodeGeeX using the NVIDIA FasterTransformer (FastTrans).It supports highly-optimized operations by using layer fusion, GEMM autotuning, and hardware-accelerated functions.For INT8 quantized version, we also implement a custom kernel that accelerates the mixed precision matrix multiplication between INT8 weights and FP16 activation vectors.According to Table 4, the INT8 quantization plus FastTrans implementation achieves the fastest inference speed and the lowest GPU memory consumption on a single GPU.The inference time per token is within 13ms (1.61 seconds / 128 tokens).We also compare the inference speed with implementations in LLM.int() (Dettmers et al., 2022) and Oneflow (Yuan et al., 2021).

The HumanEval-X Benchmark
We develop the HumanEval-X benchmark7 for evaluating multilingual code models.There are 164 code problems defined for five major languages: C++, Java, JavaScript, Go, and Python, resulting in 164×5=820 problem-solution pairs.For each problem, it supports both code generation and code translation.Examples of the problems can be found in Appendix A.5.

HumanEval-X: A Multilingual Benchmark
HumanEval (Chen et al., 2021) has been developed to evaluate Codex by OpenAI.However, similar to MBPP (Austin et al., 2021) and APPS (Hendrycks et al., 2021), it only consists of handcrafted programming problems in Python, thus cannot be directly applied to systematically evaluate the performance of multilingual code generation.Generation uses declaration and docstring as input to generate the solution.Translation uses declaration in both languages and solution in source language as input, to generate solution in the target language (docstring is not used to prevent models from directly solving the problem).
To this end, we propose to develop a multilingual variant of HumanEval, referred to as HumanEval-X.This is not trivial.For each problem, defined only for Python, in HumanEval, we manually rewrite its prompt, canonical solution, and test cases in the other four languages-C++, Java, JavaScript, and Go.Altogether, we have 820 problem-solution pairs in total in HumanEval-X, each comprising the following parts: • task_id: programming language and numerical problem id, e.g., Java/0 represents the 0-th problem in Java; • declaration: function declaration including necessary libraries or packages; • docstring: description that specifies the functionality and example input/output; • prompt: function declaration plus docstring; • canonical_solution: a verified solution to the problem; • test: test program including test cases.
Each problem-solution pair in HumanEval-X supports both code generation code translation.An illustrative example is shown in Figure 5.We take the following efforts to make sure that the rewritten code conforms to the programming style of the corresponding language.First, we use the customary naming styles, like CamelCase in Java, Go, and JavaScript, and snake_case in C++.Second, we put the docstrings before the function declaration in Java, JavaScript, C++, and Go.Symbols in docstrings are modified, e.g., single quotes are replaced by double quotes in some languages, and keywords like True/False, None are also replaced.Third, we refine test cases according to language-specific behaviors, rather than forcing the programs to return the same result for different languages.For example, when converting an integer to a binary string, Python method bin adds a prefix "0b" before the string while Java method Integer.toBinaryStringdoes not, so we remove such prefix in Java test cases.Last, we also take care of the rounding function.In Python, round converts half to the closest even number, unlike in other languages.Thus, we change the test cases to match the rounding implementations in each language.

HumanEval-X: Tasks
In HumanEval-X, we evaluate two tasks: code generation and code translation.
Code Generation.The task of code generation takes a problem description (e.g., "write a factorial function") as input and generates the solution in the selected languages (Cf Fig- ure 1 (a)).Specifically, the model takes in the prompt including declaration and docstrings, and generates the implementation of the function.Note that HumanEval-X uses the same problem set for all the five languages, thus, for solving each problem, it supports either one single language or multiple languages simultaneously.
Code Translation.The task of code translation takes the implementation of a problem in the source language and generates its counterpart implementation in the target language.Precisely, its input includes the function declaration and a canonical solution in the source language (e.g., Python).
The model should translate the solution to the target language.Adding declaration in the target language restricts function names and variable types, making the evaluation easier, especially under the zero-shot setting.To prevent the models from directly solving the problem rather than translating, we do not include the docstrings.HumanEval-X supports the translation between all pairs of 5 languages, that is, in total 20 source-target language pairs.Metric.For both tasks, we use test cases to evaluate the exact functional correctness of the generated code, measuring the performance with pass@k (Kulal et al., 2019), making it real-world useful and also completely different from the string similarity metrics like BLEU (Papineni et al., 2002), andCodeBLEU (Ren et al., 2020;Lu et al., 2021;Zhu et al., 2022).Specifically, we use the unbiased method to estimate pass@k (Chen et al., 2021): where n is the total number of generation (n=200 in this work), k is the sampling budget (typically k ∈ {1, 10, 100}) and c is the number of samples that pass all test cases.We average over the problem set to get the expectation.1 is the estimated pass@k for a single problem, and E is the expectation of pass@k over all problems.In practice, we average single-problem pass@k among all test-set problems to get the expectation.
Multilingual Metric with Budget Allocation.Unlike mono-lingual models, multilingual code models can solve problems by allocating generation budgets to various languages to increase the sampling diversity and improve the solve rate.Given a budget k, we can distribute part of it n i to each language with the assignment π = (n1, n2, ..., nm), where n i is the generation budget assigned to language i, m is the number of candidate languages.Under an assignment π = (n 1 , ...n m ), for a problem p, the pass@k π can be estimated by: where n is the total number of generation, n i is the sampling budget and c i is the number of samples that pass all test cases for language i.We show in Section 4.3 that multilingual models can benefit from budget allocation strategies and have higher solve rate than using any single language.
4 Evaluating CodeGeeX on HumanEval-X We evaluate CodeGeeX for the code generation and translation tasks on the multilingual benchmark HumanEval-X.By inheriting from HumanEval, the HumanEval-X results on Python are equivalent to the evaluation on HumanEval.Baselines.We compare CodeGeeX with five competitive open-source baselines: GPT-J-6B (Wang and Komatsuzaki, 2021), GPT-NeoX-20B (Black et al., 2022), InCoder-6.7B(Fried et al., 2022), and CodeGen-Multi-6B/16B (Nijkamp et al., 2022).These models are all trained on multilingual code data, but is previously only evaluated in HumanEval (Python).And they are closer to the scale of CodeGeeX or even larger, while smaller models in the literature are ignored.For all baselines, we use the versions available on HuggingFace (Wolf et al., 2019).We follow the experimental settings of HumanEval-X in Section 3.2.Further details can be found in Appendix A.3.
Environment.Experiments are conducted by using the NVIDIA A100-SXM-40GB GPUs with Linux system.We design a distributed framework for generation based on ZeroMQ to balance GPU loads.All generated codes are tested in language-specific environments with necessary packages installed.

Results of Code Generation and Translation
Multilingual Code Generation.Table 5 and Figure 6 report the code generation results in terms of the pass@k, k ∈ {1, 10, 100} for CodeGeeX and five baseline models on five programming languages.CodeGeeX significantly outperforms models trained with mixed corpora (GPT-J-6B and GPT-NeoX-20B), even though GPT-NeoX-20B has much more parameters.For models trained on codes, CodeGeeX outperforms those with smaller scales (InCoder-6.7B,CodeGen-Multi-6B) by a large margin, and is competitive with CodeGen-Multi-16B with a larger scale.CodeGeeX achieves the best average performance among all models, even slightly better than the larger CodeGen-Multi-16B in all three metrics (0.37%∼1.67% improvements).When considering individual languages, models have preferences highly related to the training set distribution.For example, the best language We also find that it gains performance when the sampling budgets are properly distributed to multiple languages.
Table 6: Results of code translation task in HumanEval-X.
Figure 7: Left: the proportions of running results of four models for each language.Right: the average result ratios across four models, with lines representing minimum and maximum values.For each model and each language, we study 200 samples generated under t = 0.8 and p = 0.95. Figure 7 shows the proportions of running results of four models.For all languages, the most common error type is wrong answer, with ratio ranging from 0.44 to 0.75 except for Go, showing that code generation models at the current stage mainly suffer from incorrect code logic rather than semantics.Go samples have a high syntax error rate, which may be due to Go having strict restrictions on syntax and forbidding unused variables and imports, failing to compile many logically correct codes.
CodeGeeX has less rate to generate code that produces runtime, syntax, or semantic errors.

The Multilingual Pre-Training Helps Problem Solving
We perform studies to understand whether and how multilingual pre-training can benefit problemsolving of CodeGeeX.
Exploration vs. Exploitation under Fixed Budgets.Given a fixed budget k, pass@k evaluates the ability of models generating at least 1 correct solution under k generations.Previous works (Chen et al., 2021;Li et al., 2022) have already discovered that there's a trade-off between exploration and exploitation: When the budget is small, it is better to use a low temperature to ensure accuracy on easy problems.When the budget is large, instead, adjusting a higher temperature is vital, as it makes the model more likely to find at least one solution for difficult problems.
Pass Rate Distribution vs. Languages.Unlike monolingual models, multilingual models can solve problems using various programming languages.In Figure 8, we observe that the pass rate distribution of problems against different languages are diverse.This inspires us to use budget allocation methods to help improve the diversity of the generated solutions.
Budget Allocation Strategies.We compare three basic strategies: Best Single chooses a single language with the best performance; Uniform allocates the budget uniformly; Weighted allocates the budget to multiple languages based on their proportions in the training corpus (detailed weights can be found in Appendix Table 9).Table 7 illustrates how budget allocation improves the multilingual generation.Both Uniform and Weighted outperform Best Single by promoting a more diverse generation, which gives a higher chance of solving problems.Weighted is slightly better due to the prior knowledge on the model.For model-wise comparison, CodeGeeX shows up a decent advantage over other baselines in both strategies, which suggests that it might have a more diverse solution set under multiple languages.Programming languages are created with a specific purpose and unique design; in real-world scenarios, multilingual models might take this advantage for certain tasks.Negative Correlations in Pair-Language Translation.When evaluating the translation ability in HumanEval-X, an interesting observation is that the performance of A-to-B and B-to-A are usually negatively-correlated, shown in Figure 9.Such asymmetry suggests that multilingual code generation models may have imbalanced focus on source and target languages during code translation.We provide two possible explanations.First, language distributions in training corpus differ a lot, resulting in different level of generation ability.For example, the ratio of Python is 26.6% (vs.Go 4.7%) in CodeGeeX training corpus, and average pass@100 of Others-to-Python reaches ~90% (vs.Others-to-Go only ~50%).Second, some languages are themselves harder to automatically write with syntactic and semantic accuracy due to language-dependent features, affecting translation performance as target languages.For instance, Go, which models translate poorly into, has more constraints on syntax level, forbidding unused variables or imports.

The CodeGeeX Tools and Users
Based on CodeGeeX, we build open-source extensions for IDEs including VS Code, JetBrains and Cloud Studio.The extensions support code generation, completion, translation and explanation, aiming at improving the development efficiency of programmers.As of this writing, CodeGeeX has served tens of thousands of users, with an average of 250+ API calls per active user per weekday.It currently generates 4.7+ billion tokens per week, which has been steadily growing since its release.We perform a survey on CodeGeeX's user experience from 168 users covering front-end developer, backend developer, full stack engineer, algorithm engineer, students, researcher, and other programmers.Figure 10 illustrates users' profession distribution and the satisfaction score.We evaluate the satisfaction considering five dimensions, "Ease of Use", "Reliability", "Feature", "Visual", "Speed", each scored from 0 to 5. Figure 10 shows that the majority of users have positive experiences with CodeGeeX, especially for researchers and students, while there is still room for improvement for professional developers.This can be interpreted by our training code corpus: open-sourced repositories contain many introductory or research projects, while production codes are often close-sourced.To increase the CodeGeeX's capability in professional domain, these codes are needed in the future.
We further investigate how multilinguality of CodeGeeX help coding.Figure 11 illustrates how users evaluate the helpfulness of CodeGeeX during development.There are on average over 83.4% of users think CodeGeeX can improve or slightly increase their coding efficiency, especially for mainstream programming languages like Go, C++, Python, C, C#, etc.Note that these well-performing programming languages also appear more frequently in the training data (Figure 3), which encourages us to train CodeGeeX on more language-specific data to enhance its capability.

Conclusion
We introduce CodeGeeX, a 13B pre-trained 23-language code generation model, as well as we build HumanEval-X, to fill the gap of multilingual code generation.CodeGeeX consistently outperforms open-sourced multilingual baselines of the same scale on code generation and translation tasks.
The extensions built on CodeGeeX bring significant benefits in increasing coding efficiency.The multilinguality of CodeGeeX brings the potential of solving problems with an ubiquitous set of formalized languages.We open sourced CodeGeeX aiming to help researchers and developers to widely take benefit of large pre-trained models for code generation.
The multilingual ability of CodeGeeX shows the potential of solving problems with a ubiquitous set of formalized languages.Here, we share three of our observations as the future directions.
First, we find that the model capacity is essential for multilingual programming ability.It is not trivial for the model to benefit from learning multiple languages.Human programmers can abstract the high-level concept of programming, thus learning one language can help them master the others.On the contrary, the model seems to require a large capacity to concurrently store the knowledge of each language.How to help the model extract the most essential knowledge of programming remains a research challenge.
Second, similar to others, CodeGeeX shows the reasoning potential as a model though its lack of strong generality.We demonstrate that CodeGeeX can solve problems in different languages.However, the pass rate distribution varies a lot among languages, i.e., it is not able to solve the same problem using different languages on occasion.We assume that this could probably be related to some language-specific features (e.g., some problems are easier to solve in Python), or it could be simply due to the appearance of a similar language-specific implementation in training data.Either case, there is a long way to go for the model to have a reliable reasoning ability.
Third, the few-shot ability of CodeGeeX is worth exploration.Instead of using costly fine-tuning approaches, we may do priming using a few examples and make the model achieve comparable performance.Recent works like chain-of-thought (CoT) prompting (Wei et al., 2022) have shown impressive results using such an approach, inspiring us to examine CoT in code models.

Figure 1 :
Figure 1: Summary of CodeGeeX.(a): In supported IDEs, users can interact with CodeGeeX by providing prompts.Different models are used to support three tasks: code generation, code translation and code explanation.(b) and (c): In HumanEval and our newly-proposed HumanEval-X, CodeGeeX shows promising multilingual abilities and consistently outperforms other multilingual code generation models.

Figure 2 :
Figure 2: CodeGeeX's model architecture.CodeGeeX is a code generation model with 13B parameters, consisting of 39-layer left-to-right transformer decoders and a top query layer.It takes text/code tokens as input and outputs the probability of the next token autoregressively.

Figure 4 :
Figure 4: CodeGeeX vs. its quantized version on code generation of HumanEval-X.

Figure 5 :
Figure 5: An illustration of code generation and translation tasks in HumanEval-X.Declarations, docstrings, solutions, and test cases are marked with red, green, blue, and purple respectively.Generation uses declaration and docstring as input to generate the solution.Translation uses declaration in both languages and solution in source language as input, to generate solution in the target language (docstring is not used to prevent models from directly solving the problem).

Figure 6 :
Figure 6: Results of code generation task in HumanEval-X.Left: Detailed pass@k performance in five languages.Right: CodeGeeX achieves the highest average performance compared with other open-sourced multilingual baselines.We also find that it gains performance when the sampling budgets are properly distributed to multiple languages.

Figure 9 :
Figure 9: The performance of translating A-to-B is negatively correlated with B-to-A.Such asymmetry indicates that multilingual models still lack of high-level understanding between languages.

Figure 11 :
Figure 11: Survey on "Has CodeGeeX improved your coding efficiency?".Over 83.4% of users have positive answers.

Figure 17 :
Figure 17: Solutions (Problem 95 in HumanEval-X) translated by CodeGeeX.Prompt and generated codes are separated by the 'Translation' line (added after the generation as an indicator).

Figure 18 :
Figure 18: Solutions (Problem 109 in HumanEval-X) generated by CodeGeeX.Prompt and generated codes are separated by the 'Generation' line (added after the generation as an indicator).

Figure 19 :
Figure 19: Solutions (Problem 13 in HumanEval-X) generated by CodeGeeX.Prompt and generated codes are separated by the 'Generation' line (added after the generation as an indicator).

Figure 20 :
Figure 20: Solutions (Problem 142 in HumanEval-X) generated by CodeGeeX.Prompt and generated codes are separated by the 'Generation' line (added after the generation as an indicator).

Figure 21 :
Figure 21: Solutions (Problem 33 in HumanEval-X) translated by CodeGeeX.Prompt and generated codes are separated by the 'Translation' line (added after the generation as an indicator).

Figure 22 :
Figure 22: Examples of CodeGeeX generation with prompts in Chinese, French, Russia and Japanese.Prompt and generated codes are separated by multiple '#'s (added after the generation as an indicator).

Table 1 :
Large pre-trained language models related to programming languages in the literature.

Table 2 :
(Kingma and Ba, 2014)2020)CodeGeeX.To increase training efficiency, we adopt an 8-way model parallel training together with 192-way data parallel training, with ZeRO-2(Rajbhandari et al., 2020)optimizer enabled to further reduce the memory consumption of optimizer states.Finally, the micro-batch size is 16 per node and the global batch size reaches 3,072.Specifically, we use Adam optimizer(Kingma and Ba, 2014)to optimize the loss in Equation2.The model weights are under FP16 format, except that we use FP32 for layer-norm and softmax for higher precision and stability.The model takes about 27GB of GPU memory.We start from an initial learning rate 1e-4, and apply a cosine learning rate decay by: During the two-month training, the training loss of CodeGeeX continues to decrease as the training goes on.We evaluate the checkpoints on HumanEval-X code generation task and observe that the performance is continuously increasing.See details in Figures 13 and 14 in Appendix A.3.Training Efficiency Optimization.Over the course of the training, we actively attempted to optimize the Mindspore framework to release the power of Ascend 910.Notably, we adopt the following techniques that significantly improve training efficiency:• Kernel fusion: We fuse several element-wise operators to improve calculation efficiency on Ascend 910, including Bias+LayerNorm, BatchMatmul+Add, FastGeLU+Matmul, Softmax, etc.We also optimize LayerNorm operator to support multi-core calculation.• Auto Tune optimization: When loading models, Mindspore first compiles them to static computational graphs.It uses the Auto Tune tool to optimize the choice of operators (e. g., matrix multiplication in different dimensions).And it applies graph optimization techniques to deal with operator fusion and constant folding.

Table 3 :
Training efficiency (before and after optimization).

Table 4 :
GPU memory and inference time of CodeGeeX w/ and w/o quantization on different GPUs and frameworks.

Table 5 :
Results of code generation task in HumanEval-X.

Table 7 :
Results for fixed-budget multilingual generation on HumanEval-X.