xASTNN: Improved Code Representations for Industrial Practice

The application of deep learning techniques in software engineering becomes increasingly popular. One key problem is developing high-quality and easy-to-use source code representations for code-related tasks. The research community has acquired impressive results in recent years. However, due to the deployment difficulties and performance bottlenecks, seldom these approaches are applied to the industry. In this paper, we present xASTNN, an eXtreme Abstract Syntax Tree (AST)-based Neural Network for source code representation, aiming to push this technique to industrial practice. The proposed xASTNN has three advantages. First, xASTNN is completely based on widely-used ASTs and does not require complicated data pre-processing, making it applicable to various programming languages and practical scenarios. Second, three closely-related designs are proposed to guarantee the effectiveness of xASTNN, including statement subtree sequence for code naturalness, gated recursive unit for syntactical information, and gated recurrent unit for sequential information. Third, a dynamic batching algorithm is introduced to significantly reduce the time complexity of xASTNN. Two code comprehension downstream tasks, code classification and code clone detection, are adopted for evaluation. The results demonstrate that our xASTNN can improve the state-of-the-art while being faster than the baselines.

Despite advancement, there are still restrictions that prevent the widespread adoption of existing code representation approaches in industry.Effectiveness, efficiency, and applicability are of particular concern.A recent study [50] has shown that the state-of-the-art approach ASTNN [52] incurs much computation time to embed the source code as vectors for code clone detection (e.g., 5.48× over SCDetector [50] and 3.09× over RtvNN [49]).Kang et al. [30] conduct an empirical study to demonstrate that the popular code representation approach code2vec [9] lacks generalizability and cannot be readily leveraged for downstream tasks.GNN-CDFG [11], GGNN [6], and HPG+HGT [53] require their elaborate graphs to represent code segments, leading to a strong dependency on the characteristics of programming languages.flow2vec [44], inst2vec [10], and DeepSim [54] are designed based on specific compilers such as LLVM [34] and WALA [1].The above restrictions hinder the industrial applications of existing code representation approaches.
In the industrial practice of neural code representation, we are often involved in a variety of production issues, such as constrained computing resources, limited response time, abnormal data inputs, etc.Therefore, the code representation approaches are supposed to address the following three imperative challenges: (1) Design a model with notable effectiveness.The quality of code representations can directly influence the effectiveness of the model on downstream tasks.(2) Build a model that is as fast and lightweight as possible.It is unacceptable in the industry that the software services have severe runtime delays or memory overflow problems.
(3) The model should be applicable in various scenarios, e.g., be independent of parsers to cater for different programming languages, and be able to handle code segments of various sizes so as not to be trapped in gradient problems.
To this end, we propose an eXtreme abstract syntax tree (AST)based neural network (xASTNN) for neural code representations in industrial practice.xASTNN is entirely based on common ASTs, which are always accessible and available with the source codes.To guarantee the effectiveness of xASTNN, we first perform a preorder traversal upon the AST to convert it into a statement subtree sequence.Further, we present gated recursive unit (GRvU) for capturing the syntactical information of each subtree and gated recurrent unit (GRtU) for capturing the sequential information of the subtree sequence.The introduction of gating mechanism not only improves the generalizability of our approach, but also prevent our approach from gradient problems encountered by the previous approach [52].To optimize the computation time of our xASTNN, we describe a novel and high-efficient dynamic batching algorithm for the processing of subtrees, which enables parallel operations at the contents of the same tree depth within a batch of data samples.Under the above designs, our xASTNN is thus effective, efficient, and can be widely used in industrial practice.
Extensive experiments have been carried out to validate the performance of our xASTNN.Two comprehension downstream tasks including code classification and code clone detection are applied for evaluation.The results demonstrate that our xASTNN is highly competitive as a code representation model for practical usage.Specifically, our xASTNN improves the state-of-the-art performance and achieves superb efficiency at the same time.For example, our xASTNN achieves an accuracy of 0.985 for the POJ dataset in the code classification task.The computation time of our xASTNN for representing one code segment is over 10× faster than the previous approach ASTNN.
To summarize, we have made the following contributions: • (Applicability) Based on easily accessible ASTs, we present an innovative neural code representation approach named xASTNN for industrial practice.• (Effectiveness, Applicability) We introduce gating mechanism to effectively capture the syntactical and sequential information of statement subtree sequence while avoiding gradient problems encountered by the previous approach.• (Efficiency) We describe a high-efficient dynamic batching algorithm that greatly optimizes the time complexity of operations over trees.• Our representation approach is evaluated on two common comprehension tasks: code classification and code clone detection.The results demonstrate that the proposed approach can outperform the state-of-the-art approaches and is more practical.
The remainder of the paper is organized as follows.Section 2 introduces the related work.Section 3 presents the proposed approach, including statement subtree sequence, gated recursive unit, gated recurrent unit, and dynamic batching algorithm.Section 4 validates the competing approaches.Section 5 illustrates the threats to validity.Section 6 discusses the lessons learned from the practice and we conclude in Section 7.

RELATED WORK
The emergence of code representation learning mainly borrows the concepts from natural language processing (NLP), and the source code are treated as a sequence of tokens.Raychev et al. [40] describes a recurrent neural network (RNN) and N-gram model for code completion.Allamanis et al. [7] train an N-gram model on the GitHub Java corpus for mining source code repositories.Further, they [4] propose a neural context model to suggest class and method names.CODE-NN [28] exploits LSTM [26] and neural attention to summarize code.However, the above approaches ignore the inherent syntactical features of programming languages, thus failing to produce good code representations.
To capture syntactical information of programming languages, recent researches introduce ASTs as input to the representation models.AutoenCODE [49] adopts a recursive autoencoder over ASTs to learn the unsupervised program embeddings.TBCNN [38] proposes a tree-based convolution over the AST to represent the code segment.CDLH [48] introduces TreeLSTM [46] to generate the embeddings for code clone detection.Different from previous work that directly captures the syntactical information from ASTs, code2vec [9] first extracts a bag of leaf-to-leaf paths from the AST and aggregates these paths using the attention mechanism.code2seq [8] improves code2vec by introducing LSTM [26] to encode paths, which is applied in code summarization.A similar approach to ours is ASTNN [52], which encodes the source code by capturing both the lexical and syntactical knowledge of statements.Since code naturalness is well modeled, ASTNN achieves superb performance in code tasks.Nevertheless, ASTNN is proven to be inefficient when applying to practical scenarios [50].
There are also many code representation approaches based on pre-training techniques.CodeBERT [22] considers the source code as token sequence and applies the successful BERT [20] to pre-train the model.GraphCodeBERT [24] improves CodeBERT by introducing the data flow of the source code.InferCode [12] describes a self-supervised pre-training approach by predicting subtrees of ASTs.Mastropaolo et al. [36] pre-train a T5 model based on a dataset composed of natural language English and source code for coderelated tasks.SPT-Code [39] presents a sequence-to-sequence pretraining pipeline for source code representations.However, recent evidence [31] points out that existing pre-training code representation approaches may fail in some code tasks and even achieve worse performance than BERT [20].Besides, the pre-training techniques require a large-scale dataset and a large amount of computing resources, which can be a huge challenge for industrial practice.
Therefore, an effective, efficient, and general-purpose code representation approach is urgently needed.In this work, we propose xASTNN, aiming to resolve the challenges to existing approaches in practical scenarios.

APPROACH
In this section, we present the proposed approach xASTNN.As shown in Figure 1, xASTNN consists of two phases.In the first phase, the code fragment is first transformed into an AST by using a common AST parser.A preorder traversal over AST is applied to obtain a statement subtree sequence, in which each subtree corresponds to a statement of the code fragment.The first phase can be computed in advance when training xASTNN.In the second phase, we first adopt the popular word2vec [37] to embed the subtree sequence into a distributed space.To ensure the effectiveness and applicability of xASTNN, gating mechanism is introduced to incorporate the syntactical information of subtrees and sequential Figure 1: Overview of the proposed approach.The first phase parses the code segment into a statement subtree sequence to introduce the code naturalness.The second phase combines the syntactical information and sequential information to enhance the embeddings of the subtree sequence and leverages a max pooling layer to produce code representations.
information of subtree sequence.Lastly, we exploit a max pooling layer to combine the subtree embeddings into the code vector.

Statement Subtrees for Code Naturalness
The code naturalness is a hypothesis stating that software corpora have similar statistical properties to natural language corpora since the software is a form of human communication [5].To capture the code naturalness, we first consider the source code as an AST and then transform it into a combination of statement subtrees as input to the neural network.AST is a widely used structure in the field of code representation approaches and can be easily accessed by common AST parsers.In this work, we exploit javalang [2] and pycparser [3] for the experiments of Java and C, respectively.Based on our empirical study, we opt to extract the subtrees at the statement level for the following two reasons.
• Previous work [52] has shown that statement subtree sequence can effectively capture code naturalness, thereby improving the quality of code representations.Besides, subtree at the statement level is a good trade-off between the size of subtree and the richness of syntactical information.• The size of statement subtrees is approximately equal in comparison to subtrees of other granularities.For example, subtree at the block level is made up of a varying number of statement subtrees, which can easily lead to an unstable size of subtree and thus introduces efficiency concerns in practice (see Section 3.4).Algorithm 1 illustrates how we transform an AST into a statement subtree sequence.This algorithm takes the root of an AST ℎ and the set of subtree root identifiers  as input, and outputs the subtree sequence extracted from the AST.The subtree root identifiers target to help the preorder traversal process to recognize statements, which generally are logical tokens in programming   languages such as   , ℎ,   , ℎ, ,  , , etc.They can be easily obtained when it comes to other programming languages.
The main body of Algorithm 1 consists of three procedures.It starts with initializing an empty sequence for storing subtrees (line 10) and then applying a preorder traversal to generate the statement subtree sequence (line 11) that is finally returned (line 12).In function preorderTraversal, it accepts the root node of an AST as input (line 1).When this root node belongs to subtree root identifiers (line 2), we construct a statement subtree based on this root (line 3) and append this statement subtree to the subtree sequence (line 4).Then, for each child node of this root that does not belong to the statement subtree (line 6), we recursively perform preorderTraversal (line 7) to generate the other statement subtrees, i.e., statements inside program branches.To help understand how the source code is transformed into a statement subtree sequence, we present an example as shown in Figure 2. We can observe that every subtree in the sequence corresponds to a statement in the code segment, which is obtained by preoreder traversal and is in the order of: While, If, Compound, Assign(3), Assign(4), Assign(5), End, Else, Assign(8), Assign(9).

Gated Recursive Unit over Tree Structures
Through the statement subtree sequence extracted from the AST, the code naturalness is incorporated.To better capture the syntactical information of the subtrees, we propose gated recursive unit (GRvU) based on previous work [14-16, 32, 46].In contrast, we simplify the complicated components of their mechanisms and make the recursive neural network sufficiently efficient for industrial practice.That is, the position-aware fully connected layer is removed to reduce time complexity and enable full parallelization.
Figure 3 introduces the workflow of the proposed GRvU.GRvU assigns each node of the subtree a hidden state, which records the bottom-up information of it child nodes.We further apply gating mechanism to incorporate the information of hidden states and inputs.Specifically, given a subtree node , its hidden state is the interpolation of the previous calculated hidden states ℎ  of its -th child out of  total children and the candidate hidden state h .It is calculated as follows: where   is the update gate, which keeps a part of the hidden states of children and the other part of its candidate hidden state.The calculation of the update gate is as follows: where   ∈ R × and   ∈ R × represent fully-connected layer, and  represent sigmoid function that can map the inputs to the interval from 0 to 1. Besides, we also introduce a reset gate to selectively filter hidden states of children, which is computed as follows: The reset gate is applied to choose important elements from the hidden states of children, which is also activated by the sigmoid function.Combining the input of the current root and gated hidden states of children, we obtain the candidate hidden state of node .
It contributes to generating the final hidden state of node .The candidate hidden state h is computed as: where  represents the hyperbolic tangent function that activates the input.We consider the hidden state of the subtree root as its distributed representation.Therefore, by applying the proposed GRvU over each subtree of the subtree sequence, we can acquire the representation of the subtree sequence: where  preserves the effective syntactical information of each statement, which can significantly improve the generalizability of our xASTNN.

Gated Recurrent Unit for Subtree Sequence
We also adopt the well-acknowledged gated recurrent unit (GRtU or GRU) [17] to capture the sequential information of subtree sequence, which constitutes the code naturalness together with the aforementioned syntactical information.Given the outputs from GRvU, we adopt a standard bidirectional GRtU to learn the relation between these subtrees.The calculation of GRtU is as follows: where  represents the concatenation of two vectors, which combines the subtree representations of two directions.Finally, both syntactical information and sequential information have been introduced into the subtree embeddings, thereby ensuring that code naturalness is well modeled.To obtain the final code representation, we feed the enhanced subtree embeddings into a max pooling layer: where  is the code representation produced by our xASTNN.max is utilized to choose the most important semantics of enhanced subtree representations.

Dynamic Batching Algorithm
In order to make our xASTNN competent in industrial practice, we propose an acceleration method named dynamic batching algorithm for GRvU, aiming to greatly improve the efficiency of the proposed approach.This algorithm allows completely parallel computation on subtree nodes of the same depth in a batch of data samples.Previous approach [52] also makes efforts to speed up the recursive network, however, it still suffers from the incomplete parallel computation.the previous study [52] and ours.Given a batch of data samples, the previous approach [52] processes the children of subtrees one by one.However, our dynamic batching algorithm supports full parallelism even in the width dimension, which is much faster than the previous approach.The workflow of dynamic batching algorithm is shown in Algorithm 2. It takes a batch of subtree sequences as input, and outputs the corresponding batch of subtree embedding sequences that contain syntactical information.The main body of Algorithm 2 starts from line 14.We first initialize two empty lists for storing length of each subtree sequence (line 14) and all subtrees of the batch (line 15), respectively.For each subtree sequence (line 16), we cache the amount of subtrees within the subtree sequence (line 17) and flatten all subtrees into the list ℎ (lines 18-20).Through performing bottomUp function (lines 1-13), we enhance the subtree embeddings with syntactical information (line 22) and recover them to the original form of a batch (lines [23][24][25][26][27][28][29].The recovery procedure starts with the initialization of a list to store final result (line 23) and a record variable (line 24).Then, we extract the corresponding subtree embeddings (lines 26-29) according to length of each subtree sequence (line 25).
The function bottomUp is designed for processing a batch of subtrees simultaneously.It accepts a list of subtrees as input (line 1) and outputs their embeddings learned by GRvU (line 13).This function initializes two lists for storing root (line 2) and children (line 3) of each subtree.For each subtree in the list of subtrees (line 4), we extract the root (line 5) and its children (lines 6-8), which enables the algorithm to process the contents of the same tree depth within a batch of data samples.We apply embed to produce the embeddings of roots (line 10) and bottomUp to acquire the hidden state of subtrees (line 11).At last, the hidden states of current roots can be calculated using GRvU (line 12).This recursion ultimately returns the embeddings of the subtrees that initially provided to the function bottomUp (line 13).
The overall time complexity of dynamic batching algorithm is and , we can conclude that the overall time complexity of the proposed algorithm is linear to the batch size  and the AST size .Here, the batch size  is a hyperparameter set by developers, but the AST size ,  , and  is influenced by the data and the granularity to transform AST into subtrees.A smaller  may lead to a larger  and  in the experiments.In our empirical analysis, we found that transforming AST into statement subtrees can advance the generalizability while guaranteeing the efficiency.The rationale might be that statement subtree sequences can well capture the code naturalness and have stable subtree depth.Moreover, the computation time is also affected by a few exceptional circumstances.When some subtrees within a batch have a significantly large depth, our dynamic batching algorithm should conduct extra executions for these subtrees.In practice, we could put the processing of these subtrees together to make the subtree depth as balanced as possible, which we leave for future work.

Differences from ASTNN
Our xASTNN spares no efforts to improve the previous code representation approach for industrial practice.The differences between the previous ASTNN and our xASTNN is as follows.
We adopt gating mechanism throughout the encoding pipeline.Specifically, a child-sum gated recursive unit named GRvU is proposed to encode the syntactical information of statement subtrees.We artificially let the gates position-insensitive, alleviating the computational complexity of space and time for industrial practice.The gates refine the latent information of the code segment, consequently guaranteeing the effectiveness of our xASTNN.
We introduce the gating mechanism also for applicability.The previous approach simply leverages fully-connected layer to capture the information of subtrees, which can be easily trapped in gradient problems.Their feedforward computation is as follows.
When we perform the back propagation for training the model, the gradient of the parameter  in fully-connected layer is as follows. 
where ℎ  denotes the -th child of the node , ℎ  denotes the -th child of the node , and so on. denotes the loss function.
We can observe that the gradient is composed by multiplications of many terms due to the recursive processing of child nodes.The accumulation of these terms may easily lead to a gradient of 0 or infinity, making it hard to train the model.In our scenarios, when the length of code segment is large (e.g., 100), the gradient problems of ASTNN becomes obvious.By contrast, the introduction of gating mechanism in our approach can relieve the gradient problems.The proof can be referred to [26].
Besides, we spare no efforts to optimize the time and space efficiency of our xASTNN.We implement a high-efficient and memoryfriendly model compared with the previous approach.That is, our model can still achieve time acceleration and space reduction in the absence of any algorithm.To further accelerate our approach, in this work, we present a dynamic batching algorithm for the processing of subtrees.It adopts more parallel operations and can speed up the model when we feed a batch of data samples.

EVALUATION
We validate the proposed approach in three aspects.
• How effective is the proposed approach?In industrial applications, the effectiveness of code representation is of importance.A high-quality representation can lead to better performance in code-related downstream tasks.By comparing metrics of our approach and the baselines with two program comprehension tasks, we evaluate the effectiveness of the proposed approach.• What is the efficiency of our approach?When applying the representation model in practice, the efficiency is a huge challenge.From the perspective of time and space efficiency, we measure the practical usability of each approach.• What are the effects of different designs for the proposed xASTNN?This research aspect plays a key role in the refinement of the model.We explore the performance of alternative designs of our approach by ablating or adjusting some designed modules.The results are organized and analyzed.

Experimental Setup
4.1.1Datasets.We conduct two downstream tasks including code classification and code clone detection for evaluation.The code classification measures the fundamental ability to comprehend the programs.And the code clone detection measures the ability to compare two code segments.Table 1 illustrates the detailed description of used datasets, with the statistics of code segments, categories, tokens, AST depth, and AST nodes.
In the code classification task, we adopt a widely used [10,12,25,38,43,51,52] public dataset named POJ1 [38] to measure the quality of code representations.This dataset is collected from a pedagogical programming open judge system, which consists of a large number of programming problems.Students submit their source codes as solutions; the judge system will automatically validate the correctness of the solutions.POJ contains 104 programming problems, which are considered as categories predicted by approaches.Each problem contains 500 C programs, which are considered to belong to the same class.We randomly divide the total 52,000 programs into training, validation, and testing sets with a proportion of 3:1:1.
We exploit two widely used datasets for the code clone detection tasks: BigCloneBench 2 [45] and OJClone, which are also used in [12,25,48,52,54].BigCloneBench is a handcrafted dataset that consists of known true and false positive clones.It was built by mining at first and then manually checking clones of ten common functionalities, with 3 judges over 216 hours of manual validation efforts.BigCloneBench is collected from 25,000 systems, covers 10 functionalities including 6,000,000 true clone pairs.Similar to previous work [12,48,52], we randomly select 100 thousand samples for the convenience of evaluation.We have manually checked that most of the code segments within BigCloneBench are methods, which is different from another code clone detection dataset, OJClone, where code segments are generally functionalities implemented by multiple methods.OJClone derives from POJ automatically.As [12,25,48,52,54] did, we choose the first 15 programming problems from POJ, which produces 15 × 500 = 7500 code segments.In OJClone, the two segments from the same programming problems form a clone pair; otherwise, they belong to a non-clone pair.This will provide us with 28 million clone pairs, making it immensely time-consuming to conduct experiments.Likewise [12,48,52], we randomly select 50 thousand samples instead.The OJClone dataset is generated completely automatically, without the manual checking of experts.Most of the clone pairs within OJClone are syntactically dissimilar so that the comparative ability of the competing approaches can be well measured.

Metrics.
In the code classification task, we choose accuracy in test set as the evaluation metric.It represents the proportion that how many data samples are correctly classified.
In the code clone detection task, we apply precision, recall, and F1 score as the metrics.Precision indicates how many of the predicted clone pairs are really clone pairs.Recall indicates how many of the clone pairs are correctly predicted.F1 score is the harmonic mean of the precision and recall.

Baselines.
We consider the following approaches as the baselines for code classification.Some of them can also be applied in the code clone detection task.
• SVM+N-gram [19]: A machine learning-based approach that incorporates SVM and N-gram for code classification.• Transformer [47]: A recent popular approach that introduces the parallel attention mechanism and deep residual block to encode the tokens.• CodeBERT [22]: A bimodal pre-training approach for natural language and programming language based on Transformer neural architecture.• TreeLSTM [46]: A novel approach that applies LSTM over ASTs to produce code representations.• TBCNN [38]: A novel approach that applies convolutions over ASTs to produce code representations.• code2vec [9]: A recent popular code representation approach that extracts AST paths from the AST at first and applies attentional aggregation upon these paths to produce the code representation.
• ASTNN [52]: A novel superior approach that follows a similar pipeline compared with ours, with lower generalizability and serious efficiency problem.• InferCode [12]: A self-supervised pre-training approach, which pre-trains a TBCNN model by predicting subtrees.• inst2vec [10]: A novel approach that first constructs the conteXtual flow graph and applies RNN to produce the code representation.
• GGNN [6]: A well-designed graph neural approach, which first adds edges to ASTs and adopts GGNN [35] to represent the source code.• GraphCodeBERT [24]: A upgraded version of CodeBERT that considers data flows of programs.We also select three representative approaches designed for code clone detection particularly.
• Deckard [29]: A traditional approach that utilizes subtrees to identify code clones efficiently.• SourcererCC [41]: A traditional token-based code clone detection approach, which can detect exact and near-miss clones efficiently.
• CDLH [48]: A deep learning-based clone detector that adds a hash function to TreeLSTM for code clone detection.
All the experiments are conducted on a 64-bit platform equipped with 12-core Intel(R) i7-12700KF CPU@3.60GHz,128GB of RAM, and a 24GB RTX 3090 GPU.

Effectiveness Assessment with Two Tasks
4.2.1 Code Classification.We exploit the code classification task to measure the fundamental ability of the models to comprehend the programs.In this task, three categories of baselines are considered, including two token-based approaches, five AST-based approaches, and two graph-based approaches.Among them, CodeBERT and InferCode are representative pre-training techniques designed for programming languages.Table 2 reports the accuracy in test set of competing approaches.
It can be seen that our xASTNN outperforms all baselines, achieving the highest accuracy of 0.985.Its advantages in accuracy over baselines are significant, improving token-based approaches 1.03% to 16.29%, AST-based approaches 0.31% to 14.42%, and graph-based approaches 0.31% to 3.6%.This validates that our approach is effective in learning program semantics and can produce good code representations.
We find that the performance of the token-based approaches differs in test accuracy.SVM+N-gram achieves the lowest accuracy in comparison with other approaches, showing that the use of tokens is not sufficient for code representations.Transformer has a very strong fitting ability among token-based approaches, achieving an accuracy of 0.907.Further, by pre-training the Transformer neural architecture based on a large-scale bimodal task, CodeBERT improves the accuracy of the vanilla Transformer to 0.975.To improve the comprehension ability of models, many approaches incorporate ASTs for capturing the syntactical information of the source codes.As three representative code feature learning approaches, the accuracy of TreeLSTM, TBCNN, and code2vec is not high, with an accuracy of 0.860, 0.940, and 0.913, respectively.This is because they do not have well-designed modules for processing ASTs and simply incorporate the syntactical information by LSTM, convolutions, and leaf-to-leaf paths.In contrast, the accuracy achieved by ASTNN and InferCode is high.ASTNN is an AST-based approach that introduces code naturalness by statement subtree sequence and InferCode extensively pre-trains their TBCNN model by predicting subtrees.
As for the graph-based approaches, it can be seen that both inst2vec and GGNN achieve relatively high accuracy.They both represent the source codes by constructing flow graphs.The difference is that inst2vec adopts RNN to encode flow graphs while GGNN uses a graph neural network.The pre-trained approach GraphCodeBERT achieves a high accuracy of 0.982, showing that the introduction of syntactical information can improve the performance of the token-based CodeBERT.

Code Clone Detection.
To sufficiently evaluate the effectiveness of approaches, we also introduce code clone detection, a task that is widely used in code refactoring and vulnerability detection.We consider two lines of approaches as baselines, including code representation approaches and specialized clone detectors.The results are shown in Table 3.
Comparing with eight baselines, we can find that the effectiveness of our xASTNN is the most superior on two datasets.xASTNN achieves an F1 score of 0.966 on BigCloneBench and 0.992 on OJ-Clone, improving the leading baseline GraphCodeBERT by 1.68% on BigCloneBench and 0.40% on OJClone.
Two traditional approaches, Deckard and SourcererCC, are unable to handle syntactically dissimilar code clones, as revealed in their recall.But their precision is high, which illustrates that they can precisely distinguish syntactically similar code clones.As for deep learning-based approaches, we note that code2vec, TBCNN, and CDLH all achieve relatively high F1 score compared with the traditional approaches.The advantages of these approaches come from the deep learning features.Nevertheless, it can be observed that ASTNN, CodeBERT, and GraphCodeBERT have a remarkable performance in this task.This shows that their fitting ability is superior compared to the other deep learning-based approaches.An interesting phenomenon is that ASTNN, CodeBERT, and Graph-CodeBERT have high precision and low recall in most cases, suggesting that they tend to make confident predictions.

Conclusion 1:
Through evaluating our approach and the baselines in two code comprehension tasks, we can find that our xASTNN is very generalized and effective.

Efficiency of Models
To evaluate the time efficiency of models, we measure their computation time from the acceptance of a code segment to the output of its representation.The reciprocal of this metric is known as prediction rate, which reflects the number of data samples a model can process in one second.This experiment is conducted on a broad corpus of real-world programming languages.A batch of data samples is fed into the code representation model and the average computation time for representing one code segment is reported.Figure 5 shows the computation time of baselines along with our approach and its variants (batch size is set to 64).It can be observed that our approach costs 0.23 ms to represent a code segment, faster than all the baselines.It accelerates code2vec 1.8×, CodeBERT 2.2×, GraphCodeBERT 2.7×, ASTNN 9.3×, and TreeLSTM 21.9×.
The computation time of baselines is high, ranging from 0.41 ms to 5.02 ms.code2vec is fast because it represents source code by parallelly encoding a bag of AST paths.CodeBERT parallelly processes tokens of source code and GraphCodeBERT adds the processing of data flows to CodeBERT.The computation time of ASTNN and TreeLSTM is long.The reason is that they involve syntactical information processing but no efficient algorithms are particularly designed.Two variants with different subtree granularity perform differently in computation time.Here, we consider the two most special cases, i.e., the largest subtree based on the whole AST named xAS-TNN (AST) and the smallest subtree based on all AST nodes named xASTNN (token).It can be noted that when the size of subtree gets smaller, the computation time for representing one code segment gets smaller.The reason is that smaller subtrees need fewer recursive operations and these operations are exactly taking up a large portion of the overall computation time.
In addition to the experiment of average computation time, we conduct another experiment to measure the effect of batch size on the time and space efficiency of the previous approach and our approach, aiming to validate the performance of the proposed dynamic batching algorithm.Figure 6 illustrates the results.Note that when the batch size is larger than 128, the previous approach ASTNN will suffer from a memory overflow problem.Hence, we do not report the corresponding results.
In the aspect of time efficiency, we can observe that our approach is faster than the previous approach at all batch sizes, indicating the superiority of our dynamic batching algorithm.As the batch size increases, the computation times of two approaches both decrease.However, their rates of decline are different.We note that the speedup ratio first increases and then decreases as the batch size goes.The peak value of speedup ratio (i.e., 10.5×) occurs when the batch size is around 32.The rationale for this phenomenon can be inferred from their time complexity (see Figure 4).With the increase of batch size, many subtrees of different depths will appear in the batch and dominates the time complexity.This requires our approach to cost plenty of time for these special subtrees, thus resulting in the attenuation of the speedup ratio.Theoretically, if similar-sized subtrees are processed together, this problem will be significantly alleviated.We leave it for future work.
In the aspect of space efficiency, we can see that our approach uses less memory at all batch sizes compared with the previous approach.The GPU usage of both two approaches increases quickly as the batch size increases.When the batch size is equal to 1, our approach still has the advantage in GPU usage, showing its superiority in the processing of single data sample.Similar phenomenon can be found in terms of time efficiency.This is because we optimize the implementation of our xASTNN to make it as lightweight as possible for industrial practice.Hence, when the batch size is set to be large, the improvement in GPU usage of our approach becomes more pronounced.Additionally, an interesting phenomenon is that the GPU usage does not change obviously when the batch size is less than 4. The model size is the main influencing factor for GPU usage at this point.

Conclusion 2:
The efficiency of our xASTNN is promising, which is reflected in both time and space.The computation time on the order of 10 −4 seconds allows our approach to be used in a wide range of industrial applications.

Effect of Alternative Designs
In this experiment, we investigate the effect of alternative designs for the proposed approach xASTNN, aiming to give explanations about the efficacy of each designed components.We consider POJ as the representative dataset, on which all the results are produced.At first, we evaluate the alternative designs from the following three perspectives: program subtree or token subtree from the perspective of subtree granularity, removing or replacing GRvU from the perspective of syntactical information, and removing or replacing GRtU from the perspective of sequential information.The performance of these variants are reported in Table 4.
It can be observed that our approach outperforms all the variants, showing that the current design of our approach is effective.When we adjust the subtree granularity, the accuracy decreases by 0.033 to 0.952 for program subtree and by 0.014 to 0.971 for token subtree.This result exhibits that an appropriate subtree granularity is of importance in capturing code naturalness.If the granularity of the subtree is too large or too small, the syntactical or the sequential information will be lost during the modeling process.
When we modify the encoder for syntactical or sequential information, the performance of these variants are different.When we replacing GRvU with RvNN or replacing GRtU with RtNN, the accuracy drops by 0.004 to 0.981.Nevertheless, if we remove one of the encoders, the accuracy will decline drastically, resulting in a decrease in accuracy of 0.158 or 0.009 respectively.This demonstrates that both syntactical information and sequential information introduced by our xASTNN plays a key role in code representations.
In addition, we introduce another alternative design, namely xASTNN of different sizes, to investigate the effect of model size From the results, it can be noted that the GPU usage quickly increases as the dimension increases, from 1.4 GB to 9.3 GB.The GPU usage grows slowly at first, which is because the running buffer takes up a large amount of space.When the embedding dimension reaches to a certain degree (e.g., 64), its impact on GPU usage becomes significant.As for the computation time, a similar phenomenon appears when the embedding dimension reaches 512.The accuracy is strongly influenced by the embedding dimension.It starts at 0.065, increases rapidly to over 0.976, and converges.This demonstrates that the quality of code representations is highly related to the model size.Therefore, we should consider balancing space, time, and accuracy in practice.

Conclusion 3:
The design of our xASTNN is reasonable.Each module of our approach has a different effect on its performance, requiring the developers to carefully tune the parameters according to the production requirements.

THREATS TO VALIDITY
In this section, we discuss the threats to our work.The first limitation is that some of our experiments are not based on real-world programming language corpus.For convenience of extensive evaluations on the performance of baselines and our approach, we choose to adopt the widely-used benchmark datasets instead of our non-public data.Additionally, recent study [33] suggests that BigCloneBench are considered harmful for evaluating machine learning approaches.Despite these defects, we believe that our experiments are still worthy to be used as a reference for the superiority of our xASTNN.The second limitation is that the insufficient investigation of the robustness of the competing approaches.The results reported in this work are in the common form of the average or maximum performance.Nevertheless, the stability of model performance is not measured.This measurement can illustrate whether a model has performance jitter and makes abnormal predictions.We leave this to future work.

LESSONS LEARNED
From the work of developing code representation for industrial practice, we have learned three significant lessons: Making the code representation model applicable to various scenarios is an important issue.Developing high-quality source code representations have aroused many interests recently.Some approaches borrow complicated characteristics of programming languages to improve the effectiveness of their models, making them hard to be applied in practice.Besides, an inappropriate model design and implementation can also introduces many practical problems, such as memory overflow, runtime delay, and difficult objective fitting.
Reaching a trade-off between effectiveness and efficiency is very high priority for industrial practice.In different industrial scenarios, the application of code representation model may encounter different computing environments and business requirements.Our experiment results show that the effectiveness of our approach can be improved by increasing the model size, while sacrificing the time and space efficiency.Hence, adjusting the model in terms of the actual needs is advisable in industry.
Unusual data inputs have a dramatic impact on the model performance.There are few extremely long or short code segments in practice, which are often deleted by most of academic experiments.These data samples can easily lead to gradient vanishing or explosion problems.Moreover, a batch of size-unbalanced data samples costs more time to process.The approach should spare computing resources for those special data samples.Hence, if we do not pay attention to these unusual data in industry, the performance of the model will be weakened.

CONCLUSION
In this paper, we propose an eXtreme AST-based neural network named xASTNN for producing code representations in industry.The design of xASTNN concentrates on unlocking the potential of AST-based neural network.Our approach is completely based on common ASTs, which is easily accessed by AST parsers.Besides, we introduce techniques such as gating mechanism and dynamic batching algorithm to advance the performance, reduce the computation time, and alleviate the gradient problems.Extensive experiments on two code comprehension tasks have been conducted to demonstrate the effectiveness and efficiency of our xASTNN.According to the results, we can see that our xASTNN outperforms the stateof-the-art while achieving an acceleration of over 10× than the previous approach ASTNN.Therefore, the proposed approach is lightweight, effective, and efficient, with the promising possibility of being applied in a wide range of industrial applications.

Figure 2 :
Figure 2: Illustration of an example code segment and its corresponding statement subtree sequence.The subtrees are extracted according to the statements in code segment.In subfigure (b), the objects marked with grey represent subtree roots and the objects marked with green represent children of roots.The numbers followed by Assign indicate the line numbers of the code segment.

Figure 3 :
Figure3: Illustration of the proposed GRvU.GRvU performs bottom-up recursive aggregation to learn the syntactical information.By introducing hidden states and the gating mechanism to the recursive neural network, we are allowed to acquire high-quality subtree representations.
Figure 4 illustrates the comparison between

Figure 4 :
Figure 4: Comparison between previous batching algorithm and our batching algorithm.A cuboid represents one execution of the CPU/GPU.It can be observed that our dynamic batching algorithm supports full parallelism in the width dimension, which accelerates the time complexity of GRvU.

Figure 5 :
Figure 5: Average computation time of models for representing one code segment.Each approach is tagged with the time (left) and the speedup ratio achieved by our xASTNN (right).

Figure 6 :
Figure 6: Correlation of time and space efficiency with batch size.The subfigures indicate how many times our approach has improved against the previous approach.

Table 1 :
Statistics of three datasets.POJ is used for the code classification task; BigCloneBench and OJClone are used for the code clone detection task.

Table 2 :
Results of code classification.

Table 4 :
Results of alternative designs of our xASTNN.

Table 5 :
Correlation of GPU usage, Time, and Accuracy with embedding dimension for our xASTNN.We vary the embedding dimension in xASTNN and the results are reported in Table5.It shows the correlation of GPU usage, time, and accuracy with embedding dimension.