BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection

As various forms of fraud proliferate on Ethereum, it is imperative to safeguard against these malicious activities to protect susceptible users from being victimized. While current studies solely rely on graph-based fraud detection approaches, it is argued that they may not be well-suited for dealing with highly repetitive, skew-distributed and heterogeneous Ethereum transactions. To address these challenges, we propose BERT4ETH, a universal pre-trained Transformer encoder that serves as an account representation extractor for detecting various fraud behaviors on Ethereum. BERT4ETH features the superior modeling capability of Transformer to capture the dynamic sequential patterns inherent in Ethereum transactions, and addresses the challenges of pre-training a BERT model for Ethereum with three practical and effective strategies, namely repetitiveness reduction, skew alleviation and heterogeneity modeling. Our empirical evaluation demonstrates that BERT4ETH outperforms state-of-the-art methods with significant enhancements in terms of the phishing account detection and de-anonymization tasks. The code for BERT4ETH is available at: https://github.com/git-disl/BERT4ETH.


INTRODUCTION
As a decentralized computing platform, Ethereum empowers its developers to create a variety of decentralized applications (DApps).Despite the substantial engagement garnered within the cryptocurrency sphere, Ethereum has also become a hub for a wide range of fraudulent activities, such as phishing scams [31], pump-and-dump schemes [15], Ponzi schemes [6], ICO scams [3], money laundering [30], and bot arbitrage [9], etc.
Many recent studies [2,4,[20][21][22]31] employ graph representation learning techniques for fraud detection on Ethereum.Although it is intuitive to represent the interactions between accounts as a graph, it is argued that they have the following limitations: (i) Graph is not appropriate for capturing the sequential pattern inherent in transactions.Ethereum transactions are high repetitive, indicating the presence of multi-edges between nodes.Current graph-learning methods [24,25,31] integrate multi-edges to a single edge to facilitate graph computations.However, the discarded sequential information is essential for characterizing user behaviors for tasks such as de-anonymization.
(ii) Graph Neural Networks (GNNs), especially on the highly skew-distributed Ethereum data [19], can suffer from noise when the number of convolution hops increases [14], given that Ethereum accounts are often connected to highly popular accounts.However, limiting the number of convolution hops can restrict the capabilities of GNNs, as the number of hops is typically equivalent to the depth of layers in conventional GNNs [11,17,20,29].
(iii) Existing studies primarily target on individual fraud detection tasks in an end-to-end training manner.In light of the successes of Transformer pre-training techniques in NLP [10,28], we believe that a pre-trained Transformer can support various fraud detection tasks with minimal adaptations needed.
To address the limitations discussed above, we introduce a pretrained model that offers a universal solution for various fraud detection tasks on Ethereum.BERT4ETH features the superior sequential modeling capability of Transformer [28] and the pretraining paradigm of BERT [10].In this paper, we first present the architecture of BERT4ETH, with a specific focus on the integration of Transformer into the Ethereum contenxt.BERT4ETH serves as a sequential encoder, capable of extracting representation vectors for user accounts based on their transaction histories.Second, we introduce the Masked Address Prediction (MAP) task, which involves randomly masking addresses (accounts) in transaction sequences and requiring the model to predict the masked addresses.The MAP task forces the model to learn the relationship between addresses (accounts) in transaction sequences.
However, three characteristics of Ethereum pose challenges for pre-training: (i) Repetitiveness: High repetitiveness prevents BERT4ETH from learning meaningful representations through the MAP task, because label information is very likely leaked from unmasked addresses to masked addresses.(ii) Skewed distribution: The frequency of occurrence of addresses follows the power-law distribution [19], with a small number of popular addresses proliferating in the majority of transaction sequences.This reduces the distinctiveness of representations that is what fraud detection covet most.(iii) Heterogeneity: Ethereum transactions include various types of interactions (Ether/token transfer, contract call) between different types of accounts, creating a challenge in modeling the heterogeneity and uncovering meaningful patterns behind transactions.
We tackle the above challenges by equipping BERT4ETH with the following three strategies: • Repetitiveness reduction: To counter the label leakage problem in pre-training, we first aggregate continuously repetitive transactions while preserving the sequential order.Second, we propose two alternative effective strategies: adopting a high masking ratio (80%) or a high drop out ratio (80%) during pre-training.These tactics create a task that cannot be easily extrapolated by the high repetitiveness.• Skew alleviation: We emphasize the distinctiveness by sampling high-frequency addresses as negative samples in a contrastive loss function adopted for pre-training.Optimizing the contrastive loss equals to alleviating the negative impact of high-frequency addresses.Additionally, we propose an intra-batch sharing strategy for negative samples, which allows an extremely high negativeto-positive ratio to alleviate the skewness, and decreases the overlap of negative sets because of frequency-aware sampling.Extensive experiments conducted on two crucial fraud detection tasks show that BERT4ETH significantly advances the state-of-theart performance, achieving a  1 improvement of 21.61 absolute percentage (AP) for phishing detection and Hit Ratio@1 improvement of 13.54 and 21.57AP for de-anonymization on the ENS and Tornado (0.1ETH) datasets, respectively.Contributions: To summarize, the contributions are as follows: • We present BERT4ETH, a pre-trained Transformer that provides a universal solution for various Ethereum fraud detection tasks.• We equip BERT4ETH with three effective strategies to generate robust and expressive representations, given repetitive, skewdistributed and heterogeneous Ethereum transaction.• BERT4ETH significantly advances the state-of-the-arts on two important fraud detection tasks.As a side contribution, we make available the code and dataset.

RELATED WORK AND BACKGROUND 2.1 Ethereum Representation Learning
Previous studies have primarily focused on graph-based methods for Ethereum account representation learning, which can be classified as DeepWalk-based and GNN-based methods.
DeepWalk-based method: Trans2Vec [31] is proposed for the phishing account detection task, which integrates temporal and amount information of transactions into its random walk process, making the proximity of learned node representations reflects the relationship between accounts.Other works, such as [21,22], also take inspiration from Trans2Vec.For the task of de-anonymization, which aims to identify two accounts belonging to a single user based on the proximity of account representations, Beres et al. [2] evaluate 11 graph learning methods on ground-truth pairs collected from the ENS and Tornado coin-mixers.Among them, Diff2Vec [25] and Role2Vec [1] are considered the state-of-the-art methods.
GNN-based method: Shen et al. [26] utilize Graph Convolution Network (GCN) [17] to classify accounts into "normal," "phisher," and "bot" categories based on inferred identities.Zhou et al. [34] propose HGATE, a hierarchical graph attention encoder that integrates features from both node-level and subgraph-level to enhance phishing detection performance.Li et al. [20] propose TTAGNN, a GNN that fuses multiple temporal edges by using a LSTM network, and learns node embeddings through a Graph Attention Network (GAT) [29].A graph auto-encoder is employed to generate a selfsupervised signal for representation learning, with a LightGBM model adopted for the phishing account detection task.

Transformer & BERT
Transformer [28] is a sequence-to-sequence machine translation model that introduce the groundbreaking self-attention mechanism for capturing the relationship between word tokens.BERT [10] proposes Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) task to pre-train the Transformer encoder in a bidirectional context.It advances the state-of-the-art on eleven NLP tasks with a significant enhancement, and has inspired a number of variants such as ALBERT [18], RoBERTa [23] and XLNet [32].Moreover, the pre-training paradigm of BERT makes it can be easily extended to various downstream tasks.In light of the success of Transformer and BERT pre-training, our research aims to take advantages of their superior capabilities for Ethereum fraud detection.

Terminology
Externally Owned Account (EOA): EOAs are accounts controlled by users who own their private keys, allowing them to initiate external transactions for transferring cryptocurrency or triggerring smart contracts.This study focuses on modeling the transactions initiated by EOAs as they are under human control.
Contract Account: Contract accounts are self-executing computer programs deployed on the Ethereum network.Ethereum allows encoding of arbitrary contract functionality.While contract accounts cannot issue external transactions, they can initiate internal transactions.
External Transaction: An external transactions initiated exclusively by an EOA, either transfers cryptocurrency to other accounts or call a contract account to trigger its execuation.In comparison, there are internal transactions initiated by smart contracts to execute complex logic.In this study, the term "transaction" specifically refers to external transactions, unless specified otherwise.
Token: Tokens are digital assets that can be programmed to serve various functions, such as functioning as as a currency, granting access, voting, providing identity and utility.Currently, the majority of tokens are built upon the ERC-20 standard [5].

MOTIVATION
We introduce three challenges/characteristics of Ethereum that motivate us to design a new BERT-based model.
Repetitiveness: Ethereum transactions are highly repetitive.Statistics indicate that there are 48.4% of transactions share the same receiver with its one prior transaction initiated by the same sender, suggesting that Ethereum users exhibit a tendency to repeatedly interact with the same accounts.However, the pre-training task of BERT4ETH is vulnerable to high repetitiveness: label information can leak from unmasked tokens to masked but repetitive ones, hindering BERT4ETH to capture meaningful co-occurrences between addresses, a phenomenon we refer to as the label leakage problem.
Skew-distributed: As shown in Figure 1, the frequency of occurrence of accounts (addresses) follows the power-law distribution [19,33], which means a small number of high-frequency accounts proliferate in the majority of transactions.This characteristic presents the difficulty for representation learning, as it can diminish the distinctiveness of representations: two accounts that interacted with popular accounts like Uniswap are likely be closely located in the latent space, even if they are completely irrelevant.
Heterogeneous: In Ethereum, there exist different types of accounts, transactions and functionalities associated with calling contract accounts.As compared to human languages, the heterogeneity present in transactions makes it more challenging to discern meaningful patterns and determine the most important elements of information.Given that various downstream fraud detection tasks depend on different aspects of information, we aims to preserve heterogeneity as much as possible during the pre-training phase.

BERT4ETH
In this section, we present the design of BERT4ETH, along with three strategies aimed at addressing the above-mentioned challenges, dubbed Repetitiveness Reduction (RR), Skew Alleviation (SA) and Modeling Heterogeneity (MH).

Transaction Sequence
4.1.1Data Collection.We deployed an Ethereum node by Geth and utilized Ethereum-ETL to extract structured tabular data from archived raw data.The table schema used in this paper is at https:// ethereum-etl.readthedocs.io/en/latest/schema,where transaction.csv is the external transaction file and trace.csv is the log file.

Sequence Generation.
For an EOA with address  0 , we collect all the transactions which it was either the initiator or the receiver, and sort transactions in descending order based on their timestamps.For each transaction, we collect four features, i.e., address, timestamp and amount, account type, and in/out type.In/out type feature indicates whether the transaction is received or initiated by  0 , account type indicates whether the account is EOA or contract, amount is the value of transferred amount, and timestamp denotes the transaction time.Subsequently, we insert a dummy selftransaction at the head of the sequence.The address feature of the self-transaction is set to  0 (self-address), and all the other features are set to "Null", to differentiate it from normal transactions.
Transaction De-duplication (RR#1): The first strategy of repetitiveness reduction is transaction de-duplication, aiming to reduce continuous repetitiveness.Continuous repetitiveness refers to transactions that interacts with the same addresses continuously in a sequence.First, we eliminate failed transactions as user may initiate several failed transactions before a final one successfully executed.Second, we aggregate continuous repetitive transactions that have same address, same in/out type and initiated within 72 hours into one, by summing up their transaction amounts and tracking the number of transactions.The timestamp of the aggregated transaction is set to the first timestamp of the original transactions.By adopting de-duplication strategy, we lower the repetitiveness ratio from 48.0% to 14.3%, while still preserving the order of the original sequence.

Model Architecture
4.2.1 Embedding Layer.As illustrated in in Figure 2, seven features are generated for each transaction, including address, account type, in/out type, amount, count, timestamp and position.Since amount and count are no-categorical features, we use binning to categorize them.The position index is ranked from 0 to  − 1.
First, we adopt the embedding technique to encode features to make the model aware of the transaction information.Specifically, for the -th transaction in the sequence, its transactions features are passed through embedding layers to generate the corresponding feature embeddings, which are then summed to obtain its initial transaction representation  (0)  ∈ R  .Next, we stack the initial transaction representations to form a matrix encompasses all the information of the transaction sequence.

4.2.2
Transformer Encoder.For a sequence, BERT4ETH takes  (0) as the input, and passes it through the Transformer encoder consisting of  Transformer layers.Each Transformer layer contains two sub-layers, an Attention sub-layer and a Position-wise Feed-Forward sub-layer.We formalize a Transformer layer as follows: FFN() = GELU( where the projection matrices  2 ∈ R  ×1 are trainable parameters for the -th Transformer layer.To facilitate description, we omit the multi-head [10], ResNet [13] and batch normalization [16], but they are adopted in practice.
After -layer successive calculation, the Transformer encoder produces a matrix is the final representation of the -th transaction, which encodes not only its own information but also the bi-directional context information.

Masked Address Prediction. MAP is derived from the Masked Language Modeling (MLM) of BERT, which involves a
Cloze test that requires the model to predict the masked addresses in a transaction sequence, as shown in Figure 2. In BERT4ETH, a certain percentage (%) of transactions within the sequence are selected and their addresses are replaced with the special token [MASK].The masked sequence is then passed through the embedding and Transformer layers as described before.For a masked transaction, its final transaction representation  ()  , which encodes its bidirectional contextual information, is used to predict its masked address.
The original BERT predicts masked word tokens through the calculation of probabilities across all tokens (around 30K in number).However, when applied to Ethereum, it is infeasible to calculate softmax(•) across all the addresses as there are up to billions of addresses in Ethereum.Therefore, we adopt a contrastive loss Masking Ratio (%) Figure 3: Testing  1 of phishing account detection w.r.t.different masking ratios.A high masking ratio (80%) works significantly better than the original ratio (15%).calculated over a positive address and random sampled negative addresses as the objective function for pre-training: where M is the set of masked addresses in a sequence,   is  ()  that encodes unmasked contextual information.For each sequence, we samples a negative set N .  is the address embedding of its masked address, which we refer to the positive embedding, and   is the address embedding of a negative address from the negative set N .Here we reuse the address embedding layer to prevent introducing new parameters.Optimizing Eq. 5 is essentially equivalent to encouraging   be closer to   , and away from   in the hidden space.
Multi-hop modeling: From a graph perspective, the address embedding captures up to two-hop neighborhood information during MAP pre-training ( 0 and  1 are one-hop neighbors, while  1 and  3 are two-hop neighbors).After pre-training, the Transformer encoder extracts the account representation of  0 using its transaction sequence as input.This explicitly includes the address embedding of its one-hop neighbors and implicitly captures the information from the two-hop neighbors of  0 's immediate neighbors.In total, BERT4ETH captures three-hop neighborhood information, as will be discussed in our experiment (Section 5.5).

Repetitiveness
Reduction.The issue of high repetitiveness poses a risk of label leakage for the MAP task, which can negatively impact the effectiveness of pre-training.For instance, if BERT4ETH follows the original masking ratio (15%) and uses 85% unmasked addresses to predict 15% masked addresses, the masked addresses have a high likelihood of being present in the unmasked addresses, leading to an overly easy prediction task [12].This issue, in turn, results in small values for loss and gradients, causing the parameters to be inadequately trained and impeding the model in capturing the meaningful occurrence patterns between addresses.
Despite the proposal of the de-duplication strategy to reduce continuous repetitiveness, the left discontinuous repetitiveness still remains substantial.To mitigate the adverse effects of high repetitiveness, we put forth two effective strategies without introducing any additional operations: High Masking Ratio (RR#2): A straightforward solution is to increase the masking ratio  to a very high value, thereby creating a task that cannot be easily extrapolated.Figure 3 demonstrates the testing  1 of BERT4ETH on the phishing account detection task (with fixed-training strategy will be described in Section 5.3) by switching the masking ratio from 10% to 90%.Accordingly, the  1 score increases from 0.1350 to 0.3245, causing a huge performance gap up to 18.95 AP.Among them, =80% achieves superior performance, and when >80%, we observe that the performance starts to decrease because unmasked information is too limited for the task.
High Dropout Ratio (RR#3): An alternative approach is to adopt a high dropout rate for pre-training, which shares the same idea with raising the masking ratio.Empirical results show that with a low masking ratio of 15%, a similar performance can be reached by adopting a high dropout ratio of 80%.However, the benefits brought by increasing dropout and masking ratio are not cumulative because they achieve the same effect.Therefore, given a masking ratio of 80% adopted, a dropout ratio of 20% is set as the default value based on empirical hyper-parameter tuning.

Skew Alleviation.
As previously shown in Figure 1, the occurrence frequency of Ethereum accounts follows a power-law distribution, meaning that a small number of popular accounts are highly likely exists in the majority of transaction sequences, causing two irrelevant accounts be close to each other in the hidden space simply because they interact with the same popular accounts.
A good encoder is expected to identify rare activities out of the majority of transactions, as transactions that interact with lowfrequency addresses could be important signals for fraud detection.We present two strategies to alleviate the negative impact of skewed distribution: Frequency-aware Negative Sampling (SA#1): Given that the masking ratio is high (80%), compared to low-frequency addresses, high-frequency addresses are more likely to be masked and their address embeddings will be selected as   for Eq. 5. Optimizing Eq. 5 encourages   that encodes unmasked transaction sequence be closer to   in the hidden space.As a result, addresses that cooccurr with high-frequency addresses become closer to them, and thus the sequence representations also becomes closer, which is undesired for fraud detection.An effective solution is to take highfrequency addresses as negative samples to counteract the impact of these addresses being trained frequently as positive samples.Specifically, we introduce two frequency-aware sampling strategies: Zipfan sampling and Frequent sampling, as follows: • Zipfan sampling: where  (•) is the rank of   based on the descending frequency.• Frequent sampling: where  (•) is the frequency of account/address   and  is an adjustable hyper-parameter.In the experiment, we set =0.5 and 1.0.If =0, it degrades into the uniform sampling.
Intra-batch Sharing (SA#2): Given that we sample high-frequency addresses as negative samples, the negative sets of different transaction sequences would be highly overlapped because they all concentrate on sampling high-frequency addresses.To reduce this waste,  Table 1 presents the results of skew alleviation strategies on the phishing account detection task.It is obvious that: 1) BERT4ETH achieves better performance when the degree of frequent negative sampling increases (Zipfan> freq(1.0)>freq(0.5)>uniform), and the  1 gap is up to 9.94 AP; 2) For Zipfan sampling, when the negative-to-positive ratio increases from 20 (without intra-batch sharing) to 1,000, 5,000 and 10,000 (with intra-batch sharing),  1 increases to 0.5044, and the gap is up to 8.05 AP.As a result, the negative-to-positive ratio of 5,000 is adopted as the default setting.
Figure 4 demonstrate two attention distributions received by addresses in the first Transformer layer.It is obvious that attention scores assigned to the high-frequency addresses are decreased after Zipfan sampling, enabling BERT4ETH pay more attention to lowfrequency addresses to enhance the distinctiveness.

Advanced Features
We designate the above-described model as the basic BERT4ETH.In this section, we present two advanced techniques aimed at tackling the heterogeneity of Ethereum transactions.
In/Out Separation (MH#1): It is observed that fraudulent EOAs exhibit special patterns of in-type and out-type transactions.For phishing EOAs, the in-to-out ratio is 1.250, whereas, it is only 0.385 for normal accounts.The difference arises from the nature of fraud activities: a phishing EOA receives transfers from its victims, thus making the in-type transactions dominant, whereas the out-type transactions might indicate the flow of stolen funds to accomplices or other accounts controlled by the attacker.Due to MAP is a selfsupervised task, no fraud label is available to highlight these crucial yet minority transactions.As a result, these transactions may be overlooked during the self-attention computation process.
A solution is to generate other two sub-sequences by separating the original sequence based on transaction's in/out type feature.As a result, another two Transformer encoders are employed to generate  ()  and  ()  correspondingly.Self-attention computations are confined within each sub-sequence.Parameters in embedding layers are shared to prevent largely increase of parameter numbers, as the address embedding layer is very large in size.
ERC-20 Transfer Log Encoder (MH#2): Token transfer is implemented at the contract-level rather than at the protocol level.Therefore, a token transfer to a receiver is not recorded by an external transaction.To capture the transfer relationship, we analyze ERC-20 transfer logs from internal transactions, and select the transfer behaviors happened between EOAs to prevent noise.For an external transaction that invokes ERC-20 token transfer, we associate all the EOAs that receive ERC-20 tokens to the recipient contract address.It should be noted that the number of recipient EOAs is uncertain as a single transaction may transfer tokens to multiple EOAs simultaneously.
Before passing through the Transformer encoder, we first mean pool the address embeddings of attached EOAs to encode their information.Second, we employ a gate mechanism to integrate the embeddings of contract account and recipient EOAs: where   is the address embedding of recipient contract,  is the gate vector that is adaptive to   and   . .The gate mechanism prevents introducing noise in instances where transfer actions, such as token airdrops, do not necessarily guarantee the relationship between the initiator and receiver.

Apply to Downstream Tasks
Pre-trained BERT4ETH functions as a representation extractor for a given transaction sequence.To represent the entire sequence, we pick the representation of the self-transaction  ()  because it encodes the global information.In the case where BERT4ETH adopts the in/out separation strategy, the final representation is obtained by concatenating the three representations of self-transactions extracted from  () ,   .If the sequence exceeds the maximum length, it is split into multiple sequences and multiple representations are generated, which are then mean-pooled to produce the final representation.

EXPERIMENTS 5.1 Task Description
5.1.1Phishing Account Detection.Phishing attack is the most prevalent form of fraud on Ethereum.Attackers send victims fake airdrop messages through emails or social networks to lure them into logging accounts on phishing websites [7] or transferring cryptocurrency to the designated phishing accounts [8].Unlike conventional phishing scams, Ethereum transaction are publicly accessible, allowing us to identify phishing accounts and alert susceptible users before they being victimized.In our experiment, the task of phishing account detection is framed as a binary classification problem, where the goal is to determine whether an EOA is a phishing account.We adopt Precision, Recall and  1 as the metrics.
5.1.2De-anonymization.De-anonymization aims to identify two different EOAs controlled by the same user.One application of de-anonymization is to trace the flow of money laundering.For example, Tornado Cash [2,27] provides coin-mixing services on Ethereum: a participant deposits certain amounts of ether into a Tornado mixer contract, and use another account to withdraw the deposited coins after a period of time.In our experiment, given a ground-truth pair of EOAs, we use the representation of the query EOA to query its top- closest neighbors in the hidden space.If the target EOA is present among them, de-anonymization is considered as successful.We adopt Hit Ratio@ (HR@) to measure the performance and use Euclidean distance as the metric for proximity because it yield slightly better results than cosine similarity.

Dataset:
For phishing account detection, we collect 7,057 accounts labeled by Etherscan.Among them, 97% are EOAs.For deanonymization, we use ground-truth pairs collected by Xu et al. [2] from two sources: Ethereum Name Service (ENS) and Tornado Cash.For pre-training, we randomly sample 1,000,000 EOAs and filter out accounts labeled as "phishing", "exchange", "miner" and "mining pool".These normal EOAs are also used for negative samples for phishing account detection and candidate set for de-anonymization on ENS dataset.We collect all the transactions that these EOAs has involved in, covering the period from Jan.1 2017 to May.1 2022.We filter out EOAs that has less than 3 transactions or more than 10,000 transactions.The statistics of dataset is presented in Table 2.

Implementation:
For BERT4ETH, the number of Transformer layers is set to 8, number of attention head is set to 2 and  [24].We set the number of walks per node to 10, walk length to 20, and context size to 5. For GNN-based methods, the number of GNN layers is set to 2 with the neighbor sample size of 50.For all the competitors, the negative-to-positive ratio is set to 20, hidden dimension set to 64, batch size set to 256 and dropout ratio set to 20%.Other parameters of BERT4ETH follows the previously mentioned default settings.3 summarizes the results of the fixed-training strategy.From the table we can observe that: 1) GNN-based methods, especially GAT, outperform DeepWalk-based methods.2) The basic BERT4ETH achieves a significant performance boost compared to other baselines: the performance gap of  1 is up to 21.61 AP compared to GNNs, and 36.94AP compared to the original BERT with  1 of 13.50 (Figure 3).The performance boost mainly comes from a better pre-training process, incorporating fine-grained sequential and transaction information, as well as the superior modeling ability of Transformer.3) By applying two advanced features, BERT4ETH † and BERT4ETH § further introduce 3.44 and 1.68 AP gain of  1 , respectively.The comparison of account representations generated by several baselines, as visualized in Figure 5, show that phishing account representations generated by BERT4ETH are more dense and separable.

Performance Comparison
Table 4 presents the results after fine-tuning, with graph-based methods omitted as they perform worse and some of them cannot be fine-tuned.The first three rows of Table 4 show the results of finetuning the pre-trained BERT4ETH models, showing a substantial improvement in performance compared to Table 3.This indicates the effectiveness of fine-tuning.The last three rows are results of ablating the pre-training.Obviously, the performance largely decreases, suggesting the importance of pre-training.Among these results, BERT4ETH † achieves the best performance.BERT4ETH cannot be fine-tuned for this task due to the limited number of ground-truth pairs, with 288 pairs for the ENS dataset and 182 pairs for the Tornado dataset.Table 5 presents the comparison results for the ENS dataset.Among the DeepWalk-based methods, Diff2Vec demonstrates the best performance.Among GNNs, GraphSAGE outperforms both GCN and GAT by a large margin, which can be attributed to the fact that homogeneous GNNs can introduce a large amount of noise in the multi-hop aggregation, which de-anonymization is susceptible to, especially given that Ethereum account nodes that are highly likely to be linked to popular account nodes.GraphSAGE is more resistant to noise [14] because of its skip-connection mechanism.Notably, the basic BERT4ETH can exactly de-anonymize 16.32% of account pairs, offering a significant improvement of 9.7 AP gain on HR@1.Additionally, the in/out separation strategy further brings a considerable improvement upon the basic BERT4ETH.
Table 11 presents the results for the Tornado dataset, where Rank is the average rank of the deposit account within the candidate set.The size of candidate set varies due to the unique withdrawal time.

Ablation Study
We investigate the impact of all the proposed strategies by ablating five key elements of BERT4ETH, i.e., transaction de-duplication, high masking ratio, frequent negative sampling, intra-batch sharing and transaction features.
Table 7 presents the results of the ablation study conducted on the phishing account detection task with fixed training.First, we observe that removing each one of them results in a noticeable performance decline.Second, we observe that switching masking ratio to BERT's original setting (15%) lead to the largest performance decrease, indicating that repetitiveness can largely hurt the effectiveness of pre-training for the phishing detection task.
On the contrary, the conclusion drawn from the ablation study on the de-anonymization task is entirely different.Ablating repetitiveness reduction strategies (de-duplication & 80% masking) actually lead to a slight performance increase when  ≤ 5, suggesting that de-anonymization task is not particularly susceptible to high repetitiveness.Another noteworthy finding is that two skew alleviation strategies are crucial for this task.It is worth mention that skew alleviation is important especially when the masking ratio is high.This is because high-frequency addresses are more Method HR@1 @3 @5 @10 @100 @1000 BERT4ETH

Multi-hop Modeling
Despite modeling at transaction sequence level, BERT4ETH can still capture up to three-hop neighborhood information from a graph perspective.As illustrated in Figure 6, we represent five accounts as nodes and transactions between them as edges.When conducting MAP pre-training on the sequence of node C, the address embedding of node B becomes closer to the address embeddings of nodes C and D, suggesting that address embeddings trained during MAP implicitly incorporate information from up to two-hop neighborhood.After pre-training, we use a Transformer encoder to extract the account representation of node A by taking its transaction sequence as input, which explicitly incorporates the address embedding of B and, implicitly, the information from nodes C and D. By combining MAP and the Transformer, BERT4ETH is able to capture a total of three-hop neighborhood information.
After pre-training, we extract the address embeddings to directly represent accounts (addresses) and then test them on two tasks.Table 9 and Table 10 present the corresponding results, from which we notice that while the performance declines compared to the BERT4ETH representations, address embeddings remain effective Method HR@1 @3 @5 @10 @100 @1000 BERT4ETH 16. 32  for both tasks.This indicates that the address embeddings pretrained during MAP are already equipped to capture multi-hop information.Furthermore, we observe a more significant performance drop in the phishing account detection compared to the de-anonymization task, implying that multi-hop information is more crucial for the former task.

Case Study
We identify the case of hot-to-cold query for de-anonymization where graph-based methods may fail.In this case, the cold account has much less number of transactions than the hot account.Take Case 1 for example, when using the cold account to query the hot account, both BERT4ETH and Diff2Vec rank the hot account at 8th/595,396.However, when using the hot account to query the cold account, BERT4ETH ranks 20, and Diff2Vec ranks 47,641.The reason is that graph-based methods can introduce a large amount of noise when a node's neighborhood is large.In contrast, BERT4ETH preserves the first order of neighborhood with transaction-level sequential information, and emphasizes important information via self-attention, making it still effective for hot-to-cold query.

CONCLUSION
We present BERT4ETH, a pre-trained Transformer that offers a universal solution for fraud detection tasks on Ethereum.BERT4ETH features the superb modeling ability of the Transformer and incorporates three effective strategies to tackle the challenges of pre-training a BERT model for Ethereum.These strategies, namely repetitiveness reduction, skew alleviation and heterogeneity modeling, result in substantial improvements and operate cohesively and harmoniously.The significant improvements achieved on phishing account detection and de-anonymization tasks suggest that BERT4ETH is well suited for practical applications.

Figure 1 :
Figure 1: The frequency of occurrence of account (address) follows a power-law distribution.

Figure 2 :
Figure 2: The framework of BERT4ETH pre-training.After a transaction sequence is generated, we select a portion of transactions to replace their addresses with [MASK] and feed the sequence to the model to predict masked addresses.

Figure 4 :
Figure 4: Attention distribution with uniform(a) or Zipfan(b) negative sampling.Address Id is ranked according to the descending frequency.weforce masked transactions in the same batch to share all the negative samples.Another advantage of this strategy is that given the whole number of negative samples unchanged, the negative/positive ratio largely increases from |N | : 1 to  • |N | : 1, providing a greater degree of skew alleviation, where |N | is the size of negative set and  is the batch size for transaction sequence.Table1presents the results of skew alleviation strategies on the phishing account detection task.It is obvious that: 1) BERT4ETH achieves better performance when the degree of frequent negative sampling increases (Zipfan> freq(1.0)>freq(0.5)>uniform), and the  1 gap is up to 9.94 AP; 2) For Zipfan sampling, when the negative-to-positive ratio increases from 20 (without intra-batch sharing) to 1,000, 5,000 and 10,000 (with intra-batch sharing),  1 increases to 0.5044, and the gap is up to 8.05 AP.As a result, the negative-to-positive ratio of 5,000 is adopted as the default setting.Figure4demonstrate two attention distributions received by addresses in the first Transformer layer.It is obvious that attention scores assigned to the high-frequency addresses are decreased after Zipfan sampling, enabling BERT4ETH pay more attention to lowfrequency addresses to enhance the distinctiveness.
W  ∈ R  × and b  ∈ R  are parameters optimized during training.Address embedding  ′  is encoded into the initial transaction embedding  (0)

5. 3 . 1
Phishing Account Detection: We evaluate BERT4ETH w.r.t.two strategies: fixed-training and fine-tuning.For fixed-training, the pre-trained model is used a feature extractor to generate representations, followed by the individual training of a MLP for classification.For fine-tuning, the model is trained with a cascaded MLP together.Each experiment is repeated five times and the best  1 score is reported with the threshold set to 0.3.Table

Figure 5 :
Figure 5: T-SNE visualization of phishing (orange) and normal (blue) accounts for several competitors.5.3.2De-anonymization:For ENS dataset, we construct a candidate set including ENS and normal EOAs (595,373 in total), which is shared by all the ground-truth pairs.For Tornado dataset, we use ground-truth pairs collected from 0.1 ETH and 1 ETH coin-mixers.Each ground-truth pair consisting of a deposit EOA and a withdraw EOA, we construct a candidate set including EOAs that deposited Ether to the mixers prior to the withdrawal time.The withdraw EOA is used to query the deposit EOA within the candidate set.BERT4ETH cannot be fine-tuned for this task due to the limited number of ground-truth pairs, with 288 pairs for the ENS dataset and 182 pairs for the Tornado dataset.Table5presents the comparison results for the ENS dataset.Among the DeepWalk-based methods, Diff2Vec demonstrates the best performance.Among GNNs, GraphSAGE outperforms both GCN and GAT by a large margin, which can be attributed to the fact that homogeneous GNNs can introduce a large amount of noise in the multi-hop aggregation, which de-anonymization is susceptible to, especially given that Ethereum account nodes that are highly likely to be linked to popular account nodes.GraphSAGE is more resistant to noise[14] because of its skip-connection mechanism.Notably, the basic BERT4ETH can exactly de-anonymize 16.32% of account pairs, offering a significant improvement of 9.7 AP gain on HR@1.Additionally, the in/out separation strategy further brings a considerable improvement upon the basic BERT4ETH.Table11presents the results for the Tornado dataset, where Rank is the average rank of the deposit account within the candidate set.The size of candidate set varies due to the unique withdrawal time.

Figure 6 :
Figure 6: A toy example illustrating that BERT4ETH can capture three-hop neighborhood information.

Table 3 :
Comparison for phishing detection w/ fixed-training.

Table 4 :
Comparison for phishing detection w/ fine-tuning.BERT4ETH w/o pre-training equals to Transformer.

Table 7 :
Ablation study for phishing detection w/ fixed-training.

Table 9 :
Comparison for phishing detection w/ fixed-training.

Table 11 :
Case study of hot-to-cold query for de-anonymization.Query type cold-to-hot hot-to-cold cold-to-hot hot-to-cold