PhishReplicant: A Language Model-based Approach to Detect Generated Squatting Domain Names

Domain squatting is a technique used by attackers to create domain names for phishing sites. In recent phishing attempts, we have observed many domain names that use multiple techniques to evade existing methods for domain squatting. These domain names, which we call generated squatting domains (GSDs), are quite different in appearance from legitimate domain names and do not contain brand names, making them difficult to associate with phishing. In this paper, we propose a system called PhishReplicant that detects GSDs by focusing on the linguistic similarity of domain names. We analyzed newly registered and observed domain names extracted from certificate transparency logs, passive DNS, and DNS zone files. We detected 3,498 domain names acquired by attackers in a four-week experiment, of which 2,821 were used for phishing sites within a month of detection. We also confirmed that our proposed system outperformed existing systems in both detection accuracy and number of domain names detected. As an in-depth analysis, we examined 205k GSDs collected over 150 days and found that phishing using GSDs was distributed globally. However, attackers intensively targeted brands in specific regions and industries. By analyzing GSDs in real time, we can block phishing sites before or immediately after they appear.


INTRODUCTION
Phishing sites use various techniques to trick victims into believing they are legitimate.Attackers create fake login pages by copying from the web content of branded websites [37].They also acquire domain names similar to legitimate ones [15].Users may mistakenly identify these domain names displayed in emails or short message service (SMS) messages as legitimate, click on the links, and unknowingly get redirected to phishing sites [25,40].As a result, users may unknowingly enter sensitive information, such as credit card numbers and login credentials, under the false belief that they are on a legitimate website [32].
Previous studies have investigated domain squatting since the 2000s [41].Researchers proposed methods for detecting domain names that differ slightly from legitimate domain names (e.g., typosquatting [15,39], bitsquatting [29], and IDN homograph [38]) and include brand names (e.g., combosquatting [22]).In recent phishing attempts, many domain names are created by using multiple squatting techniques to evade existing detection methods.For example, these domain names set typosquatting strings of the legitimate domain names to their subdomains and append other words connected by hyphens.The edit distance between legitimate domain names and most of these domain names is large.These domain names do not directly contain brand names, making it difficult to associate them with legitimate ones.Also, there are strong similarities between these domain names, which appear to be generated by using common techniques.We refer to these domain names as generated squatting domains (GSDs).Attackers generate a large number of similar domain names by using multiple squatting techniques, such as adding, deleting, or substituting words, letters, or random strings, based on legitimate domain names.Although they appear to be related by human cognitive characteristics, their similarity is difficult to express on a rule basis, and new patterns emerge frequently.Therefore, creating rules to cover all possible patterns requires a significant amount of effort.
In this paper, we propose a system called PhishReplicant that detects domain names registered by attackers.Our focus is on identifying GSDs that bear linguistic similarities to known phishing domain names.To achieve this, we leverage a Transformerbased language model and automatically generate matching rules.Since PhishReplicant only analyzes domain name strings, we can utilize various types of data without being restricted by data format.By extracting GSDs from the latest known malicious domain names, PhishReplicant can efficiently detect similar GSDs from the stream of newly registered domain names in real-time.As a result, we can keep up with the latest attacks without the need for manually adding newly targeted domain names or training the model on newly emerged squatting techniques.By applying PhishReplicant to these data, we can collect GSDs before they are used as phishing sites or early in the emergence of phishing sites.The source code for PhishReplicant is available at https://github.com/tkoide398/PhishReplicant.
We conducted a four-week real-time evaluation experiment using Certificate Transparency (CT) logs [3], lists of registered domain names, and passive DNS traffic observed on 66 DNS cache servers in 18 countries.We found 3,498 GSDs registered by attackers, of which 2,821 were used as phishing sites within one month of detection.We also conducted experiments comparing PhishReplicant with existing systems, including rule-based and machine learning-based systems for detecting phishing-related domain names.The results showed that PhishReplicant had the highest detection accuracy and detected the largest number of domain names that did not directly contain exact brand names.Moreover, we clustered 205k GSDs collected from the results of our proposed system and threat intelligence in 150 days as an in-depth analysis to reveal phishing tactics using GSDs.Since most GSDs in the same clusters are used for multiple days (with a median duration of 41 days), identifying GSDs based on the similarity of known phishing domain names can prevent the spread of phishing.We also found that phishing using GSDs targets brands in 35 countries and is biased toward certain geographic regions.Attackers used GSDs to deploy phishing sites targeting 265 brands, including many financial institutions such as banks and credit card services, to gain a monetary benefit.
In summary, we make the following contributions: • We propose a system named PhishReplicant that utilizes language models and automated matching rules to detect GSDs by analyzing the similarity between phishing domain names.
• We conducted a real-time experiment using PhishReplicant for four weeks and successfully discovered 3,498 GSDs that attackers were likely to have acquired.Among them, 2,821 GSDs were used as phishing sites within one month after detection.

• We performed a comparative experiment between
PhishReplicant and existing systems for detecting phishing-related domain names.The experiment showed that PhishReplicant achieved the highest detection accuracy and identified the most domain names not containing exact brand names.• We analyzed 205k GSDs collected over a period of 150 days.
The analysis revealed that phishing attacks using GSDs targeted 265 brands in 35 countries.Additionally, we clarified that each cluster of GSDs was observed over multiple days, the distribution of phishing attacks was biased, and attackers frequently targeted financial brands.

GENERATED SQUATTING DOMAIN
Attackers often obtain domain names that imitate legitimate services by employing domain squatting techniques.These domain names are then utilized to create phishing sites or embed links in emails and text messages.Previous studies proposed various methods to find domain squatting, e.g., using dictionaries such as "0" to "o" to create typosquatting [18,35,39,42] from legitimate domain names, generating similar domain names by machine learning models [27,36], and detecting combosquatting [22,43], which involves combining a popular brand or trademark with other words or phrases.We have observed that attackers frequently generate numerous similar domain names, which we refer to as Generated Squatting Domains (GSDs), to circumvent existing countermeasures.Figure 1 illustrates examples of GSDs employed for phishing sites resembling www.amazon.co[.]jp.These GSDs are generated through three primary techniques: combosquatting, typosquatting, and the use of deceptive subdomains [31], which position legitimate domain names within subdomains.These GSDs have different second-level domains comprised of six different characters.They consist of strings with a maximum edit distance of three from amazon and strings with a maximum edit distance of three for

PHISHREPLICANT
We propose a system called PhishReplicant for detecting GSDs from newly registered or observed domain names as input data.PhishReplicant analyzes the similarity between domain names of known phishing sites.Using a Transformer-based language model, the system can capture the linguistic characteristics of domain names and detect GSDs regardless of individual squatting techniques.
An overview of PhishReplicant is shown in Figure 2. The system consists of two steps.Step 1 takes known domain names used for phishing sites as input and extracts similar domain names through clustering.At this time, PhishReplicant converts the domain name strings into embedding vectors, which capture the linguistic features of GSDs.These extracted domain names are subsequently used in Step 2 for comparison purposes.This allows the system to extract new GSD patterns that frequently emerge from the latest phishing activity.In addition, as new brands are targeted by attackers, the system can handle them without manually adding them.PhishReplicant can even detect GSDs that do not explicitly contain the brand name itself or its variations by extracting patterns from a variety of GSDs without directly comparing them to legitimate domain names.Step 2 takes newly registered or observed domain names as input.The domain names that are similar to those extracted in Step 1 are output as GSDs.By detecting GSD candidates at the domain name registration and certificate issuance stages, before phishing sites are deployed, we can proactively prevent phishing attacks.

Step 1. Extracting Sets of Similar Domain Names
We obtain a set of similar domain names from Phishing Threat Intelligence (phishing TI).Phishing TI we used in this study comprises three sources: PhishTank [9], OpenPhish [8], and Twitter posts.To gather domain names related to phishing from the Twitter stream, we used CrowdCanary [28].CrowdCanary extracts 104 dimensional features from the text and images of Twitter posts searched by keywords such as "phishing" and "email".Then, CrowdCanary classifies these posts into two classes, phishing-reports and non-reports, using a supervised machine learning model.We received data on domain names in phishing-reports from the author of CrowdCanary.To extract domain names used for phishing sites from these domain names, we automatically access the domain names using our crawler as described in Section 4.2.If the accessed web page contains a logo image of a specific brand and its domain name is not that of a legitimate site, our crawler adds the domain name to the Phishing TI.In the following, we explain ways of excluding unnecessary domain names, extracting features from them, and clustering.
3. To address this issue, we explore ways to analyze textual similarities between domain names using a language model such as BERT [19].BERT is a Transformer-based language model that has shown great effectiveness in various natural language processing tasks, such as sentence classification.When identifying GSDs using BERT, one approach is to add a fully connected layer on top of BERT to perform a classification task, such as binary classification of GSDs and benign domain names, or multi-class classification of targeted brands.However, this approach has two problems.First, analyzing a vast number of domain names and classifying them into the GSD class or brand-specific classes will result in numerous false positives.Second, new patterns of GSDs emerge frequently, requiring periodic fine-tuning of the BERT model.Similar issues have been discussed in a previous study [24], which computes the representative vector of the logo image of the phishing site for brand identification and examines the cosine similarity.
Therefore, PhishReplicant uses a language model as a feature extractor rather than a classifier to identify GSDs by calculating the similarity of embedding vectors obtained from domain names.Specifically, PhishReplicant uses Sentence-BERT (SBERT) [34], a modification of the pre-trained BERT network, to represent the structure of domain names in text embeddings.The original BERT is unsuitable for semantic textual similarity tasks and clustering, which causes a significant computational overhead.To address this issue, SBERT uses siamese and triplet network structures to derive semantically meaningful text embeddings.We fine-tune SBERT on a dataset containing GSDs by using the triplet loss function to comprehend domain name similarities.Please refer to Section 3.3 for the specific training process of SBERT.
3.1.3Clustering.PhishReplicant clusters domain names using DBSCAN to extract sets of similar domain names.DBSCAN is a clustering algorithm that determines clusters by specifying the maximum distance of the neighborhood and the minimum number of data points within the clusters.DBSCAN outputs clusters on the basis of neighborhood density so that similar domain names can be grouped into the same cluster.Also, unlike clustering algorithms such as k-means, DBSCAN does not depend on the shape of the clusters or require specifying the number of clusters.Domain names not labeled into any clusters can be excluded as noise.We use cosine similarity as distance metrics of DBSCAN.We set the minimum number of data points () to three and the maximum distance () to 0.04 on the basis of the experiment in Section 4.1.

Generating Matching Rules.
GSDs may share common characteristics, including the use of identical top-level domains (TLDs), effective second-level domains (e2LDs), such as example.co[.]uk, or the same number of characters.To minimize false positives, PhishReplicant leverages these characteristics to generate matching rules for each cluster.Specifically, PhishReplicant creates three types of rules by examining whether all domain names within the cluster share the same TLD, e2LD, or number of characters.If any of these conditions are met, the system outputs a matching rule that is common to all domain names within the cluster.For instance, suppose there is a cluster containing domain names example000.test,example001.test, and example002.test.In this case, the output rule would be {"tld":".test","num":15}.

Step 2. Detecting GSDs from Newly
Registered and Observed Domain Names In Step 2, PhishReplicant receives newly registered and observed domain names and detects GSDs similar to known phishing domain names extracted in Step 1.

Extracting
For each vector   in N , we calculate its cosine similarity with all vectors in P.After calculating the cosine similarity between each vector   in N and all vectors in P, we examine the number of vectors in P that exceed a threshold value .This threshold value is defined as 1 minus the cosine distance value () used in Section 3.1.3,which is set to 0.96.If the number of vectors in P that exceed the threshold value  is equal to or greater than , we output the corresponding domain name as a candidate of GSDs.We set the  to 3 for clustering, so we set  to a smaller value of 2.
If this calculation is performed on a brute-force basis, the time complexity of this calculation represents  (), where  is more than 1M for new domain names, and  is more than 50k for known phishing domain names for one day of our experiment.To reduce this enormous amount of calculation, we use Faiss [5], a library for efficient similarity search of dense vectors.Faiss efficiently finds k-nearest neighbors of a query vector in large, high-dimensional vector databases through approximate search.After PhishReplicant finds domain names, the system applies the matching rules to each candidate if the most similar domain name has corresponding rules.Our system excludes the domain names that do not match the rules and outputs the remaining ones as GSDs.

Training
We explain how to train the SBERT model used in Sections 3.1 and 3.2 to represent the similarity of domain names.In this paper, we use SentenceTransformer [10] framework and fine-tune a pre-trained model (all-mpnet-base-v2).The advantage of using SentenceTransformer is that it provides simple ways to fine-tune pre-trained models and compute dense vector representations for sentences.The pre-trained transformer model uses a tokenizer to split sentences into vocabulary tokens and assigns IDs to each of them.The model that we fine-tune adopts a method called WordPiece tokenizer, which splits some words into sub-words.For example, the tokenizer splits www.mastercard[.]cominto www|.|master|##card|.|com(## represents the suffix following words).Thus, some brand names may not be represented.Therefore, we added common TLDs and brand names often abused for phishing sites to the tokenizer's vocabulary.We chose 1,292 TLDs that appeared in the two months of phishing TI.To collect brand names, we used the 277 brands that are registered in PhishPedia [24], and we also extracted 105 brands that were recently used in phishing attacks from the OpenPhish and PhishTank feeds.In total, we added 382 brands to the tokens.
The purpose of this training is to create a text embedding model that maps similar squatting domain names into a close vector space.Domain names can be shorter than the sentences that language models usually handle and have their own structure delimited by dots.By learning the structure of domain names as train data, the model can accurately identify the similarity of GSDs.We apply Triplet Loss [34] to fine-tune the SBERT model.Triplet Loss is a loss function that takes three sentences (anchor, positive, negative) as input and outputs corresponding embedding vectors.The objective of this loss function is to minimize the distance (inverse of cosine similarity) between the anchor and positive embedding vectors and to maximize the distance between the anchor and negative embedding vectors.In other words, the model learns that the anchor and positive domain names are semantically close, while the anchor and negative domain names are semantically distant.We create a structure called a triplet network that processes three sentences (triplets) with the same parameters.The output of the network is passed to the Triplet Loss function to calculate the loss value.We update the parameters to minimize this loss value.To prepare a triplet dataset, we extracted 42,311 GSDs from two months of phishing TI (the way of selecting GSDs is described in Section 4.1) and created 100k pairs of anchors and positives from similar GSDs.We randomly sampled 100k benign domain names as negatives from CT logs and created 100k triplets.We used this fine-tuned model, which was trained on the triplet data, in the following experiments.

Deploying
We explain how to deploy PhishReplicant to detect the latest daily GSDs.Since the domain names in phishing TI are regularly updated, we should run Step 1 with updated data periodically.In this paper, we collected domain names from phishing TI up to two months ago and ran Step 1 once a day.Step 2 should be executed as soon as new domain names are collected.We scheduled a job to detect GSDs once a day, as in Step 1, for our experiment.If we monitor the stream from CT logs and passive DNS in real time, we could detect GSDs immediately after domain registration and certificate issuance.

EVALUATION
In this section, we describe the evaluation experiments conducted on PhishReplicant.First, we evaluated the clustering results of Step 1 using a dataset.Then, we validated the output of the entire system during a certain period.Additionally, we conducted an experiment comparing PhishReplicant with existing tools for detecting domain names related to phishing.

Evaluation of Clustering
We performed an evaluation experiment to confirm that we can correctly find phishing domain names from phishing TI by clustering in Step 1.

Creating Dataset.
Since no existing data includes only GSDs and no existing methods identify GSDs, we need to create a dataset for evaluation.We labeled 34,095 domain names in the phishing TI for 20 days in October 2022 as positive (GSD) or negative.As mentioned in Section 2, GSDs have various patterns and are difficult to identify by specific rules.Therefore, using the guidelines below, we manually extracted GSDs on the basis of the characteristics, where we can notice similarities between GSDs if they are listed.
We create sets of at least three similar domain names that satisfy all of the following four conditions.
(1) Domain names contain the common strings: the entire or part of a brand name, or its typosquatting.(2) The difference in length of the domain name excluding TLD or e2LD is less than three.(3) If there are more than two subdomains, the number of subdomains is the same.(4) The number and position of dots and hyphens (if they exist) are the same.
There are some GSDs that we cannot determine on the basis of the above conditions.We further consider the following information to add similar domain names to the sets.
• The date of domain registration and certificate issuance.
• Web content of the phishing sites when accessing the domain names.• IP addresses.We exclude some hosting services' IP addresses because they may be shared among multiple users.
As a result, we labeled 8,411 (24.7%) domain names belonging to 1,011 sets for the positive data and 25,684 domain names for the negative data.We confirmed that the dataset had been correctly labeled with a review by security experts.Examples of the dataset are shown in the Appendix A.
4.1.2Clustering.We calculated embedding vectors from each domain name in the dataset by using the SBERT model trained in Section 3.3 and clustered them by DBSCAN.We performed clustering seven times, changing , the threshold for the distance between vectors, by 0.01, from 0.01 to 0.07.Domain names belonging to one of the clusters are identified as positive, and domain names not belonging to any cluster are identified as negative.Table 1 shows results for each  for the three indicators: precision, recall, and accuracy.The higher the , the more domain names belong to clusters, i.e., reducing false negatives.However, the fraction of true positives among the clustered instances (precision) is reduced.In the following experiments, we set 0.04, the  with the highest accuracy (94.6%), to perform the clustering in Step 1. Also, we use this value for the threshold of the cosine distance in Step 2 to detect GSDs.As true positives, we were able to identify similar domain names where words were substituted, and some letters were changed or added.For example, we found phishing domain names targeting Apple (apple-event-portal-support-online[.]com) and targeting Yahoo (yahmailllllll.godaddysites[.]comand yahmailll0.godaddysites[.]com).We also identified domain names with different e2LDs but similar subdomains.On the other hand, domain names that consist mostly of numbers and meaningless strings were identified as false positives.Examples are 3dq3e8b20gln5j27tro7bk7d24155gk8.web[.]appand hmzl0b2af7b7eed3176e826c7a.web[.]app.This is because the SBERT model splits long numbers into two-or three-character tokens; thus domain names containing those tokens are incorrectly clustered.However, we correctly identified domain names that consisted of a brand name or its squatting followed by a hyphen, and random numbers and strings.Examples are phishing domain names targeting a governmental organization in Germany (de-agbsession-q4ki42v[.]xyz and de-agb-session-q3j4na[.]xyz) and Facebook (business-meta-team-123908233.web [.]app and meta-business-form-1298067198.web[.]app).

Evaluation of Real-time GSD Detection
We used PhishReplicant to detect GSDs from the input of newly registered or observed domain names and validated that the GSDs were used for phishing sites.

Experimental
Setup.We explain how to collect domain names for input and how to determine if a detected GSD was later used as a phishing site.We collected domain names from entire CT logs and Zonefiles [13], a list of newly registered domain names for four weeks (28 days) starting in November 2022.Note that we can only extract domain names of phishing sites using server certificates from CT logs.However, HTTPS phishing attacks have been reported to account for most (85.1%)phishing campaigns in recent years [21].In addition, we confirmed that 48.7% of entries in phishing TI for a specific three-month period were HTTPS URLs.
We also observe passive DNS traffic as a data source of newly appeared domain names.The passive DNS traffic is gathered from 66 DNS cache servers in 18 countries on a global Tier 1 network.We extract domain names that have yet to appear from this traffic, where 13M domain names are newly resolved daily.Phishing sites may have already used domain names observed in passive DNS for phishing sites.Unlike a previous method [35], which analyzes the lifetime and traffic of each domain name, our proposed system only analyzes strings, allowing us to take action in the early stages of emerging phishing sites.
To confirm whether the detected GSD was actually used as a phishing site, we conducted the following five types of verification.URL Inspection Service: We use APIs of VirusTotal, URLScan, and Google Safe Browsing to verify that the detected GSD was reported as a phishing attempt.This confirmation process is performed for a maximum of one month.Phishing TI: After detecting GSD, we verify whether it is listed on phishing TI for up to one month.Web Crawling: We made a web crawler to identify phishing sites on the basis of web content.The crawler repeatedly accesses GSDs daily for up to one month immediately after detection.We implemented the crawler using NodeJS and Google Chrome as a browser.Although Selenium is the most well-known browser automation tool, it could be detected by anti-bot systems [16].Thus, we used ChromeDevToolsProtocol [4] to automate the browser.Some phishing sites do not allow access to web content due to a cloaking technique [30,44] if not accessed with the proper browser environment.For example, we observed that some phishing sites delivered via SMS check the browser's UserAgent is mobile (iOS or Android) using JavaScript and either respond with phishing sites or redirects to other URLs (e.g., https://www.gooogle[.]com/).We set UserAgent for Google Chrome on iOS and the viewport size as 390x844 pixels.We used Phishpedia [24] to identify a logo image in the screenshot to determine whether web content is related to phishing.We trained a Phishpedia model with logo images of 382 brands (the same as Section 3.3).Also, we visually check screenshots not identified by this model and confirm they display login forms or statements asking for money, such as requests for credit card numbers.Passive DNS: Phishing sites that target users with specific attributes (e.g., region, device) or are accessible for only a short time may be missed in the above way.Therefore, we use passive DNS and confirm IP addresses registered as A records of phishing TI's domain names for the past two months.We queried Farsight DNSDB [6] and our passive DNS to retrieve IP addresses.If a GSD uses an IP address for phishing, we determine that the GSD has been used as a phishing site.We exclude some web hosting services (e.g., firebaseapp[.]com,square[.]site,and godaddysites[.]com) because their IP addresses are shared with other users.Manual Validation: We manually check that GSDs are similar to known phishing domain names in the same way as Section 4.1.GSDs that the security vendors missed may have been running as phishing sites.Also, some GSDs not used as phishing sites may have been registered by attackers as part of phishing campaigns.To complement those domain names, we performed this manual check.

Results
. We analyzed newly registered and observed domain names for four weeks using our proposed system and detected 3,784 domain names.We confirmed through a manual check that 3,498 (92.4%) domain names were true positives, indicating they were related to known phishing domain names and were most likely obtained by attackers.Therefore, there were 286 false positives, accounting for 7.6% of the total.There are 2,821 (74.6%) domain names that were used as phishing sites after we detected them, which is the total number of those identified through validations, excluding manual checks.A total of 1,221 domain names, including 934 (VirusTotal), 433 (URLScan), and 564 (Google Safe Browsing) domain names, were actually used for phishing.In addition, we identified 430 domain names with Phishing TI and 757 domain names by web crawling.When we crawled a set of similar domain names (resolved to a small number of common IP addresses) from a single source IP address and confirmed a few phishing sites, followed by 404 responses from all the remaining domain names.Thus, some phishing sites limited the number of accesses from the same IP address.Of domain names that were successfully accessed and were not confirmed to have phishing content, websites of 631 domain names exposed the cgi-bin directory [35].Although attackers may have once used these websites for phishing, they already abandoned them after they removed the phishing content.As a result of analyzing IP address sharing using passive DNS, we confirmed that 2,106 domain names used the same IP addresses as those used by known phishing domain names.Unless the IP addresses used for phishing sites were coincidentally assigned, attackers retained control of the IP addresses and used them for other GSDs.

Comparative Evaluation with Baseline Systems
In the above analysis, we conducted evaluations to determine how many of GSDs detected using PhishReplicant were actual phishing sites, including those not contained in existing phishing TI.
Here, we compared the performance of PhishReplicant with baseline systems for detecting domain names related to phishing.We conducted this comparison by running them on the same types of input feeds corresponding to the period from March 1 to March 31, 2023, and then comparing their detection results with phishing TI.We used four existing systems as baselines in our study: dnstwist [7], Phishing Catcher [1], StreamingPhish [2], and Ctl-pipeline [20].StreamingPhish [2] is an machine learning (ML) based system that identifies phishing domain names by inputting domain names, extracting features from various fields of certificates, and domain name strings, and using a classifier (Logistic Regression).We used the attached dataset to train the classifier.Ctl-pipeline [20] is an ML-based system that identifies phishing domain names from server certificates by analyzing various certificate fields, domain name strings, and keywords contained in domain names.It extracts 126 types of features for its classifier.To train the Ctl-pipeline classifier with the latest data, we created a new dataset from CT logs and collected certificates from November 2022 to February 2023.We extracted certificates whose domain names were listed on OpenPhish or PhishTank as phishing-related certificates.A total of 27,153 certificates were labeled as phishing.We sampled the same number of certificates from the CT logs for benign label data.We used the ExtraTreesClassifier as the classifier and trained on the dataset.After training the classifier, we obtained a model capable of detecting certificates containing phishing domain names with an Accuracy Score of 0.931 and an ROC AUC Score of 0.981.We set the threshold of the probability score for phishing labels to the default setting of 0.925.If the score exceeded this threshold, the domain name of the input certificate was determined to be phishing.

Results. Table 2 displays a comparison of detection results
among PhishReplicant and the baseline systems.PhishReplicant, with 1,923 out of 7,358 domain names matching the Phishing TI or Google Safe Browsing, achieved the highest percentage of matching domain names at 26.1%.This is a significantly higher accuracy level than the 0.4 to 1.9% of results achieved by the baseline systems.Note that the difference between 7,358 and 1,923 does not necessarily mean that all were false positives.In fact, when further investigating the detection results of PhishReplicant using VirusTotal and URLScan, we found that 2,823 (38.4%) domain names matched in total.Among the domain names detected by PhishReplicant, 1,682 domain names were not detected by any other systems.This accounted for 87.5% of the domain names detected by PhishReplicant, demonstrating that the proposed system was able to detect many domain names that could not be detected by other baseline systems.
Although the number of matched domain names by the proposed system was lower than dnstwist and StreamingPhish, the results of these baseline systems include a large number of domain names that directly contain brand names without alteration, such as www.appl e.ifindmy-id[.]comand paypal.em-inff[.]xyz.Specifically, dnstwist's output results showed that out of the domain names that matched the phishing TI or GSB, 3,165 (86.8%) contained exact brand names, and StreamingPhish output 2,533 (67.2%) matched domain names that contained exact brand names.These domain names could be easily found by searching for brand names.When examining the number of matched domains that did not include exact brand names, PhishReplicant had the highest count at 1,800.
Rule-based systems, such as dnstwist and Phishing Catcher, require the input of legitimate domain names that serve as the basis for detecting squatting domain names.Therefore, when a new brand becomes a target of phishing attacks, these systems require manual updates to include the new legitimate domain names.Additionally, ML-based baseline systems, such as StreamingPhish and Ctl-pipeline, need to update and modify their classifiers to align with the latest phishing trends.In contrast, PhishReplicant automatically extracts domain names as GSDs from the latest phishing TI and detects similar domain names from new domain names based on them.This eliminates the need for frequent model updates required by baseline systems.

IN-DEPTH ANALYSIS OF GSDS
We analyzed the characteristics of sets of similar GSDs in detail, revealing the tactics and the infrastructure used by attackers.We clustered domain names detected by the proposed system and domain names extracted phishing TI for 150 days from August 2022 in the same way as Section 3.1.3.We extracted 205,158 domain names (including 3,784 detected GSDs in Section 4.2) labeled into 2,842 clusters.Each cluster consists of an average of 72.2 domain names.The cluster with the most domain names comprised 79,547 domain names.There are four clusters containing more than 10k domain names.

Duration of GSDs
We analyzed how long GSDs labeled to the same cluster were in use.We used the two types of passive DNS described in Section 4.2 to confirm the dates when domain names were first seen and identify the time difference between them in each cluster, as shown in Figure 3.The difference between each cluster's earliest and latest first-seen dates was 262 on average and a median of 41 days.If all  similar domain names appeared in a short period (e.g., a few hours), our proposed system may not work effectively.However, 2,779 (97.8%) clusters of GSDs are used for more than one day, indicating that we can almost completely prevent access to phishing sites by detecting GSDs on the basis of the similarity of the first ones.
We found that some domain names have been used intermittently for phishing sites over several years.While GSDs created using multiple squatting techniques are often out of use within a few dozen days, typosquatting domains with close edit distance to legitimate domain names are used for a long time.For example, steamcomnulty[.]comhas been used frequently to host phishing websites over the past decade and is considered a drop catch, an act of reacquisition of an expired domain name.
Case Study: Cluster A. We observed a cluster consisting of 4,639 domain names, each of which first appeared over eight days.Examples are www.macesaoeod.of6jh4[.]icu,www.maceseoeod .r7302f[.]icu,and www.macesarrod.mfxzq4[.]icu.These domain names appear to be squatting of MasterCard; however, they were used for phishing sites of three different credit card brands, which shared a single IP address of ASN-QUADRANET-GLOBAL (AS8100).
Case Study: Cluster B. We found another cluster consisting of 593 domain names, each of which first appeared over three days.

IP Address Sharing
As explained in the case studies, similar GSDs reuse IP addresses.We analyzed the IP addresses used for domain names of the 2,374 clusters for which A records existed in passive DNS out of 2,874 total clusters.There are 1,554 (65.5%) clusters whose domain names share one or two IP addresses.On the other hand, a few domain names in some clusters shared a large number of IP addresses.For example, four domain names in the same cluster, which targeted Sparkasse, a German financial institution, shared 265 IP addresses.There are 426 clusters where IP addresses associated with domain names outnumbered domain names.Although some GSDs can be detected by finding domain names associated with IP addresses of known phishing domain names in passive DNS, it is not a comprehensive method because not all GSDs share IP addresses.Our proposed system, which analyzes domain name strings alone, can comprehensively detect GSDs, including domain names that share many IPs with a few domain names.

Edit Distance
We used the Damerau-Levenshtein distance as the edit distance between domain names for each cluster to analyze the similarity of their appearance.We calculated the edit distance for all domain name pairs in each cluster and showed their average in Figure 4.There are 693 clusters (4,759 domain names) whose average edit distance was less than two.Even when trying to find domain names with close edit distance to known phishing domain names, the number of GSDs we can detect is limited.Our proposed system can detect GSDs focusing on the similarity of domain names, even if the edit distance increases due to letters and words being replaced, deleted, or added.

Phishing Targeted Brands
We analyzed brands targeted by phishing campaigns.We extracted brand information associated with 205k domain names from phishing TI and screenshots of crawling results.As a result, we found that 265 brands were imitated in 165,643 domain names, excluding those that could not be identified.We investigated these brands   5 shows the top 10 brands and countries to which they belong.Domain names targeting brands in Japan are the most common, accounting for 90.3% of the total.We investigated brands of all domain names reported on PhishTank and OpenPhish during the same period and found that the United States accounted for 69.7%, while Japan accounted for 8.4%.Phishing using GSDs has been observed worldwide.However, compared with the overall phishing trend, they are particularly prevalent in Japan.
We analyzed 1,594 clusters whose domain names were associated with at least one brand and found that 88% of the clusters have a single brand.The average number of brands per cluster was 1.27.There were some clusters whose domain names included similar strings to a single brand but targeted multiple brands.In other words, attackers created GSDs to imitate a brand and used them for phishing for unrelated brands.

LIMITATIONS
In our evaluation of the proposed system, we utilized a verified dataset and conducted appropriate experiments; however, there are certain limitations to consider.The system identifies GSDs potentially owned by attackers by examining their similarity to known phishing domain names.It is crucial to recognize that not all GSDs prepared by attackers are used for actual phishing sites.Contrary to previous studies, which required accessing websites to detect phishing content, our proposed system can identify domain names without the need for direct access.By solely analyzing domain name strings, our system can efficiently detect domain names prepared by attackers, even before the phishing sites become active.
The proposed system can only detect GSDs that are similar to known phishing domain names, which means that it cannot detect entirely new GSDs, such as those that use new squatting techniques.As described in Section 5.1, many of the similar GSDs remain active for more than one day.However, if we can observe the early stages of the emergence of phishing sites using new GSDs, we can detect the remaining GSDs that are similar to the domain names.Furthermore, similar issues occur with existing systems shown in Section 4.3.These systems require additional manual effort for brand selection and training when new brands are targeted.In contrast, the proposed system automatically detects the latest phishing-related domain names based on the extracted similarities from Phishing TI.Therefore, our system can respond to the emergence of new brands without the need for frequent additional training.We conducted experiments using various phishing TIs and URL inspection services as ground truth to validate a diverse set of phishing domain names.However, due to the lack of a comprehensive phishing feed, the potential for false negatives remains.Also, the output of our system may be limited to domain names similar to those listed in the phishing TI.
In this study, we did not perform an analysis of GSDs employing internationalized domain names (IDNs) due to the limited number of observations available for model training.For instance, within a span of 150 days of phishing TI, only approximately 400 IDNs were encountered, and none of these instances involved squatting domain names that led to brand name misidentification.Nevertheless, should GSDs with IDNs arise in the future, our proposed system can detect them by converting Unicode characters to the corresponding ASCII characters (e.g., ã to a) before extracting features.Furthermore, the detection of GSDs incorporating non-English brand names within IDNs can be achieved by fine-tuning a multilingual pre-trained model.

RELATED WORK
In recent years, domain squatting has emerged as a significant threat to Internet security, with attackers exploiting various techniques to conduct malicious activities.Consequently, numerous studies have been undertaken to investigate and address this issue.For example, Agten et al. [15] conducted a comprehensive longitudinal study on typosquatting by creating similar strings from legitimate domain names on the basis of edit distance.They found that strict policies and accessible dispute-resolution procedures can effectively reduce typosquatting abuse.Similarly, Nikiforakis et al. [29] explored the prevalence of bitsquatting in domain squatting and highlighted its common use in malicious activities such as drive-by download attacks and deception-based software installations.
Furthermore, Simpson et al. [36] investigated visually impersonating domain names (VIDNs) in business email compromise (BEC) frauds, demonstrating that their implementation of countermeasures has led to a decline in new VIDN registrations by criminals.Kintis et al. [22] investigated combosquatting, a technique that combines recognizable brand names with other keywords to create malicious domains.Their findings revealed that combosquatting domains are far more prevalent than typosquatting domains, emphasizing the need for more robust detection mechanisms.While these studies primarily focus on identifying domain names that are easily mistaken for legitimate ones, our research aims to detect domain names generated by multiple squatting techniques to evade existing detection methods, based on their similarity to known phishing domain names.
Several studies have also targeted the detection of homograph domains, including IDNs, using pairs of Unicode and ASCII characters [33,38].Chiba et al. [17] proposed a system for detecting and scoring deceptive IDNs, incorporating measurement studies and online surveys to evaluate the effectiveness of their scoring metric and suggest practical countermeasures.Other research has focused on phishing attacks that target specific industries or events, such as online banking and phishing campaigns ralated to COVID-19 [18,23,42].
Previous studies have also explored the identification of phishing domain names using features specific to certificates in Certificate Transparency (CT) logs and passive DNS traffic.Drichel et al. [20] proposed a detection pipeline capable of performing retrospective analysis and live classification of certificates published in CT logs, using machine learning models to identify phishing domain names.Sabah et al. [35] developed an approach to detect phishing by extracting various features from CT logs, certificate-based characteristics, passive DNS traffic, and lexical characteristics.Their live experiments identified phishing domains days before they were detected by Google Safe Browsing and VirusTotal.Our proposed system aims to provide a comprehensive solution for detecting GSDs by analyzing linguistic characteristics using a language model, without relying on specific input formats.This approach ensures a more robust and versatile detection mechanism in the evolving landscape of domain squatting techniques.
Several systems have been proposed for detecting phishing sites based on their appearance, such as VisualPhishNet [14], PhishPedia [24], and PhishIntention [26].These systems use deep learningbased image recognition techniques to compare screenshot images obtained by web crawlers with those of legitimate web pages.By leveraging such visual comparison and logo recognition, these systems can accurately detect phishing sites that resemble legitimate ones.However, the effectiveness of these systems is limited by their dependence on web crawlers.Specifically, some phishing sites use cloaking techniques that make them inaccessible to web crawlers, making it difficult to identify all phishing sites.In this study, we used PhishPedia to validate whether the proposed system detected actual phishing sites.Consequently, it is essential to implement the proposed system in conjunction with these web crawling systems to effectively mitigate phishing attacks.

CONCLUSION
In conclusion, the increasing sophistication of domain squatting techniques requires a more advanced approach to detecting and preventing phishing attacks.In response to this challenge, we have proposed a system called PhishReplicant, which focuses on detecting generated squatting domains (GSDs) that employ a combination of squatting techniques.By leveraging linguistic similarities between known malicious domain names, PhishReplicant can efficiently identify GSDs without the need for frequent model updates.We have demonstrated the effectiveness of our proposed system by analyzing certificate transparency (CT) logs, lists of newly registered domain names, and passive DNS data observed on 66 DNS cache servers across 18 countries.Over a four-week period, PhishReplicant successfully detected 3,498 GSDs, with 2,821 of these being used for phishing sites within one month of detection.Furthermore, our system outperformed baseline systems in terms of both detection accuracy and the number of identified domain names that did not include exact brand names.An in-depth analysis of 205k GSDs collected over a 150-day period revealed that these domains targeted 265 brands in 35 countries, with a particular focus on financial institutions and specific geographic regions.This underscores the importance of employing advanced detection system like PhishReplicant to protect users from evolving phishing threats.By continuously monitoring and identifying GSDs, we can proactively prevent phishing attacks before they have the opportunity to cause harm or shortly after they emerge, thereby enhancing overall web security.

Figure 3 :
Figure 3: Comparison of time differences between the earliest and latest first-seen dates of domain names for each cluster.The inner bars highlight data within the first 100 days, while the outer bars display the overall data.

Figure 4 :
Figure 4: Average edit distance between domain names in each cluster.

Table 1 :
Clustering results for each EPS.
[7].1 Baseline Systems.We describe baseline systems we used to evaluate the performance of PhishReplicant.dnstwist[7]isa system that generates candidates of squatting domain names based on a given domain name, using their fuzzing algorithms and dictionaries.It supports multiple types of squatting, such as typosquatting, bitsquatting, and homoglyph.Although Dnstwist has a online phishing inspection function for web pages, we used it as a generator of squatting domain names in this study.Seed domain names used as input were the domain names of 382 brands described in Section 3.3.
[1]shing Catcher[1]is a system that extracts domain names related to phishing by analyzing common names and SAN fields in server certificates.It calculates a score based on rule-based methods, such as keyword matching and Levenshtein distance comparison with legitimate domain names, and identifies the domain names as malicious.

Table 2 :
Comparison of detection results among PhishReplicant and baseline systems.
Examples are www.vianiocercenure.visoreoecssvxarercmsvi.baflrzt[.]rest,www.vianioceorcenure.visoreoecssiercmsvi.xppqjxo[.]rest,and www.vianiocenure.visoresiercmsvi.pefczzw[.]rest.These domain names have different e2LDs as in the clusters above and six patterns in each subdomain.They were used for phishing sites for two credit card brands related to Visa, sharing a single IP address of ColoCrossing (AS36352).

Table 4 :
Top 10 countries.The top 10 categories are shown in Table 3.The Other category includes consumer electronics, gaming, streaming, gambling, and energy.The Credit Card category includes 20 brands and accounts for 70.9% of all domain names.Phishing sites targeting

Table 5 :
Top 10 brands.Most domain names in the government category were related to tax payments, such as the revenue services and tax agencies in the United States, Australia, Japan, Turkey, France, and the United Kingdom.Table4lists the top 10 countries (35 countries in total) to which each brand belongs, ranked by the number of domain names.In addition, Table