Poster: Revealing Hidden Secrets: Decoding DNS PTR records with Large Language Models

Geolocating network devices is essential for various research areas. Yet, despite notable advancements, it continues to be one of the most challenging issues for experimentalists. An approach for ge-olocating that has proved e � ective is leveraging geolocating hints in PTR records associated with network devices. We argue that Large Language Models (LLMs), rather than humans, are better equipped to identify patterns in DNS PTR records, and signi � cantly scale the coverage of tools like Hoiho. We introduce an approach that leverages LLMs to classify PTR records, and generate regular expressions for these classes, and hint-to-location mapping. We present preliminary results showing the applicability of using LLMs as a scalable approach to leverage PTR records for infrastructure geolocation.


INTRODUCTION
Geolocating network devices is essential for various research areas (e.g., [3-5, 8, 13]).Yet, after two decades and despite notable advancements, it continues to be one of the most challenging issues for experimentalists [9].
Geolocating devices can be divided into two distinct problems: geolocating end-hosts and geolocating network infrastructure.While end-host geolocation has advanced signicantly due to its commercial value, infrastructure geolocation remains signicantly underdeveloped, and techniques commonly used for geolocating end-hosts do not always translate well to routers and servers.For instance, while latency-based geolocation is generally eective for end-hosts, routers often ignore ICMP echo requests.
An approach for geolocating infrastructure that has proved effective is leveraging geolocating hints in PTR records associated with network devices.Network operators encode physical location hints in DNS hostname strings of network devices to help with troubleshooting and operation [1] and previous work has shown the potential value of leveraging this information [2,6,10,11].
As early as 2002, Rocketfuel [11] used manually-assembled collections of regular expressions (regexes) to extract PTR geolocation hints.Most recently, several eorts have tried to automate te task of extracting this location hints.The task of extracting and interpreting geo-hints from PTR records is challenging.For starters, the labels are primarily designed for human interpretation rather than computational processing.In addition, there is a lack of standardization across operators in what geographic information is encoded and how, which leads to the development of an adhoc approach for each codication.Even within a single operator, legacy infrastructure from rebranding and mergers and acquisitions results in multiple standards that can take decades to converge.For example, although the merger was executed almost 20 years ago, AT&T still uses South Bell Corporation Global labels, such as 99-170-164-205.lightspeed.tukrga.sbcglobal.net.This often appears in networks with large geographic spans managed by multiple teams and divisions, as seen in companies like Google.
Huaker et al. [2] tries to automate part of the task by searching for geographic encoding based on a previously populated dictionary of geographic-related strings.More recently, Luckie et al. [6] automatically extract and interpret geo-hints embedded into hostnames using regexes informed by a dictionary that includes strings such as airport codes, city, state and country names), and learn simple deviations from geohints such as prex (e.g., "ash" for "Ashburn") and partial matches (e.g., "ftcollins" for "Fort Collins").
While highly eective, the coverage of these approaches and the associated tools and datasets is limited largely due to the challenge of scaling up some of the needed steps to create candidate regular expressions.Hoiho [6], the software component implementing Luckie et al. approach, resorts to the MAXMIND database for the large majority of IPs it cannot geolocate based on PTR records.For a CAIDA ITDK dataset (itdk-2023-03, using traces collected between 8-13 of March, 2023), it was able to extracted records from 0.041% of the IP addresses, although about half of the records have associated PTR record information.
Our work is based on the observation that Large Language Models (LLMs), rather than humans, may be better equipped to identify patterns in DNS PTR records and create extraction rules, oering a path to signicantly scale the coverage of tools like Hoiho.Our approach uses LLM to (1) classify PTR records into distinct groups based on the structure and potential geographic hints, (2) generate regular expressions based on these classications, identifying patterns and consistent naming conventions, and (3) map the identied classications and regex patterns to geographic locations by linking encoded hints with actual place names.
The following paragraphs describe our approach and present some preliminary results.

APPROACH & DESIGN
The eruption of Large Language Models (LLMs), e.g., GPT-4, redened automating information extraction (IE) tasks, including Named Entity Recognition (NER) for specialized elds, such as identifying network infrastructure encodings in our case.These LLMs leverage few-shot learning (FSL) [12] to learn from limited data, simplifying the development of new frameworks without re-training.We adopt this model to develop pipelines employing modern LLMs to learn example patterns and create extraction rules from limited cases.
Instead of a one-shot approach, we divide the process of generating regular expresions and geohints into multiple intermediate steps to maximize their precision.Our approach to decoding PTR records using a multi-step process involving three distinct LLMs.Each is specialized for a particular task, working with a subset of records from a given provider, as follows: Classication We use LLM to categorize PTR records into classes.The prompts in this stage guide the model to accurately identify and label each record according to its class, considering various features and patterns within the data.Regex Generation Following classication, we employ a regex generation LLM to create regular expressions extracted from the patterns observed in the classied records.The prompts for this model generate regex patterns that can match and extract relevant information from the records.This is critical for precise parsing and interpretation, as it allows the system to handle a diverse range of record formats and structures with high accuracy.Hint Map Generation The nal component uses LLM for hint map generation.This model correlates specic hints to geographic locations or other relevant attributes.The prompts are designed to produce mappings that enhance the accuracy of decoding network information.By providing context-specic hints, this step aids in interpreting complex data and improves the system's overall ecacy.

PRELIMINARY RESULTS
To evaluate our approach we selected a subset of 680 ASNs to analyze.We chose all large cloud providers, tier 1 ISPs (i.e., ISP without customer to provider relationships), the top 390 AS from APNICS Internet population data (together representing over 80% of the Internet population), and the top AS by APNICs AS internet population data per country.In total, the dataset includes 680 ASNs and 854,317,370 PTR records.From our classication model, we nd that 23% of these ASNs encode geographic information with a ner granularity than country level in their records.hints.Additionally, the analysis identied 1,409 unique regular expressions (regexes) used for geolocation.We apply our approach to a dataset containing 51,840 ASNs and 1,282,817,253 PTR records collected by OpenIntel [7].Out of these, we generate regular expressions and hint mapping for 680 ASes.
First Attempts At Validation.We extracted geo-hints from AT&T (AS7018) records from CAIDA ITDK dataset (itdk-2023-03), a particularly challenging operator due to non-standard encodings.Hoiho extracts data from 563 out of 239,796 AT&T records.Our LLM-based approach is able to extract 38,883 records.Our validation involves pinging to geolocate AT&T devices from RIPE Atlas probes in the same city, a dierent city in the same country, a dierent country, and a dierent continent.Figure 1 shows clearly separated latency distributions, suggesting that the majority of devices were correctly geolocated.

CONCLUSION
We make the case for an LLM-based approach to extract geo-hints for network devices, reducing the reliance on manual tasks of current approaches.We extract geographic information and perform an initial validation with records from AT&T, nding that our extracted geo-hints correspond to lower RTTs when using that information to select probes.

Figure 1 :
Figure 1: CDFs of RTTs from probes in a dierent distance from the geolocated device.

Table 1
summarizes the key parameters of this dataset and our mapping results.Our generated expressions and hint mappings cover 190 countries and identify 2,117 unique cities through 5,096

Table 1 :
Evaluation dataset and mapping results