Pushing Alias Resolution to the Limit

In this paper, we show that utilizing multiple protocols offers a unique opportunity to improve IP alias resolution and dual-stack inference substantially. Our key observation is that prevalent protocols, e.g., SSH and BGP, reply to unsolicited requests with a set of values that can be combined to form a unique device identifier. More importantly, this is possible by just completing the TCP hand-shake. Our empirical study shows that utilizing readily available scans and our active measurements can double the discovered IPv4 alias sets and more than 30x the dual-stack sets compared to the state-of-the-art techniques. We provide insights into our method's accuracy and performance compared to popular techniques.


INTRODUCTION
Uncovering the Internet's topology is crucial for Internet measurement and analysis.Common topology mapping tools, such as Traceroute, only provide partial information by revealing interfacelevel links.Alias resolution, the process of mapping IP addresses to the underlying hardware, enhances the accuracy and completeness of the observed topology [17].Moreover, it can aid researchers in the development of novel measurement techniques [29].The identification of dual-stack hosts, i.e, IPv4 and IPv6 enabled host, presents a conceptually similar challenge to alias resolution.Due to its large address space, however, measuring IPv6 networks remains a challenging task.Nevertheless, identifying dual-stack hosts is an important step in understanding network performance [7], policy [8], and security posture [9].
Prior work introduced many techniques to resolve aliases with the common source address [5] as the earliest approach.This technique operates by sending a packet to a closed port on a router, which triggers an ICMP port unreachable message.If the source address of the ICMP message differs from the probed address (the interface where the packet is received), the IP pairs are inferred as aliases.However, detecting aliases using this method becomes challenging as many routers always respond from the probed address or may not respond at all, rendering the technique impractical.
Other techniques utilize the IPID field in the IP header.IPIDbased techniques are predicated on the fact that many routers maintain a monotonic IPID counter that increments with each generated packet, and shared across interfaces.IPID-based tools attempt to sample the IPID value of candidate IPs over a short timeframe and perform a monotonic bounds test on the IPID sequences.If an IP pair share the same sequence, then they are likely to be aliases.RadarGun [3], Rocketfuel [28], and MIDAR [20] are few examples of tools utilizing this technique for IPv4 addresses, and Speedtrap [22] for IPv6 addresses, respectively.If a router utilizes a non-monotonically incremental IPID counter, such technique fails to identify potential aliases.Additionally, these techniques require sending large number of packets, rendering them less optimal for large scale measurements.
Recent work took a protocol-centric approach and exploited a unique identifier in the response to an unsolicited SNMPv3 request [1].This approach can infer aliases by grouping addresses that shares the same unique identifier.One drawback to this approach is that it requires the target IP to respond to a specific service, i.e., SNMPv3.Firewalls and access control lists can limit the number of identifiable aliases for a given host if the service is configured to respond only on selected addresses.
The mentioned techniques mainly addresses the alias resolution problem, however, the protocol-centric approach also solve the dual-stack identification.Further, researchers have developed a use-case specific solutions for dual-stack identification [1,4,24,26] with generic techniques utilizing DNS PTR records [9,23] In this paper, we take a protocol-centric approach and introduce a technique that improves both IP alias and dual-stack resolution.Our main contributions can be summarized as follows:

METHODOLOGY
Scanning for active services is a widely used technique in Internet measurement and security analysis [9,13].In this paper, we show that utilizing service scanning results for two popular protocols, namely, SSH and BGP, enables large-scale alias and dual-stack inference.By analyzing these protocols and their specifications [21,27], we identify unique host identifiers that can be used to group IP addresses belonging to the same host in both IPv4 and IPv6.

Service Scan Data
We perform active service scans for SSH and BGP in two phases: (1) An Internet-wide TCP scan sending a single SYN packet on port 22 and 179 using ZMap [13].(2) A service scan using ZGrab2 [16] targeting IPs, which are responsive to the Internet-wide ZMap scan.In the service scan, specifically for SSH, we complete the TCP handshake and subsequently send a protocol-specific payload to solicit banner information from the target IP.For BGP, the target IP sends an open message after we complete the TCP handshake without the need for any additional data exchange.
To complement our view of active services, we leverage the Censys dataset [12], in addition to our own active measurements.Censys perform service scan on the 65k ports.However, we only consider hosts that are running SSH and BGP on the default ports, i.e., TCP/22 for SSH and TCP/179 for BGP.

SSH Identifier
The Secure Shell (SSH) protocol, initially introduced in RFC 4253 [21], provides a mechanism to establish a secure network connection.We utilize ZGrab2's SSH module, which handle the SSH handshake, to perform our service scan.Upon completion of the TCP handshake, the server and the client send their respective service string banner and then proceed to exchange a series of plain text message before transitioning to an encrypted session.During this exchange, both the server and client communicate their respective capabilities  This exchange enables both endpoints to convey to the other the algorithms they support.RFC 4253 [21] states that each supported algorithm MUST be listed in order of preference, from most to least.This requirement results in a signature that can be used to identify the client and the server implementation [11,31].We use this information, and the service banner as the first part of our SSH host identifier.SSH server requires a pair of host keys.These keys are typically generate during the service setup.The client and server exchange the public key components during the connection setup phase.We use the server public key as the second part of our SSH identifier.While the SSH public key itself is likely to be unique per host, our active scan shows that 0.4% of non-singleton hosts communicate different algorithmic capabilities.Therefore, combining the key with the host's algorithmic capabilities can enhance the uniqueness of the SSH identifier.We highlight (in blue) the various parts of our SSH identifier in a snippet of SSH connection setup in Figure 1.

BGP Identifier
The BGP protocol is used to facilitate the exchange of routing information between BGP-speaking routers.To that end, BGP speakers establish and maintain a TCP session, typically over port 179.When scanning for host running BGP, we complete the TCP handshake and wait for data.We simply close the connection after 2 seconds timeout, or after receiving any data.We find that more than 5.8M BGP speakers close the connection immediately after completing the TCP handshake.However, 364k IPs close the connection after sending an OPEN and a Notification message stating that the connection is rejected.Figure 2 shows an example of a dissected BGP OPEN message from our service scan.
The OPEN message of a BGP speaker contains multiple fields that, when combined, can serve as a globally unique identifier.The first notable field is the BGP identifier.The BGP identifier is used as part of a loop and collision prevention mechanism and defined in RFC 4271 [27] as 4-octet unsigned integer that uniquely identifies the BGP speakers within an Autonomous System (AS).Moreover, it should have the same value for every local interface.The OPEN message also contains the Autonomous System Number (ASN) of a BGP speaker's network.The ASN is a globally unique number that is associated with a single AS [18].Some OPEN messages may contain optional parameters field that indicate the supported capabilities [6].The additional fields within the OPEN message such as Length, Version, and Hold Time are host-wide, and shared across all interfaces.Combining the values of those fields results in a unique identifier that we use to group alias and dual stack addresses.We highlight (in blue) the relevant parts of the identifier in a dissected BGP message in Figure 2.

Alias and Dual-Stack Inference
For every IP that is responsive to the BGP and SSH service scan, we extract the respective identifier.We group IP addresses that shares the same identifier into SSH and BGP alias sets, respectively.We group IPv4 and IPv6 addresses that share the same identifier into dual-stack sets.

Datasets
We leverage two different types of datasets.First, we use active measurement data in the IPv4 and IPv6 Internet.In IPv4, we perform Internet-wide scans for the SSH and BGP protocols using ZMap [13] and ZGrab2 [16].In IPv6, we use an IPv6 Hitlist [15,32] to identify potentially active addresses in the vast IPv6 address space.The active measurement data was collected on April 18, 2023, utilizing a single vantage point located in a data center in Germany.Our dataset, including our analysis, are publicly available [2].Second, we use data obtained from Censys [12] to identify additional responsive hosts to SSH or BGP.We selected a Censys snapshot that closely matches the date of our active measurement, March 28, 2023.
In Table 1 we show an overview of these two datasets as well as the union, where applicable, of both sources.In IPv4, we find that both Censys as well as our active scans cover a similar number of ASes for both SSH and BGP.Censys does, however, find around 6M more IPs for SSH and 35k more IPs for BGP.This might be linked with Censys performing distributed measurements, which reduces the likelihood of triggering rate-limiting or intrusion detection system filters [30].Further, censys also finds an additional 5.6M IPs running SSH on 60,806 different ports.We do not consider non-standard ports from Censys since our active scan only covers port 22.The union of both IPv4 data sources provides additional coverage compared to just a single source, both with respect to the number of covered IPs as well as ASes.Therefore, unless explicitly stated otherwise, we use the union of both data sources in the remainder of the paper for our IPv4 analysis.
In IPv6, our active scans find more than 1M SSH IPs and 67k BGP IPs.In contrast, Censys reports only 944 SSH IPs and no IPs for BGP.Further, the SSH IPs are running the service on a non-standard port, namely 80 and 443.We believe that the variation attribute to the IPv6 hitlists used.Due to its limited coverage, we exclude Censys IPv6 data from our analysis.However, as of August 15, 2023, Censys IPv6 snapshot reports more than 415k IPv6 addresses running SSH on port 22.We expect this number to increase overtime as Censys scans for IPv6 more rigorously.
In addition to SSH and BGP services, we conduct an SNMPv3 scan for both IPv4 and IPv6.We utilizing an already established methodology [1] to identify alias and dual-stack sets.We then use the results for validation purposes and as a supplement to our results.The SNMPv3 data also serve as baseline for comparison.We note that Censys data primarily reports SNMPv2 hosts and does not seem to include any information on SNMPv3.Consequently, we do not include it as an additional source.

Validation
We take a cross-protocol validation approach and compare sets derived from IP addresses responsive to different protocol pairs.We also utilize MIDAR [20] as an additional source for validation.Specifically, we test a random sample of 61k alias sets using MIDAR and check whether the resulting sets perfectly match the ones we identify with SSH.We ensure that each sample set contains at most ten IPv4 addresses to ensure completing the MIDAR run in a close time frame to the SSH service scan.We provide a summary of our validation results in Table 2 where we report the test sample size, the number of sets that exactly match, and the number of sets with mismatching IPs.
In cross-protocol validation, we initially compare the alias sets obtained from SSH and BGP.Our active scan data contains a total of 7.8k responsive addresses, common to both protocols.We identify 1.34k alias sets using SSH and 1.35k alias sets using BGP.The validation between SSH and BGP protocols shows that 96% of the SSH sets have a perfect match with the BGP sets.
Next, we examine the results of SSH and SNMPv3 pairs.Our active scan data contains a total of 63k responsive addresses to both protocols, resulting in 13.6k alias sets using SSH and 14.5k alias sets using SNMPv3.The validation between SSH and SNMPv3 protocols shows a 97% agreement.
Finally, we compare the BGP and SNMPv3 pairs with 37k responsive addresses to both protocols.We identify 1.84k alias sets using BGP and 1.9k alias sets using SNMPv3.The validation between BGP and SNMPv3 shows a 95% agreement.
When comparing our results with MIDAR, we focus solely on SSH-based alias sets due to the time required to run MIDAR against all alias sets.We find that only 13% of the sampled sets can be verified with MIDAR.This low coverage can be attributed to two We suspect that the disagreement can be attributed to IP churn given that the MIDAR run took three weeks to complete.It is also possible that some of these sets share the same host key.In summary, the validation results confirm that our technique has at least a 95% agreement with state-of-the-art.

Limitations
Our methodology provides the largest sets of alias and dual-stack addresses to date.However, we do note a few limitations: • First, our methodology relies on application-level data.As such, it is only applicable to IPs responsive to SSH and BGP.Firewalls and access control may block or restrict access to the these services which can limit the alias inference.• Second, in the case of BGP, BGP speakers can have a non-unique BGP identifier due to mis-configuration which can lead to incorrect inferences.• Third, our defined SSH identifier, might not be unique in all cases.
It is in fact possible for multiple host to share the same identifier, e.g., SSH servers can be shipped with factory-default keys [14,19].
It is unlikely for two different hosts to generate the exact same host key, however, unless an administrator chose to use the same key pair across multiple hosts.• Lastly, our validation is limited by the relatively small number of overlapping sets with other techniques, the responsiveness of a service on all IPs in a given set, and the possibility of IP churn.

ETHICAL CONSIDERATIONS
For our active experiments we do our best to minimize additional load or harm on the destination devices.BGP, SSH, and SNMPv3 load is very low (only a few packets per destination).Moreover, we randomly distribute our measurements over the address space for our experiment, ensuring that at most one packet reaches a target IP each second.Furthermore, we coordinate with local network administrators to ensure that our scanning efforts do not harm the local or upstream network.For the active scanning we use best current practices [10,13,25] to ensure that our prober IP address has a meaningful DNS PTR record.Additionally, we show information about our measurements and opt-out possibilities on a website of our scanning servers.During our active experiments, we did not receive any complaints or opt-out requests.

ANALYSIS
In this section we present our results, consisting of alias resolution and dual-stack statistics as well as AS-level analyses.

Alias Resolution
To identify alias sets, we group IP addresses with identical unique identifiers for SSH and BGP.We also supplement our findings with SNMPv3 as described in [1].In Table 3 we report the number of non-singleton alias sets and the contribution of each individual protocol, data source, and the union of all.In IPv4, the SSH active scan results in 505k alias sets, which cover over 3.2M unique IPv4 addresses.Similarly, the Censys dataset results in 699k alias sets, covering more than 4.6M IPv4 addresses.Censys data provide a notable increase of 70% and 80% in the number of IPv4 addresses and resulting alias sets compared to the active measurement alone.
With BGP, both Censys and the active scan produce similar results, with 12k alias sets covering 175k IPv4 addresses.In contrast, our SNMPv3 scan results in 557k alias sets covering 6.1M IPv4 addresses.By consolidating these findings, we can effectively cover more than 11.8M IPv4 addresses.
Interestingly, a substantial majority of 97% of these addresses only respond to a single service, while only 3% are responsive to two or three services.Consequently, this stark difference increases the resulting alias sets, exceeding 1.4M, of which 40% can only be identified with SNMPv3 and 60% (which is more than double what can be achieved by SNMPv3 alone) with SSH or BGP.We note however, that the majority of these sets comes from SSH.In Figure 3 we show the distribution of IPv4 addresses per alias set.We find that the majority of the sets contain less than 100 addresses.Additionally, more that 60% of SSH alias sets contain only two addresses compared to less than 30% for BGP and SNMPv3.BGP sets are also more likely to contain more addresses compared to sets derived from SSH and SNMPv3.We also note a similar set size regardless of the data source.
For IPv6, the active SSH scan results in 47k alias sets that cover 266k unique IPv6 addresses.Moreover, we find 8.3k and 16.7k alias sets, covering 48k and 71k IPv6 addresses with BGP and SNMPv3, respectively.Merging these results we obtain over 66k IPv6 alias sets, with a coverage of more than 340k unique IPv6 addresses.Similar to our IPv4 results, a majority of 94% of these addresses are only responsive to a single service, while 6% are responsive to two or three services.This results in 25% of the IPv6 alias sets being identifiable only with SNMPv3, while 75% can be identified with SSH and BGP.In Figure 4 we show the distribution of IPv6 addresses per alias set.Similar to IPv4, the majority of sets contain less that 100 addresses.Additionally, SSH sets are more likely to contain fewer IPv6 addresses compared to BGP and SNMPv3.We also note a similar set size for BGP and SNMPv3.

Dual-Stack Inference
Next, we shift our attention to the results of dual-stack identification, as summarized in Table 4.We merge alias sets from IPv4 and IPv6, if they use the same unique identifier.The SSH active scan results in more than 634k dual-stack alias sets, which cover 1.05M IPv4 addresses and 771k IPv6 addresses.With BGP, we identify 4.2k dual-stack sets, covering 78k IPv4 addresses and 16.3k IPv6

AS-Level Analysis
Figure 5 shows the distribution of Autonomous System Numbers (ASNs) per IPv4 alias set.We find that less than 10% of SSH and SNMPv3 sets contain addresses associated with two or more ASes.In contrast, over 35% of BGP sets contain addresses associated with multiple ASes.This outcome aligns with expectations, as BGP typically consist of border routers that connect different ASes.
In Figure 6, we show the distribution of the number of alias and dual-stack sets per AS.We find that over 37k ASes contain at least one set.The majority of ASes have fewer than 100 sets, and only 3% of ASes have more than 100 alias sets.
To better understand the main contributors of alias sets, we now focus on the top 10 ASes.In Table 5, we report the largest AS based on different protocols as well as the union of all three protocols for IPv4.We expect SSH to be predominantly prevalent in cloud provider networks, whereas BGP and SNMPv3 to be more prevalent in ISP networks.Indeed, among the top 10 ASes for SSH, 8 are cloud service providers, including DigitalOcean (rank 1, AS14061), Amazon (rank 3, AS16509; rank 6, AS14618), and OVH (rank 4, AS16276).Surprisingly, however, we also observe two major ISPs:   Telefonica de Argentina (rank 2, AS22927) and China Telecom (rank 8, AS4134).Shifting our focus to the top 10 ASes in the BGP and SNMPv3 data, we find that 8 of them are ISPs, while the remaining 2 are cloud service providers.The top three ASes for BGP are Zenlayer (AS21859), Verizon (AS701), and Glide (AS42689); the top three for SNMPv3 are Telecom Italia (AS3269), Vodafone Italy (AS30722), and Deutsche Telekom (AS3320).Lastly, we consider the union of all data sources.We find this to be dominated by similar as in the SSH data set, with a split of 6 cloud service providers and 4 ISPs.We conclude our analysis by considering the largest 10 ASes with IPv6 alias sets and dual-stack alias sets.Table 6 shows the union results of all three protocols for IPv6 and IPv4-IPv6 dualstack alias sets.The IPv6 alias sets spread over 7k ASes in total.The top 10 are split between 7 ISPs (e.g., Hurricane Electric, AS6939; China Unicom, AS4837; Chinanet, AS4134) and 3 cloud service providers (e.g., Akamai, AS63949; Dreamhost, AS26347).Finally, our dual-stack alias sets cover more than 9.5k ASes.Note that this includes sets with at least a single IPv4 and a single IPv6 address.We find that the top 3 ASes are cloud service provides (DigitalOcean, ASAS14061; Linode, AS63949; OVH, AS16276) and cover more than 54% of the total dual-stack sets.The remaining 7 are ISPs and cover only 10% of all dual-stack alias sets.

CONCLUSION
In this paper we introduced a multi-protocol approach to improve IP alias resolution and dual-stack identification.Our key observation is that a unique identifier for each protocol can be used to group different subsets of alias sets.We evaluated our method with two popular protocols, namely, SSH and BGP, and we showed that our technique substantially increases both the number of alias as well as dual-stack sets, compared to similar protocol-centric technique such as SNMPv3.Our results showed that we can supplement previous work and identify up to 1.4 million non-singleton IPv4 alias sets, i.e., double compared to what can be achieved with previously known technique.Our results also showed that we can identify more than 650 thousand dual-stack alias sets.By a large margin (30×), this is the largest set reported to date.As part of our future research agenda, we plan to investigate if other popular protocols are associated with unique identifiers that will further increase the IP coverage of alias and dual-stack sets.We also plan to inspect SSH identifiers more in-depth, specifically in terms of consistency and stability.Moreover, we plan to use updated IPv6 hit-list as we were limited to these publicly available in this paper.Our initial results are very encouraging, and we plan to perform additional measurements from multiple vantage points (VPs) to understand the effect of geographical VP location.

Figure 2 :
Figure 2: A Dissected BGP OPEN Message

Figure 6 :
Figure 6: Distribution of the number of alias sets per AS.

Table 1 :
Service Scanning Dataset Overview

Table 2 :
Alias Sets Validation

Table 3 :
Alias Sets Overview

Table 5 :
Top 10ASes for IPv4 alias sets for each protocol separately and for the union.Each cell shows the ASN as well as the number of alias sets in parenthesis.

Table 6 :
Top 10ASes for IPv6 alias and dual-stack sets.Each cell shows the ASN as well as the number of alias sets in parenthesis.