Aggressive Internet-Wide Scanners: Network Impact and Longitudinal Characterization

Aggressive network scanners, i.e., ones with immoderate and persistent behaviors, ubiquitously search the Internet to identify insecure and publicly accessible hosts. These scanners generally lie within two main categories; i) benign research-oriented probers; ii) nefarious actors that forage for vulnerable victims and host exploitation. However, the origins, characteristics and the impact on real networks of these aggressive scanners are not well understood. In this paper, via the vantage point of a large network telescope, we provide an extensive longitudinal empirical analysis of aggressive IPv4 scanners that spans a period of almost two years. Moreover, we examine their network impact using flow and packet data from two academic ISPs. To our surprise, we discover that a non-negligible fraction of packets processed by ISP routers can be attributed to aggressive scanners. Our work aims to raise the network community's awareness for these"heavy hitters", especially the miscreant ones, whose invasive and rigorous behavior i) makes them more likely to succeed in abusing the hosts they target and ii) imposes a network footprint that can be disruptive to critical network services by incurring consequences akin to denial of service attacks.


INTRODUCTION
Intensive and incessant Internet-wide scanning activities have evolved significantly over the past several years primarily due to two orthogonal factors: the development and wide adoption of research tools such as ZMap [20] and Masscan [22] that have been enabling researchers to examine a plethora of security and networking questions; and the independent explosion of botnets and malware that target Internet-of-Things (IoT) applications and hosts (e.g., Mirai and others [33,3,32,35,47]).While the utility of innocuous research scanners has been indispensable for many applications (e.g., understanding the risk profile and security posture of networks and protocols [15,37,29,6], detecting network outages [26,42,44], disclosing and assessing new vulnerabilities [19], identifying IP space usage and address exhaustion [7,40], studying censorship [41,49,43] and understanding botnets [3] and cybersecurity flaws [18,16,27,48, 1]), their collective impact on the overall network traffic, their origins, the profile of the applications/ports they target, etc. are currently not well understood nor have been systematically quantified.A similar gap exists in understanding the network impact and characteristics of malicious network scanners (e.g., botnets [3] or adversaries that forage for insecure Internet hosts [10]) that are heavily probing the Internet.In this paper, we attempt to shed some light into the behavior of both families of scanners through the lens of i) a large network telescope and ii) traffic data (i.e., flows and packet streams) from several vantage points of a large academic ISP, namely Merit Network, and a campus university network, i.e., University of Colorado; we collectively refer to these probers as aggressive scanners (AH, for short, for "aggressive hitters") due to their defining characteristic of exhibiting some sort of "excessive" behavior.
Large network telescopes or Darknets [38,36] provide a unique perspective for understanding macroscopic Internetwide activities, such as scanning [17].Network telescopes are instrumented to receive and record Internet-wide traffic destined to large swaths of unused (but routed!) IP space.In this paper, we longitudinally study a large network telescope operated by Merit Network, namely the ORION Network Telescope (ORION NT) [36], covering about 500,000 contiguous "dark" (i.e., unused) IPs for a period spanning 22 months (January 1st, 2021 to October 15th, 2022) to obtain up-to-date insights into the characteristics of aggressive Internet-wide scanners that reach our Darknet.We consider three separate modalities to examine intensive scanning behavior (see Section 3).E.g., following the definition of "large scans" from [17], we consider hosts that scan 10% or more of the dark IP space to be aggressive.Using this definition, we identify 155,010 unique IPs associated with aggressive scanning in 2022 across a total of 57,334,643 unique IPs reaching the Darknet.They contribute 540 billion packets amounting to 65% of all packets captured in the Darknet for 2022.
To understand the network impact ascribed to these "heavyhitters" we integrate into our analysis flow data from Merit, which serves upwards of one million users.Further, we examine live streams of packets at one monitoring station at the same ISP and another station at the University of Colorado campus network.We join the ISP datasets with the identified hitters to measure the impact of the AH activities on the network in terms of packet volume.We found that AH packets contribute between 0.1-5.85% of the total ingress/egress packets processed by core routers on a typical day; this is a non-negligible fraction.Our main contributions include the i) up-to-date longitudinal profiling of Internet-wide "aggressive" scanners and ii) measurable evidence that the aggregate network footprint of these scanners is not as inconspicuous as researchers and operators generally assume.This traffic can be disruptive to network operators; especially traffic originating from origins that never disclose their intents (as opposed to the seemingly benign "Acknowledged" lists [9] that do reveal the scanning purpose).Scanners of unspecified intent are the vast majority of probers we categorize as "aggressive", and can be associated with botnet propagation and nefarious reconnaissance (e.g., see [10]).We plan to produce and share daily lists of such scanners (using all three definitions) that the network and "threat exchange" communities [50,34] could subscribe to, hoping that they can be utilized by operators to block and mitigate this disruptive Internet background noise.

DESCRIPTION OF DATASETS
A. Darknet data.We analyze data from the ORION NT to identify and then study the aggressive hitters.To study yearly trends, we split the Darknet dataset into two parts: Darknet-1 (spanning the entire 2021) and Darknet-2 (January 1st, 2022-October 15th, 2022).See Table 1.
Central to our analysis of Darknet data is the notion of a darknet event.For this study, a darknet event represents a "logical scan" such as those defined in [17,45].Following [17], a logical scan summarizes the scanning activities of a source IP appearing in the Darknet.TCP-SYN packets, UDP packets, or ICMP "Echo Request" packets are the three traffic types we consider as "scanning packets" [17].A logical scan represents the activity of a source IP associated with a particular Darknet destination port and traffic type.For each darknet event / logical scan we record its start and end timestamps; an event is considered to have ended when no packets have been seen in the Darknet from the event's source IP to the event's targeted destination port and traffic category for more than a "timeout" period of around 10 minutes1 .For each event, we record total packets, number of unique Darknet destinations contacted and metadata [36].
B. ISP flows.To quantify the scanners' network impact, we utilize ISP flows from Merit.The flows are in Netflow format and collected with a packet sampling rate of 1:1000 at three core Merit routers.The Netflow collectors are configured to only sample ingress and egress traffic to/from the ISP.i.e., internally facing router interfaces are not included in the flow data.We employ two datasets: Flows-1 (January 15th, 2022 to January 21st, 2022) and Flows-2 (October 1st, 2022).C. Packet streams.To further validate the network impact results, we also performed measurements on mirrored packet streams at Merit and the campus network at the University of Colorado (to be referred as CU).CU is not associated with Merit (i.e., Merit does not provide upstream/transit services to CU and the IP spaces of both networks are different), and serves a population of 100,000 users.These non-sampled packet streams include the majority of ingress/egress traffic observed at a major core router at Merit (one of the three routers we have flow data from) and all campus traffic at CU.We examine 72 hours starting on 2022-11-28.During then, at Merit, the monitoring station processed traffic exceeding 8 Mpps (million packets per second) and ≈ 80 Gbps.At CU, we observed peak rates at 5 Mpps and ≈ 40 Gbps.D. Acknowledged scanners.To obtain insights into the seemingly benign/research scanners while also partially validating our lists of detected aggressive scanners, we employ the publicly available list of "Acknowledged Scanners" [9].The list curator considers a scanning IP as an "Acknowledged Scanner" ("ACKed" scanner, in short) if the scanners make any efforts to disclose their intentions (e.g., research purposes).At the moment our analysis was performed, the list [9] makes available the source IPs of 36 unique organizations.E. Honeypot data.To cross-validate the lists of non-ACKed scanners (i.e., the likely miscreant ones) and shed light into their behaviors, we employ data from GreyNoise [23].Grey-Noise (GN) operates distributed honeypot sensors at multiple cloud providers meticulously placed throughout the world.The IPs observed contacting their sensors are tagged by the GN team via an internal process.An IP is annotated as benign,malicious or unknown; more specific tags are also available for some IPs.We examined GN data (with 2,962,153 unique IPs) for the whole month of June 2022.Ethical considerations.Working with real-world traces requires ethical and responsible data handling.Our measurement infrastructure was designed with careful consideration and follows best practices imposed by the security/privacy boards and network managers of the organizations that operate the corresponding instrumentation.For instance, all of our datasets are passively collected and we never interact or probe any of the identified IPs present in our datasets.The data were analyzed in a secure manner only by the authors.Moreover, we followed the "code-to-data" paradigm for analyzing the live packet streams in which our code was shared with and executed by authorized personnel with access to the mirrored data.We do not collect nor examine any device MAC addresses or user payload, and we merely performed packet counting (i.e., total packets originating from AH) when examining the packet streams.
Darknet data are generally considered to pose minimal privacy risks; however, we take measures to not expose any identifiable information that might endanger networks or individuals.E.g., in the analyses that follow we elected to not publicly disclose the actual ASN and organization names that originate AH to protect the reputation of these networks.

AGGRESSIVE NETWORK SCANNERS
Definition 1: Address Dispersion.We classify a source IP appearing in our Darknet as aggressive whenever it is involved in a darknet event that targets 10% or more dark IPs.This definition was also employed in [17] to identify "large scans".We found 2,977,242 scanning events in Darknet-1 and 2,075,485 events in Darknet-2.We identified 158,681 distinct IPs satisfying this condition in the Darknet-1 dataset and 155,010 IPs in 2022.Definition 2: Packet Volume.The second definition is based on packet volume.For each Darknet dataset, we compile the Empirical Cumulative Distribution Function (ECDF) for the number of packets sent per event.Using the empirical distribution, we calculate the (1 − )th-percentile, and declare a scanner as "aggressive" whenever it participates in an event with total packets transmitted crossing the critical threshold.We utilized  = 0.0001.
The thresholds that correspond to the top-0.01%events were found to be 64,810 packets and 23,491 for Darknet-1 and Darknet-2, respectively.The number of identified aggressive source IPs found from this definition in 2021 was 159,159.We noticed that these numbers are very similar to those obtained using the address dispersion rule; indeed, the Jaccard similarity score2 for the two sets of hitters is found to be 0.8.Due to the high similarity among the two populations in the sequel we mostly focus our attention to scanners identified using the address dispersion definition.Definition 3: Number of Distinct Destination Ports.Our final definition is based on the number of distinct ports that a scanning IP contacts in the Darknet in a given day.We again source our data to obtain the ECDFs for the number of unique ports for both years.We use the same  = 0.0001 to find the critical threshold.The ECDFs for Darknet-1 and Darknet-2 differ, indicating a shift towards more scanned ports (see Izhikevich et al. [30] for a possible explanation).For Darknet-1, we classified the IPs scanning more than or equal to 6542 ports per day as aggressive, whereas for 2022 the threshold is 57,410 ports.

NETWORK IMPACT
Having the lists of AH available, we now shift focus into understanding the impact that these scanners pose to networks.First, we utilize flow data from Merit to measure the collective packet volume generated by the identified AH and processed by the ISP's routers as they transit the network.We start by individually checking flow data from three core Merit routers.These routers collectively process more than 50% of all packets transiting Merit's network.
Table 2 showcases the network impact imposed by aggressive scanners for definition #1 (we omit results for the second definition since that scanning population is very similar to the one identified with the first definition; results for definition #3 show a less pronounced impact, albeit nonnegligible, but we omit them for brevity).We report on the total number of packets observed at a specific vantage point originating from a source IP belonging to an identified AH.
In addition, we also include the portion of traffic that these packets amount to with regards to all the packets that a given router processes for the days examined.The tables highlight a somehow unexpected result: the daily fraction of aggressive scanners' packet volume lies between 1.1 − 5.85%; this is a relatively high percentage and indicates that the impact of aggressive scanners on network traffic is not negligible.To rephrase, we see evidence that, on average, at least one out of every hundred ingress or egress packets that a router processes is a packet originating from an AH.
Table 2 illustrates that the peering arrangements in place at the ISP directly affect the fraction of AH packets recorded on a given router.For instance, we remark that router-1 endures the highest impact with regards to hitters identified with the address dispersion metric; this can be explained by the fact that definition #1 AH frequently originate from Europe and Asia, as shown in Table 5, and router-1's routing policies (e.g., upstream tier-1 peers) dictate that such traffic would enter Merit at that point-of-presence.
We next reflect further on interpreting and validating this surprising result.We note that the higher percentages occur on weekends, namely when the overall Merit traffic is lower.We also speculate that content caching [21] plays a critical role in "amplifying" the effect of network scanning.Merit has put in place careful traffic engineering considerations to have their users benefit from content caches (e.g., videos, etc.) that reside within the ISP.User traffic to/from these content caches does not traverse the 3 border routers we study here so these packets do not contribute to the calculated ratio.
To further validate our results, and to eliminate the possibility that the high network impact might be due to some bias arising from the sampled flow data, we next examine the mirrored packet streams at both Merit and CU. Figure 1 illustrates the results, offering some interesting findings: i) This non-sampled dataset confirms that the network impact at Merit (and router-1, specifically) lies around 2% (see left panel, top row) 3 ; ii) the network impact at CU is also high, but an order of magnitude less than Merit (see right panel, top row), hovering just shy of 0.10%.We hypothesized that this could be an artifact of the lack of content caching at CU which means that the monitoring station at CU sees more video-related traffic compared to the Merit station.Indeed, we checked with the network engineers at CU and they verified that no content caching is present within their network and off-net caching is provided by their upstream ISP; iii) the instantaneous impact from AH could even exceed 7% on certain occasions (middle row panels) on both networks, reaching even 12% at Merit; iv) as we observe on the bottom row panels, on several 1-second intervals (shown in red color) when the AH impact is high, overall network traffic could also reach high levels (e.g., exceeding 6 Mpps).This implies that AH are overwhelming the network even during its "busy" times, and consequently network performance might suffer due to potentially incurred packet drops and network delays.In short, these AH collectively exhibit behavior akin to denial-of-service attacks.Figure 2 further corroborates the hypothesis that the network impact difference between Merit and CU can be explained by the presence of content caching (or lack thereof).The figure illustrates the instantaneous packet rates ascribed to the identified AH at Merit (left) and CU (right) when we normalize by their total number of /24 networks (28561 /24 nets for Merit and 291 for CU).As observed, CU is in fact more adversely affected by the collective impact of these scanners on a per /24 basis.
Table 3 allows us to understand the protocol behavior of these AH, as observed at both the Darknet and Flow data at Merit.The table illustrates the protocol distributions with 3 The (cumulative) fraction declines over time since we transition from a weekend day to a weekday.Further, we performed this 3-day analysis using AH for Nov. 27th, 2022, and due to DHCP churn (see [50]) some AH IPs might have become obsolete by the second and third days of the analysis.Rate (pps) 1-sec intervals when AH impact above 0.37% (99.9-pctl) Figure 1: Network impact (for def.#1 AH) observed using packet data.Left: Merit impact.Right: CU impact.Top row show the fraction of packets observed at the monitoring station when packets are counted in a cumulative manner (i.e., from start of experiment).The center row shows the instantaneous impact.Bottom row shows the instantaneous rates; note that on certain occasions (instances highlighted in red), high AH network impact coincides with instances of high overall network traffic rates (in Mpps).4 shows the network impact that scanners that can be classified as "Acknowledged" bear onto the network.The tabulated data suggest that "seemingly benign" scanning activities contribute a relatively high toll on the routers.The results are for the Flows-2 dataset (October 1st, 2022).

SCANNERS CHARACTERIZATION
Next, we longitudinally study the identified scanners and attempt to characterize them (e.g., their origins, top ports targeted, etc.). Figure 3 shows time-series for definition #1.Table 4: Network impact attributed to ACKed scanners.We report total packets sent by ACKed (in billions) and their fraction amongst all ingress/egress packets.

Router-1
Router-2 Router-3 Definition # 1 3.17 The left panel shows the number of active AH per day (which includes AH that may have started scanning prior to that day), the number of unique daily AH (i.e., ones that started their scanning efforts during that day), and the number of all active and daily scanners.The lines for the latter two scanner numbers seem to coincide because their values are very similar; their average difference is only 8,471 IPs.The right panel shows the number of packets transmitted by the number of daily scanners in a given day, juxtaposed with the aggregate Darknet scanning packets.Due to the darknet events data format, we can only calculate packet statistics for daily scanners.
The plot shows that the number of aggressive scanners increases over time.On average, we found 1452 (3876) daily (active) hitters per day in 2021, whereas there are 1779 (5349) daily (active) hitters per day in 2022.Figure 3 (right) depicts that the identified hitters contribute the vast majority of packets seen in the Darknet.We observe that on average around 0.1% of scanning IPs appearing in the Darknet and corresponding to AH are responsible for over 63% of the total packets captured per day in ORION NT.
Next, we discuss the origins of AH.We characterize the type of Autonomous Systems (AS) that originate these scanners, and the country of origin.Table 5 tabulates the top-10 networks and the countries associated with definition #1 AH. (Numbers in parentheses indicate ACKed scanners.)We also studied the origins of AH based on the other two definitions; for space economy, we omit these tables, but we point out that the origins for the first two definitions are very similar, echoing the previous observations that scanners from the first two definitions (address dispersion and packet volume) largely overlap.On the other hand, the origins for the third group differ, and we even see the presence of research institutions.Notably, a certain US-based cloud provider ranks top  in all six definitions/datasets (except once), indicating strong preference from scanning organizations for its use.Next, we validate our inferences using the publicly available lists of "Acknowledged Scanners" [9], aiming to shed light into organizations that are seemingly benign and perform aggressive scanning for research purposes.We consider an identified AH as an ACKed scanner if i) its IP is within the list of IPs available in [9]; ii) we find a match via reverse DNS checks.I.e., we compiled a list of 48 "keywords"(see list [2]).based on the reverse DNS records of the IPs in [9].
Table 6 summarizes the matching results.E.g., we find that 4706 IPs from 27 distinct organizations using definition #1 and Darknet-1 are indeed AH.We note that we discovered several IPs (around 7600 in total) belonging to organizations considered as "ACKed scanners" that were not included in [9].Overall, we identified 7,974 IPs from 29 unique ACKed scanning organizations (out of 36 in [9]) during the full 22months period across all definitions.
We next characterize the aggressive hitters in terms of the top applications they target (with regards to packets received).We also break down the attempts against each port based on whether the ZMap, Masscan or "Other" fingerprints have been observed (see [17] for the ZMap, Masscan fingerprints). Figure 4 shows the top ports/protocols for definition #1.We notice that 20 out of top 25 ports are present both in 2021 and 2022, and that AH send large number of packets to TCP ports.Out of top 25 services which receive the most number of packets in 2021, only 4 UDP-based services are targeted.ICMP (Echo Requests) completes the top-25 set.
Next, we take a moment to compare this behavior with prior work [17], which also employed Merit's Darknet. Figure 2 in [17] shows the same type of AH (i.e.,large scans targeting more than 10% of the dark IP space) and offers a baseline for comparison.Indeed, AH's profile has dramatically changed since the Durumeric et [47]).Further, Redis vulnerabilities are recently popularly mined for Cryptojacking [8] and other application-level attacks [25].Looking at Figure 3 in [17], we also notice that ZMap/Masscan currently play a prominent role in Internetwide scanning whereas in 2014 their presence was minimal (as expected, since they were relatively unknown tools then).
Comparing with Richter et al. study [45], we do observe some similarities in the top-ranked ports (see Figure 10 [45]) as well as some notable differences.E.g., Telnet was the topscanned port in the scanners identified in Richter et al. [45], agreeing with current trends (i.e., Telnet is the 2nd most scanned port in our datasets).However, we notice that Redis/6379 was absent from the rankings of Richter et al. [45].Interestingly, we also see that TCP/445, one of the most scanned ports in Richter et al. [45], is not preferred by AH.This agrees with the results in Durumeric et al. [17] where we see TCP/445 mostly associated with "small scans" (i.e., scanning less than 10% of the Darknet space; see Figure 2, [17]).
We also validate our results using lists of scanners obtained from GreyNoise [23] in which nefarious aggressive scanners are included.Using the month of June 2022 as a basis for comparison, we found a significant overlap between the two vantage points; namely, on average 99.3% of AH identified in our Darknet are also found in GN on a given day.Since GreyNoise operates a "distributed" honeypot in several regions worldwide, this suggests that most of our identified hitters are not performing localized scans, but rather engage into macroscopic Internet-wide behaviors.
Our study is closest to the works of Durumeric et al. [17] and Richter et al. [45].Scanning trends have changed since these studies were conducted (2014 and 2019, respectively), and we document some differences in Section 5. To the best of our knowledge, this study is the first that quantifies the network impact of aggressive Internet-wide scanners.We note though that we have not examined IPv6 scanners [11,46] nor their impact.The recent work in [46] studies such scanners through the lens of a large Content Delivery Network and available firewall logs.We leave analysis of AH IPv6 scanners as future work.

CONCLUSIONS
The paper studies a germane sub-population of Internet-wide IPs, namely the AH observed at the ORION NT.The impact on the network of these AH, as shown in the paper, is surprisingly high.Thus, understanding their behavior is important, with the tangible goal of potentially blocking malicious ones (e.g., the non-ACKed ones) either at the "edge" of an ISP or as they transit the Internet.An important security implication of these AH, which are intense and persistent, is that they are more likely to succeed in finding the vulnerabilities they seek.Further, from a network performance perspective, a critical consequence is that high packet rates (see Figure 1) from these AH could lead to service degradation akin to ones occurring during DoS attacks.Thus, raising awareness towards them is important; we plan to share curated lists of these AH with the community on a regular basis.
We offer three concrete methodologies on how to identify AH.With the proposed methodologies we aim at obtaining "quality lists" of scanners, minimizing false positives due to spoofing or misconfigurations.Further, succinct AH lists have practical implications: engineers that would consider blocking Internet-wide scanners are likely to focus anyways on the top ones in order to minimize the risk of blocking legitimate traffic due to DHCP IP churn and NAT considerations [50].In fact, as Figure 6 (right, Zipf-like distribution) in the Appendix shows, even starting by blocking a small amount of AH, a large fraction of the problem is ameliorated.
Future plans include further investigating the impact of the aggressive hitters on more networks beyond the academic ones studied here.In addition, by examining AH observed at additional vantage points (e.g., other large Darknets), we are aiming to further validate that there is no bias in our existing results.The fact that we identified AH using Merit's "dark" IP space and that these AH contribute an important traffic portion at a completely different network (i.e., CU campus) points towards no selection bias.We leave analysis of heavy IPv6 scanners as part of future work, along with further characterizations of the IPv4 AH population.

Table 1 :
Description of Datasets.

Table 2 :
Network impact attributed to active AH (definition #1) as seen at the top-3 routers at Merit.We report the total packets sent by these scanners (in billions) and the percentage of these packets amongst all routed packets.

Table 5 :
Origins of aggressive scanners for definition #1.
al. 2014 study.SSH was the top-targeted port by AH back then, but it now ranks 3rd in both 2021 and 2022.The top-ranked aimed ports currently,