Sequential Pattern Mining: A Proposed Approach for Intrusion Detection Systems

Technological advancements have played a pivotal role in the rapid proliferation of the fourth industrial revolution (4IR) through the deployment of Internet of Things (IoT) devices in large numbers. COVID-19 caused serious disruptions across many industries with lockdowns and travel restrictions imposed across the globe. As a result, conducting business as usual became increasingly untenable, necessitating the adoption of new approaches in the workplace. For instance, virtual doctor consultations, remote learning, and virtual private network (VPN) connections for employees working from home became more prevalent. This paradigm shift has brought about positive benefits, however, it has also increased the attack vectors and surface, creating lucrative opportunities for cyber-attacks. Consequently, more sophisticated attacks have emerged, including Botnet attacks which typically lead to Distributed Denial of Service (DDoS). These pose a serious threat to businesses and organisations worldwide. This paper proposes a system for detecting malicious activities in network traffic using sequential pattern mining (SPM) techniques. The proposed approach utilises SPM as an unsupervised learning technique to extract intrinsic communication patterns from network traffic, enabling the discovery of rules for detecting malicious activities and generating security alerts accordingly. By leveraging this approach, businesses and organisations can enhance the security of their networks, detect malicious activities including emerging ones, and thus respond proactively to potential threats. The performance evaluation for the proposed approach reveals a True Positive Rate (TPR) of over 99% and a False Positive Rate (FPR) of 0%.


INTRODUCTION
4IR has played a pivotal role in the digital transformation of businesses and industries.The COVID-19 pandemic has further accelerated this trend, forcing us to rely more heavily on technology for daily activities such as accessing government services and transportation.This paradigm shift has revolutionised how employees work, promoting remote work and increasing the use of online communication platforms such as Teams and Zoom.This shift has brought numerous benefits, including cost savings, increased productivity and efficiency.However, it has also increased the attack surface for adversaries, creating a lucrative opportunity for cyber-attacks due to the deployment of a large number of smart technologies that operate without human intervention.These technologies have also increased the risk of sophisticated attacks, such as multi-stage attacks (MSAs) [2,3,14,22,23].An example of MSAs are Botnet attacks, which are often used to launch DDoS attacks [8,9] at a later stage, these have been a serious threat in recent years.As businesses and industries continue to embrace digital transformation, it is important to remain vigilant and take proactive measures to mitigate the risks of cyber-attacks.
The stages of a cyber-attack typically begin with reconnaissance, which involves gathering information about the target organisation to map its security posture.This is followed by a scanning attack, which is a pre-attack stage that adversaries use to identify potential attack vectors that can be exploited to gain access to the network.During this stage, port scanners, ping scanners, and related tools are employed to discover open ports and obtain information about the network services running on them, as well as details about the operating systems and versions in use.The output of this stage usually consists of a list of attack vectors that can be used to penetrate the target organisation's defences.
Port scanning involves sending packets to the target host to initiate a TCP connection through a three-way handshake.Through this process, a scanner can determine the state of the port on the target network hosts by sending a packet with the SYN flag set and analysing the response from the host being scanned.There are various types of port scans, such as the syn scan, TCP connect scan, and stealth scan.The stealth scan is particularly effective as it limits the noise generated during the scan by not completing the full three-way handshake, thus making it relatively more difficult to detect.
Network monitoring tools, such as Zeek [29] and Snort [7,13,17], are equipped with pre-defined rules and signatures that enable the detection of common scanning attacks.Additionally, firewalls are typically deployed to secure networks, employing different sets of rules to filter out malicious traffic while allowing only legitimate traffic into the network.Given the availability of these intrusion detection tools and technologies, the likelihood of successful execution of scanning techniques by attackers is considerably low.
As technology continues to advance and security measures become more sophisticated, attackers are constantly developing new techniques to gain access to target networks.In addition to standard scanning methods, adversaries create custom scans that involve sending packets with combinations of TCP flags that are not typically used in normal communication.This leads to mapping firewall rules and gaining more understanding of the traffic filtration rules implemented on the firewall.This then helps them develop attack strategies that allow them to send traffic that probes the network in a manner that evades the implemented rules.By doing so, attackers can identify attack vectors and exploit them to gain unauthorised access to the target network.It is imperative for organisations to detect these malicious activities at an early stage to prevent ultimate attacks.Timely detection and appropriate countermeasures can protect organisations from severe financial and reputation damage.
Intrusion Detection Systems (IDSs) are security measures that can either be devices or software designed to monitor hosts or networks proactively [20,32].Their primary objective is to detect and report malicious activities to the network security team.IDSs can be classified into two categories based on their behaviourhost-based IDSs (HIDSs) and network-based IDSs (NIDSs).NIDSs analyse network traffic collected from devices such as routers and switches, whereas HIDSs process and analyse log files to detect attacks on a particular host [32].
Additionally, IDSs can also be classified based on the techniques they utilise, such as signature-based IDSs and anomaly-based IDSs.Signature-based IDSs identify threats by analysing predefined signatures of known malicious activities, while anomaly-based IDSs monitor and identify unusual network behaviour that deviates from the norm.In summary, IDSs are an essential component of a robust cybersecurity strategy that adds another layer of security that helps detect and prevent potential security threats and attacks.
This paper proposes an approach for intrusion detection of malicious activities in network traffic that utilises SPM techniques.As a proof of concept, this work focuses on detecting the second phase of a typical attack life cycle, which is the scanning phase.SPM is an unsupervised learning technique that extracts intrinsic communication patterns from network traffic.The patterns discovered through SPM are then used to detect scanning activities on the monitored network.Additionally, a rule-based approach is proposed as part of the system for the classification of scanning traffic based on the discovered sequential patterns.
The rest of this paper is organised as follows: Section 2 discusses related work, Section 3 presents the proposed methodology, The experimental setup, dataset used and results are discussed in Section 4 and finally, the conclusion of the paper is provided in Section 5.

RELATED WORK
Ananin et al. [1] conducted a comprehensive review of various port scan types, including scanning attacks, and developed a mathematical model for detecting anomalies related to these attacks.They evaluated their approach by implementing an algorithm derived from the mathematical models to test their detection model.
Birkinshaw et al. [5] proposed an Intrusion Detection and Prevention System (IDPS) designed to detect port scanning attacks and Denial of Service (DoS) attacks.The authors stress the importance of early detection, such as during port scanning, to prevent the potentially devastating impact of ultimate attacks such as DoS.Their proposed approach utilises Software Defined Network (SDN) technology and is capable of real-time detection.Moreover, the approach can be extended to include the detection of other types of malicious activities.The authors reported a low False Positive Rate (FPR) for their approach.
Husák et al. [18] conducted a study highlighting the underutilisation of data mining techniques in the cybersecurity domain.They provided an in-depth discussion of rule mining and SPM use cases, particularly in the context of cyber alert analysis.Moreover, they conducted a survey on alert correlation and attack prediction.The authors evaluated pattern mining techniques, considering speed, using a real dataset of alerts.Finally, they presented a comparison of different methods and shared valuable lessons learned, and thus demonstrated the importance of exploring the full potential of data mining techniques in the cybersecurity domain.
Tıktıklar et al. [31] conducted a study that investigated the existing SPM algorithms.The study analysed the underlying principles of the algorithms and performed a comparative analysis across various domains such as cybersecurity, telecommunications, air quality monitoring, and user behaviour analysis.The evaluation of the algorithms was based on a real-life telecommunications dataset.The study compared three SPM algorithms, namely GSP, Prefix Span, and CMRules, and concluded that their performance may vary depending on the dataset analysed.
Fournier-Viger [11] conducted a comprehensive survey on SPM and identified its trends for discovering patterns in sequential data.SPM algorithms have found numerous applications in different domains ranging from bioinformatics to e-commerce.One of the prominent applications of SPM is natural language processing, particularly in text analysis.In addition, SPM algorithms have been used in market analysis to analyse customers' purchasing patterns, which helps in recommending products to customers.The study discusses some popular SPM algorithms such as PrefixSpan, highlighting their strengths and weaknesses.Jafarian et al. [19] proposed a DNS-based technique for detecting network scanning attacks aimed at enterprise networks, both internal and external.Their approach involves monitoring the network subnet's ingress and egress flow and correlating it with the preceding DNS query/response.This method has been shown to effectively detect scans with less overhead.
In their study, Yue et al. [33] analysed the Train Ethernet Consist Network (ECN), which is responsible for transmitting train control signals.They identified intrusion threats to the data security of railway vehicles due to the increased interaction between the train network and the external environment.To address these challenges, they proposed an ensemble-based IDS that can detect ECN attacks such as IP Scan, Port Scan, DoS and Man-in-the-Middle (MITM) attacks.Their proposed IDS employs Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) to detect such attacks.The authors evaluated their IDS on an ECN testbed and reported a high accuracy of 0.975.
In their study, Sagatov et al. [28] emphasised the significance of protecting networks against scanning attacks, which are often the first steps in exploiting network vulnerabilities.These attacks exploit protocol behaviour to gather information about open ports and the services running on a target network, which can then be used to exploit any discovered vulnerabilities.The researchers proposed a method to detect the initial stages of attacks in TCP and UDP, which could help address the challenges of defending against these attacks.They tested their method on a testbed they created and evaluated its effectiveness.
Aparicio-Navarro et al. [4] proposed an IDS that uses Fuzzy Cognitive Map (FCM) and Pattern-of-Life (PoL) techniques to detect malicious activities.The IDS is designed to address the increasing complexities of cyber-attacks.In their evaluation, the team reported a high detection rate of 99.76% with a low FPR of 6.33%.Other intrusion detection approaches for detection some of malicious activities that have been a serious threat recently include machine learning approaches [6, 15,16,34].Specifically, feature selection approaches contribute to improved performance evidenced by high TPR and low FPR [21,24].

PROPOSED METHODOLOGY
The proposed methodology is illustrated in Fig. 1.The proposed system takes in network traffic as input, which is then processed by the Network Traffic Filter Module.This module extracts key features from the packets transmitted between hosts such as ICMP type and code IDs or TCP header flags.These key features are then organised into a sequence that accurately represents the activities between the hosts.The extracted features can be related to different communication activities between two hosts communicating through TCP, User Datagram Protocol (UDP) or any other protocol.The output of this module is traffic filtered with only relevant features organised as a database of sequences.This sequence database is passed to the Sequential Pattern Miner Module for further processing.
The Sequential Pattern Miner Module extracts frequent sequential patterns that are passed to the Detection Rule Generator module and the deployed Malicious Activity Detection module.At this point, the system follows a process of analysing sequential patterns through both the Malicious Activity Detection Module and the Detection Rules Generator Module.The Malicious Activity Detection Module is designed to identify any instances of malicious activity based on the detection rules that have been implemented within the module.The Detection Rules Generator Module, on the other hand, is responsible for supporting the development of new detection rules.This is done by forwarding unknown patterns to the Network Security Team for a thorough analysis, which is then used to create new detection rules.These newly created rules are then evaluated and ultimately deployed in the Malicious Activity Detection Module.Sequential pattern mining is a technique used to extract valuable insights from sequential data in various domains.For instance, it is used in recommender systems to analyse sequences of products purchased together or subsequently, revealing crucial insights about customer buying behaviour.The discovered sequential patterns are then used to recommend products to customers based on their purchasing patterns [30].Apart from the retail domain, sequential pattern mining has also been successfully employed in other domains, such as cybersecurity [18], to analyse sequential patterns.
The generation and analysis of sequences are a crucial part of the proposed system.Sequential Pattern Mining Framework (SPMF), a data mining software, is used to extract sequential patterns from the sequences [10,12].The sequences generated from network traffic are preprocessed to transform them into a format compatible with SPMF.Since SPMF only takes integer values as input, the preprocessing includes converting feature values for the sequences into integers.Once the sequences of activities are ready, they are passed into the Sequential Pattern Miner Module, which extracts sequential patterns from the network traffic sequences for further analysis using SPMF revealing insights and patterns of network malicious activities taking place on the network.
The sequential pattern mining algorithm utilised and the one implemented on SPMF is PrefixSpan [11,27].PrefixSpan uses two techniques: database projection of subsequences within the databases and depth-first search for traversing the entire sequence database for mining frequent sequential patterns.This process of finding different sequential patterns is done recursively.To mine different sequential patterns the algorithm requires an input sequence database and minimum support, where minimum support means the frequency of occurrence of sequential patterns or how many sequences contain a particular sequential pattern.
Upon receiving input, the algorithm scans the entire database and counts the minimum support of each sequential pattern in the set of sequences.The minimum support for each of the sequential patterns is then evaluated against the minimum support.Any sequential pattern with support less than the minimum support is considered infrequent and is consequently eliminated.The process is repeated to find the next sequential pattern comprising of occurrence on one item followed by another item, this is performed for each of the sequences in the sequence database.Again, the minimum support is compared against the support of subsequences, and those found to be less frequent are eliminated.This process is continued until even longer and more frequent item sets are discovered [27].One of the benefits of PrefixSpan algorithm is that it considers only the observed sequences database as opposed to creating new ones like other algorithms do and is easy to extend.

EVALUATION RESULTS
This section provides a discussion on the evaluation of the proposed approach.It is split into two subsections.The first is Sec 4.1, covering dataset description as well as steps followed and the second, Sec 4.2 on the analysis and discussion.

Experimental Setup
To evaluate the effectiveness of the proposed system, the reconnaissance dataset [25] consisting of port scanning activities is utilised.Specifically, TCP three-way handshake traffic relating to TCP flags for setting up communication connections is derived from this dataset.As a proof of concept for the performance evaluation of the proposed system, the publicly available dataset by the Canadian Institute for Cybersecurity based at the University of New Brunswick is utilised [25].This dataset is network traffic generated from 105 IoT devices and 33 different attacks have been executed including reconnaissance activities and more specifically port scanning.The experiment was performed following the steps illustrated in Fig. 1, the process begins with extraction of relevant features.The feature extraction process focuses on the TCP three-way handshake negotiation process between two hosts communicating through TCP.This approach provides a detailed evaluation of the network traffic and its patterns.Specifically, for each TCP connection setup, a sequence is generated for network packets, with a particular emphasis on TCP flags for each connection setup.This enables an in-depth and critical analysis which then leads to gaining insights on these malicious activities in terms of how they work and target goals.This then results in the development of countermeasures to combat these malicious activities.

Analysis and Discussion
This section provides a detailed discussion of the results of the experimental setup.Fig. 2 3 shows a sample of frequent sequential patterns within the port scanning traffic uncovered by the SPM process.While existing signature-based detection approaches can already detect this type of scan and related, advanced cyber-attackers do not confine themselves to the standard communication patterns, they however, experiment with different custom scans that are not necessarily aimed to determine whether a particular port is open but instead the goal is mapping firewall rules [26].Once the firewall rules on the target network are well understood, the adversaries can then develop a successful strategy to breach firewalls and further probe the network for running services.This leads to the discovery of version numbers of these services and ultimately vulnerabilities which are exploited to gain access.With the proposed SPM system, these custom patterns will be detected by rules generated for the detection of such malicious activities.

Figure 2: Samples of Network Activity Sequences
The proposed approach is evaluated by analysing the TCP handshake traffic and labelling it for horizontal scanning.To create ground truths for horizontal port scans, the approach considers a scenario where a source IP address scans multiple IP addresses  on the same port.The number of IP addresses scanned  can be set to a sufficiently large enough value to constitute a horizontal scan.A detection rule is then developed to identify similar patterns across multiple devices, which is indicative of the same type of malicious activity.Beyond just detecting a scan, these frequent sequential patterns detected on multiple hosts are forwarded to the network security team for further insights into the goals of the malicious activity.This approach can reveal firewall rules that the malicious activity is attempting to circumvent.Once the goals of the sequential patterns are determined, specific rules can be developed to detect similar patterns more quickly.The confusion matrix in

CONCLUSION
This paper presents an approach for IDSs that utilises SPM for detecting malicious activities in network traffic.The proposed system uses SPM to identify sequential patterns from the network traffic, which are then utilised to detect malicious traffic using a rule-based engine.The system is evaluated on a publicly available reconnaissance dataset for detecting port scanning activities and is capable of detecting advanced custom scans and stealth scans.The proposed system also facilitates the generation of security rules by forwarding unknown sequential patterns related to new advanced custom scans to the network security team for further analysis.This approach provides efficient and realistic labelling of the scanning attack and improves network security.Future work will focus on developing and adding more rules to the system to enable the detection of other malicious activities in addition to port scanning.

Figure 1 :
Figure 1: Proposed Methodology for the detection of malicious activities.
provides a sample of sequences generated for each pair of source IP address & source port and destination IP address & destination port, while setting up network communication.For example, for two hosts communicating, namely the scanner and the target hosts, the sequence generated might be 0x0002 -0x0012 -0x0010 -0x0004.This communication sequence would translate to [SYN] -> [SYN, ACK] -> [ACK] -> [RST].This communication sequence translates to a type of scanning activity known as a stealth scan or a half-open scan.Fig.