Identifying Industry Devices via Time Delay in Dataflow

In networks with critical industrial processes where operational integrity is paramount, device identification is crucial for security and effective management. Without such identification, the potential for mismanagement and security breaches increases. Active scanning for network device identification poses risks, especially in industrial settings. Such scanning can disrupt operations or even cause damage. Therefore, finding non-invasive identification methods that bypass active scanning is imperative. Passive scanning, owing to its non-intrusive approach, is favored for industrial devices. Modern statistical learning techniques combined with passive scanning can mitigate risks of active methods. Our research harnesses time delay data in network communications to accurately identify specific industrial PLC models. We derive our data from timestamp details of the OPC UA protocol, widely recognized as a standard in industrial communication. Statistical variables from time delay data enhance the accuracy of passive device identification in industrial settings.


INTRODUCTION
Convergence between information technology (IT) systems and operational technology (OT) systems has been underway for more than a decade.The importance of addressing this issue from a cybersecurity perspective is being addressed more and more [2,4,7].
In this article, we focus on the area of cyber security, namely the scanning of industrial networks, where the approach is different from IT networks.
Scanning computer networks is a process that allows the discovery of active devices on these networks.The purposes may be, for example, to gather information about these active devices (model, type, quantity, communication method, etc.), or another purpose may be to detect potential vulnerabilities.Scanning computer networks is an important part of network management and security.Reasons for using network scanning can be for example inventory management, troubleshooting or security assessment.Network Scanning can be divided into two approaches, namely the passive scanning method and the active scanning method: • Passive scanning -Using the passive scanning method, it is possible to analyze traffic without actively sending packets or actively examining target devices.Passive scanning is performed using network sniffers that capture and analyze data packets traversing the network.Passive scanning focuses on observing and analyzing traffic patterns, protocols, and communication behavior in the network.A well-known tool for passive scanning is Wireshark.• Active scanning -Active scanning involves actively sending packets to target systems to obtain responses.Active scanning thus involves direct interaction with the target systems.
Active scanning is commonly used for network management and can provide the administrator with a large amount of information in a short period of time.A well-known tool for active scanning is Nmap or Nessus.In the area of network scanning, we focus on the identification of industrial devices.Due to the nature of industrial devices, i.e. the emphasis on real time communication and minimal delay, it is important to approach scanning and device identification in a passive manner.Since active scanning can impact the entire production, a small delay on a device can mean a large deviation from production at the final stage and this can mean a large financial loss and it can also have an impact on the safety of employees or the environment.Therefore active scanning should be avoided in the standard production process, it can be used in a pre-planned shutdown when it may not have such an impact on production but the shutdown itself is costly for the company.It is therefore important to look for passive scanning and equipment identification.Modern statistical learning methods and approaches can help us to do this [12].
There are several possible data sources for statistical learning models.In this paper, we have chosen to explore the possibility of using network communication as a data source since it is a relatively easy to capture data source.There is, however, a rather large challenge of how to work with network communication to make the model generalizable and not just dependent on the environment.Therefore, we focused on packet exploration and the most suitable parameter for generalization seemed to be the time delay of the device itself.However, it is a challenge to obtain timestamps of the device itself that are telling.Therefore, we decided to use the OPC UA protocol, which carries these timestamps internally.Thus, our approach for identification can be defined as Clock skew approch based on time series.Next, we focused on the feature engineering capabilities for the timing parameters, which is based on statistical variables.This approach will give the model a good insight into the time-related behavior of each device.As a processing model, we considered mainly models based on tree structures as they generally have very good results for tabular analysis and time-series problems.For this reason, we chose the XGBoost algorithm.
The paper is structured as follows.In the chapter Related work we summarize the current state of the art of industrial device identification.In Section 3, we describe the methodology and our data processing approach to device identification.This is followed by the Results section where the results and success rate of the identification can be seen.The penultimate chapter is Discussion where we discuss the results, limitations and possible problems.Finally, there is a conclusion which summarizes the whole paper.

RELATED WORKS
There are several works dealing with device identification based on network traffic and timing information i.e. clock skew approach [5,6,[8][9][10]13].However, none of these articles focus on the industrial environment and PLC devices.Most of the identification articles are primarily focused on IoT.There are also some articles dedicated to the industry [1,3] but they use a different approach.Lanze et al. [5] in their study deal with the identification of wireless access points and use information from network communication (beacon frames) for clock skew.They use classical statistical methods for identification and thus do not use machine learning (ML) models.Sharma et al. [13] focused on the identification of classical IT devices such as computers phones, laptops etc. in their work they use ICMP timestamp information for clock skew.they also use a classical statistical approach.A similar approach was also chosen by Polcak et al. [9,10].Oser et al. [8] address the identification of IoT devices based on machine learning.They use timestamp information from TCP communication to obtain timestamp values for clock skew computation.A similar approach is also taken by Le et al. [6].In the industrial context, Formby et al. [3] identified industrial devices based on network communication and used packet header analysis as a data source, particularly focusing on response delay time.Boyes et al. [1] focused primarily on Industrial IoT devices, but only from an analysis perspective.Our work is mainly concerned with the possibilities of passive identification in industry as such an approach is important for industrial facilities.It also deals with purely industrial devices, i.e.PLCs and the possibility of model identification.We focus on the use of the clock skew data source to the possibilities of creating statistical input parameters for better and passive identification of individual device models.

METHODOLOGY 3.1 Data collection
Since this is a model identification of industrial devices, we used six physical PLC devices for data collection.Three devices were used to collect data for the training set and three other devices (same models) were used to collect test data.The information about the devices as a model and whether they were used as training or testing devices can be seen in Table 1.Three devices (ET200SP_Tr, S7_1200_Tr and S7_1500_Tr) out of these six were used for training and were not placed in any scenario or simulation, i.e. they did not have any inputs or outputs connected, they were just pure PLCs configured for OPC UA communication only.The other three devices (ET200SP_Te, S7_1200_Te, S7_1500_Te) were already connected in different scenarios, i.e. they were used in normal operation.This configuration was chosen because of the validity of the results as it is important to have a realistic test set that is independent of the training set for good results.The ET200SP_Te device was wired in the lab with the inputs and outputs (LEDs and buttons) connected to the device for testing and configuration by the students.The data from the device was captured over a virtual private network (VPN) -we mention this because of the possible delay and large data flow that could affect the results.The S7_1200_Te device is used in the motor control polygon, this test polygon is used for teaching (laboratory exercise), it also consists of HMI, motor, driver, buttons and LEDs.The third device S7_1500_Teis used as a control unit for a beer production polygon so called mini brewery, which has already a more complex circuit and of course there is an HMI and a number of inputs and outputs.We have mentioned more about the devices and test environments, especially the brewery test polygon in the article [11].

3.1.1
The essence of the data.Within this investigation of the possibility of identification, we focused on the use of the data source of clock delay in the device compared to real time (clock skew) and the possibility of obtaining this information from network traffic, i.e.Clock skew based on newtork communication approach.However, here the problem is how to get the timestamp information from the device within the network communication, it is important to get the timestamp added by the device itself when sending to make the information valid.Therefore, we have chosen communication within the OPC UA protocol in which it is possible to see the timestamp added by the device itself when sending.The data was captured using wireshark and thus the output was .pcapfiles.Within the .pcapfiles we filtered only OPC UA traffic and focused on only 5 types of OPC UA messages: ReadRequest, ReadResponse, Pub-lishResquest, PublishResponse and ServiceFault.We collected eight hours of communication for the training device and over three hours of communication for the testing device.We extracted the communication into csv files and then processed them in Python using the pandas library.

Data preparation
From the whole captured traffic we were interested in only two values, namely OPCUA_Timestamp which contains the timestamp value directly from the device and Time_PC_Observ which contains the time from the computer that is synchronized.So we created a new clock skew column within the clock delay in which we subtracted OPCUA_Timestamp from Time_PC_Observ.For the datasets, we had to first filter out the individual devices so that we could process the datasets more accurately.Next, we needed to define the time period for which we would identify the devices.We decided to perform the identification within one hour.This is therefore a time-series problem.For a duration of one hour according to the intercepted communication corresponded for the device: ET200SP: 874 packets per hour, S7_1500: 874 packets per hour, and S7_1200: 893 packets per hour.We averaged the result and parsed 880 packets for each hour.For the training set, we generated eight samples for each device.For the test set, we had fewer samples for the ET_200SP_Te and S7_1200_Te devices, we had three samples, and for the S7_1500_Te device we had five samples.Thus, the test set was built with a total of 24 samples and the test set with 11 samples.

Feature engineering.
In our model, we employed time series data to derive several statistical features as input.Specifically, for every 880 packets, we established new parameters including the linear regression coefficient (LR_coefficient), standard deviation (std_dev), coefficient of variation (coef_of_var), mean skewness (mean_of_skew), and range (range).
The first input parameter we created was to obtain the slope of the linear regression (coefficient of linear regression -'LR_coefficient').Within this, we worked with the input parameters 'Time' and time lag 'clock_skew'.So for each sample we trained linear regressions for these input parameters.Subsequently, we were able to obtain the slope of the linear regression hence its coefficient.Subsequently, we straightened the linear regression curve to be able to work better with the other input parameters, i.e. to avoid bias.We calculated the other input parameters on the flattened curve.As a second input feature, we calculated the standard deviation -'std_dev' for each sample on the flattened linear regression curve.We then also calculated the coefficient of variation -'coef_of_var' for all samples.As the fourth feature, we calculated the skewness -'mean_of_skew').The last input variable we created is range (maximum value -minimum value).This is how we prepared the features for both training and test datasets.In the results, we found that the input parameter coefficient of variation was not important for our model for that here in Figure 1 we show only four parameters that had an effect on the results.These parameters plot the distribution in the training set for each sample.The labels are labeled as follows Label 1 = ET200SP_Tr, Label 2 = S7_1200_Tr and Label 3 = S7_1500_Tr.It can be seen that there was one sample as an outlier for the S7_1200_Tr device, but we left it in the datasteam because we were interested in the impact, with more data it would be reasonable to remove this outlier.

Algorithm
Since this paper is concerned with device identification the machine learning approach can be defined as a classification problem.Given that it is a tabular data analysis and time-series problem, the use of tree-based algorithms seems appropriate.Therefore, we chose the nowadays very popular XGBoost algorithm, which has good results for this type of data.In Table 2 we can see the hyperparameter settings of the XGBoost algorithm.

RESULTS
The results on the test set were very promising, the algorithm only misidentified one device in the test set by confusing the ET200SP device with the S7_1200 device.In total the test set contained as mentioned 11 samples of hourly traffic of which 10 were identified very well.Accuracy of the given model reached 91%, precision score was 93%, recall reached 91% and F1-score was also 91%.A summary can be seen in Table 3.
In Figure 2, the confusion matrix can be seen where the success rate of the model on the test set can be seen.It is possible to see just one misidentification error.Furthermore, the importance of input features can be seen in Figure 3.It can be seen that the most significant two input features were standard deviation and coefficient of linear regression.Furthermore, it is still worth mentioning the features range and skewness, but they did not have such a strong importance, the least important input feature was the coefficient of variation.

DISCUSSION
The results bode very well for further research, as the success rate of device identification was 91% from an hourly run.According to the confusion matrix, it could be seen that only one sample was not recognized, which could also be due to one outlier within the training set.The result of the error was that the algorithm assumed one sample was an S7_1200 device but it was actually an ET200SP.It is also worth mentioning a rather important thing that at the level of model identification, two different types were used for training and test within the PLC S7_1200, the 1212C was used for training and the 1215C was used for test, and even so, excellent results were obtained, i.e. the identification can be generalized quite well for single models.Another thing that could be mentioned and quite interesting for the accuracy of the identification is that the test data for the ET200SP device was captured via VPN, as we wanted to test the effect of VPN on the model identification performance.The result was a very good identification and therefore did not show a negative effect of the VPN on the identification in this way.
The limitations of the work were mainly in the small number of training and test samples, but on the other hand, they are telling samples that do not affect each other, so it can be assumed that such a model will have good performance in real deployment.Furthermore, it would also be good to have multiple types of PLC device models to increase the recognition range, but since Siemens currently provides these three models as main products (S7_1200, ET200SP and the top class S7_1500), the model is still a telling one for Siemens devices.However, we would have liked to include older device types such as the S7-300 and S7-400 in the model, but these devices do not currently support OPC UA communication so were not included for this scenario.

CONCLUSION
The paper dealt with the possibility of identification of industrial devices based on the calculation of statistical parameters from the time delay of the device clock.This information was collected from network communication.The paper dealt with the use of a modern machine learning approach to develop a model to identify three PLC models.For the paper two independent datasets were created one for training and the other for testing, each of these datasets were collected from different devices.The use of these features for device identification seems like a good direction and is particularly suitable for generalizing the whole solution, i.e. the approach is not environment dependent.
In future work, we plan to enlarge the datasets with more samples for individual devices.Furthermore, our goal is to reduce the time for device identification as much as possible, that is, to spread the samples over shorter time periods and evaluate the good performance of the model.Time is probably the most important commodity and therefore we would like to get to a shorter identification level.

Figure 1 :
Figure 1: Sample distribution in the training set.

Figure 3 :
Figure 3: Importance of features for device identification.

Table 1 :
Devices used for data collection.

Table 3 :
Evaluation metrics for the XGBoost model.