SensorLoader: Bridging the Gap in Cyber-Physical Reverse Engineering Across Embedded Peripheral Devices

Safety-critical cyber-physical systems, such as autonomous vehicles and medical devices, are often driven by notions of state provided by sensor information translated through embedded firmware. This sensor pipeline is often a fragmented supply chain across vendors, and analyzing the associated security properties entails semantic reverse engineering of third-party software, i.e., mapping low-level software representations to cyber-physical models without access to source code. This mapping is a manual, time-consuming, and error-prone process. This paper introduces SensorLoader, a tool designed to automate mapping sensor semantics across all layers of closed-source software representations. SensorLoader exploits open-source knowledge, potentially derived from structured vendor description files or unstructured vendor datasheets, to extract and infer sensor semantics. We leverage large language models to extract sensor semantics from unstructured sources and map the semantics to memory maps and structures used by the Ghidra reverse engineering framework. We formalize the limitations of this automatic extraction and demonstrate how our approach can streamline the reverse engineering process for embedded systems. Preliminary evaluations suggest that SensorLoader can effectively and scalably aid in identifying vulnerabilities and deviations from expected behaviors, offering a more efficient pathway to secure cyber-physical systems.


INTRODUCTION
Internet of Things (IoT) devices are an integral part of our world today, and the primary way they interact with the world is through sensors and peripheral devices.The raw data collected by sensors are fed into embedded systems to compute and estimate notions of state about the physical world in order to drive safety-critical applications.However, the sense-to-actuate pipeline is susceptible to vulnerabilities stemming from both traditional cyber vulnerabilities and external false data injection attacks, to out-of-band vulnerabilities that target imperfections in the sensor analog-to-digital signal processing [1].Moreover, security analyses of these devices often entails semantic reverse engineering embedded firmware, i.e., mapping traditional program analyses to the semantics of the cyberphysical application domain.
Prior works in semantic reverse engineering [4,7,11] aim to identify critical control algorithms, typically starting by analyzing how sensor data propagates from inputs to actuators and the software control flow in between.However, identifying where sensor data is mapped in memory often requires rigorous manual analysis, including cross-referencing peripheral data sheets with the target microcontroller firmware datasheets, aiming to understand the interdepence between signals, and often validating the interdepence through manual hardware analyses.This paper aims to the explore the extent to which this process of manual semantic information extraction can be automated.Automating this manual semantic information is challenging as peripheral devices within a single device come from various vendors with different standards and data sheets that may or may not be available, and lack of ground truth for closed-source firmware hinders the ability to validate the information extraction process.
An additional challenge for semantic reverse engineering frameworks is building priors for the semantics.While previous works start with assumptions about access to high-level control algorithm information [3,7,9,10], the semantics of the low-level peripheral implementations are either abstracted away or assumed to be provided through knowledge of domain-specific development environments.Recent works [2] have leveraged tools that map vendor-provided description files of device peripherals to disassembled code [6].However, these frameworks only focus on the microcontroller's (MCU) peripheral communication without incorporating semantics about the sensor and actuator chips on the other end of the MCU's communication bus.This paper presents SensorLoader1 , a reverse engineering framework that automates the extraction and mapping of sensor semantics to reverse-engineered code.SensorLoader opportunistically leverages both structured and unstructured information about peripheral mappings as well as devices connected to peripherals.Ultimately, our research aims to understand and pinpoint how and where sensor data is being sent back to the MCU.SensorLoader takes a two-pronged approach to answer the how and where respectively.First automating the peripheral semantic information extraction and overlaying it to system view description files (SVD) that map peripheral information.And then second, using that data to identify functions handling sensor data in the decompiled firmware of the target MCU.SensorLoader opportunistically fuses existing structured semantic information for target MCU's, and leverages large language models (LLMs) to extract the semantic information from unstructured vendor datasheets and overlays the information on the SVD files.Currently it is able to seamlessly complete the first step of the approach, extracting and mapping semantic sensor information into the SVD file.
Our research focuses on answering this question: How can we automate the opportunistic integration of open source peripheral information across multiple chips from different vendors?This paper will explore the challenges that come with trying to solve this problem as well as the preliminary results of our tool, SensorLoader.Ultimately, we hope that this paper can shed light on how automating reverse engineering processes can help ensure sensor security in embedded systems and also promote further research in this area.

PRELIMINARIES
In this section we explore various semantic reverse engineering approaches, reverse engineering frameworks, and introduce our tool SensorLoader.

Sensor Abstraction Layers
Embedded systems rely on sensors to interact with the world and carry out their given task.Sensors take in raw data, which is then processed and sent to the MCU.Raw data is transformed by Sensor Abstraction Layers (SALs), as you can see in figure number 1. SALs translate heterogeneous data inputs into a standardized format suitable for subsequent processing.However, raw data from sensors, even when standardized, often carries noise and inaccuracies.Here, sensor fusion steps in as a critical process, combining data from multiple sensors to improve accuracy and reliability, compensating for individual sensor shortcomings.This consolidated data stream is then channeled into state estimation techniques, with the Kalman filter being a prime example.The Kalman filter excels at continuously estimating the state of a system over time by weighing its predicted state against new sensor measurements, thus offering a more reliable and precise representation of the real-world scenario.SALs and state estimation libraries help make embedded systems more responsive and adaptive to their environments.The Kalman filter can process raw accelerometer data, reducing its noise, and it also uses acceleration data combined with other sensors' data to perform state estimation.

Semantic Reverse Engineering and Embedded Systems
Semantic reverse engineering is a common approach to understanding binary code.It is the addition of semantic information to a decompiled binary, making it more readable for engineers.An example of semantic information might be the addition of which peripherals or sensors an embedded device is using.Cyber-physical systems (CPS) are of increasing importance in numerous applications.However, their intricate communication methods leave them open to vulnerabilities.Traditional reverse engineering primarily focuses on binary code execution semantics for vulnerability analysis.Our project enhances this by automating semantic information extraction from open-source peripheral semantic information.This automation streamlines the vulnerability assessment process and aids engineers in understanding the nuanced communication methods between peripheral devices.The motivation for our research is derived from prior semantic reverse engineering projects, notably the MISMO framework by Sun et al. [7], which focused on reverse engineering in the context of industrial control systems (ICS) and Internet of Things (IoT) devices.A common assumption in many reverse engineering projects is the pre-existing knowledge regarding the communication protocols, locations, and other semantic meta-data of peripheral devices.We aim to instantiate these assumptions by leveraging the information extracted from sensor datasheets, which serve as opensource intelligence facilitating the connection between reverse engineering representations and semantic information.A recent tool, SVD-Loader [6], enables the extraction of peripheral information from MCU hardware description files in the System View Description (SVD) format.When integrated into reverse engineering tools such as angr, BinaryNinja, or Ghidra, SVD-Loader populates the memory maps with potential peripheral communication protocols and pins, thereby enhancing the binary analysis process.Several recent reverse engineering research projects have utilized SVD-Loader, including Shimware by Gustafson et al. [2], which explored retrofitting monolithic firmware images with novel security measures from an adversarial perspective.A component of this project, IOFinder, aimed at identifying malicious input insertion points in firmware, employed a hybrid static-symbolic approach to identify IO functions and taint tracking to analyze the usage of sensor data values.The evaluation indicated that advancements in "binary type recovery related to structs" could significantly automate this process.
Briding the semantic gap with SensorLoader.Given the widespread utilization of SVD-Loader in reverse engineering research, we sought to augment this tool further and extend its capabilities.Our methodology for SensorLoader was developed with the following central question: How can SVD-Loader assist in navigating through the SAL and identifying low-level representations of data structs in decompiled firmware?This reflects a strategic approach to advancing the domain of semantic reverse engineering, leveraging existing tools and pioneering new methodologies to enhance the security posture of cyber-physical systems.

SENSORLOADER OVERVIEW
In this section we will discuss the details of our research and outline our approach for SensorLoader.The intention of SensorLoader is to automate a portion of the reverse engineering process, particularly focusing on the parsing and extraction of semantic details from sensor datasheets.The goal is to map this extracted information about peripherals to their low-level binary representations, ensuring a seamless translation between high-level sensor specifications and their foundational binary codes.By automating these traditionally manual, error-prone, and time-consuming steps, Sen-sorLoader helps to ensure enhanced accuracy.And also enables reverse engineers to allocate their expertise more effectively, overall accelerating the reverse engineering cycle.
Assumptions.To begin we will introduce the assumptions that our approach takes.Although in a previous section, we mentioned that sensor data sheets aren't always readily available, in the case of our tool we assume that they are easily accessible.Another assumption we make is that the SVD file for the target MCU is available, and that we can easily extract semantic information from it.Additionally, we also assume that we have access to the firmware and original source code of the target MCU chip.

System Model
The design of this tool has two components.Firstly, the tool harnesses the potential of Artificial Intelligence, particularly utilizing Large Language Models (LLMs), to efficiently parse and extract semantic information from diverse sensor data sheets.This AI-driven module is optimized to discern intricate patterns, terminologies, and specifications embedded within datasheets, ensuring a comprehensive extraction process.The second facet of SensorLoader's architecture delves into the realm of reverse engineering: analyzing the firmware with which the microcontroller unit (MCU) is flashed.By deciphering the firmware's structure and operation, SensorLoader identifies strategic insertion points where the previously extracted data can be incorporated.Adopting a two-pronged strategy that combines advanced AI capabilities with reverse engineering practices, SensorLoader adeptly captures pertinent data and seamlessly incorporates it into the system, ensuring an integrated analysis process.Designed as a plug-in for binary analysis tools, it smoothly integrates the information from sensor data sheets directly into the reverse engineering framework.
Sensor datasheet extraction To parse and extract open-source peripheral information from target board sensors' datasheets, we use an LLM because it is adaptable and can better detect and collect unstructured data.Using custom queries and prompts within the LLM, we can essentially "talk" with the datasheet and ask it specific things.The extracted information aids us in understanding the communication methods of these peripheral devices.For example, which busses and pins are used to read and write data.To confirm that the extracted semantic data is correct we manually check it with the data sheet.
Reverse Engineering MCU firmware In order to find the locations of the low-level binary representations of the functions and data structures of interest we perform static binary analysis on the target MCU firmware.Decompiling the firmware into high-level source code, allows us to analyze the control algorithms and identify sensor data structures and values.First, we manually go through this process as it is not linear to determine the best path to automate this workflow.Then leveraging the peripherals list generated by SVD-Loader we can target specific peripherals of interest based on extracted semantic information and jump to relevant portions of code.We can further explore the code by using the function cross-references to navigate deeper into the decompiled code.Establishing ground truth in reverse engineering is important, hence, we plan to leverage IOFinder to confirm the locations of sensor functions and data structures.We also decompiled the executable and linkable file (ELF), because it contains debugging symbols and compared the output to our suspected variables to further validate our findings.

IMPLEMENTATION
SensorLoader was implemented by merging multiple existing technologies to extract relevant peripheral communication information.The program utilizes a simple menu user interface, where users can either analyze a list of preexisting PDF files, or upload and analyze an external PDF file.Our initial target instance for this project was the Pixhawk 3 Pro board, by PX4 Autopilot.While the board can support multiple different sensor devices, we chose to focus on target sensor device and how it communicates to the system processor.In all of our test cases our target microcontroller (MCU) was the STM32.It is the most common MCU for this board and has a readily available System View Description (SVD) file that outlines general communication methods of potential onboard sensor devices.The specifics of how SensorLoader was implemented is described in these steps:

Peripheral Device Identification
Based on the user's choice, the corresponding device's datasheet PDF file is parsed by SensorLoader and categorized according to the table of contents.This optimizes the retrieval side of the questionanswer (QA) communication information extraction by reducing the information, and thus tokens passed into the language model (LM).The utility of categorizing the data sheet in SensorLoader is that it boosts the efficiency and quality of the model's response, as fewer resources are devoted to parsing the entirety of the PDF to pinpoint relevant information then.This mitigates some of the difficulty in analyzing unstructured files.

Sensor Semantics Information Extraction
The language model that facilitates the information extraction is LangChain, powered by OpenAI's "text-davinci-002" [5].The user has two options: Either run predefined queries, or choose their own queries.The option for predefined queries automates the process of communication extraction by iterating over a list of pinpointed prompts that are each sent to the LM.The iterative list contains optimized queries 2 that tend to yield the most specific responses.Here are a few for example: • "What buses does (device's name) use to communicate to the system processor?"• "What are the registers associated with communication between (device's name) and the system processor via (target-Bus1)?" • "What are the register fields that support communicated across (targetRegister1)?" After the first iteration over all of the predefined queries, variables are initialized as strings according to the model's responses.The quality of each variable is checked for accuracy and specificity, and if they are of quality, the corresponding prompt is skipped in the next iteration.The prompts are the same on the second and third iterations to account for error, but if there are still variables that do not pass the quality check, the unanswered prompts are paraphrased to attempt to achieve the desired specificity of the response.The loop will end once all of the variables are populated with relevant information.The iterative nature of the predefined queries is extensive, but helps generalize the approach towards several use cases.The user can exit the loop at any time and choose the second option.They may curate their own prompt to either rephrase one of the leftover prompts from a failed iteration or to 2 List of all queries can be found here: (REPO LINK) extract different information.The user may choose to start with option two together if they are analyzing different files.

Sensor Semantic RE Integration
Peripheral Communication Information Translation to SVD Format.The SVD file has significant information as to how the MCU communicates to potential onboarded sensors.The file has separate blocks for every bus that the MCU communicates through, for all the potential registers, and all of their potential register fields.When reverse engineers would analyze different MCU's firmware in the past they would manually identify which registers were pertinent in their project.SensorLoader automated the process of identifying the relevant communication methods and overlaying that information to the MCU's SVD file.The variables associated with the pinpointed buses, registers, and register fields replace their placeholder counterparts in the file automatically.This extracted information populates the corresponding generic fields, and updates them with the correct bit-size, bit-width, and their hexadecimal addresses in memory.The remaining potential communication methods are ignored and removed, optimizing the decompilation process and semantic analysis of the target MCU firmware.In order to extract semantic meaning from the target firmware, SVD files are vital.
The SVD-Loader script is open-source and is particularly useful for our decompilation framework.SVD-Loader takes in a standard SVD file, and each field is converted into usable values for the decompilation process.SVD-Loader offered a baseline level for semantic information, overlaying all of the potential registers and register fields for the many different types of sensors that can be connected to the MCU.SVD-Loader receives the augmented SVD file from SensorLoader, propelling the level of specificity and usability of the information in the SVD file.Instead of generic registers and register fields for a particular bus on potential onboarded sensor devices, the file now contains sensor specific communication information and ignores the irrelevant sections the particular bus.The augmented SVD file is then converted into readable values for the decompilation process which will narrow down and classify the relevant registers/register fields of a particular communication bus.The process of SensorLoader, overlaying semantic information to SVD file, and SVD-Loader, converting the information of the SVD file to usable data, supports the effort of trying to yield comprehensible information of the communication methods to the decompilation framework.
Mapping semantics to software representations.We used Ghidra, an open-source reverse engineering tool developed by the NSA, to perform binary analysis on the target MCU firmware.The open-source nature of the PX4 drone firmware provided us with a transparent view, allowing us to predetermine functions and data values of interest, thereby making our tracking and analysis efforts easier.We were able to easily navigate to the driver (IMU) source code in the PX4 repository and find functions that correlated with the acceleration values we were interested in.For example, we could see how the x_accel value was processed in the function ProcessAccel() and also view what other functions called upon that function.To find these functions in Ghidra, we modified the firmware by injecting strings and logging comments into it, essentially leaving ourselves a breadcrumb trail to follow.Once we decompiled the modified firmware, we were able to quickly locate the trail by searching the Defined Strings.While our trail led us to some promising functions and potential data structures, we were unable to establish solid ground truth for now.Had we been able to confirm our findings about the low-level representations, we would have simply added an additional block of code in the SVD-Loader script replicating this manual process and overlaying any useful extracted information, thereby automating our entire approach.

Results
The initial phase of information extraction from the sensor datasheets was a success, leading to the enhancement of the SVD file by populating some of its fields with more descriptive content, as you can see above in figure number 3. We tested the accuracy of this process by analyzing different sensor datasheets from different vendors.However, challenges arose when attempting to overlay the extracted information onto the lower-level binary representations.Precisely locating the appropriate regions within the decompiled binary for this overlay proved to be a complex task, and as such, this aspect of our objectives remains unfulfilled.Nonetheless, the strides made with SensorLoader in data sheet information extraction and SVD file augmentation underscore its potential and lay the groundwork for further refinements in subsequent iterations.

Limitations
In the current implementation, SensorLoader has only been evaluated on ARM-based microcontrollers whose SVD files are provided.The database used by SVD currently contains the description files across several vendors of widely adopted processors in IoT devices.Thus, while the current iteration is limited to ARM processors, future work can generalize this approach to non-ARM platforms by leveraging LLMs to translate the associated datasheets to a target SVD representation supported by SVD-loader.

EVALUATION
This section assesses SensorLoader's performance in extracting communication information from sensor datasheets, focusing on the communication between the STM32 microcontroller and five sensors for integration into the Pixhawk 3 Pro board.The five sensors to gauge the accuracy are the ICM-20689, the IIM-426552, the

Iterative Query Approach
To evaluate SensorLoader's efficacy, we center our assessment on ten pertinent variables in the information overlay process.Each variable represents specific aspects of communication.These variables are the device's name, the target bus for MCU communication, the target bus description, a list of registers for the target bus, a list of descriptions for each register, a list of hexadecimal addresses for each register, a list of register fields for the target register, a list of descriptions of each register field, a list of the bit offset addresses for each register field, and a list of the bit sizes for each register field.
The evaluation starts cyclically, iterating over the ten queries at least twice.If all variables are found after the first iteration over the queries, the second iteration begins with the same queries.If variables were not found after the first iteration, the corresponding queries are rephrased to fine-tune the logical extraction and retrieval process.Even though the initial query has enough descriptive language to prompt the model for the requested information, instances may arise where the model fails to provide a correct response.For instance, a query before rephrasing might be "How does [device's name] communicate to the system processor?"while the rephrased version could be "What are all of the ways [device's name] sends signals to the system processor?".Both queries seek the same information but differ syntactically, potentially leading to varying model responses.

Response Evaluation
In evaluating model responses, we categorize them as either "correct" (true positives) or "incorrect" (false negatives).If a response is correct, it contains the expected information the query requests.If a response is incorrect, it does not return available information.Combining the iterative query approach and requiring two correct responses per query guarantees precise data extraction from sensor datasheets, thus evaluating SensorLoader's accuracy.The results for each target sensor are in the table below.The evaluation of target sensor devices exemplifies the accuracy of SensorLoader.While it has only been tested on a few data sheets, the relative accuracy of the model when extracting information on the unique devices is expected.The BMP-390 had the most accurate information extraction, due to the fact that is the most simple sensor out of the target devices and contains the shortest datasheet.The discrepancy between the accuracy of the model's analysis of the remaining devices, all of which are IMUs, can be attributed to the error of prompting and/or the mode's retrieval.
While this section does assess the model's accuracy on retrieving relevant information from a prompt, it does not evaluate Sensor-Loader's accuracy on completely overlaying information to the SVD file.This evaluation serves as a representation of the efficiency of SensorLoaders ability to extract relevant communication information.In order to holistically understand the scalability of SensorLoader, the sample pool of datasheets needs to increase and include a larger diverse sample in order to gain better quantitative results.Further analysis will also help us formalize dependencies across registers, as some registers map directly to abstraction and can be more nuanced .

DISCUSSION
We delved deep into our initial research question, which focused on maximizing the utility of open-source peripheral data.We introduced SensorLoader to process these datasheets and extract relevant details, enriching our comprehension of sensor interactions with other peripherals.This effort aimed to support those in the reverse engineering domain, enhancing the safety of IoT applications.Currently, SensorLoader relies on SVD files-which is primarily provided by ARM-based chips.However, SVD is simply a target representation format, and future work can analyze how to map additional chip providers to SVD representations.Future work can also focus on scaling SensorLoader to analyze multiple datasheets in a single run, as well as improving the approach to mapping semantic information to more complex software representations.

RELATED WORK
The domain of semantic reverse engineering (SRE) has witnessed substantial contributions that have significantly shaped our framework, SensorLoader, aimed at bridging the semantic gap between peripherals and reverse engineering.A cornerstone in the existing arsenal of tools is the open-source script SVD-Loader [6], which can be integrated into reverse engineering platforms such as angr, BinaryNinja, or Ghidra, thereby enriching the memory maps with potential peripheral communication protocols and pins.This tool has been instrumental in formulating the methodology and implementation of SensorLoader.Moreover, its utility extends beyond, as seen in Shimware [2] which delved into retrofitting monolithic firmware images with novel security measures, FANDEMIC [8] that investigated the adversarial injection of malicious firmware into power management integrated circuits within supply chains, and the study by Wouters et al. [12] which highlighted the susceptibility of Texas Instruments SimpleLink MCUs to adversarial attacks.While these works have undertaken efforts toward automating segments of the reverse engineering process, none have leveraged AI techniques to aid in the automation of peripheral semantic extraction and integration.Our initiative strives to address this gap by not only instantiating the assumptions regarding peripheral information but also exploring AI-driven methodologies to automate the semantic reverse engineering process.

CONCLUSION
Our research introduces the framework SensorLoader, which automatically extracts semantic information from sensor data sheets, thereby facilitating a more nuanced understanding of potential vulnerabilities within embedded systems.By focusing on bridging the gap between peripherals' data sheets and cyber-physical reverse engineering, we present an approach to accelerate the efforts of reverse engineers.Their work, essential for locating and rectifying cyber-physical vulnerabilities, can now be conducted with a greater degree of precision and efficiency.The implications of this advancement not only streamline the reverse engineering process but also enhance the overall security posture of embedded devices.The initial prototype servers as a pivotal step towards further innovations in the realm of automating semantic reverse engineering of cyber-physical systems.

Acknowledgements
The research reported in this paper was sponsored in part by This research was done through the University of Southern California, Information Sciences Institute's Research Experience for Undergraduate (REU) Site "SURF-I: Safe, Usable, Reliable and Fair Internet".We thank the National Science Foundation for supporting the REU by NSF grant award #2051101, as well as Award #2220312 .The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of NSF or the U.S. Government.The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

Figure 1 :
Figure 1: Sensor abstractions layers propagated across connected devices on a CPS.

Figure 3 :
Figure 3: Results of Overlaying Extracted Information System View Description File