Tiny-Impute: A Framework for On-device Data Quality Validation, Hybrid Anomaly Detection, and Data Imputation at the Edge

In the landscape of Internet of Things (IoT) systems, data quality degradation can occur continuously for several reasons, such as sensor malfunctions, intermittent network availability, device maintenance, or incomplete data collection. This paper proposes three efficient data quality validation and imputation algorithms that identify and replace noisy and missing values with better-quality data. We extensively stress-tested and evaluated our algorithms by deploying them on microcontrollers and small CPU-based IoT boards with memory limited to as little as 32KB. We simulated a sensor data stream using five real-world datasets, including data collected from the Newcastle Urban Observatory. In this setup, each algorithm excelled in different areas, consistently demonstrating high performance across 100 samples in terms of high energy efficiency (0.014 J), low computation time (85.94 ms), and low error rates (0.0019 MAE, 0.0027 RMSE). Remarkably, we found that on average, our algorithms running on hardware, costing less than $10, showed performance on par with state-of-the-art methods on high-end devices. The results also demonstrated that our algorithms enabled on-device cleansing of live streaming sensor data, eliminating the dependency on cloud services and allowing for real-time data quality validation and processing at the edge.


INTRODUCTION
In legacy IoT systems, devices send data to the computational clouds where dedicated Machine Learning (ML) models provide insight.However, due to the advancements in edge computing and Tiny Machine Learning (TinyML) algorithms, data analysis has moved closer to the source of data, on the network edge, and directly to low-power tiny devices [23].This fosters the proliferation of smart IoT devices as it reduces not only the network throughput but also processing latency while increasing data privacy [19].
A substantial portion of the ML process is dedicated to data preparation [27].ML models are susceptible to the data quality fed to their inputs as it directly impacts their performance and might lead to inaccurate predictions, decreasing interpretability and reducing trust for the ML-driven systems [30].Therefore, data quality is critically important in machine learning, leading to organizations investing in data governance, data quality assessment and data preprocessing techniques [21].It includes addressing the problems of missing values, outliers, duplicate records and biases in data.The problem of assuring data quality fed to ML models deployed on the edge and low-power devices is more complex due to the limited maintenance and observability of the devices resulting from their limited computational and networking capabilities [7].Therefore, there is a need for efficient data validation and preprocessing algorithms targeting resource-constrained devices.In IoT systems, data quality loss can occur for several reasons, such as incorrect data entry, sensor malfunctions, intermittent network availability, device maintenance, or incomplete data collection.Our experience with real-world data from Urban Observatory [5] shows that it is common for datasets collected from real-world edge computing systems to have missing data points and sequences and sporadic erroneous sensor readings.
In this paper, we propose Tiny-Impute framework that contains efficient data validation and imputation algorithms.As depicted in Figure 1, our algorithms identify and replace noisy values with better-quality data.Data imputation requires estimating or predicting missing or erroneous values in a dataset.It is a crucial step in pre-processing and cleaning data as it can significantly affect the accuracy and reliability of the results, analysis, or ML model produced.Although imputed values are educated estimates, they should still be used cautiously because they could introduce bias or inaccuracies into the analysis.Proper validation and evaluation of imputed techniques are required to guarantee the validity and reliability of imputed data.
Figure 2 presents the usage of our proposed Tiny-Impute framework.Here, the red arrows path delineates the conventional data transfer process to the cloud for data cleaning and decisions.In contrast, the green arrows path represents our approach, providing immediate access to high-quality data in real-time, obviating the necessity for cloud services.The main contributions of this paper are summarized as follows: • This is a novel study to perform TinyML-based data imputation at the embedded system level and demonstrate real-time (in milliseconds) and high-performance (low RMSE, MAE) data imputation on a ≈ 3$ ESP32 chipset.
• Our framework provides three sophisticated yet resourcefriendly algorithms that can execute on MCU and small CPU based edge devices.The algorithms enable on-device cleansing of streaming sensor data, eliminating the dependency on cloud services.• When the data cleaned by our novel algorithms are used as input for popular edge AI models from various model zoos and hubs, they show superior inference performance with higher metric scores.• Our algorithms are at statistical and unsupervised ML (unlike NN approaches) implementations, which means users can offthe-shelf deploy them on edge devices, without the need for inference engines or third-party embedded ML frameworks like TensorFlow Lite Micro.• Our algorithms are freely made available1 for users to reproduce our results on a wider range of IoT development boards and datasets.
Paper Outline.The subsequent sections of this paper are organized as follows: in Section 2, we provide an extensive overview of the state-of-the-art approaches in the field.Section 3 outlines the primary contributions of this paper, specifically focusing on data quality validation and our proposed trio of imputation algorithms.Section 4 is dedicated to the evaluation of these algorithms, including a comparative analysis of the results.Finally, Section 5 serves as the conclusion of this paper.

RELATED WORK
Since our framework enables data quality validation and data imputation on MCU-based tiny edge devices, for comprehensiveness, our review consists of the two following subsections:

Data Quality Validation in Edge Computing
The research landscape of data collection and quality was studied in [27], particularly in Deep Learning (DL).They emphasized the growing importance of data collection due to the increased reliance on large datasets in modern DL, reducing the need for extensive feature engineering.A framework was applied in [13] to 92 real-world water quality datasets from various hardware setups.It automated data quality assurance in three steps: rule-based assessments for each sensor, cross-correlation for spatiotemporal relationships, and multi-sensor data quality validation, bridging the gap between data and information.What sets our approach apart is its applicability fully at the edge level, aligning with the demands of real-time processing and resource-constrained environments.Our approach can be used to clean and build a high-quality sensor-based dataset for DL model training and also pass the cleaned data in real-time for edge AI inference.
In [28], the authors constructed ontologies to validate data quality through class-based constraints definitions to reduce duplication of rules and lowering labor costs.Their method incorporated semantic inference, improving logical reasoning to detect data anomalies, enhancing completeness and correctness in validation.[8] assessed data quality of probe measurements.Their data validation included comparisons with models, other missions, and ground observations.They also continuously upgraded swarm products based on quality control and user feedback.[1] calculated error detection costs in IoT edge devices based on a constructed energy dissipation model.This model was used to evaluate various data validation methods' impact on edge device energy consumption.Their analysis emphasized the crucial role of choosing the right data validation scheme at the edge level.The work presented in [2] involved using VSNL which is a real-time algorithm to validate sensor data at the node level within IoT/WSN sensor networks.VSNL employed adaptive thresholding techniques to identify and differentiate various types of errors in sensor data and authentic events.Similar to these use cases [1,2,8,28], our approach to data quality validation is adaptable to different use cases and industries.Whether it's in healthcare, environmental monitoring, or industrial settings, our four data quality validation stages can be customized to suit the specific requirements of each domain.
Study [25], analyzed sensor data quality, emphasizing the detection and correction of common errors like outliers, missing data, bias, drift, repeated values, uncertainty, and stuck-at-zero.They highlighted the importance of validating outliers and addressing various error sources using additional sensors.In [24], a rule-based approach was developed to assess sensor data quality by examining features such as data spikes and missing values.This research primarily focused on evaluating sensor data quality and was followed by the need for data imputation when dealing with removed or missing data.The issue of detecting outliers in training data was addressed in [15] by introducing a statistical training-data cleaning method for principal component analysis-based sensor fault detection.This approach resulted in an enhancement of data quality and a notable improvement in the accuracy of outlier detection.Finally, in [20], OCPCC model was used for real-time outlier detection.However, the study recognized the energy consumption associated with the training phase and only the importance of addressing outliers within the training data.In comparison to the above, the uniqueness of our approach is its holistic perspective, which not only detects and handles outliers but also encompasses a broader spectrum of data quality concerns by examining various other factors.

Data Imputation in Edge Computing
In [19], supervised K-Nearest Neighbor (KNN) was used on edge devices to impute missing data in cases with different missing mechanisms.After imputing, they applied classification with Naive Bayes to analyze the accuracy of handling missing data.In [22], four state-of-the-art benchmarks were utilized to optimize and evaluate appropriate data points for imputing missing values.Their imputation techniques were tested on medical datasets from hospitals and tried to classify different illnesses according to the imputed datasets.FedTMI, presented in [29], is a federated transfer learning imputation technique tailored for edge-cloud imputation in factory settings.FedTMI enhances the target edge model by transferring knowledge from selected helper models alongside its own edge data.Estimation and simulation strategy shown in [26], exists to handle missing network data from 14 schools.This study addressed various types of missingness due to the complex study design and introduced practical solutions.In contrast to these studies, this work is the first to propose a hybrid approach, that autonomously detects data anomalies and missingness which are then imputed using our framework algorithms.Also, this work is the first to study this entire process directly on the most resource-constrained MCU and small CPU-based edge devices.Various TinyML-based anomaly detection on MCUs [10] studies exist for on-device inference and not yet for imputation.
Rigorous analysis was conducted in [4] to determine suitable imputation strategies for continuous and categorical variables.In their investigation, missForest and KNN emerged as particularly apt choices for imputing data, exhibiting effectiveness across diverse scenarios.Another comprehensive study conducted by [16] evaluated various imputation techniques across five numerical datasets.Their assessment revealed that the KNN-based imputation method demonstrated superior performance than six alternative approaches such as mean and median imputation, as well as more sophisticated techniques including predictive mean matching, Bayesian linear regression, non-Bayesian linear regression, and random imputation methods.One of our imputers is designed with our highlyoptimized version of unsupervised KNN as the base given the effectiveness of KNN has already been demonstrated in various studies [4,16,19].
[3] transformed time series data into images and applied subsequent imputation using their pix2pix GAN conditional generative adversarial network (cGAN).[14] outlined a novel univariate imputation method which combines decomposition and imputation techniques.The authors decompose time series into seasonal, trend, and residual components, using support vector regression for trend and residual imputation and self-similarity decomposition for seasonal components.Lastly, in [7], large gaps of missing values were simultaneously imputed by identifying analogous sub-sequences adjacent to the gaps and filling them with the closest match subsequence.They employed a Dynamic Time Warping algorithm for sub-sequence comparison and integrated a shape-feature extraction algorithm to improve results.Unlike these and other sophisticated approaches, our algorithms are statistical-level and unsupervised machine learning self-contained designs (eliminating compute expensive network layers, libraries, etc.), which makes them most suitable for low-power and low-resource hardware platforms.

TINY-IMPUTE FRAMEWORK DESIGN
Figure 3 presents this work's proposed two-phase framework.First, in Phase 1 (Section 3.1), data quality validation measures are introduced to ensure the quality of the data.Based on these measures, a decision will be made regarding whether data imputation is needed.Afterwards, in Phase 2 (Section 3.2), additional anomaly detection and data imputation are performed, with three different imputation algorithms provided in this work.Finally, after successfully completing both phases, users can immediately use the cleaned and high-quality data produced by the framework to train their ML models at the edge and apply them for accurate inference.The following sections provide detailed descriptions of each phase.

Phase 1: Data Quality Validation
In this phase, four stages are implemented to validate the data batch received at the edge device, which is collected directly from the sensors.The results of this phase will be used to determine whether phase 2, data imputation, is necessary, depending on the data's quality and cleanliness during this phase.Stage 1: Data Acquisition and Regularity.In the edge computing landscape, data must be generated and processed as close as possible to the location where it's needed, stemming from the need to minimize latency.Given this context, the process of data acquisition becomes even more intricate.Consistency and regularity in data acquisition are pivotal, especially in AI and ML-driven tasks.
Irregular data, such as those with missing values or incomplete records, significantly inhibit on-the-edge computations and predictions.If unaddressed, these data anomalies introduce potential errors and analytical biases.To combat this, this stage proposes that the acquisition process be meticulously validated for the uniformity of data recording intervals, such as sampling data at consistent time intervals and verifying against it.This validation acts as a checkpoint to diagnose disruptions from the communication channels or the edge-based data acquisition systems, like compromised sensors or timing discrepancies due to localized processing.
Stage 2: Value Range and Data Scaling.Data scaling, especially in edge environments, should be approached with precision, given the constrained computational resources often present on the edge, and especially on TinyML boards.Inaccurate scaling could inadvertently lead to data clusters that do not genuinely reflect the underlying patterns, thereby producing misleading insights.The challenge is magnified on the edge due to potential variations in data inputs from diverse edge devices.This stage detects noisy or meaningless data while ensuring that data fields are adhering to their predefined measurement scales and types.As an example, edge-based parameters like CO levels readings should be free from anomalies, such as negative values or alphanumeric combinations, considering the real-time processing requirements of edge systems.Stage 4: Data Consistency Over Time.Temporal consistency, especially in edge environments, is crucial for the continuous and reliable functioning of localized systems.Given that edge devices often operate in diverse environments with varying external factors, data inconsistencies might emerge more frequently than in centralized systems.As edge computing prioritizes real-time responses, inconsistencies in data and data windows, if left unchecked, could lead to misguided instantaneous decisions.Edge-driven solutions should incorporate algorithms and checkpoints to identify and rectify shifts from pure time series data to those that could be adulterated with cross-sectional or pooled datasets, ensuring that real-time decision-making remains informed and accurate.During this stage, data streams are rigorously examined to ensure they maintain their expected temporal structure and that there is no deviations from data gathered at regular intervals.

Phase 2: Hybrid Anomaly Detection and Data Imputation
This section presents the design of our three hybrid anomaly detection and data imputation algorithms.

Moving Average with Simple Linear Regression (MA-SLR).
This algorithm is designed for MCU and small CPU devices (like Arduino boards), considering their hardware limitations.
In this algorithm, we developed and employed a hybrid system that seamlessly integrates moving averages with Z-score thresholding to accurately pinpoint and remove anomalous data points within the dataset .This is further augmented by utilizing a modified linear regression method for data imputation.
The algorithm commences by establishing the length  of the dataset .This crucial initialization step sets the stage for subsequent iterations.After initializing, we proceed to calculate the moving average for each -sized window in .This is mathematically expressed as: The moving average forms the backbone of the outlier detection mechanism, providing a reliable point of reference for each window.
Next, we calculate the variance  2 [] for each window, employing the following formula: Subsequently, we derive the standard deviation  [] by computing the square root of  2 [].With the moving average and standard deviation in hand, we delve into the core of our outlier detection mechanism.For each data point  [ +  − 1] within the window, we invoke the following criterion for outlier identification: (3) Should 1 > 2 and 3 not be zero, the data point  [ +  − 1] is flagged as an outlier, its index is stored in  [ + − 1], and its value is reset to None in .
Post-outlier detection, we segment  into known data points and missing values, the latter including the outliers we have identified.For each segment consisting of known data points, we ascertain the linear regression coefficients  0 and  1 .To calculate them we use known data points  known and  known which represent x and y coordinates of known data points respectively.Firstly,  is set to the number of known data points, which is len( known ).The means     for the  and  coordinates are calculated as: The sum-of-squares terms ss   and ss  for both coordinates are then calculated using the following formulas: Algorithm 1 Moving Average with Simple Linear Regression (MA-SLR) for anomaly detection and data imputation.Let  be the length of  for  = 0 to  −  do 11: 13: for  = 0 to  −  do 14: 16: if t1 > t2 and t3 is not zero then Split  into known data points and indices of missing values 21: for known data points do

22:
Calculate linear regression coefficients  0 and  1 23: for each missing value index  in  do 24: ←  (after imputation) 26: return Finally, the coefficients  1 and  0 are calculated as: These coefficients serve to impute the missing values in  through the equation: ) After all data is imputed, we set  = , effectively yielding a dataset where outliers have been supplanted and missing data is imputed via linear regression-based imputation.Through the application of this multifaceted approach, we enhance the reliability and hence the utility of the dataset , preparing it for more rigorous analytical processes.This MA-SLR design is summarized in Algorithm 1.
MA-SLR time complexity: reading data is  () while moving average calculation, standard deviation computation, and outlier detection are  ( * ), where  is window size for anomaly detection.Similarly, SLR imputation involves known data extraction, linear regression coefficient computation, and data imputation, each with a time complexity of  ().So, the overall time complexity of the code is approximately  ( * ).
Algorithm 2 K-Nearest Neighbors with Expectation-Maximization (KNN-EM) for anomaly detection and data imputation.
for  = 1 to  do return

K-Nearest Neighbors with Expectation-Maximization (KNN-EM
).This algorithm is designed for edge devices (like gateways, AIoT boards, SBCs) with processing and memory capabilities higher than MCUs, it's more suitable for inputting large data batches.The design of this algorithm combines our highly optimized unsupervised KNN and EM for anomaly detection and data imputation respectively.The algorithm's objective is to identify and remove anomalies in a given dataset and impute the missing values.This two-step process leverages the strengths of both the KNN and EM methodologies, providing a robust solution for data pre-processing.
KNN-EM produces an imputed dataset, , devoid of anomalies and enriched with imputed values.The algorithm commences with the initialization of an empty list  (list of detected anomalies).As we traverse each data point   within the input data batch  (data batch), we initialize an empty list denoted as distances.This list is designed to temporarily store distances between the iterating data point   and all other data points   in .Each distance is calculated using the formula: Upon completion of each inner loop over   , distances will contain lengths between   and all   where  ≠ .The list distances is then sorted in ascending order.From the sorted list, we select the -th smallest distance, where  is the predefined number of nearest neighbors.If this -th smallest distance exceeds a threshold  (Distance threshold for outlier detection),   is classified as an anomaly and is added to .
The next segment of the algorithm is aimed at data imputation.Initially, we define a list   (Data with Missingness) by excluding all detected anomalies (List of Detected Anomalies) from : is constructed by iterating over each data point   in .If   is found in the list  , it is replaced with the value None which signifies that it is removed and will be imputed.Otherwise,   is taken as-is.This operation can be formalized as: From  , another list   (Non-Missing Data) is derived, which consists of all elements in   excluding None.The mean () and variance ( 2 ) are calculated as: These estimates serve as the initial parameters for the ensuing Expectation-Maximization (EM) iterations.The algorithm conducts  (Number of EM Iterations) cycles of EM iterations.For each iteration, a list  (Estimated Data) is formulated.This list contains elements from  , but replaces occurrences of None with the current .Subsequently, both  and  2 are recalibrated based on  using the formula: Following the  iterations, the final dataset  (Imputed Dataset) is constituted by replacing None values in   with the last calculated .The algorithm eventually returns , a refined dataset purged of anomalies and missing values, optimized for further analytical procedures.
By adhering to these methodical steps, our hybrid algorithm accomplishes the dual objectives of anomaly detection and data imputation with computational efficiency and statistical rigor.This KNN-EM design is summarized in Algorithm 2.
KNN-EM time complexity: data reading is  (), KNN algorithm involves data sorting at  (()) and Euclidean distance calculations at  ( 2 ), outlier removal is  (), and the EM imputation has mean and variance computations at  () and iterative updates at  ( * ), where  is the dataset size and  is the number of EM iterations.The dominant factors impacting runtime are the sorting and KNN distance calculations, resulting in an overall time complexity of  ( 2 * ()).

Optimized Laplacian Convolutional Representation (LCR-Opt).
Here, we deeply modify and optimize a top-performing and highresource consuming (LCR) method [9], that imputes missing data using a low-rank approximation model complemented by regularization techniques.Our optimized version of LCR is LCR-Opt, which shows the same-performance characteristics of LCR, but can comfortably execute on MCUs and small CPUs.for iteration  = 1 to  do return  Initialization and data preprocessing are the first steps.The algorithm initiates by scanning  to identify missing data, represented by  0 .Additionally, matrices  , , are computed based on  by assigning  the size (shape) of it and initializing  and  with copy of it.We then move to the domain of Laplacian transformation and Fourier transformation.The Laplacian matrix is padded to the nearest power of 2, denoted as   .This is an optimization step to make the Fast Fourier Transform (FFT) computations more efficient.Subsequently, we find the FFT of the padded Laplacian   using: which can be summarized for   as: The iterative optimization loop represents the core of the algorithm.
Here, we iterate  from 1 to  .During each iteration, the FFT is used to isolate the real () and imaginary ( ) components of the data: Transformations are applied to these components, factoring in regularization parameters and proximity mappings.After these transformations, we employ Inverse FFT (IFFT) to convert back to the spatial domain.The IFFT can be formally expressed as: for our work we can summarize it as: The real values are truncated to an appropriate length , leading to the determination of proximal mapping  using , , , and :  = proximal_mapping(, , , ) after that, the values for  and  are updated to be used in the next iterations by calculating  (eta) as: then we update  as: where (Tr) is training data and we iterate  from 1 to len(z) -1.
Afterwards, we update  as: upon completion of the iterations,  (imputed data batch) is assembled, replacing missing values in  0 with their imputed counterparts.The imputed data matrix  is returned as primary output.
In conclusion, the LCR-Opt offers a rigorous and computationally efficient methodology for high-quality data imputation.By employing low-rank approximations along with regularization techniques and leveraging the FFT and IFFT, the algorithm achieves computational efficiency and data accuracy.This LCR-Opt design is summarized in Algorithm 3.
LCR-Opt time complexity: data reading is  (), and the algorithm iterates for a maximum of  iterations, with each iteration having  ( • log( )), where  is the length of the input data vector -the primary factor affecting time complexity is the algorithm, resulting in  ( •  • log( )).

FRAMEWORK EVALUATION
This section covers a comprehensive evaluation of proposed algorithms, using different devices and datasets summarized in Table 1.For statistical validation, the reported results correspond to the average of 10 runs.This evaluation aims to answer: • Can the proposed algorithms run on MCUs and small CPUbased IoT development boards?• What is the minimum Flash and SRAM space required for the proposed algorithms?• How does the outlier detection and imputation performance of our algorithms vary across datasets and hardware types?• How much time and energy are consumed to perform imputation on the selected devices?• What is the impact on imputation accuracy when running the algorithms on resource-limited hardware in comparison to running on standard laptops?

Setup
This section presents the loss mask, evaluation metrics, and algorithm parameters setup.Further experimentation details are provided in Tiny-Impute repository.4.1.1Loss Mask.To assess the robustness and efficiency of algorithms, we introduce controlled missingness to datasets using our custom-designed loss mask.This method systematically injects missing values into a data batch , simulating real-world data incompleteness for evaluation.We start by taking two inputs:  (data batch) and  (missingness percentage), determining  (number of missing values) as × 100 . distinct values from  are replaced with None, tracked in MI (Missing Indices). is modified accordingly, resulting in  ′ with  missing elements.

Evaluation Metrics.
To assess the quality of imputed data in comparison to the original data, we compute Root Mean Square Error (RMSE) and Mean Absolute Error (MAE).We take two sets of data as input: the original data O and the imputed data I (results produced by our algorithms) and return the calculated metrics as indicators of the algorithm's imputation performance.boards, MA-SLR consistently had the shortest execution times, highlighting its efficiency.KNN-EM took more time, particularly with larger batches, suggesting MA-SLAR to be the most time-efficient choice; (ii) LCR-Opt consistently takes ≈0.001 ms across datasets, but due to the variable processes in Windows 10, time consumption does not scale linearly with batch size; (iii) ESP32 is on average ≈7 ms faster than the MKR1000.Execution time reduced as we switched to the Raspberry Pi and Laptop, the former achieved times ≈1 ms while the latter achieved nanosecond-level imputation, across all algorithms; (iv) LCR-Opt's execution times varied but were generally 15 times higher than the other algorithms.

Time Efficiency
For brevity, in Figure 4, we omitted MKR1000 results, which were consistently ≈10% slower than ESP32.Also, LCR-Opt's execution times are unavailable for various batch sizes on ESP32 and MKR1000 due to SRAM overflows.We also exclude MA-SLR and KNN-EM on a laptop since they execute in nanoseconds for all batch sizes.
The interesting finding is that MA-SLR completes its tasks faster on the ESP32 than LCR-Opt on the Raspberry Pi.This implies that despite the ESP32's hardware limitations, our streamlined MA-SLR design achieves comparable computational efficiency to the highend Raspberry Pi.So, in Figure 5, we analyze MA-SLR results in detail: (i) On Raspberry Pi, MA-SLR showed execution times of just 0.3 ms for Gesture Phase Segmentation and 0.28 ms for Iris Flowers datasets.These sub-millisecond results attest to the achievable hyper-speeds; (ii) Smaller batches have minuscule differences in speed between the Raspberry Pi and ESP32, ≈20 ms difference for batch size of 20; (iii), Even for larger batches, the difference between the two platforms is still relatively comparable, e.g. with a batch size of 100, the difference averages is ≈85 ms (less than a tenth of a second).This shows that even with TinyML constraints, compared to higher-end devices, fast and accurate data quality validation, and missing data imputation is still achievable.

Energy Efficiency
Table 2 shows the average energy used (in Joules) by boards to impute data for batch sizes ranging from 20 to 100.Energy consumption is calculated by multiplying the current (Amperes) drawn (observed using a precision Multimeter) by their voltage (Volts) to determine the power, and then multiplying that by the task time (seconds) to find the energy used.Energy efficiency analysis to guide algorithm selection for low-power IoT applications: (i) MA-SLR is the most energy-efficient, even on TinyML boards.It consumes the least energy across various batch sizes, making it a favorable choice for low-power apps; (ii) KNN-EM tends to have ≈ 16 times higher (on ESP32) energy usage compared to MA-SLR, especially on the ESP32 and Raspberry Pi which makes it a suitable choice when computational accuracy is a priority over energy; (iii) LCR-Opt's energy usage varies, making it impractical for some configurations and devices, due to its higher energy usage (≈ 105 times higher than MA-SLR on ESP32).However, it exhibits significantly improved accuracy with larger batch sizes, especially in thousands.

Accuracy
Table 3 shows the algorithm's imputation accuracy across various platforms and datasets excluding LCR-Opt on the MKR1000 due to the algorithm's overflowing SRAM needs.From the table, the following is observed: (i) Imputation accuracy remains consistently uncompromised across diverse hardware platforms even when running on IoT boards.For example, the MA-SLR algorithm running over the Iris Flowers shows low MAEs of 0.048 and 0.025 on MKR1000 and ESP32 boards with just 32KB and 520KB SRAM respectively.Those results are compared against a Raspberry Pi with 4GB of RAM and a Laptop with 8GB RAM that achieved scores of 0.045 and 0.031 respectively; (ii) Despite the differences between the devices used, ESP32 shows comparable or even lower error scores.For instance, when running KNN-EM on Daily Sports and Activities, it achieved 0.00149 MAE, whereas the Raspberry Pi had 0.00177 MAE and the Laptop had 0.00195 MAE, despite the ESP32's inherent limitations.(iii) Invariable of the type of data, our algorithms show the same performance.For example, for both the Mammographic Mass (int) and Iris Flowers (float) datasets, their accuracy results are very comparable with just a 0.02 to 0.04 error difference.

CONCLUSION
This paper introduces a data imputation framework designed for low-power microcontrollers in IoT devices.Our comprehensive evaluation across datasets and devices demonstrates strong performance.Notably, the MA-SLR algorithm achieves high accuracy with minimal energy use (e.g., 0.025 MAE on ESP32), while KNN-EM offers precision (0.00149 MAE on ESP32).Surprisingly, IoT boards match or outperform higher-end devices in accuracy.These findings highlight the potential for efficient and accurate data imputation in resource-constrained IoT and edge computing environments, enabling real-time data quality validation and enhancement.Further work will include exploring adaptive window sizes and parameter tuning strategies tailored to specific device capabilities and data characteristics could lead to improved anomaly detection accuracy.Followed by investigating the incorporation of online, incremental, and lifelong learning approaches to facilitate continuous model refinement in dynamic IoT settings.Lastly, as IoT deployments continue to grow in scale and complexity, scalability becomes a crucial concern.We could explore distributed and parallel computing approaches to ensure these algorithms can handle large volumes of data generated by extensive sensor networks.
This work is licensed under a Creative Commons Attribution International 4.0 License.

Figure 2 :
Figure 2: Tiny-Impute empowers on-device sensor data stream cleaning, eliminating the need for high-latency cloud services.It delivers high-quality data, enabling state-of-the-art ML models to achieve precise, real-time control of real-world entities.

Figure 3 :
Figure 3: Tiny-Impute for data quality validation plus hybrid anomaly detection and data imputation at IoT device level.

Stage 3 :
Trend Analysis in Time Series Data.In the dynamic world of edge computing, real-time insights often depend on the seamless interpretation of time series data.Detecting genuine data trend shifts as opposed to temporary fluctuations becomes paramount.Edge devices, being closer to the source sensors generating data, have the potential to capture micro-fluctuations, which makes this validation even more significant.Ensuring that changes in data trends align with realistic expectations is crucial, which is what this stage does.An example to consider would be edge devices tracking vehicle speeds on roads.Utilizing the real-time processing capabilities of the edge, sudden accelerations or other irregularities should be immediately recognized.

Figure 4 :
Figure 4: Proposed algorithms on-device anomaly detection and data imputation time consumption across various datasets.

4. 1 . 3
Algorithm Parameters.All three algorithms have minimal tuneable parameters impacting their behaviors, which we have set with default values stemming from the best performances during our test trials.In MA-SLR,  (window size) controls the smoothness of the moving average, with larger values filtering out short-term fluctuations. (z threshold) sets the outlier detection threshold, with higher values being less sensitive to deviations and lower values capturing smaller fluctuations.In KNN-EM, smaller  (neighbors) increases sensitivity but can add noise, larger  yields smoother decisions but may miss details.  (threshold) defines anomaly detection, and  (iterations) refines EM estimates, balancing precision and computation time.In LCR-Opt,  controls spatial smoothness, smaller values limit neighbor influence. determines Laplacian constraint strength, higher values increase smoothness. balances data fitting and simplicity, higher values favor simplicity. sets the maximum iterations, managing computation time.

Figure 4
Figure4provides a detailed performance view of algorithms across various datasets and boards.Results analysis: (i) On both TinyML

Table 1 :
Devices (left) and datasets (right) used for proposed algorithms evaluation.

Table 2 :
On-device imputation energy used by algorithms.

Table 3 :
Proposed algorithms anomaly detection and data imputation performance (RMSE and MAE) across various devices and datasets.For each device/dataset, the best results (lowest errors) across algorithms are in bold.