skip to main content
research-article
Open Access

DNN Is Not All You Need: Parallelizing Non-neural ML Algorithms on Ultra-low-power IoT Processors

Published:19 April 2023Publication History

Skip Abstract Section

Abstract

Machine Learning (ML) functions are becoming ubiquitous in latency- and privacy-sensitive IoT applications, prompting a shift toward near-sensor processing at the extreme edge and the consequent increasing adoption of Parallel Ultra-low-power (PULP) IoT processors. These compute- and memory-constrained parallel architectures need to run efficiently a wide range of algorithms, including key Non-neural ML kernels that compete favorably with Deep Neural Networks in terms of accuracy under severe resource constraints. In this article, we focus on enabling efficient parallel execution of Non-neural ML algorithms on two RISCV-based PULP platforms, namely, GAP8, a commercial chip, and PULP-OPEN, a research platform running on an FPGA emulator. We optimized the parallel algorithms through a fine-grained analysis and intensive optimization to maximize the speedup, considering two alternative Floating-point (FP) emulation libraries on GAP8 and the native FPU support on PULP-OPEN. Experimental results show that a target-optimized emulation library can lead to an average 1.61× runtime improvement and 37% energy reduction compared to a standard emulation library, while the native FPU support reaches up to 32.09× and 99%, respectively. In terms of parallel speedup, our design improves the sequential execution by 7.04× on average on the targeted octa-core platforms leading to energy and latency decrease up to 87%. Last, we present a comparison with the ARM Cortex-M4 microcontroller, a widely adopted commercial solution for edge deployments, which is 12.87× slower than PULP-OPEN.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Leading by the recent progress in machine computing power, communication technologies, and big data, Machine Learning (ML) has unveiled cutting-edge breakthroughs in a broad range of domain-specific applications. As a crucial factor for the widespread use of ML systems, Internet-of-Things (IoT) devices have recently experienced explosive growth, reaching 50B of connected devices in 2020 [1]. Spanning from Autonomous Driving [2] to Non-intrusive Load Monitoring [3], ML has become ubiquitous, witnessing a booming of Artificial Intelligence (AI) services and applications [4]. Due to the proliferation of edge devices, the amount of data generated at the network edge has increased dramatically, reaching 850 ZB of data by 2025 [5]. So far, the limited computational capabilities of resource-constrained Microcontroller Unit (MCU)-based systems have favored offloading data to the cloud for analytics, where computational resources are flexible and virtually unbounded. However, the cloud-computing paradigm suffers from scalability issues concerning communication latency, bandwidth, and privacy [6, 7].

Latency- (e.g., Autonomous Vehicles) and privacy-sensitive IoT applications (e.g., Health Monitoring Wearable Devices) are prompting a paradigm shift [8, 9, 10] toward near-sensor processing at the extreme edge to unleash the potential of ML. Such applications demand fast and accurate automated decision-making capabilities while handling highly confidential and sensitive customer data. Pushing the ML frontiers closer to the information sources promises several benefits, including energy efficiency, data privacy protection, reduced bandwidth costs, and low-latency response [11].

Unfortunately, moving the intelligence to the edge is non-trivial due to the limited computational capabilities and energy efficiency of resource-constrained IoT devices. As shown in Table 1, modern ML inference tasks run on cloud servers and mobile platforms featuring a peak processing power of up to 38.7 TFLOPS and 155 GFLOPS, respectively. Instead, the ARM Cortex-M4 MCU represents a widely used platform for edge deployments leveraging a 461,000× lower computational capability. Off-the-shelf Deep Neural Networks (DNNs) inference demands hundreds of GFLOPs, largely exceeding typical timing requirements for most applications when executing on state-of-the-art (SoA) single-core MCUs. With 3.8 GFLOPS per inference, ResNet [12] demands 44.19 s running on the ARM Cortex-M4 platform while executing EfficientNet-B0 [13] and MobileNet-V2 [14] requires 8.45 s and 2.33 s per inference, respectively.

Table 1.
Cloud ML (NVIDIA A100-Ampere)\(\rightarrow\)Mobile ML (iPhone-Apple A13)\(\rightarrow\)Edge ML (STM32F401-ARM Cortex-M4)
Compute Power (FLOPS/s)38.7T\(\xrightarrow {\text{250,000}\times }\)155G\(\xrightarrow {\text{1,845}\times }\)84M

Table 1. Computational Capabilities of ML Inference Platforms from Cloud to Edge Deployment

Emerging Parallel Ultra-low-power (PULP) processors [15, 16] represent an appealing target for TinyML applications, since they enable to meet the ML computational constraints in a power envelope of a few milliWatts. The PULP paradigm builds upon near-threshold computing while leveraging data- and thread-level parallelism to overcome the performance reduction at low operating voltages [17]. By integrating an I/O-dedicated core with a multi-core Cluster (CL) of processors, this platform offers a flexible software-oriented acceleration for ML and Digital Signal Processing (DSP) tasks. In this work, we leverage two RISCV-based PULP MCUs to provide proper computing capabilities for ML at the edge. GAP8 [18] is a commercial off-the-shelf chip delivering up to 10 GMAC/s (90 MHz, 1.0 V) at the energy efficiency of 600 GMAC/s/W within a worst-case power envelope of 75 mW. Instead, PULP-OPEN is a research platform running on an FPGA emulator, whose most recent silicon embodiment features a 32.2 GOPS peak performance with a maximum power envelope of 49.4 mW [19].

Standard edge-class MCUs usually trade off silicon area and energy efficiency for programmability, limiting the HW resources to the bare minimum to improve the power envelope [20]. At the same time, ML applications demand processing FP workloads, since FP support enables satisfying the requirements of dynamic range and precision without intensive numerical tuning. Due to such tight design and power constraints, small, low-cost IoT cores cannot always afford the cost of a full-fledged HW Floating-point Unit (FPU). Several industry-standard STM1 and NXP2 System-on-Chips (SoCs) integrate FPU-less ARM Cortex-M family cores3 to enable low-power operation. Furthermore, commercial devices such as 16-bit PIC and MSP4304 MCUs, along with Xtensa L106 core embedded into ESP8266 SoCs,5 follow this trend. These FPU-less devices implement FP computation with SW FP emulation. Deriving the fixed-point variant of FP algorithms is highly time-consuming [21] and requires additional analysis that takes up 30% of the overall development time [22]. In addition, fixed-point computations are deeply susceptible to quantization effects, thus making FP conversion error-prone and challenging [23, 24, 25]. Edge applications constrained by tight resource budgets and short time-to-market would be negatively impacted by adopting fixed-point arithmetic. In this scenario, using fast FP SW emulation libraries brings several benefits by decreasing development time and enabling fast time-to-market. Parallelizing emulated FP workloads on multi-core ULP devices can dramatically reduce the runtime overhead introduced by FP SW emulation while still meeting the power budget of TinyML applications. In this article, we consider two alternative FP emulation libraries on GAP8, since this target does not offer FPU-native support. libgcc provides a set of standard low-level routines to handle arithmetic operations not natively supported by the target platform. We also deploy RVfplib, which consists of a library optimized for FP arithmetic emulation on 32-bit RISCV processors [26].

In recent years, academic and industrial researchers have focused their interest on DNNs, introducing novel topologies to improve accuracy and efficiency, customizing hardware designs and Instruction Set Architectures (ISA) to DNN execution [27]. At the same time, Non-neural ML kernels have been partially neglected by the TinyML research community. Nevertheless, for a wide range of applications, these algorithms lead to an accuracy comparable with SoA DNNs while demanding lower computing capabilities. Greeshma et al. [28] achieve near-SoA accuracies on the Fashion-MNIST dataset [29] deploying a set of Non-neural ML algorithms: linear Support Vector Machine (SVM) and Random Forest (RF) attain up to 97.3% accuracy. At the same time, Logistic Regression (LR) and k-Nearest Neighbor (kNN) reach 91.7% and 95.9%, respectively. Thus, Non-neural ML algorithms represent an important target for optimized deployment on PULP-class devices for TinyML. In this scenario, the primary goal of our work is to optimize the parallel design of a set of Non-neural MLoptimize the parallel design of a set of Non-neural ML algorithms to run efficiently on two RISCV-based PULP MCUs.

The main contributions of this article are:

  • We optimize the sequential and parallel design of six widely utilized Non-neural ML algorithms, maximizing the Cycles per Instructions (CPI) metric on two RISCV-based PULP MCUs. We provide a detailed experimental assessment that explains the architectural factors limiting the performances at the core- and system-level. We compute the floating-point operations (FLOP) intensity for each kernel to describe in-depth the achieved performance with alternative FP emulation supports and FPU-native system. We also report the theoretical speedup following Amdahl’s law to motivate the structural limitations on parallel performance.

  • We compare the kernel execution time when running on a single-core configuration, leveraging alternative floating-point (FP) emulation libraries on GAP8 and the FPU-native support on PULP-OPEN. We also report code size, energy consumption, and latency for each algorithm and platform configuration. The experimental evaluation shows that the target-optimized RVfplib library achieves an average 1.61× speedup and 6.24% code size reduction compared to the standard libgcc emulation support. Adopting the fast SW emulation library also enables a 37% energy reduction. The FPU-native support reaches up to 32.09× speedup and 41.71% code size decrease compared to libgcc emulation.

  • We examine the one- versus eight-core parallel speedup achieved on the targeted PULP platforms, considering FP emulation on GAP8 and FPU-native support on PULP-OPEN. The results reveal that our optimized parallel design allows achieving near-ideal speedups for Non-neural ML kernels, ranging from 6.56× to 7.64× compared to a single-core execution. We also report an energy and latency reduction of up to 87%.

  • We compare the Non-neural ML algorithms execution time running on PULP-OPEN and the ARM Cortex-M4 MCU. The experimental results demonstrate that a single-core PULP-OPEN configuration leads to speedups ranging from 1.36× to 2.39× compared to Cortex-M4 deployment, along with 85%–89% average energy and latency reductions. While fully leveraging the PULP-OPEN eight-core CL diminishes the computing time by more than one order of magnitude, between 9.27× and 15.85×. We also provide parallel design energy and latency improvements, which reach up to 98% decreases compared to Cortex-M4.

Skip 2RELATED WORK Section

2 RELATED WORK

2.1 NN Tools and Libraries

The current generation of SW frameworks and tools for TinyML mainly focuses on neural ML algorithms deployment on SoA single-core MCUs. A significant representative of this trend is CMSIS-NN [30], a software library including a set of kernels developed to maximize the performance and minimize the memory footprint of NNs on ARM Cortex-M family cores. X-CUBE-AI [31] from STMicroelectronics6 converts pre-trained NNs exported from common DL frameworks into a pre-compiled library optimized on computation and memory targeting STM32 MCUs. By addressing optimal memory tiling and efficient data transfers, the AutoTiler tool from GreenWaves Technologies7 generates code from pre-trained DNNs supporting the execution on the RISCV-based multi-core MCU GAP8.

2.2 Non-neural ML Libraries

While the aforementioned solutions enable deploying NN workloads on several MCUs, they do not support generating code for pre-trained Non-neural ML algorithms. Consequently, several works have been proposed recently from the industry and open-source domain to support Non-neural kernels inference at the edge. CMSIS-DSP is a software library including a comprehensive set of DSP functions optimized by ARM for various Cortex-M processors with FP support. Recent versions of CMSIS-DSP add new functions support for Non-neural ML algorithms, including alternative SVM kernels, a Naive Bayes estimator, and distance functions for clustering algorithms. The TinyML paradigm includes a set of techniques to integrate ML algorithms within resource-constrained MCUs [8]. Yazici et al. [32] implement SVM and RF models on a Raspberry Pi platform, reporting accuracy between 82% and 96% and an execution time of around 5 s to perform inference on 100 instances. However, the Raspberry Pi platform has a power envelope of 2–5 Watts [33], which far exceeds the few milliWatts power budget of TinyML applications. Furthermore, Reference [32] does not provide any insight into the algorithm design. Edge Machine Learning (ELM) [34] consists of an open-source ML framework targeting STM32 edge devices, implementing linear kernel SVM, RF, Decision Tree (DT), and kNN. Instead, MicroML [35] and emlearn [36] are Python modules that extend the Scikit-learn library to generate Non-neural ML algorithms targeting edge MCUs, including SVM, RF, DT, and naïve Gaussian Bayes algorithms. These libraries provide platform-independent C implementations for a wide range of target MCUs, without dependencies with external libraries and with integer/FP arithmetic support. However, these solutions do not provide platform-specific optimizations necessary to achieve peak performance at the edge and do not support parallel execution on multi-core ULP processors.

2.3 Non-neural ML Parallelization

In recent years, several works have been proposed to tackle the efficient parallelization of Non-neural ML algorithms on many- and multi-core architectures [37, 38, 39]. However, such approaches target high-end platforms leveraging resources unavailable on MCU-class devices and fail to meet the limited TinyML budget. They also primarily focus on accelerating the algorithms training phase by deploying multi-level parallelism with complex memory hierarchies provided by these architectures. In Reference [40], the authors designed a highly efficient parallel SVM training on x86-based many-core architectures, achieving up to 84× and 47× speedups w.r.t. LIBSVM on the Intel Xeon Phi co-processor and Ivy Bridge CPU. Unfortunately, the design utilizes task- and data-level parallelism by leveraging multiple threads and a Vector Processing Unit (VPU) to reach satisfactory performances. Parallel Ultra-low-power platforms usually limit the HW resources to meet a power envelope of a few milliWatts, thus not supporting standard Multi-threading programming models and large vector units. Zhu et al. [41] compared an OpenMP- and OpenCL-based parallel learning to Rank SVM for multi-core CPUs and GPUs, proving that OpenCL reaches 7.8× and 19.3× speedup on such platforms. However, OpenCL parallel programming model leverages features not supported by MCU-class devices, such as shared virtual memory and dynamic parallelism. By conducting a comprehensive study of parallel LR training, Ma et al. [42] reduced the computing time by 200× and 500× on an Intel multi-core CPU and NVIDIA GPU. The approach relies on techniques generally not supported by our edge devices, such as multi-threading, load balancing to allocate virtual threads, and minimization of thread creation/destruction events.

2.4 HW/SW Optimizations

In the past decade, researchers have proposed specialized designs to reduce the inference costs of ML algorithms. Microsoft released the EdgeML8 library, which consists of novel Non-neural ML algorithms suitable for severely resource-constrained edge and IoT devices. For example, ProtoNN [43] is a kNN-based algorithm designed to reduce model size and execution time on IoT devices with less than 32-kB memory and a frequency of 16 MHz. While ProtoNN efficiently handles extensive datasets obtaining SoA accuracy, its related optimization problem is non-convex, requiring the adoption of stochastic gradient descent (SGD) with iterative hard thresholding to perform training. Bonsai [44] is a tree-based algorithm designed to guarantee efficient prediction on IoT devices such as the Arduino Uno board, operating at 16 MHz with no FPU-native support, 2-kB RAM, and 32-kB read-only flash. Bonsai learns a single, shallow, sparse tree in which both internal and leaf nodes make non-linear predictions: the overall prediction is computed as the sum of the individual predictions along the path traversed by an input sample. This approach reduces the model size compared to the solution that employs independent classifiers in the leaf nodes. Since MCU-based devices for IoT applications often do not integrate an FPU, Gopinath et al. [45] proposed a framework that generates efficient fixed-point code for ML inference at the edge. Moreover, this approach requires expressing the ML algorithm in a domain-specific language and using a custom compiler. Mahajan et al. [46] describe a template-based framework to accelerate a set of learning algorithms (including LR and SVM) on FPGA. FPGA acceleration is a viable approach in many domains, but its power budget is too high for ULP processing at the edge of the IoT.

In this article, we optimize the parallel design of six very common Non-neural ML kernels [47, 48] achieving peak performance on two RISCV-based multi-core PULP MCUs. We designed the algorithms using the C programming language standard while integrating low-level platform-dependent optimizations into the runtime. Following, we deeply detail the design through a fine-grained analysis describing the parallelization patterns and memory access optimizations adopted.

Skip 3BACKGROUND Section

3 BACKGROUND

This section briefly describes the target MCUs and the software ecosystem deployed in this work, along with a motivations discussion presented in Section 3.1. The PULP platform will be presented in Section 3.2, while GAP8 and PULP-OPEN in Sections 3.3 and 3.4, respectively. Along with this, we report in Section 3.5 the two FP emulation libraries deployed to enable FP computations on architectures with no FPU-native support. Finally, in Section 3.6, we introduce the software stack and parallel programming model used to achieve fine-grained data- and thread-level parallelism.

3.1 Motivations

SoA DNNs achieve the highest accuracy in many application fields, including Keyword Spotting, Computer Vision, and Anomaly Detection. However, their higher performance comes with a price of computational complexity, hampering their applications in many resource-constrained platforms, such as MCU-based IoT devices. Moreover, DNN performs only marginally better than tree-based models in some application fields (e.g., energy prediction models [49]). For these reasons, non-neural ML techniques remain widely used for ultra-low-power and tightly resource-constrained near-sensor processing applications. In fact, a few commercial smart sensors, such as the LSM6DSOX system-in-package by STMicroelectronics, feature an embedded hardware processing engine accelerating DTs for “in-sensor” processing and classification.

To quantitatively assess the complexity versus accuracy tradeoff on open benchmarks, we analyzed the accuracy achieved by Non-neural ML algorithms and SoA DNNs while comparing the computational complexity at inference time in terms of Multiply-and-Accumulate (MAC) operations. The study has been conducted on three widespread industrial and commercial use cases: Keyword Spotting, Image Classification, and Anomaly Detection. Using the well-known MLPerf Tiny benchmark suite [50], we considered Speech Commands, CIFAR-10, and ToyADMOS datasets, and DS-CNN, ResNet-8, and FC-Autoencoder (FC-AE) as SoA DNN references.

As shown in Figure 1, we executed GEMM-based Non-neural ML algorithms on the Speech Commands dataset for the Keyword Spotting task. The DS-CNN architecture reaches 90% accuracy but at a higher cost of 2.9 MMACs per inference, as depicted in Figure 2. Leveraging Non-neural ML models enables lowering the computational complexity to only 6 kMACs with a 490× speedup, still reaching an acceptable 77% accuracy. It is important to notice that the accuracy of DNNs on these tasks keeps increasing, but at the same time non-neural ML approaches are also getting better. In recent years, academic researchers have also focused on leveraging custom feature extractors on top of SVM and LR. On the Speech Commands dataset, Huh et al. [51] reached 98% accuracy by changing the loss functions from the classification loss to a range of metric learning objectives and then training a one-versus-one SVM kernel. On the NOSS benchmark suite, Shor et al. [52] trained an LR classifier on time-averaged representations achieving 96%.

Fig. 1.

Fig. 1. Non-neural ML vs. SoA DNNs Top-1 accuracy. Abbreviations: Feature Extractor (FE).

Fig. 2.

Fig. 2. Non-neural ML vs. SoA DNNs computational complexity.

To assess Non-neural ML algorithms performance in image classification, we trained RF and NB models on CIFAR-10 achieving up to 50% accuracy, while ResNet-8 architecture leads to 85%. However, adopting Non-neural ML kernels decreases the computational complexity by up to \(318\times\), requiring only 40.3 kMACs per inference against the 12.8 MMAC demanded by ResNet-8. Furthermore, many works have investigated the use of CNN-based feature extractors to pre-process image pixels leading to astonishing performances when coupled with Non-neural ML kernels. Liu et al. [53] reached 87.2% accuracy on CIFAR-10 training a set of DTs with the feature extracted from the last fully connected layer of a ResNet; using NB, they achieved 86.6% accuracy.

Last, we evaluated performances in the Anomaly Detection scenario by comparing kNN and kMeans kernels against the FC-Autoencoder architecture on the ToyADMOS dataset. The SoA DNN achieves a 0.85 AUC score requiring 270 kMACs to detect abnormal input data. At the same time, Non-neural ML algorithms reduce computing time by \(6.2\times\) with merely 43 kMACs per inference and still lead to an acceptable 0.75 AUC. Several works also studied alternative feature extractors to improve the performance of Non-neural ML kernels in Anomaly Detection. Durkota et al. [54] reach up to 0.94 AUC by deploying a Siamese Network to extract features on top of the kNN model while using the Mutual Information technique enables reaching 0.95 AUC with k-Means [55].

To summarize the discussion, SoA works on alternative feature extractors have proved that Non-neural ML algorithms can still compete with SoA DNNs in terms of accuracy in several industrial scenarios, often achieving significant reductions in computational and memory footprints. Since low-cost IoT devices are subject to tight memory and compute constraints, the efficient acceleration of these kernels is practically a relevant target and will remain so in the near future. This article focuses on enabling efficient parallel execution of Non-neural ML algorithms on two RISCV-based PULP platforms.

3.2 PULP Platform

PULP is a RISCV-based open-source platform9 built on the near-threshold computing paradigm [17]. The ultra-low-power design allows outstanding energy efficiency while data- and thread-level parallelism overcome the performance reduction at low operating voltages.

Figure 3 depicts the PULP System-on-Chip (SoC) top-level design. The microarchitecture is divided into two isolated voltage and frequency domains, managed by DC/DC and Frequency-locked Loops (FLLs): the Fabric Controller (FC) and the Cluster (CL). The PULP CL consists of a configurable number of RI5CY cores, a RISCV-based processor featuring a four-stage in-order single-issue pipeline, and supporting the RV32IMCXpulpV2 Instruction Set Architecture (ISA). The standard RV32IMC ISA provides support for integer, compressed, and multiply/divide instructions. Instead, the XpulpV2 extension enables highly energy-efficient computations with custom ML- and DSP-centric instructions. For that purpose, XpulpV2 includes hardware loops, post-incrementing load/store, multiply-add instructions, fixed-point, bit-manipulation, and single instruction multiple data (SIMD) support down to 8-bit packed data.

Fig. 3.

Fig. 3. Top-level view of the PULP platform System-on-Chip.

The PULP CL replaces traditional data caches with a Tightly Coupled Data Memory (TCDM) to reduce energy and area consumption while leveraging DSP data access pattern predictability. The memory acts as a size-configurable multi-banked scratchpad memory (SPM) with a banking factor of two (i.e., eight banks for the four-core configuration), enabling shared-memory parallel programming models such as OpenMP [56]. A single-cycle latency word-level interleaved logarithmic interconnect allows data sharing between TCDM and cores with a low average contention rate. The CL features a hierarchical instruction cache (I$) consisting of a first private level and a second shared one. This design provides optimal performances and energy efficiency in fetching data-parallel code, reducing instruction misses, and leveraging the SIMD nature of most near-sensor processing applications.

A custom Hardware Synchronization Unit (Event Unit) implements low-overhead support for fine-grained parallelism, providing fast event management, parallel thread dispatching, and synchronization. The Event Unit also provides high-energy efficiency by utilizing power-saving policies when cores are in the idle state. The cores waiting for a synchronization barrier or an event are taken to a fully clock-gated state, thus zeroing the dynamic energy consumption.

On the SoC level, PULP features a RI5CY core and a multi-channel I/O \(\mu\)DMA to manage data transfers and minimize the core workload when performing I/O. A 15-cycle latency multi-banked SPM memory acts as an L2 hierarchy level that serves the CL data bus, the I$ refills, and the CL DMA unit. The SoC also features a comprehensive set of peripherals enabling parallel capture of images, sounds, and vibrations, for use in smart applications such as speech recognition and object detection.

3.3 GAP8

GAP8 [18] is a commercial SoC for IoT applications, embedding a RISC-V multi-core processor derived from the PULP open-source computing platform. The SoC leverages a single-core FC coupled with an octa-core CL, enabling AI workload at the edge.

The single-core system acts as an advanced MCU in charge of controlling all the SoC operations while fetching instructions from a 4-kB I$. Featuring a 512-kB L2 memory reachable by each core and a private 16-kB L1 memory, the FC domain includes a ROM memory to store the primary boot code. An 800 Mbit/s Double-data Rate (DDR) Hyperbus interface enables extending the on-chip memory, while a multi-channel \(\mu\)DMA permits hiding L3 data transfer cost. A set of peripherals (i.e., QuadSPI, I2C, 4I2S, CAM, UART, PWM, GPIOs, JTAG) enables the acquisition of several signals featuring high bandwidth and efficiency.

On the CL side, the SoC integrates eight identical RI5CY cores with a 16-kB two-level shared I$ and a 64-kB multi-banked TCDM. Offloading highly compute-intensive kernels allows up to 10 GMAC/s (90 MHz, 1.0 V) at the energy efficiency of 600 GMAC/s/W within a worst-case power envelope of 75 mW. Furthermore, the extremely energy-efficient design enables 3.6 \(\mu \text{W}\) power consumption when in deep-sleep mode.

3.4 PULP-OPEN

PULP-OPEN is a research-oriented platform based on the PULP project, tailored for applications in the domain of near-sensors computing. The platform reflects the GAP8 architecture and microarchitecture, with the addition of FPU native support.

The PULP-OPEN CL integrates FPnew [57], a parametric open-source FPU leveraging the insertion of any number of pipeline stages and supporting a wide variety of standard and custom FP formats. In this work, we deploy four FPnew instances shared among the eight cores of the CL, each presenting one pipeline stage. The shared FPU provides support for IEEE 754 single- (FP32) and half-precision floats (FP16), along with custom 16-bit bfloats (FP16alt). Moreover, the architecture implements SIMD vectorization, vectorial conversions, and data packing/unpacking.

Figure 4 depicts the top-level design of the shared FPU exploited in this work. A logarithmic tree interconnect links individual FPU instances with two cores, enabling sharing FPUs among different cores with total transparency at the software level. The static mapping of FPUs allows cores to always access the same physical FPU instance. At the core side, the interconnect interface overrides the FPU during the execution stage, simulating a core-private block. An Auxiliary Processing Unit (APU) interface connects the FPU instances to the cores, leveraging ready/valid protocol with a round-robin policy and communicating with the processor execute pipeline stage. In the case of simultaneous access to the FPU, the system propagates the ready signals to only one processor and stalls the pipeline of the competing core. The FPU utilizes a connection scheme with interleaved allocation to decrease access contentions in unbalanced workloads.

Fig. 4.

Fig. 4. Top-level design of the PULP FPU sub-system.

3.5 FP Emulation Libraries

In this work, we deploy FP32 as the standard data format for computations. To enable the execution of FP32-based algorithms on GAP8, we perform FP computations employing a standard and a custom FP emulation library.

The GNU Compiler Collection (GCC) provides a low-level runtime library called libgcc. The routines integrated into the library handle arithmetic operations not natively supported by the target processor. The GCC compiler automatically creates calls to libgcc routines or inlines the code when the target benchmark includes operations with no HW-native support. In particular, libgcc includes a set of FP IEEE-754 compliant routines supporting single- and double-precision data formats, with a wide variety of arithmetic, conversion, comparison, and advanced software-emulated operations.

To reduce the overhead when executing FP-based kernels on GAP8, we also use RVfplib [26], a custom RISCV-based IEEE-754 compliant library optimized for FP arithmetic on 32-bit integer processors. The library provides two versions targetting code size and performance optimization compatible with RV32IMC processors. In this work, we use the RVfplib version optimized for faster code execution. With the support for standard FP32 and FP64 data formats, RVfplib provides target-optimized software routines for conversion, arithmetic, and comparison operations.

3.6 Programming Model and Compilation Toolchain

An efficient and low-overhead software stack is mandatory to fully leverage the CL compute power. In this work, we use the PULP open-source software ecosystem,10 which provides a parallel programming model and compiler support for both targets.

The PULP toolchain provides compiler support for GAP8 and PULP-OPEN platforms. It includes an extended version of GCC 7.1 supporting the XpulpV2 extension along with a set of custom relocation schemes supported by the linker. After loading the code program into L2 memory, the FC executes the application from the entry point and offloads compute-intensive kernels to the CL.

A Hardware Abstraction Layer (HAL) provides access to low-level resources to explicit the parallel computing paradigm. The core identifier allows scheduling the parallel workload among the workers leveraging data- and thread-level parallelism. An inter-core synchronization is mandatory to ensure correct results in the shared-memory programming model. Thus, the CL architecture provides specialized HW support for optimized synchronization primitives, such as barriers and critical sections, to orchestrate the execution flow. The OpenMP programming model is also available but implies higher overhead costs than HAL primitives. In this work, we focused on maximizing Non-neural ML algorithms execution performance; hence, we used the lower-level HAL for our experimental assessment.

Skip 4ALGORITHM DESIGN Section

4 ALGORITHM DESIGN

In this section, we present the design of six key Non-neural ML algorithms optimized for parallel execution on the two RISCV-based PULP platforms. After giving an introductory description of the mathematical fundamentals, we thoroughly detail the parallelization strategy used to dispatch the CL workload efficiently. We also report the fine-grained analysis and intensive optimization to maximize the speedup. For simplicity, we grouped the algorithms based on their mathematical formulation and parallelization nature:

  • General Matrix Multiply-based (GEMM-based): LR and SVM.

  • Gaussian Naive Bayes (GNB).

  • Metric Space-based (MS-based): kNN and K-Means.

  • Independent Tasks-based (IT-based): RF.

To break the TinyML memory bottleneck on resource-constrained devices, the research community usually leverages novel techniques such as optimal double-buffering and memory tiling [58, 59]. We optimized the algorithms as stand-alone kernels fine-grained tuned to process in parallel data placed in L1 memory. An external double-buffering wrapper enables using L2 memory when data do not fit L1, overlapping L1–L2 memory transfer operations, and kernel processing with almost zero cycles overhead. Last, we find an optimal tiling strategy for each algorithm fine-tuning the memory accesses to maximize data reuse and performance.

In this section, we detail the design of the stand-alone kernels optimized to run efficiently in parallel onto the octa-core CL. The colors used in the following figures depend on the data associated with each core, as depicted in Figure 5. We use a specific color for the memory data read by a particular core. Since sequential operations imply executing with a single core, we arbitrarily selected core 0 to execute sequential operations and colored the read memory data in red. For each algorithm, we consider a training dataset A consisting of \(N_{train}\) d-dimensional samples and \(N_{class}\) classes. To describe the parallelization schemes, we utilize bold capital and lowercase letters to represent matrices and vectors, while lowercase symbols depict scalar variables.

Fig. 5.

Fig. 5. Cores coloring used to mark related processing data.

4.1 Parallelization Approach

The OpenMP [60, 61] paradigm is a widely adopted parallel programming model for shared-memory multi-core MCU platforms, and it has already been demonstrated in the context of embedded systems [62, 63, 64] and TinyML applications [65, 66, 67]. However, this programming model leads to unavoidable overheads in distributing the workload and orchestrating communication/synchronization among the workers [68]. Minimizing such runtime overheads is crucial to enabling fine-grained parallelism on ULP multi-core platforms. Furthermore, TinyML applications have small workloads implying relatively short parallel regions (just a few tens of cycles), making it challenging to amortize overheads. The SPMD parallel paradigm [69] is an alternative approach requiring more programmer effort than OpenMP, since it requires modifying the source code and dealing with low-level details (e.g., inter-core synchronization, critical sections, and shared/private variables allocation). Nevertheless, the SPMD paradigm enables fine-grained parallelism due to a higher runtime control, leading to less overhead than a traditional OpenMP. Montagna et al. [70] compared the two paradigms and proved that a bare-metal SPMD runtime achieves a 178% runtime improvement compared to a baseline OpenMP on multi-core ULP MCUs. Based on this evidence, our work focuses on providing an optimized SPMD version of the code.

To further improve the parallel runtime approaching ideal performances, we leverage HW-specific optimizations for core idling and synchronization. GAP8 and PULP-OPEN Clusters integrate a multi-core Event Unit (EU) optimized to accelerate key data-parallel patterns execution, such as barriers and locks, while supporting power-saving policies to put cores in idle state. The EU is a lightweight HW block designed to enable fine-grained parallelism that aims to achieve minimum synchronization overhead in terms of cycles and energy. Due to its efficient HW design, executing barriers and critical sections with the eight-core Cluster configuration requires 6 and 50 Cycles, respectively. The barrier and mutex extensions correspond to the parallel and critical section constructs fundamental in most parallel programming models. Thus, leveraging EU HW-specialized support is key to drastically reducing the synchronization overhead in parallel programming primitives. In our work, we access low-level resources leveraging a Hardware Abstraction Layer (HAL).

4.2 Horizontal and Vertical Workload Distribution

We introduce two data partitioning schemes adopted in the rest of this section to achieve optimal performance on multi-core platforms, namely, horizontal and vertical workload distribution.

As a common pattern, ML workloads include an operation between a \(r\,\times \,c\) matrix M and a c-dimensional input vector x, leading to a scalar value y. In this scenario, programs can conveniently exploit data-level parallelism: A workload distribution strategy splits data into chunks, and each core executes the same code on a different chunk. This method has an associated overhead, since it implies the computation of core-dependent loop bounds. Since this overhead is constant, its impact decreases as the chunk size increases.

Depending on r and c dimensions, selecting a partitioning strategy mapped onto horizontal or vertical stripes of the matrix operand could significantly improve CL utilization. Having \(r \gt \gt c\) favours a vertical decomposition. The strategy involves partitioning r rows into \(n_{cores}\) chunks consisting of \(r/n_{cores}\) elements. Instead, \(c \gt \gt r\) promotes a horizontal decomposition. Following the approach, each core computes on r vectors of dimension \(c/n_{cores}\).

4.3 GEMM-based Algorithms

Below, we describe the algorithms based on the GEMM function, a Basic Linear Algebra Subprograms (BLAS) routine largely deployed in statistics and ML. As reported in Equation (1), GEMM-based algorithms leverage the product between two input matrices A and B, while C represents a pre-existing matrix overwritten by the output: (1) \(\begin{equation} C^{m\,\times \,n} = \alpha \cdot A^{m\,\times \,k} \times B^{k\,\times \,n} + \beta \cdot C^{m\,\times \,n}. \end{equation}\) \(\alpha\) and \(\beta\) are scalar inputs that enable the plain product \(A \times B\) and the output matrix C accumulation.

LR and SVM present an analogous inference scheme consisting of a GEMM computation performed between the input vector x and the matrix W while alternative activation functions process the output.

4.3.1 Logistic Regression (LR).

LR is a supervised ML algorithm for binary classification, which leverages a logistic function to model output probabilities [71]. While Linear Regression applies an interpolation between points by avoiding distinguishing classes, LR deploys the logistic function to squeeze the linear output between 0 and 1, thus returning the class probability. Due to its high classification performance and straightforward interpretability, the model has been widely adopted across several real-world scenarios, such as intrusion detection [72] and anomaly detection [73].

As reported in Equation (2), LR binary decision function leverages the weighted sum between x and the real-valued d-dimensional weights vector w, with the addition of a bias term b. Each weight \(w_i\) directly relates to the input feature \(x_i\) and characterizes how relevant the ith dimension is for discriminating the classes. As a further contribution, b spatially shifts the position of the decision boundary away from the origin. Last, LR employs the sigmoid function \(S(x) = 1 / (1 - exp(-x))\) to map real-valued numbers into the range \([0,1]\), thus retrieving the class probability.

To support multi-class classification, we leverage the one-versus-all approach, which consists of training \(N_{class}\) distinct binary classifiers, each designed to recognize a specific class against the others. Thus, the learned vector W becomes a matrix of size \(N_{class}\,\times \,d\), while b is a \(N_{class}\)-dimensional vector. Each classifier output is a real value representing the predicted score of the target class. The Softmax function shown in Equation (3) normalizes the result to a probability distribution over the output classes. Last, the ArgMax operator Equation (4) selects the class characterized by the largest predicted probability: (2) \(\begin{equation} f(x) = S(wx + b), \end{equation}\) (3) \(\begin{equation} \sigma (x_{i}) = \frac{\exp (x_i)}{\sum _j \exp (x_j)},\,\,\,i\in [0,N_{class}-1], \end{equation}\) (4) \(\begin{equation} y = {\mathrm{ArgMax}}\,[\sigma (Wx + b)]. \end{equation}\)

4.3.2 Support Vector Machine (SVM).

SVM is a linear ML model that provides a robust theoretical foundation and generalization performance [74]. Several domain-specific applications rely on SVM due to its ability to handle high-dimensional data and solve non-linear tasks. Yi-Hung et al. [75] proposed an SVM-based face recognition system, while Siddharth et al. [76] introduced an EEG-based focal seizure detection algorithm that deploys SVM with 100% accuracy.

In the binary classification setting, SVM consists of an optimal \((d-1)\)-dimensional hyperplane determined by the d-dimensional normal vector w and the offset b that separates the training set A into classes by the largest margin. The nearest data points to the hyperplane represent the Support Vectors (SVs), while their distance corresponds to the margin. Although the general formulation of the algorithm enables classifying non-linearly separable data via high-dimensional mapping, we only focus on a linear kernel in this work.

SVM inference involves processing x deploying the decision function described in Equation (5), where sign refers to the function extracting the argument sign. Thus, \(wx + b\) indicates on which side of the generated hyperplane the testing input x resides, while the sign function extrapolates the information providing the output class. Moving towards multi-class configuration, we leverage the one-versus-all approach again, learning a hyperplane per class: (5) \(\begin{equation} y = sign(wx + b). \end{equation}\)

4.3.3 GEMM-based Algorithms Parallelization Scheme.

In Figure 6, we present the parallel design of GEMM-based algorithms optimized to maximize the speedup running on multi-core shared-memory platforms. To offload the compute-intensive matrix-vector multiplication between \({ {\boldsymbol x}}\) and \({ {\boldsymbol W}}\) onto the CL, we assign to the cores the processing of \(chunk_0\) elements for each \({ {\boldsymbol W}}\) row following the horizontal decomposition scheme. By using the offline determined \(chunk_0\) size and the \(core_{id}\), the cores compute at runtime lower (\(lb_0\)) and upper bounds (\(ub_0\)) data indexes for the first computation. OP1 consists of a partial matrix-vector multiplication where each core processes a \({ {\boldsymbol W}}\) row chunk multiplying and accumulating with the chunked input \({ {\boldsymbol x}}\). Iterating the processing on \({ {\boldsymbol W}}\) rows, we store core-dependant intermediate results in a \(N_{class}\,\times \,n_{cores}\)-sized shared global array \({ {\boldsymbol R}}\). After getting through a synchronization barrier, we obtain the effective matrix-vector multiplication result by combining intermediate results \({ {\boldsymbol R}}\) with vector \({ {\boldsymbol b}}\) and switching to a vertical parallel scheme in OP2. Namely, the computation consists of accumulating \({ {\boldsymbol R}}\) elements by row with the corresponding \({ {\boldsymbol b}}\) value. By leveraging a fresh \(chunk_1\), we calculate core-dependent \(lb_1\) and \(ub_1\) bounds, which define \({ {\boldsymbol b}}\) elements and \({ {\boldsymbol R}}\) rows assigned to each core. Thus, each core iterates on the \(chunk_1\) size accumulating \({ {\boldsymbol R}}\) rows with \({ {\boldsymbol b}}\) elements and leading to the \(N_{class}\)-sized result vector \({ {\boldsymbol y}}\). A CL synchronization barrier forces cores to wait until all CL cores finish OP2 computation to avoid L1 data coherency issues. Consequently, the core master executes a sequential activation function OP3 depending on the specific GEMM-based algorithm. LR requires the Softmax function to normalize the result, while SVM includes the sign routines to retrieve the argument sign. Last, OP3 ends with the ArgMax to return the class with the highest score.

Fig. 6.

Fig. 6. GEMM-based algorithms parallelization scheme. OP1: Partial matrix-vector multiplication; OP2: Intermediate results and bias combination; OP3: Activation function + ArgMax; \({ {\boldsymbol b}}\) : Bias vector; \({ {\boldsymbol R}}\) : Matrix-vector multiplication intermediate result matrix; d: Dimension; \(c=N_{class}-1\) , \(n=n_{cores}-1\) , \(chunk_0=d/n_{cores},\,lb_0=core_{id}\,\times \,chunk_0,\,ub_0=lb_0+chunk_0\) , \(chunk_1=N_{class}/n_{cores},\,lb_1=core_{id}\,\times \,chunk_1,\;\text{and}\;ub_1=lb_1+chunk_1\) .

4.4 Gaussian Naive Bayes (GNB)

Naive Bayes (NB) consists of a family of simple probabilistic classifiers based on Bayes’ theorem along with the strong assumption of conditional independence among features given the class [77]. The model simplicity and high accuracy levels make the method attractive in several tasks, such as anomaly detection in industrial IoT [78] and vehicle accident detection [79].

Considering a multi-class problem while attempting to classify an input x, the minimum classification error is ensured by picking the class \(c_i\) with the largest posterior probability \(P(c_i|x)\). As shown in Equation (6), Bayes’ theorem enables to calculate posterior probabilities \(P(c_i|x)\) by leveraging prior probabilities \(P(c_i)\) and class-conditional likelihood \(P(x|c_i)\). Since the marginal probability \(P(x)\) does not depend on the class \(c_i\) and x is constant, NB ignores \(P(x)\) calculation only keeping the joint probability \(P(x,c_i)\) in the numerator. By using the chain rule to expand the definition of \(P(x,c_i)\) along with the strong conditional independence assumption, the joint probability model can be expressed as reported in Equation (7): (6) \(\begin{equation} P(c_i|x) = \frac{P(x|c_i)P(c_i)}{P(x)} \propto P(x|c_i)P(c_i) = P(x,c_i),\,\,\,i\in [0,N_{class}-1], \end{equation}\) (7) \(\begin{equation} P(c_i|x) \propto P(c_i) \prod _{k = 1}^{d-1} P(x_k|c_i),\,\,\,i\in [0,N_{class}-1]. \end{equation}\)

We derive the NB classifier by combining the model mentioned above and the Argmax decision rule Equation (8): (8) \(\begin{equation} y = \underset{i\,\in \,N_{class}}{\mathrm{ArgMax}}\,P(c_i) \prod _{k = 1}^{d-1} P(x_k|c_i). \end{equation}\) NB classifiers differ mainly by the assumptions made regarding the distribution of the class-conditional likelihood \(P(x|c_i)\). In this work, we leverage a normal Gaussian distribution, Equation (9), to estimate statistical parameters for features. By performing a Maximum-Likelihood training, we learn the \(N_{class}\,\times \,d\)-sized mean (\(\mu\)) and variance (\(\sigma\)) matrices, while the \(N_{class}\)-dimensional prior probability \(P(c_i)\) vector is estimated directly on the dataset: (9) \(\begin{equation} P(x|c_i) = \frac{1}{\sqrt {2\pi \sigma ^{2}_{i}}} \exp \left(-\frac{(x-\mu _{i})^2}{2\sigma ^{2}_{i}}\,\right),\,\,\,i\in [0,N_{class}-1]. \end{equation}\)

4.4.1 GNB Parallelization Scheme.

To perform NB decision function, Equation (8), while fully leveraging CL compute power, we designed the parallelization scheme shown in Figure 7. GNB per-class key operation consists of computing feature-dependant class-conditional likelihoods \(P(x_k|c_i)\) and combining them in a sequence product with the prior probability \(P(c_i)\). In OP1, we vertically split this compute-intensive workload, assigning each CL core a partial sequence product by leveraging an optimal \(chunk_0\) data size computed offline. At runtime, each core calculates core-dependent \(lb_0\) and \(ub_0\) data index boundaries to retrieve \(chunk_0\) per-row \(\boldsymbol {\mu }\) and \(\boldsymbol {\sigma }\) elements necessary to compute \(P(x_k|c_i)\). By applying the Gaussian distribution formula, Equation (9), for each \(\mu - \sigma\) pair in the core-dependent \(chunk_0\) and multiplying them, we place OP1 results in an intermediate \(N_{class}\,\times \,n_{cores}\)-sized shared array \({ {\boldsymbol R}}\). To bring together intermediate results and achieve the actual result, we combine \({ {\boldsymbol R}}\) with \({ {\boldsymbol p}}\) vector in OP2 by leveraging a vertical decomposition scheme. Thus, we define at compile time a fresh \(chunk_1\) data size, determining the number of \({ {\boldsymbol p}}\) elements and \({ {\boldsymbol R}}\) rows assigned to each core. By calculating \(lb_1\) and \(ub_1\) bounds, the cores iterate vertically on \(chunk_1\) rows multiplying \({ {\boldsymbol p}}\) with core-related partial sequence product and resulting in the \(N_{class}\)-sized result vector \({ {\boldsymbol y}}\). Since OP3 consists of a sequential computation on \({ {\boldsymbol y}}\), we deploy a CL synchronization barrier to force waiting until all CL cores finish OP2 operation. Last, the core master retrieves the class y with the highest score by performing the ArgMax function.

Fig. 7.

Fig. 7. GNB parallelization scheme. OP1: Partial \(P(x|c)\) sequence product; OP2: Intermediate results and \({ {\boldsymbol p}}\) combination; OP3: ArgMax; \({ {\boldsymbol p}}\) : Prior probabilities vector; \({ {\boldsymbol R}}\) : Sequence product intermediate result matrix; d: Dimension; \(c=N_{class}-1\) , \(n=n_{cores}-1\) \(\,chunk_0=d/n_{cores},\,lb_0=core_{id}\,\times \,chunk_0,\,ub_0=lb_0+chunk_0\) \(\,chunk_1=N_{class}/n_{cores},\,lb_1=core_{id}\,\times \,chunk_1,\;\text{and}\;ub_1=lb_1+chunk_1\) .

4.5 Metric Space-based Algorithms

MS-based algorithms involve arranging data points by proximity order leveraging the computed distances. In this work, we consider the Euclidean metric shown in Equation (10). In addition, we provide a time complexity analysis on alternative sorting algorithms when running on a sequential and parallel platform, respectively: (10) \(\begin{equation} \Vert p - q \Vert = \sqrt {\sum _{i=1}^{d-1} (p_{i} - q_{i})^2}. \end{equation}\)

4.5.1 k-Nearest Neighbor (kNN).

kNN is a non-parametric instance-based supervised learning algorithm widely used in classification problems [80]. Due to its simplicity and classification performance, the model has been adopted in gesture recognition ML systems [81] and bone cancer detection approaches [82].

Without learning a discriminative function from the training set A, kNN stores the whole set and delays computations until inference. Given a testing input x and a distance function, kNN computes the distance between x and A. The model orders A instances in descending order of proximity through the retrieved distances. Finally, kNN classifies x as the most prevalent class among the k nearest neighbors to the query point.

4.5.2 k-Means.

k-Means [83] is a well-known unsupervised learning algorithm widely deployed in several domains, such as data mining [84] and pattern recognition [85]. Without requiring a training phase, the clustering method relies on an iterative pass that partitions the training set A space into disjointed regions covering the original input space. Considering dividing A into k clusters \(U_{j\in [0,\,k-1]}\), each represented by arbitrarily initialized d-dimensional centroids \(u_{j\in [0,\,k-1]}\), the iterative procedure consists of the following steps:

  • Distance calculation: compute the Euclidean distance \(\Vert p - q \Vert\) between A and clusters centroids \(u_j\), as indicated in Equation (11): (11) \(\begin{equation} d_{j\,+\,k\,\times \,i} = \Vert x_i - u_j \Vert \qquad j\in [0,k-1],\ i\in [0,N_{train}-1]. \end{equation}\)

  • Clusters allocation: assign data instances to the nearest centroid \(u_j\) according to Equation (12), where i represents the ith A instance and \(id_i\) the assigned cluster: (12) \(\begin{equation} id_i = \arg \min _{} d_{j\,+\,k\,\times \,i} \qquad j\in [0,k-1],\ i\in [0,N_{train}-1]. \end{equation}\)

  • Centroids update: compute new centroid \(u^{new}_j\) coordinates by averaging the instances belonging to the corresponding cluster \(u^{old}_j\), as reported in Equation (13): (13) \(\begin{equation} u^{new}_j = \frac{\sum _{i = 0}^{N - 1} I\lbrace id_i = j\rbrace \ x_i}{\sum _{i = 0}^{N - 1} I\lbrace id_i = j\rbrace }, \qquad j\in [0,k-1]. \end{equation}\)

k-Means continues iterating the three steps until the distance between previous \(u^{old}_j\) and current centroids \(u^{new}_j\) is lower than a pre-fixed threshold. When the centroids do not move significantly between iterations, the algorithm reaches the final centroids. In this work, we pick the first k elements of the training set A as initial centroids for k-Means clusters.

4.5.3 Sorting Algorithms.

MS-based algorithms require arranging data points based on the computed distances. Traditional efficient sorting routines feature a favorable time complexity when dealing with complete sorting problems. By the way, kNN and k-Means demand a partial sort returning the k smallest elements and the smallest one, respectively. Considering a n-sized input array, retrieving the lowest k elements without sorting the remaining \(n - k\) elements could lead to a significant speedup improvement. For that purpose, we present a brief time-complexity analysis of two well-known sorting routines, highlighting the advantages and drawbacks when running on a sequential and parallel platform.

Quick Sort (QS) is a highly efficient in-place sorting algorithm based on a divide-and-conquer procedure. By selecting a pivot element, the routine partitions the input array into two sub-arrays and reorders them, relying on the pivot comparison. The procedure is then re-iterated recursively on the sub-arrays until obtaining the reordered input array. QS routine has a time complexity of \(O(n\log _2{n})\) on average when executing on a single-core platform. Due to the divide-and-conquer algorithm nature, QS complexity does not scale when dealing with a partial sorting task. Thus, the routine requires ordering the whole input array making its adoption highly inefficient for MS-based algorithms.

Selection Sort (SS) is a simple in-place comparison-based sorting algorithm that separates the input array into two sub-arrays. Initially, the sorted sub-array is empty, while the unsorted sub-array consists of the whole input array. By finding the smaller element in the unsorted sub-array, the algorithm swaps it with the leftmost unsorted element and moves the sub-array boundaries. Although the SS procedure offers the worst time complexity on average (\(O(n^2)\)), it enables saving computations when tackling partial sorting problems. Considering returning the k smallest element, SS demands \(O(nk)\) comparisons, making its adoption in MS-based algorithms favorable compared to QS when \(k\lt \log _2{n}\). Deploying SS with k-Means is highly efficient, since the algorithm determines the closest centroid for each data instance, corresponding to \(k = 1\). Regarding kNN, the most efficient sorting algorithm strictly depends on the dataset dimension n and the hyperparameter k. In this work, we deploy for kNN and k-Means a dataset consisting of 1k instances, favoring SS deployment when \(k \lt 10\).

When moving to a multi-core CL composed of c cores, the operating array is divided into c sub-arrays. Each core performs the sorting routine on the corresponding local sub-array requiring \(O(\frac{n}{c}\log _2{(\frac{n}{c})})\) and \(O(\frac{n}{c}k)\) comparisons for QS and SS, respectively. To bring together local results, an additional set of comparisons between the local smaller k elements is mandatory, requiring \(O(ck)\) comparisons. In Equation (11), we report the time complexity of the two sorting algorithms, noting that the parallelization introduces an equal overhead on both routines. Thus, running on a multi-core platform makes SS adoption favorable compared to QS when \(k\lt \log _2{(\frac{n}{c})}\). As in the sequential execution, SS is still highly efficient in k-Means, while in kNN, the hyperparameter k determines the most efficient sorting algorithm. Considering the 1k instances dataset used for kNN and k-Means, SS is favorable when \(k \lt 7\): (14) \(\begin{equation} QS = O\left(\frac{n}{c}\log _2{\left(\frac{n}{c}\right)}\right) + O(ck),\;\;\;\: SS = O\left(\frac{n}{c}k\right) + O(ck). \end{equation}\)

4.5.4 MS-based Algorithms Parallelization.

Figure 8 shows the parallelization approach designed to dispatch kNN inference onto the eight-core CL. The first operation (OP1) consists of computing the Euclidean distance between the query point \({ {\boldsymbol x}}\) and \({ {\boldsymbol A}}\), thus \(N_{train}\) distance operations. To fully leverage the CL compute power, we use a vertical decomposition scheme to split the workload and determine offline the chunk size on which each core works. At run-time, the cores calculate individual lower (lb) and upper bounds (ub) based on the \(core_{id}\) and perform the Euclidean distance computation on the corresponding chunk of \({ {\boldsymbol A}}\) rows. After filling with results an intermediate \(N_{train}\)-sized global array \({\boldsymbol e}\), the cores execute a k-elements Local Selection Sort (OP2) on the related chunk, saving the local k neighbors in a \(N_{train}\)-dimensional global buffer \({ {\boldsymbol l}}\). A CL synchronization barrier forces cores to wait until all CL cores finish OP2 computation. To bring together intermediate results, the master core performs a k-elements Global Selection Sort (OP3) and returns the most voted class among the k neighbors performing the ArgMax function.

Fig. 8.

Fig. 8. kNN parallelization approach. OP1: Euclidean Distance; OP2: k-elements Local Selection Sort; OP3: k-elements Selection Global Sort + ArgMax; \({ {\boldsymbol A}}\) : Training set; \({ {\boldsymbol e}}\) : Euclidean distance vector; \({ {\boldsymbol l}}\) : Local k-nearest neighbors vector; d: Dimension; k: Nearest-neighbors hyperparameter; \(N=N_{train},\,chunk=N/n_{cores},\,lb=core_{id}\,\times \,chunk,\; \text{and} \;ub=lb+chunk\) .

While kNN inference consists of a single procedure step, k-Means iterates a set of routines until the distance between \({ {\boldsymbol U}}_{new}\) and \({ {\boldsymbol U}}_{old}\) is smaller than a threshold. In this regard, we present the optimized design of a k-Means iteration to achieve peak performance when running on a multi-core platform.

As shown in Figure 9, the algorithm begins calculating the Euclidean distance (\(OP1)\) between \({ {\boldsymbol A}}\) elements and each centroid \({ {\boldsymbol u}}_i\), thus demanding \(N \times k\) distance computations. To dispatch the workload efficiently onto the CL, we divide \({ {\boldsymbol A}}\) horizontally by determining offline \(chunk_0\), which defines the number of \({ {\boldsymbol A}}\) rows assigned to each core. At run-time, we offload the distance computation to each core using \(lb_0\) and \(ub_0\) to tag core-dependent data indexes. Since a core computes k distances for each \(chunk_0\) element, OP1 leads to a \(N \times k\)-dimensional result that we store in the global shared buffer \({ {\boldsymbol e}}\).

Fig. 9.

Fig. 9. kmeans parallelization approach. OP1: Euclidean distance calculation; OP2: Cluster ID allocation; OP3: Local centroids update; OP4: Global centroids update; \({ {\boldsymbol A}}\) : Training set; \({ {\boldsymbol e}}\) : Euclidean distance vector; \({ {\boldsymbol id}}\) : Cluster ID vector; \({ {\boldsymbol U}}_{old}\) : Initial cluster centroids; \({ {\boldsymbol U}}_{local}\) : Local cluster centroids; \({ {\boldsymbol U}}_{new}\) : New cluster centroids; \(N = N_{train}\) , \(chunk_0=N/n_{cores},\,lb_0=core_{id}\,\times \,chunk_0,\,ub_0=lb_0+chunk_0\) \(chunk_1=(N\,\times \,k)/n_{cores},\,lb_1=core_{id}\,\times \,chunk_1,\;\text{and}\;ub_1=lb_1+chunk_1\) .

In OP2 the increased vertical dimension \((N \times k)\) demands expanding the data chunk to \(chunk_1\), making a core working on k distances for each \(chunk_0\) element. Thus, the cores find the closest centroid \({ {\boldsymbol u}}_i\) to each \(chunk_0\) element and assign the cluster ID. Furthermore, the results are saved in an \(N_{train}\)-sized array \({ {\boldsymbol id}}\) containing the cluster ID for each \({ {\boldsymbol A}}\) data sample. OP3 consists of a Local Centroids Update where each core accumulates and counts \({ {\boldsymbol A}}\) instances belonging to the same centroid \({ {\boldsymbol u}}_i\) operating on \(chunk_0\) elements. The operation ends with a CL synchronization barrier to ensure each core finishes the workload before moving to the following computation step. Last, we perform a Global Centrodis Update (OP4) to pull together local results \({ {\boldsymbol U}}_{local}\). Each core takes charge of computing the global value of a centroid \({ {\boldsymbol u}}_i\) corresponding to its \(core_{id}\), working on non-contiguous elements. Thus, the core accumulates \({ {\boldsymbol U}}_{local}\) and count variables using the \(core_{id}\) to retrieve data from the chunks and dividing them, finds the new global centroid \({ {\boldsymbol U}}_{new}\).

4.6 Random Forest

RF is a robust ML algorithm leveraging an ensemble of low-correlated randomized Decision Trees (DTs) to split the training set using feature space subsets [86]. Due to the low-variance nature and the capability to handle various data types effectively, the model has been largely deployed in several domain-specific applications such as Non-intrusive Load Monitoring [87] and anomaly detection [88].

Starting from the root node, DTs consist of several splitting nodes where an input feature \(x_i\) is evaluated with a test condition to determine the branch to be followed. Repeating the decision procedure over the entire structure, the DT reaches a leaf containing the predicted class. Last, RF returns the input prediction by aggregating DTs votes and picking up the class with the higher number of votes.

To optimize the model execution on edge devices, we designed a custom DT implementation representing the model structure with arrays. This approach save all tree structures into four arrays: feature, threshold, left child, and right child. By using feature and threshold arrays, we evaluate the node comparison. While leveraging the result, we pick the following node from the left- and right-child array. Last, we mark leaf nodes by writing a negative integer value in the corresponding ith node elements of the feature array.

4.6.1 RF Parallelization Approach.

The DT algorithmic structure prevents a priori knowledge of the taken pathway toward the leaf at compile time. The model unveils the taken branches by evaluating the input x at runtime, and this unpredictability complicates the DT parallelization. In this regard, we adopt a parallelization scheme consisting of assigning the whole DT execution to a specific core. Furthermore, the strategy involves the static assignment of DTs to the available cores.

In Figure 10, we illustrate the parallel algorithm design to offload RF execution onto multi-core platforms maximizing the compute power utilization. To efficiently dispatch the RF model onto the CL, we determine offline a chunk size representing the number of DTs assigned to each core. By computing core-dependant lb and ub, each core retrieves the assigned \(DT_{id}\) and executes the workload computing the result for the assigned DTs. A Critical Section (CS) barrier prevents multiple cores from accessing the Vote Update section simultaneously. Thus, we aggregate DTs results atomically by incrementing the retrieved class in a vote array. Last, a CL Synchronization Barrier ensures that each core finishes the workload before moving to the ArgMax function, which retrieves the final prediction.

Fig. 10.

Fig. 10. RF parallelization approach. \(DT_i\) : ith Decision Tree; CS: Critical Section; d: Dimension; \(chunk=N_{trees}/n_{cores}, \, lb=core_{id}\,\times \,chunk, \;\text{and}\; ub=lb+chunk\) .

Skip 5EXPERIMENTAL EVALUATION Section

5 EXPERIMENTAL EVALUATION

This section presents the results of our design optimized for parallel execution employing a fine-grained analysis and intensive optimization. We provide Non-neural ML algorithms execution time, considering two alternative FP emulation libraries and FPU-native support. By comparing the kernel single-core execution, we point out the performance improvement obtained by switching from a standard to a custom RISCV-based emulation support and an FPU-native platform. We also compare achieved speedups for each target platform leveraging the eight-core CL compute power and the optimized algorithm parallel design. To clarify the achieved results, we conducted an analysis to determine non-ideality sources and architectural factors when performance is sub-optimal.

Section 5.1 describes the adopted experimental setup and the ML framework deployed to train the Non-neural ML kernels. A comparison of the sequential execution overhead between alternative FP emulation supports and an FPU-native platform is discussed in Section 5.2. After presenting in Section 5.3 the achieved speedups by fully exploiting the CL compute power, we illustrate an in-depth comparison of the execution time between PULP-OPEN and ARM Cortex-M4 in Section 5.4.

5.1 Setup

The experimental analysis has been conducted using two different target platforms. The GAPUINO development board11 represents a commercial solution integrating GAP8 coupled with a rich set of peripheral interfaces to fast prototype embedded applications. A JTAG bridge allows programming the onboard FLASH memory and debugging GAP8 code. Instead, the hardware design includes a set of Special-purpose Registers (SPRs) to store the count of hardware-related events at the core level. Using non-intrusive per-core performance counters enables fine-grained performance analyses, measuring events related to instructions (executed instructions, total and active cycles) and memory accesses (I$ misses, TCDM contentions, and L2/TCDM memory stalls). In this work, we use the GAPUINO board to profile Non-neural ML algorithms performance on GAP8 while using a standard and a custom software FP library. Furthermore, we set the FC clock frequency to 250MHz while the CL runs at 150MHz.

We also performed experiments on the PULP-OPEN architecture, thus leveraging FPU-native support. To emulate the microarchitecture, we used a hardware emulator running on a Xilinx UltraScale+ VCU118 FPGA board.12 The architecture emulation enables faster experiments than RTL-equivalent simulations while providing cycle-accurate results. In addition to the performance counters provided by GAP8, the PULP-OPEN design supports recording FPU pipeline-related events (FPU stalls, contentions, and write-back stalls). Using Vivado Design Suite, we generate and load the microarchitecture bitstream on the FPGA. An OpenOCD interface with GDB support mapped on GPIO pins allows uploading the application binary code in the L2 memory and running the program. A virtual UART mapped on a dedicated USB port enables to read results from an emulated terminal. In this work, the FPGA clock frequency has been set to 20 MHz.

To characterize performance, we selected three datasets widely adopted among the TinyML community and are contained in the MLPerf Tiny benchmark suite [50]. Speech Commands is an audio dataset of spoken words designed to build Keyword Spotting systems, consisting of 105k utterances from 2.6k different speakers. The dataset supports 35 English words and a collection of background noises, where each speech sample is 1sec long. Following MLPerf Tiny reference implementation, we deployed a subset of the dataset consisting of 10 words. We used the remaining words to approximate the “unknown” label, which, along with “silence,” results in 12 output classes. As pre-processing, we used 10 Mel-frequency cepstral coefficients (MFCC) features extracted from a 40 ms long speech frame with a stride of 20 ms, resulting in 490 features for 1 s audio. For that purpose, we used Speech Commands to benchmark GEMM-based algorithms in this work. To test MS-based algorithms, we deployed the ToyADMOS dataset for anomaly detection in machine operating sounds. According to MLPerf Tiny benchmark suite, we used only the Toy-car machine type among the other six available. For training, we deployed 7k normal sound samples from seven Toy-cars, each delivering 1k machine sound samples mixed with environmental noise. We also pre-processed the audio into a log-mel-spectrogram with 128 bands featuring a sliding window of five frames, leading to a 640 input size. Regarding k-Means, we adopted two 640-dimensional clusters to divide the training set, while four nearest neighbors for kNN. CIFAR-10 is a multi-class labeled dataset consisting of 60k \(32\times 32\) RGB images, divided into 50k training instances and 10k for the testing set. The dataset represents the de-facto standard for TinyML benchmarking, since the low image resolution makes CIFAR-10 the most suited data source for training tiny image classification models. For that purpose, we used CIFAR-10 to benchmark the IT-based algorithm and GNB in this work.

We performed the training of the algorithms entirely relying on the Scikit-Learn ML framework, leveraging its front-end to dump model parameters and structures. Whenever model parameters do not fit the L1 memory, we place data into the L2 level and use the double-buffering wrapper to overlap DMA operations with kernel processing optimally. To guarantee efficient runtimes, we initially optimized the sequential version of the Non-neural ML algorithms on each platform. Thus, we thoroughly investigated kernel execution using non-intrusive performance counters to optimize the instruction-level scheduling of the four-stage in-order single-issue pipeline adopted by both target cores. We used the L1 load stall counter to limit hazards due to data dependencies while monitoring branch stalls to minimize pipeline flushing. We also leveraged the I$ misses counter to investigate cache locality issues. This in-depth analysis led to the highest attainable CPU utilization achieving near-optimal Clock per Instruction (CPI) for most algorithms. In the parallel version, we focused on reducing TCDM contentions to limit the wasting of cycles when multiple cores attempt to read data from the same memory block. Furthermore, we optimized the use of parallel programming primitives to the bare minimum reduce synchronization overheads. Last, we conducted extensive benchmarking considering all FP emulation supports and platforms, measuring the execution cycles and other statistics for each variant.

5.2 Benchmarking Floating-point Emulation Libraries versus FPU-Native Support

In Figure 11, we show the cycles, latency, and energy required by Non-neural ML algorithms considering a sequential execution on the two RISCV-based PULP MCUs and alternative FP emulation libraries for GAP8. We report on top of cycles columns the achieved speedup compared to the baseline, which consists of executing the kernels on GAP8 with libgcc support for FP emulation. Regarding the energy efficiency and latency values, we indicate the percentage decrease compared to the baseline. Table 3 represents algorithms code size and percentage reduction when moving from the baseline to RVfplib emulation and then to the FPU-native system. Last, we present in Table 2 the execution statistics for each kernel and platform configuration, along with the architectural non-idealities retrieved from the performance counters. Pipeline Non-idealities (N.I.) refers to the sum of architectural factors owed to the cores pipeline (stalls related to memory load latency and taken branches). At the same time, FPU N.I. accounts for FPU-related events limiting the efficiency (write-backs, contentions, and dependencies). libgcc emulation leads to the lowest CPI, ranging from 1.28 to 1.45, due to the high usage of branching conditions and global variables placed into L2 memory by the GCC toolchain. Moving from the baseline to the custom RISCV-based RVfplib emulation library reduces execution times, achieving 1.36–1.9× speedups on GAP8 and a higher 1.18–1.33 CPI. Employing fast SW FP emulated routines on FPU-less processors brings several further benefits for TinyML: latency features up to 47.34% decrease, while energy efficiency reaches 26.27%–47.34% reductions. Adopting the FPU-native PULP-OPEN platform decreases pipeline N.I. and FPU factors to 1% execution time, reaching up to 1.12 CPI and 32.09× performance improvement compared to the baseline. Consequently, FP support leads to higher latency and energy lowering, ranging from 59.74% to 99.1% compared to libgcc adoption on top of GAP8.

Fig. 11.

Fig. 11. Non-neural ML algorithms cycles, latency, and energy on a single-core GAP8 and PULP-OPEN configuration.

Table 2.
KernelPlatformFP Instr. (%)CyclesInstr.CPISpeedupPipeline N.I.I$ MissesExt. LDFPU N.I
SVMGAP8 + libgcc89.98757k548k1.38146k7.6k4.5k
GAP8 + RVfplib69.06447k335k1.33\(\mathbf {1.69}\)92.7k16.3k1
PULP-OPEN24.8929.6k23.7k1.25\(\mathbf {25.56}\)5.9k2510
LRGAP8 + libgcc90.16796k570k1.40150k24.8k4.60k
GAP8 + RVfplib68.65463k351k1.32\(\mathbf {1.72}\)96.8k371
PULP-OPEN24.9830.9k24.6k1.26\(\mathbf {25.75}\)6.10k51184
GNBGAP8 + libgcc92.4286.4M67.4M1.2815.9M3.38M16.1k
GAP8 + RVfplib57.6762.0M50.1M1.24\(\mathbf {1.39}\)11M387k1
PULP-OPEN27.253.05M2.72M1.12\(\mathbf {28.34}\)279k37.9k130.7k
RFGAP8 + libgcc54.231.01M695k1.45344k39.9k1
GAP8 + RVfplib29.98742k629k1.18\(\mathbf {1.36}\)78.8k18.5k1
PULP-OPEN6.39405k350k1.16\(\mathbf {2.48}\)70.5k19.9k10
kNNGAP8 + libgcc90.49117M80.7M1.4529.1M1.57M554k
GAP8 + RVfplib69.6861.6M46.5M1.32\(\mathbf {1.9}\)13.3M635k15
PULP-OPEN45.53.64M2.85M1.28\(\mathbf {32.09}\)735k36.6k150
kMEANSGAP8 + libgcc74.82625k466k1.3489.4k8.39M515
GAP8 + RVfplib48.27395k315M1.25\(\mathbf {1.58}\)45.4k5251
PULP-OPEN40.6420.5k18.3k1.26\(\mathbf {30.44}\)2.8k41144

Table 2. Runtime Statistics and Architectural Factors Executing the Non-neural ML Algorithms on a Single-core GAP8 and PULP-OPEN Configuration, Leveraging Libgcc and RVfplib for FP Emulation on GAP8

Table 3.
SVMLRGNBRFkNNk-Means
GAP8+libgcc21.423.1125.5921.2223.1722.9
GAP8+RVfplib19.9 kB (\(\downarrow\)7.3%)21.3 kB (\(\downarrow\)7.9%)23.7 kB (\(\downarrow\)7.3%)20.4 kB (\(\downarrow\)3.9%)21.5 kB (\(\downarrow\)7%)21.3 kB (\(\downarrow\)7%)
PULP-OPEN13 kB (\(\downarrow\)39%)13.5 kB (\(\downarrow\)42%)15.4 kB (\(\downarrow\)40%)13 kB (\(\downarrow\)39%)14.2 kB (\(\downarrow\)39%)13.8 kB (\(\downarrow\)40%)

Table 3. Non-neural ML Kernels Code Size on a Single-core GAP8 and PULP-OPEN Configuration, Leveraging Libgcc and RVfplib for FP Emulation on GAP8

GEMM-based algorithms demand executing a matrix-vector multiplication, which requires a sequence of FP mul and add operations at low level. When executing the kernel on the baseline, libgcc __mulsf3 and __addsf3 emulation routines (multiplication and addition between single-precision FP variables, respectively) slow down the runtime requiring about 800 kcycles per inference. Compiling GAP8 code integrating the RISCV-based emulation library decreases the execution time to almost 450 kcycles due to the RVfplib latency obtained by leveraging the PULP ISA extensions. Thanks to the native support for single-cycle FP arithmetic instructions, PULP-OPEN decreases further the execution time, leading to a 25.56–25.75× speedup compared to the baseline.

In the GNB model, the normal Gaussian distribution calculation requires executing high latency transcendental functions (i.e., expf and logf), thus making the algorithm compute-intensive. As a result, running the kernel on the baseline setup demands an order of magnitude higher execution time than previous algorithms, namely, 86.4 Mcycles. By deploying RVfplib on GAP8, the executin time decreases to 62 Mcycles with a 0.3× speedup drop compared to the performance of GEMM-based kernels. Transcendental functions involve a high usage of the __divsf3 routine, which slows down the execution time when passing from libgcc to RVfplib emulation support. As a consequence, expf and logf routines present a 1.2× average speedup with respect to the baseline. Overall, transcendental functions severely limit RVfplib speedup, since they account for 20% of GNB execution time. Furthermore, taken branches (TBs) account for 17.78% GNB computational time and decrease by up to 5% less than GEMM-based kernels, thus limiting the runtime improvement. Moving the execution onto PULP-OPEN further reduces the running time to 3.05 kcycles, thus reaching a 28.34× speedup compared to the baseline. Load stalls reduction to almost 0% of the execution time enables a 3× relative speedup increase compared to GEMM-based kernels, where load stalls represent 19% of the computation time.

Due to limited usage of FP computations, RF presents lower performance when switching the FP emulation support and moving to an FPU-native platform. On the baseline, RF demands about 1.01 Mcycles deploying only the __lesf2 libgcc emulation routine to compare feature values with thresholds. By showing 54.23% FP instructions, RVfplib allows improving only a limited fraction of the workload, thus leading to 742 kcycles with a 1.36× speedup compared to the baseline. Leveraging the PULP-OPEN FPU reduces the execution time to about 405 kcycles with a reduced speedup of 2.38× owing to a 6.39% kernel FLOP intensity.

By running kNN on GAP8 deploying libgcc FP emulation support, the kernel requires 117 Mcycles per inference. Since the algorithm leverages GEMM-based FP emulation routines with the addition of __subsf3, achieving a 1.9× speedup with RVfplib is mainly due to architectural factors. While TBs increase by 2.01% of the execution time in GEMM-based kernels, kNN presents a TBs decrease of almost 3% of the computing time moving from libgcc to RVfplib. Previous algorithms feature 24.89–27.25% FP instructions, while kNN reaches up to 45.5% due to 21.2M FP instructions out of a total of 46.5M instructions. As a result, the kernel gains performance from leveraging more of the FPU compute power leading to a 32.09× speedup compared to the baseline when deploying PULP-OPEN.

The kernel takes about 625 kcycles when performing on the baseline while leveraging RVfplib on GAP8 reaches a 1.58× speedup reducing the runtime to 395 Mcycles. kMEANS lower FP rate compared to kNN explains the 0.3× drop of performance when switching from libgcc to RVfplib FP support. While kNN accounts for 90.49% instructions to emulate FP computations, kMEANS uses only 74.82% of the overall workload, thus leading to a speedup decrease. Running the kernel on PULP-OPEN, the execution time decreases to almost 20.5 kcycles, improving performance by 30.44× compared to the baseline. By presenting a reduced FLOP intensity of 40.64% and a higher LD stalls increase compared to kNN, the kernel achieves a 2× lower speedup compared to the baseline.

Adopting SW-optimized FP emulation libraries on IoT FPU-less platforms leads to several advantages also for latency and energy efficiency. GEMM- and MS-based algorithms are almost dominated by FP computations, featuring 75% to 90% of FP instructions. Leveraging small optimized RBfplib routines leads to 36.7%–47.34% energy usage reduction, demanding about 190 \(\mu\)J per GEMM-based and k-Means inference and 26.4 mJ for kNN. Consequently, such Non-neural ML kernels present higher latency percentage reductions, enabling running inferences on GAP8 in about 352 ms for kNN and 2.5 ps for the remaining. GNB transcendental routines high usage and RF reduced FP computations ratio limit energy and latency improvements to 26.2%–28.8% compared to libgcc deployment on GAP8. Instead, leveraging PULP-OPEN FPU-native support reduces such resources by up to 99%, requiring down to 3.7 \(\mu\)J and 75 \(\mu\)s per GEMM-based inference.

Adopting RVfplib on GAP8 to execute RF reduces the code size by only 3.9% due to the low FP computations ratio, while the other kernels reach a 7.9% lowering. Last, PULP-OPEN FPU-native support decreases the code size up to 42%, considering libgcc support.

5.3 Parallel Performance

In Figure 12, we report the cycles, latency, and energy required by Non-neural ML kernels, comparing sequential and parallel execution on PULP-OPEN and GAP8. To assess the parallelization performances, we also report the one- versus eight-core parallel speedup in Figure 13 and indicate the percentage loss between the achieved and ideal speedup on top of each column. Furthermore, Table 4 gives more profound insight into the results by providing measurements of the architectural factors limiting the speedup retrieved from platform performance counters. The considered ML kernels consist of a workload divided into fully parallelizable sections and inherently sequential portions. For that purpose, the table also reports the theoretical speedup of Non-neural ML kernels when using multiple processors. Thus, we profiled the execution time of the sequential code sections for each platform configuration and applied Amdahl’s law using the formula in Equation (15): (15) \(\begin{equation} Speedup = \frac{1}{(1-p) + \frac{p}{N}}. \end{equation}\) Amdahl’s law has two parameters: p is the percentage of parallelizable code, and N is the total number of available cores. This formula provides an ideal bound for the theoretical speedup, since it does not take into account the parallelization overheads. The optimized parallel design introduced in Section 4 enables reaching near-ideal speedups ranging from 6.56× to 7.64× compared to a single-core execution. By reducing TCDM contentions to at most 4.25% of the execution time and improving the instruction scheduling, we achieve CPIs ranging from 1.32 to 1.72.

Fig. 12.

Fig. 12. One- versus eight-core non-neural ML algorithms cycles, latency, and energy comparison. Abbreviations: RVfp (RVfplib).

Fig. 13.

Fig. 13. Non-neural ML kernels parallel performance on GAP8 and PULP-OPEN. Abbreviations: G8 (GAP8), RVfp (RVfplib), PULP-O (PULP-OPEN).

Table 4.
KernelPlatformCoresCyclesInstr.CPISpeedupTheor. SpeedupPipeline N.I.I $ MissesTCDMExt. LDFPU N.I.
SVMGAP8 + libgcc1757k548k1.38146k7.6k04.5k
8108k62.6k1.727.037.9419.7k4.86k11567
GAP8 + RVfplib1447k335k1.3392.7k16.3k01
865.5k45.2k1.456.837.9412.3k2.67k162
PULP-OPEN129.6k23.7k1.255.9k25010
84.20k3.17k1.327.057.837404616524
LRGAP8 + libgcc1796k570k1.4150k24.8k04.60k
8112k66.5k1.697.077.8819.9k6.26k16578
GAP8 + RVfplib1463k351k1.3296.8k3701
867.8k45.5k1.496.837.9511.6k3.88k124
PULP-OPEN130.9k24.6k1.266.10k501184
84.66k3.34k1.396.637.88766283198380
GNBGAP8 + libgcc186.4M67.4M1.2815.9M3.38M016.1k
811.5M8.22M1.47.497.891.99M785k4532.07k
GAP8 + RVfplib162.0M50.1M1.2411M387k01
88.09M6.09M1.337.647.961.37M299k50762
PULP-OPEN13.05M2.72M1.12279k37.9k0130.7k
8463k345k1.346.567.9134.7k16.8k1.49k6244.1k
RFGAP8 + libgcc11.01M695k1.45344k39.9k01
8151k89.5k1.696.667.9243.3k11.4k42060
GAP8 + RVfplib1742k629k1.1878.8k18.5k01
8111k81.2k1.366.77.910.4k2.46k60060
PULP-OPEN1405k350k1.1670.5k19.9k010
859.4k44.1k1.356.827.819.16k1.32k1.08k600
kNNGAP8 + libgcc1117M80.7M1.4529.1M1.57M0554k
815.4M10.1M1.527.597.943.64M808k1.58k69.5k
GAP8 + RVfplib161.6M46.5M1.3213.3M635k015
88.2M5.84M1.47.517.931.67M608k1.69k225
PULP-OPEN13.64M2.85M1.28735k36.6k050
8548k377k1.456.657.5991.4k7.09k858225253
kMEANSGAP8 + libgcc1625k466k1.3489.4k8.39M0515
883.6k59.3k1.417.47812.7k3.66k998
GAP8 + RVfplib1395k315M1.2545.4k52501
854.2k39.9k1.367.2986.83k2.62k101
PULP-OPEN120.5k18.3k1.262.8k410144
82.94k2.17k1.356.988353414110

Table 4. Runtime Statistics and Architectural Factors Executing the Non-neural ML Algorithms on a One- and Eight-core Configurations

To retrieve the highest predicted probability, GEMM-based kernels leverage the argmax sequential routine. Thus, the theoretically achievable speedup decreases to 7.83×–7.95× depending on the deployed platform and FP emulation support. The parallel algorithm design allows achieving speedups between 6.63× and 7.07× by switching the configuration. By emulating FP computations on GAP8, I$ misses do not scale linearly with the number of cores while increasing from almost zero to 5.72% of the parallel execution time in LR with RVfplib support. While other non-idealities are negligible, I$ misses limit the speedup to \(7.07\times\) for libgcc and \(6.63\times\) for RVfplib when performing GEMM-based kernels on GAP8. By leveraging the PULP-OPEN platform, the parallel computing time decreases to 4.20–4.66 kcycles making minor non-ideality sources affecting the performance. Among the most significant, TCDM contentions represent 3.92–4.25% of the PULP-OPEN eight-core execution time, highly bounding the speedup. Moreover, I$ misses increase when offloading the kernel computation onto CL. In particular, LR shows an I$ misses rise from nearly zero to 6.08% of the parallel runtime. Regarding the FPU non-idealities, they explain up to 1.74% of the parallel execution time, thus not limiting CL utilization. However, despite the above-mentioned architectural factors, the optimized algorithm design allows reaching 6.63×–7.05× parallel speedup.

By emulating GNB FP computations on the GAP8 eight-core CL, we improve the sequential execution by 7.49× for libgcc FP support and 7.64× for the custom RVfplib library. The architectural factor limiting the speedup on both emulation supports is related to I$ misses, since they slowly decrease moving to the parallel execution. Performing the kernel on PULP-OPEN leads to not-negligible FPU non-idealities that double up compared to the sequential execution and account for almost 10% of the parallel runtime. Concurrently, several architecture factors contribute to limiting CL compute efficiency, particularly I$ misses do not scale linearly while covering 3.63% of the parallel execution time. Therefore, leveraging the eight-core PULP-OPEN CL decreases GNB inference to 463 kcycles, thus reaching a 6.56× speedup compared to a single-core execution.

The most significant impact of architectural non-idealities involves a decrease in the CL performance efficiency when dispatching the RF kernel onto the eight-core engine. By deploying libgcc to emulate FP comparison operations, the runtime reduces down 151 kcycles with a speedup of 6.66×. Accordingly, RVfplib decreases the computing time from 742 to 111 kcycles enabling a 6.7× performance improvement. In addition to the sequential argmax routine limiting the gain to 7.9× speedup, I$ misses, and TCDM contentions bound the performance speedup accounting for 3%-7% of the parallel execution time. Instead, PULP-OPEN achieves a 6.82× computation time improvement compared to a single-core execution, presenting a theoretical speedup of 6.82×. The reduced kernel FLOP intensity (6.39%) involves a low FPU usage, thus leading to zero FPU pipeline non-idealities. I$ misses and TCDM contentions are the main architectural factors limiting the performance, impacting almost 4% on the parallel computation time.

Offloading kNN computations to the GAP8 eight-core CL while deploying libgcc emulation support reduces the execution time from 117 to 15.4 Mcycles, thus reaching a \(7.59\times\) speedup. Leveraging the optimized RVfplib library, kNN optimized parallel design improves the single-core running time by \(7.51\times\). In both implementations, I$ misses limits the CL compute power utilization, since they scale sub-linearly with the number of cores while accounting for 5.24%–7.41% of the parallel execution time. By running the kernel on PULP-OPEN, we improve the runtime from 3.64 Mcycles to 548 kcycles leading to a 6.65× speedup. Due to PULP-OPEN reduced execution time, the sequential code weighs more on the computation and strictly limits the theoretical speedup to 7.59× with 28 kcycles executed by a single-core. Furthermore, architectural factors such as I$ misses, TCDM contentions, and Ext-LD restrict the runtime reduction when offloading kNN computations to PULP-OPEN eight-core CL.

Considering the remaining MS-based algorithm, kMEANS features a 7.47×–7.29× runtime improvement compared to a sequential execution deploying libgcc and RVfplib on GAP8, respectively. While the theoretical speedup attains almost 8×, architectural non-idealities limit the speedup when leveraging the eight-core CL. I$ misses account for a large portion of the parallel execution time, slowly decreasing in libgcc and growing from nearly zero to 4.83% of the parallel computing time when deploying RVfplib emulation support. By switching to the PULP-OPEN platform, the FPU-native system decreases the 20.5 kcycles single-core execution time to 2.94 kcycles leveraging the eight-core CL. Along with I$ misses, several architectural factors such as TCDM contentions and Ext-LD contributes to bounding the speedup improvement to 6.98×.

Adopting optimized parallel designs of Non-neural ML kernels on top of PULP processors also offers several benefits for latency and energy efficiency, which are crucial in the TinyML domain. By fully leveraging the eight-core CL compute power, we enable performing the kernels with an excellent latency and energy decrease ranging from 85% to 87%. Executing Parallel k-Means and GEMM-based algorithms on the PULP-OPEN platform requires only 7.35–11 \(\mu\)s latency and 0.36–0.55 \(\mu\)J per inference. While RF demands 149 \(\mu\)s and 7.34 \(\mu\)J, dispatching NB and kNN onto the eight-core CL reduces the latency and energy usage to 1.2–1.4 ms and 57–67 \(\mu\)J.

5.4 Comparison with Cortex-M4

This section compares the execution time of the Non-neural ML kernels between PULP-OPEN and the ARM Cortex-M413 architecture. This comparison focuses on single-core sequential execution, because the techniques proposed for code parallelization require minimal runtime support and are, to a large degree, orthogonal to the ISA and the core micro-architecture. We used for comparison an STM32F414 MCU, since it belongs to a widespread, commercially successful ultra-low-power MCU family. The STM32F4 features the Adaptive Real-time (ART) memory accelerator to speed up instructions fetch along with DSP and FPU instructions support. To perform the experimental evaluation, we optimized the Non-neural ML algorithms for the Cortex-M4 target using CMSIS-DSP routines and custom-coded functions not provided in the library. In particular, we leveraged CMSIS-DSP GNB and linear SVM implementations while the LR design for Cortex-M4 uses the optimized dot product included in the library. CMSIS-DSP Euclidean distance routine embeds the square root calculation. Thus, we improved the distance metric by removing such a multi-cycle operation in MS-based algorithms. Since there is no CMSIS-DSP support for RF, we coded the kernel for the STM32F4 target using the same optimization strategies we devised for the sequential implementation on PULP.

Figure 14 reports the cycles required for the sequential execution of the ML benchmarks on Cortex-M4 and PULP-OPEN. The figure also reports the results executing on the eight-core CL as a further reference. We report the achieved speedup w.r.t. the Cortex-M4 on top of the bars. Focusing on the sequential execution, PULP-OPEN achieves speedups ranging from 1.36× to 2.39× compared to Cortex-M4. While RF execution on PULP-OPEN achieves a 1.36× execution time decrease, GEMM-based kernels reach up to a 2.39× runtime improvement. Along with GNB, MS-based algorithms attain an intermediate improvement result with a 1.74×–1.94× speedup.

Fig. 14.

Fig. 14. ARM Cortex-M4 vs. PULP-OPEN comparison.

Both architectures execute kernels optimized explicitly for their ISA, and execution time is expressed in cycles (i.e., it is independent of frequency). This gap is due to three main factors: single-cycle load operations, hardware loop support, and fused multiply-and-add FP operations. Load operations are executed in a single cycle when programmers adopt techniques to reduce data dependencies inside the loop body (e.g., loop unrolling). Adopting hardware loops saves one register, removes the overhead of updating the loop counter, and avoids pipeline stalls when the branch is taken. Finally, multiply-and-accumulate operations require two cycles on PULP-OPEN, but they are pipelined so that the throughput is close to 1 op/cycle when the compiler avoids data dependencies on the output register.

Skip 6CONCLUSION Section

6 CONCLUSION

This article presents the parallel design of six relevant Non-neural ML algorithms to fit ML computational constraints into edge-based PULP MCUs. We developed the algorithm design targetting efficient execution on GAP8, a commercial chip, and PULP-OPEN, a research platform running on an FPGA emulator. We determined efficient memory access patterns and parallelization schemes achieving peak performance by optimizing the runtime through a fine-grained analysis and extensive optimization. Since IoT-class MCUs often limit the HW resources to benefit energy efficiency, we leveraged two alternative FP emulation libraries to perform FP computations on the FPU-less GAP8.

By comparing the Non-neural ML kernels execution time on a single-core GAP8 configuration, we show that the target-optimized RVfplib library achieves an average 1.61× speedup compared to the standard libgcc emulation support. Instead, leveraging the FPU-native support on a single-core PULP-OPEN allows up to 32.09× speedup compared to libgcc emulation. We also examined the parallel performance on the adopted PULP platforms, comparing the single-core execution time with the eight-core CL runtime. The parallel design enables near-ideal speedups ranging from 6.56× to 7.64×, considering the two PULP platforms and GAP8 FP emulation supports. We support the discussion with a comprehensive runtime analysis providing core- and SoC-level architectural factors limiting the speedup in each platform configuration and algorithm. Last, we present a comparison between PULP-OPEN and ARM Cortex-M4. By leveraging PULP-OPEN in a single-core configuration, we achieve 1.36×–2.39× speedup compared to Cortex-M4 deployment. At the same time, using the eight-core CL of PULP-OPEN reduces the runtime drastically, leading to a 9.27×–15.85× performance improvement.

Future work will include the design of an automatic tool to deploy Non-neural ML algorithms on PULP-based MCUs targetting optimal tiling and double-buffering operations to achieve peak performance. Furthermore, we will expand the developed parallel library by integrating further Non-neural ML kernels and supporting new emerging PULP architectures.

Footnotes

  1. 1 www.st.com/en/microcontrollers-microprocessors/stm32-32-bit-arm-cortex-mcus.html.

    Footnote
  2. 2 www.nxp.com/products/processors-and-microcontrollers/arm-processors:ARM-PROCESSORS.

    Footnote
  3. 3 developer.arm.com/Processors/Cortex-M0.

    Footnote
  4. 4 www.ti.com/microcontrollers-mcus-processors/microcontrollers/msp430-microcontrollers/products.html.

    Footnote
  5. 5 www.espressif.com/en/products/socs/esp8266.

    Footnote
  6. 6 https://www.st.com/content/st_com/en.html.

    Footnote
  7. 7 https://greenwaves-technologies.com/.

    Footnote
  8. 8 https://github.com/microsoft/EdgeML.

    Footnote
  9. 9 https://github.com/pulp-platform.

    Footnote
  10. 10 https://github.com/pulp-platform/pulp-sdk.

    Footnote
  11. 11 https://greenwaves-technologies.com/product/gapuino/.

    Footnote
  12. 12 https://www.xilinx.com/products/boards-and-kits/vcu118.html.

    Footnote
  13. 13 https://developer.arm.com/Processors/Cortex-M4.

    Footnote
  14. 14 https://www.st.com/en/microcontrollers-microprocessors/stm32f4-series.html.

    Footnote

REFERENCES

  1. [1] Evans D.. 2011. The Internet of Things: How the Next Evolution of the Internet Is Changing Everything. Technical Report. Cisco. Retrieved from https://www.cisco.com/c/dam/en_us/about/ac79/docs/innov/IoT_IBSG_0411FINAL.pdf.Google ScholarGoogle Scholar
  2. [2] Dogan Ürün, Edelbrunner Johann, and Iossifidis Ioannis. 2011. Autonomous driving: A comparison of machine learning techniques by means of the prediction of lane change behavior. In Proceedings of the IEEE International Conference on Robotics and Biomimetics. IEEE, 18371843. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Tabanelli Enrico, Brunelli Davide, Acquaviva Andrea, and Benini Luca. 2022. Trimming feature extraction and inference for MCU-based edge NILM: A systematic approach. IEEE Trans. Industr. Inform. 18, 2 (2022), 943952. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Kumar Pradeep, Kumar Pradeep, and Tiwari Arvind. 2017. Ubiquitous Machine Learning and Its Applications (1st ed.). IGI Global. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Cisco. 2016. Global Cloud Index: Forecast and Methodology, 2016–2021. Technical Report. Cisco. Retrieved from https://www.cisco.com/c/en/us/solutions/collateral/service-provider/globalcloud-index-gci/white-paper-c11-738085.html.Google ScholarGoogle Scholar
  6. [6] Barbera Marco V., Kosta Sokol, Mei Alessandro, and Stefa Julinda. 2013. To offload or not to offload? The bandwidth and energy costs of mobile cloud computing. In Proceedings of the IEEE Infocom. IEEE, 12851293. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Sun Yunchuan, Zhang Junsheng, Xiong Yongping, and Zhu Guangyu. 2014. Data security and privacy in cloud computing. Int. J. Distrib. Sensor Netw. 10, 7 (2014), 190903. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Sanchez-Iborra Ramon and Skarmeta Antonio F.. 2020. TinyML-Enabled frugal smart objects: Challenges and opportunities. IEEE Circ. Syst. Mag. 20, 3 (2020), 418.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Banbury Colby R., Reddi Vijay Janapa, Lam Max, Fu William, Fazel Amin, Holleman Jeremy, Huang Xinyuan, Hurtado Robert, Kanter David, Lokhmotov Anton, et al. 2020. Benchmarking TinyML systems: Challenges and direction. Retrieved from https://arXiv:2003.04821.Google ScholarGoogle Scholar
  10. [10] foundation TinyML. TinyML reasearch community. [n. d.]. Retrieved from https://www.tinyml.org/. https://www.tinyml.org/.Google ScholarGoogle Scholar
  11. [11] Yu Wei, Liang Fan, He Xiaofei, Hatcher William Grant, Lu Chao, Lin Jie, and Yang Xinyu. 2017. A survey on the edge computing for the internet of things. IEEE Access 6 (2017), 69006919.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Tan Mingxing and Le Quoc. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 61056114.Google ScholarGoogle Scholar
  14. [14] Sandler Mark, Howard Andrew, Zhu Menglong, Zhmoginov Andrey, and Chen Liang-Chieh. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45104520.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Technologies Greenwaves. [n.d.]. GAP Processors. Retrieved from https://greenwaves-technologies.com/gap8_gap9/. https://greenwaves-technologies.com/gap8_gap9/.Google ScholarGoogle Scholar
  16. [16] Sony. [n.d.]. Spresense development board. Retrieved from https://developer.sony.com/develop/spresense/. https://developer.sony.com/develop/spresense/.Google ScholarGoogle Scholar
  17. [17] Mittal Sparsh. 2015. A survey of architectural techniques for near-threshold computing. ACM J. Emerg. Technol. Comput. Syst. 12, 4 (2015), 126.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Flamand E., Rossi D., Conti F., Loi I., Pullini A., Rotenberg F., and Benini L.. 2018. GAP-8: A RISC-V SoC for AI at the edge of the IoT. In Proceedings of the International Conference on Application-specific Systems, Architectures and Processors (ASAP’18). IEEE, 14.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Rossi Davide, Conti Francesco, Eggiman Manuel, Mach Stefan, Mauro Alfio Di, Guermandi Marco, Tagliavini Giuseppe, Pullini Antonio, Loi Igor, Chen Jie, Flamand Eric, and Benini Luca. 2021. 4.4 A 1.3TOPS/W @ 32GOPS fully integrated 10-core SoC for IoT end-nodes with 1.7\(\mu\)W Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode. In Proceedings of the IEEE International Solid- State Circuits Conference (ISSCC’21), Vol. 64. 6062.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Gottscho Mark, Alam Irina, Schoeny Clayton, Dolecek Lara, and Gupta Puneet. 2017. Low-cost memory fault tolerance for IoT devices. ACM Trans. Embed. Comput. Syst. 16, 5s (2017), 125.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Chen Doris and Singh Deshanand. 2013. Profile-guided floating- to fixed-point conversion for hybrid FPGA-processor applications. ACM Trans. Architect. Code Optim. 9, 4, Article 43 (2013), 25 pages.Google ScholarGoogle Scholar
  22. [22] Menard Daniel, Chillet Daniel, and Sentieys Olivier. 2006. Floating-to-fixed-point conversion for digital signal processors. EURASIP J. Adv. Signal Process. 2006 (2006), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Christensen Michael and Taylor Fred J.. 2006. Fixed-point-IIR-filter challenges. EDN Netw. 51, 23 (2006), 111122.Google ScholarGoogle Scholar
  24. [24] Menard Daniel, Serizel Romain, Rocher Romuald, and Sentieys Olivier. 2008. Accuracy constraint determination in fixed-point system design. EURASIP J. Embed. Syst. 2008 (2008), 112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Chang Wei-Hsin and Nguyen Truong Q.. 2008. On the fixed-point accuracy analysis of FFT algorithms. IEEE Trans. Signal Process. 56, 10 (2008), 46734682. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Perotti Matteo, Tagliavini Giuseppe, Mach Stefan, Bertaccini Luca, and Benini Luca. 2022. RVfplib: A fast and compact open-source floating-point emulation library for tiny RISC-V processors. In Proceedings of the International Conference on Embedded Computer Systems. Springer, 1632.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Capra Maurizio, Bussolino Beatrice, Marchisio Alberto, Masera Guido, Martina Maurizio, and Shafique Muhammad. 2020. Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead. IEEE Access 8 (2020), 225134225180.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Greeshma K. V. and Sreekumar K.. 2019. Fashion-MNIST classification based on HOG feature descriptor using SVM. Int. J. Innov. Technol. Explor. Eng. 8, 5 (2019), 960962.Google ScholarGoogle Scholar
  29. [29] LeCun Yann, Bottou Léon, Bengio Yoshua, and Haffner Patrick. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 22782324.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Lai Liangzhen, Suda Naveen, and Chandra Vikas. 2018. CMSIS-NN: Efficient neural network kernels for arm cortex-M CPUs. Retrieved from https://arXiv:1801.06601.Google ScholarGoogle Scholar
  31. [31] STMicroelectronics. [n.d.]. X-Cube-AI: AI Expansion Pack for STM32CubeMX. Retrieved from https://www.st.com/en/embedded-software/x-cube-ai.htm.Google ScholarGoogle Scholar
  32. [32] Yazici Mahmut Taha, Basurra Shadi, and Gaber Mohamed Medhat. 2018. Edge machine learning: Enabling smart internet of things applications. Big Data Cogn. Comput. MDPI 2, 3 (2018), 26.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Bekaroo Girish and Santokhee Aditya. 2016. Power consumption of the raspberry pi: A comparative analysis. In Proceedings of the IEEE International Conference on Emerging Technologies and Innovative Business Practices for the Transformation of Societies (EmergiTech’16). 361366.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Sakr Fouad, Bellotti Francesco, Berta Riccardo, and Gloria Alessandro De. 2020. Machine learning on mainstream microcontrollers. Sensors MDPI 20, 9 (2020), 2638.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] blog Eloquent Arduino. [n.d.]. MicroML. Retrieved from https://github.com/eloquentarduino/micromlgen.Google ScholarGoogle Scholar
  36. [36] Nordby Jon. [n.d.]. Emlearn: Machine Learning inference engine for Microcontrollers and Embedded Devices. Retrieved from https://github.com/emlearn/emlearn.Google ScholarGoogle Scholar
  37. [37] Almansoor Mohamed, Alaradi Mohamed, and Alqaddoumi Abdulla. 2020. Parallel programming for classification algorithms using logistic regression and artificial neural networks: Framework and applications. In Proceedings of the International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable Economy (ICDABI’20). IEEE, 16.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Senagi Kennedy and Jouandeau Nicolas. 2022. Parallel construction of random forest on GPU. J. Supercomput. Springer (2022), 121.Google ScholarGoogle Scholar
  39. [39] Liu Peng, Zhao Hui-han, Teng Jia-yu, Yang Yan-yan, Liu Ya-feng, and Zhu Zong-wei. 2019. Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark. J. Central South Univ. Springer 26, 1 (2019), 112.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] You Yang, Song Shuaiwen Leon, Fu Haohuan, Marquez Andres, Dehnavi Maryam Mehri, Barker Kevin, Cameron Kirk W., Randles Amanda Peters, and Yang Guangwen. 2014. Mic-svm: Designing a highly efficient support vector machine for advanced modern multi-core and many-core architectures. In Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 809818.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Zhu Huming, Li Pei, Zhang Peng, and Luo Zheng. 2019. A high performance parallel ranking SVM with opencl on multi-core and many-core platforms. Int. J. Grid High Perform. Comput. IGI Global 11, 1 (2019), 1728.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Ma Yujing, Rusu Florin, and Torres Martin. 2019. Stochastic gradient descent on modern hardware: Multi-core CPU or GPU? Synchronous or asynchronous?. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’19). IEEE, 10631072.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Gupta Chirag, Suggala Arun Sai, Goyal Ankit, Simhadri Harsha Vardhan, Paranjape Bhargavi, Kumar Ashish, Goyal Saurabh, Udupa Raghavendra, Varma Manik, and Jain Prateek. 2017. ProtoNN: Compressed and accurate kNN for resource-scarce devices. In Proceedings of the International Conference on Machine Learning. PMLR, 13311340.Google ScholarGoogle Scholar
  44. [44] Kumar Ashish, Goyal Saurabh, and Varma Manik. 2017. Resource-efficient machine learning in 2 KB RAM for the internet of things. In Proceedings of the International Conference on Machine Learning. PMLR, 19351944.Google ScholarGoogle Scholar
  45. [45] Gopinath Sridhar, Ghanathe Nikhil, Seshadri Vivek, and Sharma Rahul. 2019. Compiling KB-Sized machine learning models to tiny IoT devices. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 7995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Mahajan Divya, Park Jongse, Amaro Emmanuel, Sharma Hardik, Yazdanbakhsh Amir, Kim Joon Kyung, and Esmaeilzadeh Hadi. 2016. Tabla: A unified template-based framework for accelerating statistical machine learning. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’16). IEEE, 1426.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Mahdavinejad Mohammad Saeid, Rezvan Mohammadreza, Barekatain Mohammadamin, Adibi Peyman, Barnaghi Payam, and Sheth Amit P.. 2018. Machine learning for internet of things data analysis: A survey. Digital Commun. Netw. 4, 3 (2018), 161175. Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Merenda Massimo, Porcaro Carlo, and Iero Demetrio. 2020. Edge machine learning for AI-enabled IoT devices: A review. Sensors 20, 9 (2020), 2533.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Ahmad Muhammad Waseem, Mourshed Monjur, and Rezgui Yacine. 2017. Trees vs. neurons: Comparison between random forest and ANN for high-resolution prediction of building energy consumption. Energy Build. 147 (2017), 7789.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Banbury Colby, Reddi Vijay Janapa, Torelli Peter, Holleman Jeremy, Jeffries Nat, Kiraly Csaba, Montino Pietro, Kanter David, Ahmed Sebastian, Pau Danilo, et al. 2021. MLPerf tiny benchmark. Retrieved from https://arXiv:2106.07597.Google ScholarGoogle Scholar
  51. [51] Huh Jaesung, Lee Minjae, Heo Heesoo, Mun Seongkyu, and Chung Joon Son. 2021. Metric learning for keyword spotting. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT’21). IEEE, 133140.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Shor Joel, Jansen Aren, Han Wei, Park Daniel, and Zhang Yu. 2022. Universal paralinguistic speech representations using self-supervised conformers. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 31693173.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Liu Xueliang, Zhang Rongjie, Meng Zhijun, Hong Richang, and Liu Guangcan. 2019. On fusing the latent deep CNN feature for image classification. World Wide Web 22, 2 (2019), 423436.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Durkota Karel, Linda Michal, Ludvik M., and Tozicka Jan. 2020. Neuron-net: Siamese Network for Anomaly Detection. Technical Report. DCASE2020 Challenge, Technical Report.Google ScholarGoogle Scholar
  55. [55] Zhao Minglu, Takizawa Hiroyuki, and Soma Tomoya. 2021. Spatiotemporal anomaly detection for large-scale sensor data. In Proceedings of the 12th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP’21). IEEE, 162168.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Dagum L. and Menon R.. 1998. OpenMP: An industry standard API for shared-memory programming. IEEE Comput. Sci. Engineer. 5, 1 (1998), 4655.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Mach S., Schuiki F., Zaruba F., and Benini L.. 2021. FPnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing. IEEE Trans. Very Large Scale Integr. Syst. 29, 4 (2021), 774787.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Zhuo Xiaoyan, Nandi Iman, Azzaoui Taha, and Son Seung Woo. 2020. A neural network-based optimal tile size selection model for embedded vision applications. In Proceedings of the IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS’20). 607612.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Burrello Alessio, Garofalo Angelo, Bruschi Nazareno, Tagliavini Giuseppe, Rossi Davide, and Conti Francesco. 2021. DORY: Automatic end-to-end deployment of real-world DNNs on low-cost IoT MCUs. IEEE Trans. Comput. (2021).Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Chandra Rohit, Dagum Leo, Kohr David, Menon Ramesh, Maydan Dror, and McDonald Jeff. 2001. Parallel Programming in OpenMP. Morgan Kaufmann.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Tagliavini Giuseppe, Cesarini Daniele, and Marongiu Andrea. 2018. Unleashing fine-grained parallelism on embedded many-core accelerators with lightweight OpenMP tasking. IEEE Trans. Parallel Distrib. Syst. 29, 9 (2018), 21502163.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Munera Adrian, Royuela Sara, and Quiñones Eduardo. 2020. Towards a qualifiable OpenMP framework for embedded systems. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE’20). IEEE, 903908.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Chapman Barbara, Huang Lei, Biscondi Eric, Stotzer Eric, Shrivastava Ashish, and Gatherer Alan. 2009. Implementing OpenMP on a high performance embedded multicore MPSoC. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing. IEEE, 18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Agathos Spiros N., Dimakopoulos Vassilios V., Mourelis Aggelos, and Papadogiannakis Alexandros. 2013. Deploying OpenMP on an embedded multicore accelerator. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’13). IEEE, 180187.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Patel Sumit, Potdar M. B., and Gohil Bhadreshsinh. 2015. A survey on image processing techniques with OpenMP. Int. J. Eng. Dev. Res. 3, 4 (2015), 837839.Google ScholarGoogle Scholar
  66. [66] Padilla Dionis A., Pajes Ramon Alfredo I., and Guzman Jerome T. De. 2020. Detection of corn leaf diseases using convolutional neural network with OpenMP implementation. In Proceedings of the IEEE 12th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM’20). IEEE, 16.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Huang Lei, Stotzer Eric, Yi Hangjun, Chapman Barbara, and Chandrasekaran Sunita. 2012. Parallelizing ultrasound image processing using OpenMP on multicore embedded systems. In Proceedings of the IEEE Global High Tech Congress on Electronics. IEEE, 131138.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Fürlinger Karl and Gerndt Michael. 2006. Analyzing overheads and scalability characteristics of OpenMP applications. In Proceedings of the International Conference on High Performance Computing for Computational Science. Springer, 3951.Google ScholarGoogle Scholar
  69. [69] Darema Frederica. 2001. The SPMD model: Past, present and future. In Proceedings of the European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting. Springer, 11.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Montagna Fabio, Tagliavini Giuseppe, Rossi Davide, Garofalo Angelo, and Benini Luca. 2021. Streamlining the OpenMP programming model on ultra-low-power multi-core MCUs. In Proceedings of the International Conference on Architecture of Computing Systems. Springer, 167182.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. [71] Cramer J. S.. 2002. The origins of logistic regression. In Tinbergen Institute Discussion. Tinbergen Institute.Google ScholarGoogle Scholar
  72. [72] Ioannou Christiana and Vassiliou Vasos. 2018. An intrusion detection system for constrained WSN and IoT nodes based on binary logistic regression. In Proceedings of the 21st ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems. ACM, 259263.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. [73] Hasan Mahmudul, Islam Md. Milon, Zarif Md. Ishrak Islam, and Hashem M. M. A.. 2019. Attack and anomaly detection in IoT sensors in IoT sites using machine learning approaches. Internet Things 7 (2019), 100059.Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Cortes V. Vapnik and C.. 1995. Support-vector networks. Mach. Learn. 20, 1 (1995), 273297.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Liu Yi-Hung and Chen Yen-Ting. 2007. Face recognition using total margin-based adaptive fuzzy support vector machines. IEEE Trans. Neural Netw. 18, 1 (2007), 178192.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. [76] Siddharth T., Gajbhiye Pranjali, Tripathy Rajesh Kumar, and Pachori Ram Bilas. 2020. EEG-based detection of focal seizure area using FBSE-EWT rhythm and SAE-SVM network. IEEE Sensors J. 20, 19 (2020), 1142111428.Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Nir Friedman, Dan Geiger, and Moises Goldszmidt. 1997. Bayesian network classifiers. Mach. Learn. 29, 7 (1997), 131163.Google ScholarGoogle Scholar
  78. [78] Wu Di, Jiang Zhongkai, Xie Xiaofeng, Wei Xuetao, Yu Weiren, and Li Renfa. 2020. LSTM learning with Bayesian and gaussian processing for anomaly detection in industrial IoT. IEEE Trans. Industr. Inform. 16, 8 (2020), 52445253.Google ScholarGoogle ScholarCross RefCross Ref
  79. [79] Kumar Nikhil, Acharya Debopam, and Lohani Divya. 2021. An IoT-based vehicle accident detection and classification system using sensor fusion. IEEE Internet Things J. 8, 2 (2021), 869880.Google ScholarGoogle ScholarCross RefCross Ref
  80. [80] Cover T. and Hart P.. 1967. Nearest neighbor pattern classification. IEEE Trans. Info. Theory 13, 1 (1967), 2127.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. [81] Wong W. K., Juwono Filbert H., and Khoo Brendan Teng Thiam. 2021. Multi-features capacitive hand gesture recognition sensor: A machine learning approach. IEEE Sensors J. 21, 6 (2021), 84418450.Google ScholarGoogle ScholarCross RefCross Ref
  82. [82] Ranjitha M. M., Taranath N. L., Arpitha C. N., and Subbaraya C. K.. 2019. Bone cancer detection using k-means segmentation and KNN classification. In Proceedings of the 1st International Conference on Advances in Information Technology (ICAIT’19). 7680.Google ScholarGoogle Scholar
  83. [83] MacQueen J.. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability: Weather Modification. University of California, 281296.Google ScholarGoogle Scholar
  84. [84] Wu Wenbin and Peng Mugen. 2017. A data mining approach combining K-means clustering with bagging neural network for short-term wind power forecasting. IEEE Internet Things J. 4, 4 (2017), 979986.Google ScholarGoogle ScholarCross RefCross Ref
  85. [85] Peng Xiaosheng, Zhou Chengke, Hepburn Donald M., Judd Martin D., and Siew W. H.. 2013. Application of k-means method to pattern recognition in on-line cable partial discharge monitoring. IEEE Trans. Dielectr. Electr. Insul. 20, 3 (2013), 754761.Google ScholarGoogle ScholarCross RefCross Ref
  86. [86] Breiman. L.2001. Random forests. Mach. Learn. 45, 1 (2001), 532.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. [87] Tabanelli Enrico, Brunelli Davide, and Benini Luca. 2020. A feature reduction strategy for enabling lightweight non-intrusive load monitoring on edge devices. In Proceedings of the IEEE 29th International Symposium on Industrial Electronics (ISIE’20). IEEE, 805810.Google ScholarGoogle ScholarCross RefCross Ref
  88. [88] Lin Tzu-Hsuan and Jiang Jehn-Ruey. 2020. Anomaly detection with autoencoder and random forest. In Proceedings of the International Computer Symposium (ICS’20). 9699.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

(auto-classified)
  1. DNN Is Not All You Need: Parallelizing Non-neural ML Algorithms on Ultra-low-power IoT Processors

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Embedded Computing Systems
              ACM Transactions on Embedded Computing Systems  Volume 22, Issue 3
              May 2023
              546 pages
              ISSN:1539-9087
              EISSN:1558-3465
              DOI:10.1145/3592782
              • Editor:
              • Tulika Mitra
              Issue’s Table of Contents

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 19 April 2023
              • Accepted: 29 October 2022
              • Revised: 18 October 2022
              • Received: 17 July 2021
              Published in tecs Volume 22, Issue 3

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
            • Article Metrics

              • Downloads (Last 12 months)188
              • Downloads (Last 6 weeks)70

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!