An Approach to the Systematic Characterization of Multitask Accelerated CNN Inference in Edge MPSoCs

Deep Learning is ubiquitous today and is increasingly moving from the cloud down to the edge of networked infrastructures, where it enables embedded applications to perform complex inference tasks close to the data sources, reducing long-distance data movement and alleviating the need for a powerful cloud infrastructure. Edge-class multi-processor system on chip (MPSoC) devices featuring an on-chip FPGA fabric offer key advantages for Deep Learning inference tasks, especially for complex applications where multiple models may be run concurrently in the same platform. In this work, we propose an approach and a practical framework for the systematic characterization of multithreaded Deep Learning inference on edge FPGA MPSoCs. We instantiate the framework into a real-world MPSoC platform, targeting Xilinx Vitis-AI as a representative example of a commercial Deep Learning acceleration toolkit for edge environments. We design a comprehensive experimental campaign and apply it to the platform for several convolutional neural networks, each trained on three different datasets. We show that our approach can be used for both hardware- and software-level analysis of a target system. Among other findings, the analysis revealed a suboptimal behavior of the underlying toolkit runtime, involving the utilization of the accelerator cores and the uneven software latency of the support library, influenced by the shapes of the input tensors.


INTRODUCTION
Deep Neural Networks (DNNs) now offer state-of-the-art performance in various domains, from computer vision to natural language processing.DNN tasks, particularly inference, are increasingly moving from the cloud down to edge systems.Moving processing tasks closer to data sources, in fact, helps optimize storage, networking, and computational requirements; reduce energy consumption; and address potential data security concerns.Interestingly, high-end edge systems may be required to run demanding, multitasking DNN workloads.As an example, consider a video surveillance application, processing various data sources for different purposes concurrently (multiple data streams undergoing various types of analysis like object recognition, intrusion detection, etc.).An additional important use case for multitasking in Artificial Intelligence (AI) edge platforms includes the case of multiple DNNs cooperating in an ensemble and combining their results to increase the accuracy and robustness of a given task.
The preceding scenarios clearly challenge the capabilities of an edge-class system in terms of computation and storage resources, hindering the deployment of deep models at the edge of a distributed infrastructure.Modern Multi-Processor System on Chip (MPSoC) platforms can address these limitations by offering comprehensive compute environments that can be deployed at the edge.They can run fully fledged Linux-like operating systems and even hypervisors on application-class multicore processors.MPSoCs are also often equipped with an on-chip Field Programmable Gate Array (FPGA) fabric.In fact, tightly integrated hardware reconfiguration capabilities can enhance runtime performance and reduce energy consumption of the MPSoC, by limiting the need for off-chip data movement.Furthermore, FPGAs typically run at relatively low frequencies, yielding considerable power savings for many applications [33].FPGA-based Deep Learning acceleration has been widely addressed by the industry [28,29,39,40] and academia [4,34,35].It is currently an active field of exploration and development in both the server domain [4,27] and embedded and edge platforms [2,15].
Several technologies and toolkits for FPGA-accelerated DNNs exist.Some rely on high-level synthesis [29] and/or DNN quantization [39], as well as other model compression techniques, whereas others are based on specialized co-processor designs [40].In that respect, FPGA-based platforms lend themselves to extreme hardware-level optimizations, for example, by closely customizing the data representations driven by the actual application requirements.A few academic works address multitasking and multiuser AI accelerators.Ghodrati et al. [9] used architectural fission, dividing their DNN accelerator into multiple smaller DNN engines on demand.Jiang et al. [21] addressed quality of experience in DNN acceleration leveraging dynamic reconfiguration capabilities of an FPGA platform.
However, although AI toolchains and edge platforms supporting multitasking AI are becoming increasingly important, only few previous works address the characterization of their performance.Xilinx Vitis-AI [40] is the most mature of these toolkits.It features a tracing and profiling tool, namely the Vitis-AI profiler, which only supports its own accelerator and, albeit open source and configurable, offers a narrow interface aimed at software and memory profiling.In academia, Zhu et al. [44] focused on the high-level behavior of scheduling mechanisms for inference jobs and proposed different scheduling policies for different Convolutional Neural Networks (CNNs).Restuccia and Biondi [34] investigated real-time aspects of AI acceleration, introducing a hardware module to increase time predictability with prefetching.Qasaimeh et al. [32] performed comparisons for vision kernels of Vitis-AI with embedded processors and embedded GPUs.Nevertheless, to the best of our knowledge, no general framework exists that enables the systematic characterization of toolchains and platforms for multitasking AI inference.In this work, we propose an approach and a practical framework for the characterization of multitask inference acceleration 48:3 in edge-class MPSoCs, leveraging the computing power and features of a fully fledged Operating System (OS) and the hardware reconfigurability of the FPGA fabric.We demonstrate the methodology on a state-of-the-art AI toolkit, Xilinx Vitis-AI [40], as a representative example of Deep Learning acceleration in edge-class FPGA-based MPSoCs.By applying our methodology, we design a noninvasive, extensible measurement framework leveraging the Linux tracing infrastructure and use it for evaluating the performance of the Deep Learning Processing Unit (DPU) against various hardware configurations and different DNN models and datasets.An extensive experimental campaign is performed to stimulate single-and multithreaded behaviors of both the software support layer, the toolkit, and the hardware accelerator itself.The proposed methodology allows us to deliver an in-depth empirical analysis of the AI framework and reveal nontrivial behaviors and suboptimal design of the software support layer, providing insights on possible design choices from both the framework designer's and user's perspective.As a by-product, we perform a thorough analysis of some of the toolkit's undocumented features, including some debug performance counters in the DPU hardware IP, providing insights on their behavior and semantics.Last, we show that a single experimental campaign designed with the proposed methodology can be repurposed to perform additional analysis and estimation on quantities and details that were not explicitly included in the design, such as energy consumption estimation.
The rest of the article is organized as follows.Section 2 covers related work on DNNs of interest in this work and Deep Learning acceleration.Section 3 provides an overview of the proposed characterization methodology and framework.Section 4 describes the application of the methodological framework to a concrete edge AI toolchain and MPSoC platform.Section 5 presents the experimental results alongside a comprehensive analysis of the collected data.Section 6 concludes the article with a few final remarks.

BACKGROUND
This section outlines the background of the techniques and the methods we adopted in this work.It also describes the technological scenario of Deep Learning acceleration toolkits focusing on FPGA technologies and discusses related work in the field.

Techniques for Efficient DNN Inference
The technical literature comprises numerous works addressing the efficient deployment of Deep Learning models.According to some works [11,18], the proposed solutions can be generally classified into two categories.The first one aims at defining directly smaller networks (e.g., Mo-bileNets [18]) with more efficient blocks, whereas the second one focuses on the compression of pre-trained models (i.e., model compression).Cheng et al. [3] further categorized model compression techniques into the following four groups: • Parameter pruning and sharing attempts to reduce the number of parameters by removing the redundant and noncritical ones [10].• Low-rank factorization identifies the informative parameters by performing matrix/tensor decomposition [42].• Transferred/compact convolutional filtering designs efficient convolutional filters, in terms of storage and computation complexity.• Knowledge distillation (KD) trains a compact neural network by distilling the knowledge of a more complex and accurate model [11,17,26,31].
Network quantization, which falls into the first group of model compression techniques, consists of reducing the precision of the data types for the network weights, biases, and activations from floating-point to fixed-point, short integers, and even binary values (i.e., binarized neural networks [5]).Compressing the initial model massively reduces the volume of data to be moved in and out of memory, allowing energy reduction and speed improvement, despite some degradation in performance [14].Quantization can be applied both on the pretrained model ready for inference and during the training phase.The former case targets the DNN parameters and is referred to as Post-Training Quantization (PTQ).The latter is known as quantization-aware training and can also involve gradients of the backpropagation algorithm.

Deep Learning Acceleration Technologies
General-purpose processors, such as Intel architectures for server-class systems, feature dedicated support for the computing requirements of Deep Learning (e.g., matrix extensions [19]).In addition, NVIDIA introduced several frameworks for Deep Learning, from the cuDNN [6] library to TensorRT [38].The latter is an SDK supporting several optimization techniques, like quantization, pruning, layer merging, and parameter sharing, to reduce the computational cost and the energy consumption of the network.Furthermore, efforts toward Deep Learning specific accelerators have been spent both by the industry (e.g., Google TPUs [8]) and academia.FPGA technology also entered the Deep Learning realm, mainly for inference tasks.Hardware reconfiguration capabilities and ease of integration with classical processing systems make FPGA technology a good candidate for synthesizing Deep Learning inference co-processors.Recurrent neural networks have been the target of FPGA-acceleration research [4,12,27] as well as CNNs [2,15].The interested reader can refer to other works [1,13,36] for a detailed discussion.Modern FPGA-based DNN acceleration is based on full-stack frameworks.Such a choice aims to increase the usability of FPGA technology, given the hardware design skills required for FPGA development, and increase their potential user base.The most required features are the following: (1) Integration with popular Deep Learning toolkits (TensorFlow, PyTorch, ONNX, etc.) (2) Easy deployment on target platforms, completely hiding FPGA complexity.
Next, we introduce the most common FPGA toolkits for FPGA-accelerated Deep Learning.

NVIDIA NVDLA.
Alongside its GPUs, NVIDIA also offers a free and open source microarchitecture for Deep Learning acceleration, namely NVDLA [28].It has been designed for Internet of Things (IoT) devices and is shipped as a Verilog design with complete compiler, simulation, runtime, and driver support.For training and quantization, it is compatible with TensorRT [38].[29] is an open source toolkit from Intel for the heterogeneous deployment of computer vision and neural network kernels, mainly inference, on Intel hardware.It supports a variety of platforms, from edge-class CPUs to integrated GPUs and server-class FPGA cards.[39] is an experimental and open source toolkit by Xilinx for the quantization and deployment of DNNs on FPGA.It is based on high-level synthesis technology and offers a parameterizable dataflow to synthesize a pipelined design of SIMD arrays mimicking the structure of the target DNN.It is composed of low-precision matrix-vector engines and other processing elements.It supports 2-bit and 1-bit precision and only CNNs.The downside of FINN is that the whole synthesized design of the quantized DNN has to fit the target FPGA at once.This can be a potential limitation for particularly deep networks and limited-resources devices.Seyoum et al. [35] leveraged Dynamic Partial Reconfiguration (DPR), also recently referred to as Dynamic Function eXchange (DFX) in Xilinx products, to time-multiplex different layers of the quantized model in an FPGA.48:5 [40] is a partially open source toolkit, which can interoperate with TensorFlow, PyTorch, and Caffe (only up to Vitis-AI 2.0).It performs DNN quantization to 8-bit fixed-point precision on the floating-point model and outputs an executable file for a statically configurable co-processor, namely DPU [30].The core is then integrated as a hardware IP and synthesized on the FPGA fabric.Zhu et al. [44] analyzed the high-level behavior of the scheduling mechanism of inference jobs on the DPU core proposing alternative scheduling policies to improve performance for different types of DNNs.Restuccia and Biondi [34] studied the time predictability of the IP, performing a response-time analysis, revealing a regular latency for any given DNN.Furthermore, they proposed a custom hardware module to further improve the time predictability of the DPU.Qasaimeh et al. [32] evaluated the DPU against Cortex A57 CPUs and NVIDIA Jetson TX2 GPUs.Their experimental results show that the DPU IP can outperform GPUs and CPUs in terms of energy consumption per frame and inference runtime.

Target Models
In this subsection, we present some background about the models we used for demonstration purposes.Three CNNs were selected for the experimental campaign, namely VGG-16 [37], ResNet-50 [16], and MobileNet [18].The choice of the aforementioned networks was motivated by their employment in recent works regarding DNN quantization [20,22,41,43] and their diversity in terms of depth and number of parameters.
The selected CNNs were defined with different purposes.VGG-16 [37] was proposed in 2014 with the key idea of using very small convolutional filters (3 × 3) at each layer.Two or three filters are stacked into a layer, producing the same receptive field of a single convolutional layer with larger filters, but requiring fewer parameters.
ResNets (Residual Networks) were designed in 2016 by He et al. [16], with the aim of mitigating the problems of convergence for deep networks by design, based on identity mapping.Moreover, to reduce training time for deep ResNets, the authors modified the building block with a bottleneck design.In our experiments, we employed ResNet-50, the version of ResNet consisting of 50 layers.
MobileNets were proposed in 2017 by Howard et al. [18].They are designed for mobile and embedded applications.Their main novelty is the employment of depthwise separable convolutions, which factorize standard convolution into (1) a depthwise convolution, which applies a single filter separately to each input channel, and (2) a pointwise (1 × 1) convolution, used to linearly combine the output of the previous operation.This factorization allows the reduction of both computational costs and model size.

MOTIVATION AND HIGH-LEVEL APPROACH
This section discusses the main motivation and describes the architecture of the proposed framework for the systematic characterization of multitask accelerated CNN inference in edge-class MPSoCs.Figure 1 depicts a block diagram with the activities needed by our methodology.

Workload
A key feature of a characterization campaign is the choice of the workload.Any measurement framework or characterization infrastructure requires a realistic workload to test the target system against.Such a workload can be either extracted from a real-word application, like those presented in Section 1, or purposely designed for the experimental campaign.Despite the benefits of a realworld workload, such a choice tends to be biased by on-field interference and, generally, is more difficult to implement.However, a synthetic workload needs to be carefully designed to be realistic, but it offers a wider design space and allows the target platform to be stimulated with the maximum degree of freedom.The designer is provided with several design dimensions, namely the choice of the DNNs and datasets; frequency of the inference requests; distribution of the input samples; size of the working set for each thread and user; and, most importantly for our multithreaded study, the number and configuration of software threads and application-level user environments.Such design choices serve as a configuration for a task scheduling environment, where a task can be either a whole user process or a single thread.During the execution of the experiments, the scheduler enforces the workload characteristics on the platform.

FPGA-Based MPSoC Edge Platforms
The capabilities of the target platform may impact the experimental campaign, in that they provide more or less sophisticated support for the deployment of the activities foreseen by the characterization framework.For edge-class platforms, the computing resources typically allow deploying fully fledged OSes or hypervisors to manage such resources, enabling the best tradeoff between invasiveness and accuracy of the characterization activities.An FPGA fabric is often featured by high-end edge MPSoCs, offering the possibility to determine at runtime the design of the hardware accelerator.A set of design configurations, for the same accelerator or for different designs, can then be chosen for deployment.This adds another design dimension to the characterization of the platform.

Edge Deep Learning Toolkit
Deep Learning requires a toolkit to deploy and run inference on the target hardware.In our case, the platforms running inference functions are edge-class devices.Typical Deep Learning training environments are based on high-level languages and full-precision models.The training phase runs on powerful workstations with abundance of resources, high-precision datatypes, and a large number of operations.Since such computing requirements are not met by edge devices, the standard approach is to train DNNs on high-end systems and then convert them into edge-compatible formats.An edge-specific toolkit, such as a DNN compiler, is necessary to convert the complex and high-level description of a DNN model to a set of simplified and edge-compatible operations.To reduce the model size and complexity, these tools leverage the model compression techniques introduced in Section 2.1.This simplified model is used to perform DNN inference on such devices.Alongside a compilation toolchain, a runtime environment featuring low-level drivers and user APIs is deployed on the platform to let the user program communicate with the hardware, particularly the FPGA accelerator.

Measurement Framework
The measurement framework is central to the characterization of a platform.It should minimize interference and bias on the measures, but it also offers a set of simple tools making it versatile and extensible.The key tasks to be addressed are how and what to measure, and how to collect data.

How to Measure?
The first task requires selecting or designing a measurement technology, which matches the needs of accuracy and versatility for the target application.For the chosen technology, it is not necessary to offer the highest possible precision, if the granularity of the figures of interest, such as latency and power consumption, is significantly larger.In fact, taking measurements with an unneeded degree of precision may add overheads and cause other effects to interfere with the measurements themselves.Versatility, extensibility, and usability are also key requirements for the selection of the measurement technology.As a matter of fact, if the technology is too complex to use and repurpose (i.e., it is not extensible or versatile), then most of the work spent in dominating such complexity will be lost when designing new experiments or moving to another platform.The computing capabilities of MPSoC platforms can offer effective tools to satisfy such requirements.Our proposal aims to leverage the existing software and hardware infrastructures of the platform, at MPSoC-or OS level.Tracing and debugging technologies can be implemented with both software and hardware modules.For instance, for the case of software probes, the OS/hypervisor running on the main processing system could have a tracing or probing technology already in place.Leveraging such tools would simplify the deployment of software probes, removing the need for aggressive code instrumentation.Likewise, hardware probes, monitors, and performance counters are already supported by edge platforms and can be used for the target characterization.Moreover, FPGAs offer the possibility to introduce and customize monitoring features in the hardware design itself.These capabilities can be implemented either as a custom design or leverage the available vendor IP libraries.

What to Measure?
The next step is to decide the target physical or logical quantities to measure.Latency of a given operation is the most obvious example, whereas other quantities of interest include energy consumption, memory, and system activities like page faults, cache misses, or interrupt counts.An important aspect is to ensure isolation and reduce interference (up to the required degree of accuracy, as discussed earlier) between the implementation of the measurement probes and the measurement target quantities.This interference and bias may happen to be not negligible.If this is the case, we either need to select a less invasive probing technology or perform calibration on the platform to correct the collected measures.

How to Collect Measures?
The last step in the measurement framework is the selection of the target technology to collect the desired measures.External tools, like oscilloscopes, are less invasive and offer the highest accuracy, yet they introduce a nonnegligible complexity in the whole infrastructure.Once again, MPSoCs offer the possibility to deploy effective data collection and even in-field preprocessing.Such a capability removes the need for an external workstation, since the platform itself has the required storage capacity and computing power to collect data and store it for further analysis.Nevertheless, using the platform for data collection may pose a possible source of interference, depending on the selected technologies, which needs to be evaluated and mitigated.

Data Analytics
Once data is collected, one can perform analysis with advanced technologies and statistical techniques to extract the desired information.MPSoC platforms would, in theory, be capable of performing some basic elaboration of the collected measures and even provide data visualization capabilities.Performing data analytics in-field, however, would steal most of the compute resources available in the MPSoC.Consequently, high-quality data analytics is normally supposed to be performed offline by a powerful workstation.

Extensibility
The previous sections provided a high-level view of the operations foreseen in the proposed framework, and here we discuss its potential for extensibility.The flow revolves around the MPSoC platform with its computing capabilities and resources.The measurement component is the element that is most influenced by these aspects, as it relies on the resources provided by the platform at board-, SoC-, or OS level, whereas the edge AI toolkit has a more limited impact.In Section 5.1 and Section 5.3.1, we show that even in the case where the hardware design of the accelerator is closed or there are undocumented features in the AI toolkit, it is possible to use our methodology to extract useful insights.Finally, all activities related to the workload design and task scheduling, together with data analytics, are independent of the platform itself.In fact, designers are free to implement any kind of DNN workload on the platform, reusing the existing measurement probes or designing new ones, as in Section 5.

IMPLEMENTATION OF THE FRAMEWORK
This section presents an instantiation of the approach described in Section 3 for our experimental setup, along with the technologies we employed.The workload used for demonstration includes the three models already described in Section 2, whereas the physical platform includes a Xilinx Zynq Ultrascale+ MPSoC, namely a ZCU102 device, featuring a large FPGA fabric as well as an ARM quad-core system running a Linux image built with PetaLinux, as discussed in Section 4.3.
The evaluation of the proposed methodology was performed on a specific platform for demonstration purposes.We do not aim here at a comparison of different platforms.Nevertheless, we highlight that the general methodology of Section 3 may be instantiated with any AI toolkit and hardware platform, as no technical choice in the flow is constrained to platform-or vendor-specific technologies.Implementers may then decide to customize the flow for board-or MPSoC-specific debug or monitoring hardware, or vendor-specific FPGA IPs from Intel FPGA or Xilinx-AMD to be synthesized in the reconfigurable fabric.However, such an approach would limit the user's ability to extend the implementation and deploy or repurpose it on another platform.

Edge Deep Learning Framework
Here we present the target Deep Learning toolkit for our edge platform and the details for the practical deployment of the proposed characterization methodology.The choices presented in this section do not limit the extensibility of our proposal to other edge-AI frameworks [29,38], or hardware accelerators [28,39] and architectures [9,21].

AI Toolkit.
For our experimental campaign, we used Xilinx Vitis-AI [40].At the time of writing and to the best of our knowledge, it is the most mature framework for FPGA acceleration of DNNs.It is composed of a set of tools to convert a predefined DNN model to an executable format, targeted for a Xilinx-provided co-processor, namely the DPU.The tools are shipped as a set of environments to be integrated with several Deep Learning frameworks, namely TensorFlow, PyTorch, and Caffe (up to Vitis-AI 2.0).For each environment, the first tool is the quantizer.This program performs both PTQ and quantization-aware training, but in this work we only employ the former.More in detail, PTQ takes as input a floating-point model-that is, the full-precision DNN model defined and trained in one of the preceding Deep Learning frameworks.The second input is a calibration dataset, used to contain the accuracy loss during the quantization process.This dataset can be a subset of the training set without labels, as there is no need for backpropagation at this phase.PTQ outputs a quantized model with 8-bit fixed-point data representation.The latter can then be compiled with the Vitis-AI compiler, which is not open source, for the target DPU architecture, or configuration.The resulting artifact is an .xmodelfile, which contains the instruction stream for the DPU and metadata, such as the shapes of the input and output tensors of the quantized DNN model.

AI Accelerator.
The target DPU accelerator we used in this work targets Xilinx's edgeclass MPSoCs.Its full name is DPUCZDX8G [30].This edge-class version of the DPU only supports CNNs.Recurrent neural network capable versions of the DPU IP are only supported by Xilinx cloud cards (i.e., Alveo cards), which are outside the scope of this work.The DPU is shipped as an encrypted IP and integrated with Xilinx Vivado or Vitis, which obfuscates all the microarchitectural details of the IP and prevents any modification to its hardware description.Nevertheless, the IP can be configured with a set of parameters at synthesis time.The most interesting of these parameters, which we focus on in this work, is the number of DPU cores.Up to four of these cores are supported, which actually behave more like hardware threads than independent cores.The ZCU102 can fit up to three DPU cores in the default configuration, allowing us to use this parameter as a design dimension for our multithreading analysis.Some further parameters are (1) the possible presence of an additional hardware core for the acceleration of the softmax function, (2) the DPU architecture (i.e., the width of the SIMD operations the DPU supports), and (3) the utilization intensity of DSP and RAM blocks on the FPGA fabric.Some of these parameters change the DPU configuration footprint for the Vitis-AI compiler, some do not.For this work, we set all parameters but the number of DPU cores to their default values.Each DPU core has three AXI master interfaces, one for instructions and two for data, and an AXI slave to be configured by the main processing system (i.e., the ARM quad-core).Each DPU core executes as an asynchronous co-processor, which, once setup in its registers, runs independently, fetching its instructions and data from memory and writing back the results.On termination, the DPU core raises an interrupt to notify the main processing system.
The DPU is exposed to the processing system as a character device, whose driver is available in the Xilinx fork of the Linux kernel [25] as a kernel module.Although OpenCL is also supported, in this work we used the naked driver flow, to reduce the system complexity and interference, so as to simplify our analysis.The interaction with the driver, or OpenCL runtime, is performed by the Vitis-AI library, which is part of the toolkit [40].This library offers several APIs at various levels of abstraction.To maximize the control of the execution flow, we chose the lowest-level available APIs.We can create a vart::Runner object by parsing the .xmodelfile.This operation places the instructions for the DPU in memory.Moreover, this object exposes all metadata about the input and output tensor shapes and sizes to the application to prepare the memory layout for the DPU.The execute_async() method of the vart::Runner class launches the execution of the instructions on a DPU core (i.e., the inference job), and the method wait() stalls the program until the interrupt from the target DPU core is received.

CNNs Training and Quantization.
As mentioned in Section 2, we employ PTQ to reduce the model size and computational cost of pretrained CNNs.In particular, we trained the CNNs using TensorFlow (version 2.9.2) and the Keras APIs, which are supported by the toolkit of choice, Xilinx Vitis-AI.We conducted training on Google Colaboratory.The pretrained Broadly speaking, given a CNN, the number of parameters, and thus the model size, depends on the considered network and the employed dataset.The latter modifies the dimension of the input and the output layers, based on the input size and on the number of classes, respectively.
Table 1 shows the size of both the floating-point and the quantized and compiled model, respectively stored through .h5and .xmodelformat.For each of the considered datasets, from this table it is clear that MobileNet outperforms the other networks in terms of model size.Indeed, MobileNet is by design an efficient network not only in terms of computational costs but also of memory consumption.Conversely, VGG-16 always requires the largest amount of parameters, which explodes in the case of ImageNet dataset (i.e., 138.4 million parameters).Indeed, unlike the other considered CNNs, VGG-16 employed three fully connected layers, with a number of parameters becoming significantly higher as the number of classes and the dimensions of the input increases.As a result, the model size increases too, for both the floating-point and the quantized model, as reported in Table 1.
The table also points out that the quantization process, as expected, decreases the model size, since it converts the model weights from floating-point to fixed-point values.The resulting fixedpoint model requires less memory bandwidth, and also provides improved power efficiency and computational time.Conversely, the number of parameters of the fixed-point network is approximately the same as the floating-point one, as the Vitis-AI Quantizer does not alter the network structure, but only removes nodes that are not required once the model is quantized.(e.g., TensorFlow 2) and stored as a .h5file.This model is then quantized and compiled with the Vitis-AI tools, resulting in an .xmodelfile.In parallel, the user application is developed in C++ or Python, using the APIs of the Vitis-AI library, and (cross-)compiled for the target board to an ELF file.The .xmodel and ELF files are copied to the board, and Vitis-AI library shared objects (.so) are installed on it.The ELF file can now run, loading the shared libraries, and reading the .xmodeland the target inputs (e.g., images from disk, camera, or other sources), and launch DPU jobs.

Measurement Framework
This section presents the details of the measurement framework we designed starting from the proposed flow.Furthermore, challenges and technical considerations are discussed and analyzed throughout the section.

Measurement Technology.
For our experimental campaign, we set up a measurement infrastructure based on Kprobes [23], a Linux tracing technology, leveraging the existing support on the platform.In particular, we employed dynamic probes with the aim of also improving reusability and extensibility.Such probes are defined and inserted in the Linux tracing framework after the target user programs or kernel modules have been compiled.With reference to naming, kprobes are kernel-space probes and uprobes are probes for user programs and libraries.This approach provides a noninvasive tool for tracing and minimizes source code instrumentation and recompilation.As a matter of fact, we can probe into a precompiled Vitis-AI library, without any modification to the source code.Kprobes or uprobes trap during the execution of any user or kernel program, hooking to a kernel symbol (e.g, a function) or an offset in a user program, tracing the entry or exit of a function call.The main challenge when employing kprobes and uprobes is that the target symbols (i.e., function names) must be visible in the compiled object file.Automatic code inlining may remove symbols that were originally in source code.Moreover, with aggressive compiler optimizations in user-level programs and high-level languages, like C++, symbols are harder to extract.Once extracted, they might even point to different locations in the assembled code than one may expect.Hence, the choice of kprobes required that the application-level code, calling the AI framework APIs, be compiled without optimizations.Probes can access various architectural and contextual details for the target program-for example, function arguments and return values, processor register values, and stack and memory contents.The selected data is, then, automatically recorded to the system trace, alongside the system uptime timestamp of the probed event.This mechanism provided us with microsecond resolution.Furthermore, using the uptime of the system as a common timebase, we also managed to reduce possible biases of kernel versus user-space measurement precision or variability.
Linux probes are widely used for performance analysis and system debug.Linux perf-tools [24] use this technology and even allow users to place custom probes in the system.Furthermore, the Vitis-AI profiler tool uses kprobes to trace DPU system activities.It also offers a GUI interface but is aimed specifically at Xilinx's DPU IP.Although it is open source and configurable, it is difficult to extend and offers a narrow interface aimed at software tracing and memory profiling.Our aim was, rather, to demonstrate how to design and implement a lean characterization framework for custom measurements and analysis, with variable granularity and no dependence on technology-specific tools.
We did not employ hardware probes in the experimental evaluation of this work.We did not design our own hardware to monitor the system nor used the available DPU IP performance counters, because in this work, they are part of the system under test.As a matter of fact, we employed kprobes also to perform sampling of these registers.Future work will use the findings of this work to leverage such registers for further analysis and integrate them with a custom hardware design.

Measurement
Target. Figure 3 depicts a simplified timeline of the measurement for an inference job.We were interested in four timestamps: (1) the call to execute_async(), (2) DPU kickoff (i.e., writing to the target DPU core start register, which launches the inference processing), (3) DPU interrupt, as soon as the IP completes its job, and (4) return from the wait() API call.We name the three intertimes between these four timepoints, respectively, software left, hardware, and software right.The first timeframe, software left, measures the delay introduced by the Vitis-AI library, before starting the actual inference.The second interval, hardware, covers the runtime of the inference job on the IP.The last timeframe, software right, covers the latency introduced by the Vitis-AI library, before the control is returned to the user program.
Considering interferences, Figure 3 shows another short timeframe-that is, the interrupt latency between the raise of the interrupt signal of the DPU and the actual activation of the kernel driver, which is probed by the target kprobe.Although interrupt latency is a possible source of nondeterminism and randomness, our experimental results revealed that in the target setup and under the target workload, this timeframe is relatively short compared with the lengths of the whole measurement timeline and with a reduced variability, as explained in Section 5.1.1.
Although the details in this section are vendor specific (e.g., execute_async() and wait() APIs are specific to Xilinx Vitis-AI), the timeline definition in Figure 3 can be applied to any call to an asynchronous co-processor, as on-or off-chip GPUs or FPGAs typically are.

Data Collection.
Linux probes are used to instrument the system, and the selected data is reported in the system trace file.Once the measurement is completed, we can extract the probed information by just reading the file offline from the resulting trace.Reusing the Linux tracing infrastructure also enables a general low-interference and low-complexity mechanism on the target measures, as it is based on system calls and hardware breakpoints [23].Data is collected in the system trace file, which is in privileged kernel address space, and not in user directories used by the target application.With such an approach, we sought to improve the safety of the measure infrastructure and minimize interference from the data collection performed on the platform file system.

Platform Environment
For our experiments, we deployed the DPUCZDX8G IP v3.4 on the XCZU9EG MPSoC.We used Vivado 2021.2 for synthesizing three configurations of the IP, with up to three DPU cores.PetaLinux 2021.2 was used for building the kernel image and root filesystem starting from the PetaLinux BSP from Xilinx [40].We built our experimental application using the C++ APIs of the Vitis-AI library, version 2.0, and cross-compiler environment, also from Xilinx [40].CNN training was conducted using TensorFlow 2.9.2 and Google Colab.For quantization and compilation, we employed the tool of the Vitis-AI 2.0 environment for TensorFlow 2.

EXPERIMENTAL EVALUATION
In this section, we perform an analysis of the collected data by applying the flow described in the previous sections to the target CNNs and datasets, alongside an analysis of the DPU performance counters introduced in Section 5.1.We track the execution flow of our inference jobs from the system trace, collect system snapshots and timestamp, and retrieve and compute the target quantities (i.e., latencies and the values of the DPU performance counters).The next sections describe our findings on the collected data.

Debug Performance Counter Register
The DPU IP features several debug performance counters for each DPU core.These are not documented in the product guide [30] and, in a few cases, required some reverse engineering.We described in Section 4.2.2 how to sample the values of these registers.We performed such measurements for multiple runs of all possible combinations of the selected CNNs and datasets.We describe our findings next.

Cycle Counter Register.
The first register we analyzed is a cycle counter, namely cycle_cnt.By crosschecking with our measurements, we inferred that it is used to keep track of the number of AXI cycles of the execution of a single inference job of each DPU core.The value of this register varies for each inference job and can be divided by the AXI bus frequency in megahertz, which is 100 in the hardware design we used, to derive the inference execution runtime in microseconds.We selected three representative CNN-dataset configurations, namely MobileNet trained on CIFAR-10, ResNet-50 on CIFAR-100, and VGG-16 on ImageNet.Such a choice is motivated by the fact that over the nine possible CNN-dataset combinations, the three that we selected respectively exhibit the fastest, average, and slowest hardware runtime.This selection proved to be representative also for other CNNs and datasets.Furthermore, this choice allows us to use all of the three target CNNs and three target datasets.Figure 4 depicts the distribution of runtime values for 1,000 runs of the three representative CNN-dataset configurations, computed from cycle_cnt against the distribution of kprobe intertimes, derived as described in Figure 3.This plot shows how the average difference of these distributions settles around 22 μs, which is compatible with the interrupt latency and measuring overhead of the target technologies.
Table 2 lists the average cycle_cnt values for all combinations of the selected CNNs and datasets.As expected, MobileNet always outperforms the other networks in runtime, since it is designed to be efficient in terms of computational costs.VGG-16 and ResNet-50 provide similar values for CIFAR-10 and CIFAR-100, whereas on ImageNet, VGG-16 is much slower than ResNet-50, even though the former has fewer convolutional layers.Indeed, because of its bottleneck design,  the inference time of ResNet-50 remains moderate.Instead, as for VGG-16, employing stacks of 3 × 3 convolutions reduces the number of parameters, but it increases the number of convolutions, making the network less efficient in terms of computational cost and time.

Operation Counter Registers. Our evaluation relies on two additional groups of registers:
start and end counters for four types of operations, namely misc, load, save, and conv.These operations respectively map to miscellaneous operations (e.g., pooling and activation), loads and stores from and to memory (for instructions, network parameters, input and output tensors), and convolutions.Unlike cycle_cnt, the values of these registers are constant for each .xmodelfile and do not vary across different runs.This suggests that their values are related to the stream of instructions the DPU executes during inference, which is the same for each .xmodelfile.Although the semantics of the start and end counters is not documented, from our experiments we speculate that start is the number of started operations, whereas end keeps track of those which successfully completed.In fact, the two values coincide in case of a successful inference job, whereas for unsuccessful jobs (e.g., under high system load), the end counter is always lower than the corresponding start value.4, and Table 5 report on the debug performance counters of the different CNNs on CIFAR-10, CIFAR-100, and ImageNet, respectively.Such values depend on both the network structure and the quantization process.Furthermore, to understand the impact of both the convolutional and the dense layers of the CNNs, we run each CNN with and without its final dense layers.
As can be observed from the tables, dense layers do not affect the values of misc_cnt.Indeed, such a counter involves the number of miscellaneous operations performed by the MISC Engine [7] in the DPU architecture, namely pooling and activation operations executed for each convolutional layer.As for the conv_cnt and the load_cnt, the resulting values are mainly determined by the convolutional layers, except for VGG-16.Indeed, unlike the other CNNs, VGG-16 has three dense layers, which clearly impacts the number of performed convolutions and load operations, and thus the computational costs.It is worth noting that the store_cnt value obtained by running only the convolutional layers is often higher than the value obtained when the entire network is considered.Indeed, when only the convolutional layers are executed, the output of the network is the output of the last convolutional layer, which could be larger than the output of the entire network (e.g., a vector of 10 or 100 elements for CIFAR-10 and CIFAR-100, respectively).Therefore, writing this intermediate output to the memory could need more store operations, resulting in a higher store_cnt value.
Moreover, the highest performance counter values refer to ResNet-50, as it is the deepest network.Instead, the lowest values are those of MobileNet, as it is optimized for computational costs.The only case where MobileNet provides a higher value than VGG-16 is misc_cnt.This is because MobileNet comprises more convolutional layers, each of which requires at least the activation, and therefore the misc_cnt value for this network is higher.
Furthermore, the larger the input size, the higher the performance counter values.Indeed, although the performance counters are comparable for CIFAR-10 and CIFAR-100 (where the input are 32 × 32 RGB images), they are significantly higher for ImageNet (consisting of 224 × 224 RGB images).The reason behind this behavior is that the larger the image size, the larger the input of each convolutional layer.As a result, the operational complexity also increases, leading to a higher value of both misc_cnt and conv_cnt, but also store_cnt.As for load_cnt, we note that for each CNN, the variability of this value among different datasets is lower than the other performance counters.This suggests that the number of load operations depends more on the structure of the network and the number of parameters than on the input size or the number of classes.

Timeline Analysis
In this section, we present the measurement results of the software left and software right timeframes depicted in Figure 3.As explained in Section 4.2, such intervals account for the delay introduced by the Vitis-AI library, respectively before starting the inference and before the control is returned to the user program, after the inference.From the top plot in Figure 5(a), one can clearly recognize that the software left distribution presents two modes, which are directly connected to two dataset groups: CIFAR-10 and CIFAR-100 in the first, and ImageNet in the second.This phenomenon is caused by the shapes of the input tensors, which is the same for both CIFAR datasets (3 × 32 × 32) but much larger for ImageNet (3 × 224 × 224).Although the allocation of the input tensors in memory is performed before the software left timeframe, the effect of a larger shape in the Vitis-AI library latency is considerable for this interval.Conversely, the bottom plot shows the latency in the software right frame.We point out that there is no observable difference among the two aforementioned dataset groups.However, it is interesting to note that each distribution exhibits two modes, which are most likely caused by memory access patterns, like page faults in virtual memory systems.
Figure 5(b) shows the same distribution as Figure 5(a) but highlighted per CNN.The plots suggest an opposite clustering for the two timeframes.In particular, in the top plot, one can recognize that software left is not subject to any observable difference in latency with respect to the CNN it will run.However, software right exhibits an appreciable difference among the CNNs, although all distributions are concentrated around the same values.There is no significant difference between CNNs besides interference due to noncontrollable factors in our setup.

Multithreading Analysis
This section presents the analysis of multithreading, as supported by the target AI toolkit.We inspect the behavior of software threads, which we implemented with the C++ std::thread APIs, against those that we call hardware threads (i.e., DPU cores).

Software Thread Allocation.
The first interesting property to inspect is the allocation of software threads over DPU cores.Figure 6 shows the assignment of vart::Runner objects, belonging to each software thread to DPU cores during inference tasks.One can observe that with a single vart::Runner and regardless of the number of available DPU cores, the core with index 0 runs 100% of the inference tasks, whereas the other available cores remain idle.Likewise, having two runners in two software threads evenly balances the load between the first two cores, but it leaves the third idle.Finally, the assignment of software threads over hardware threads is not balanced unless the number of software threads matches an integer multiple of the number of available DPU cores.The most unbalanced examples are the case of two DPU cores and three software threads, and the case of three DPU cores and four software threads.The former causes core 0 to serve 66.6% (i.e., two thirds) of the inference tasks with core 1 only accounting for the remaining 33.3%.The latter case requires core 0 to serve 50% of the requests, leaving only 25% to the remaining two cores.This is caused by the static allocation of each vart::Runner object on a given DPU core at construction time by the Vitis-AI library.The assignment is performed in a round-robin fashion, but this denies the possibility of a vart::Runner to execute on an arbitrary DPU core, even when it is free.Such design choice in the Vitis-AI library cannot be motivated by any thread-core affinity policy, even when the instruction stream of the DPU exhibits some locality.This is because each AXI master interface of each DPU core is not necessarily connected to a coherent AXI slave port of the processing system of our target MPSoC.Such an allocation policy effectively causes an unfair scheduling at runtime.On one hand, the performance of a software thread can be degraded, given a vart::Runner, just because the statically assigned core is busy, whereas others might be idle.On the other hand, exclusive access to a DPU core might possibly be granted to a single thread, or a reduced number of threads.In a multitask system, like our target edge-class MPSoC platform, scheduling fairness is a desired property because multithreaded systems require a wise scheduling policy.A simple dynamic allocation policy may improve the preceding behavior.

Multithreading
Latency. Figure 7 shows the latency distribution of six software threads.Since all CNN-dataset configurations exhibit the same characteristics, data for a single one is shown: MobileNet on CIFAR-10.The timeframe considered is the one in Figure 3 alongside the memory layout preparation performed by the software thread.Clearly, a single thread cannot benefit from multiple DPU cores, and all single-thread plots show similar latencies for each hardware configuration.The first row depicts the distribution for a single DPU core.Software multithreading offers a significant speedup when moving from one thread to two, with a 1.5× improvement, thanks to overlapping of the DPU idle time of a thread with DPU runtime of another.However, allocating more than two software threads does not yield better performance.In the second row, we plotted the results for two DPU cores.The first thing to note is that with two software threads, having two DPU cores enables a larger speedup, almost 2×, suggesting that there were still stalls for a single DPU core.Moreover, it is interesting to note that for a number of threads matching an integer multiple of this number, namely 2, 4, and 6, the distributions show a single mode, with relative speedups of about 2×.This is due to the aforementioned balanced allocation of software threads over DPU cores.Conversely, for a number of threads different from a multiple of 2 (i.e., 3 and 5), the plot clearly shows two modes per distribution.The third row of Figure 7 depicts the distributions for three DPU cores, where similar considerations can be made.Allocating two software threads does not yield any improvements to latency, with respect to the case of two DPU cores.A single mode is apparent for three threads, with a 2.4× speedup with respect to one thread.For four and five threads, two bimodal distributions can be recognized, as expected, whereas for six threads, a balanced, single-mode distribution arises.

Thread Versus Fork2join
Latency. Figure 8(a) shows the average runtime of single software threads also for other CNN-dataset configurations.Computing the average partially hides the details discussed previously but shows similar behaviors for others.All plots point out that for a single DPU core, the speedup of software multithreading quickly saturates at two threads.With two threads, performance does not significantly change between two and three DPU cores.Finally, both multi-DPU core configurations can better benefit from software multithreading, although they both show a saturation effect.
Figure 8(b) plots the average runtime of the whole multithreading run, namely fork2join latency, effectively taking into account the maximum service time of the single threads.The plot changes from Figure 8(a) are apparent, and the suboptimal scheduling policies manifest themselves.For a single software thread or two, the plot is similar since no unfairness has appeared yet.As soon as we consider three software threads, for the middle plot (i.e., for ResNet-50-CIFAR-100), the fork2join latency actually increases with a two-DPU core design.The same happens when moving from four threads to five.This effect is also present in the top and bottom plot, albeit less evident.An analogous behavior can be recognized for three DPU cores: it reaches its local minimum at three software threads, then increasing at four threads and decreasing again for five threads.A particular case is shown in the top plot, where the fast-running MobileNet-CIFAR-10 configuration, with three DPU cores and four software threads, actually performs worse than three cores and three threads.This last phenomenon manifests itself in the VGG-16-ImageNet configuration only at six software threads, due to the fact that its high latency absorbs this effect for a smaller number of threads.

Multiuser Analysis
In this section, we characterize the target system under multiuser workloads.In this scenario, we are interested in using the MPSoC as a CNN inference engine available for multiple external agents (i.e., users).We first describe the design choices and challenges of the synthetic workload generation.Subsequently, experimental results are presented and discussed.

Multiuser Task Scheduling
. Figure 1 includes a task scheduler module in the workflow architecture.Previous experiments assumed a simple implementation of the scheduling module; however, this section particularly discusses its implications.
In Section 5.3.1, we learned how software runner objects are statically assigned to hardware DPU cores.Section 5.3 showed how to take advantage of the multithreaded nature of the DPU IP with software multithreading.In this section, we do not employ the latter, because we decided to isolate each user request in a separate single-threaded process.Such a configuration, with the current Vitis-AI APIs, cannot exploit the multithreading potential of the platform, as each process has no knowledge of the runtime utilization of the DPU cores.However, based on our previous experiments, we designed a scheduling policy that fairly and deterministically utilizes DPU cores.We implement a simple and noninvasive scheduling module enabling the efficient use of hardware multithreading.
First, we let each process allocate a number of vart::Runner objects equal to the number of available DPU cores.Note that the properties of these objects depend on the compiled CNN provided during construction (i.e., the .xmodel).In general, this is user specific and provides the desired DNN diversity in a CNN inference server.Subsequently, we use a simple context-aware round-robin scheduling policy to instruct each process about which of these runners to select.
Such a design also aims to minimize invasiveness on the AI toolkit and adds a constant overhead in the inference service.We suggest a simple extension of the Vitis-AI library APIs to provide users with a similar interface for instantiating or calling runner objects.Letting the user explicitly select which DPU core to address would allow for an effective use of hardware multithreading with a minimal increase in API complexity.A more elegant and efficient alternative would be to completely remove the static binding between runner objects and DPU cores, and include a roundrobin, or any other, scheduling policy, in the AI toolkit runtime itself.Although the target toolkit for our experiments is open source, we discarded the implementation of this idea to minimize the impact on the toolkit itself, which is part of our target system to be characterized and not part of the characterization framework.

Synthetic
Workload.We synthesize several multiuser workloads by performing inference requests from different processes (i.e., from different users) with a single thread per request.For a matter of simplicity, we perform user requests at a fixed rate for each run with a fixed interarrival time.This fixed term is, then, varied across runs.To isolate the behavior of the CNN inference from other system modules, requests are generated from user processes hosted by the OS and not across a network connection.This removes any effect of a network connection (e.g., topology, bandwidth, latency).We synthesized different workloads with 20 user processes with 20 images each, with the exception of one case (i.e., VGG-16 trained on ImageNet), where the target system could not sustain such a load.User processes and images respectively were reduced to 15 and 5.We tested three hardware designs, with one, two, and three hardware DPU cores.Allocation of user requests was performed as described in Section 5.4.1 with a simple round-robin scheduling policy enforced by a task scheduling module.For the CNN-dataset in the workload (i.e., the compiled .xmodel),we took an incremental approach in the diversity of requests.We first used the three representative CNN-dataset configurations, introduced in Section 5.1.1 for the first set of tests.Finally, we conducted an experiment, labeled mixed CNNs, sampling among all nine CNN-dataset configurations for workload generation.
The following sections present and analyze the results of these experiments in terms of average user request latency.

Representative
CNNs. Figure 9(a) depicts the latency plot of each of the compiled CNNdataset configurations for a different number of DPU cores.Notably, the user request rate, represented on the x-axis by the request interarrival time, is different for each workload, as they differ in computing and memory demand.We pushed the system to the highest rate it can sustain for each workload.Interarrival times range from 0.05 seconds for the fastest configuration (MobileNet-CIFAR-10) to 2.5 seconds for the slowest (VGG-16-ImageNet).Despite the different rates, the first two plots show the same two trends: (1) the typical rising latency/load curve and (2) the presence of crosspoints between the curves of different number of DPU cores.The plot representing a single DPU core always starts with the lowest latencies for the lowest request rates (i.e., the largest interarrival times).This indicates how multithreading only brings overhead at low rates.In fact, the curve for two DPU cores shows lower latencies than the one for three cores.Interestingly, this order flips for higher rates, as soon as the advantages of multithreading become predominant.The latency of inference jobs performed on a single DPU core increases as more and more user requests start queuing in the system.The curve crosses first the one corresponding to two DPU cores, then the one for three cores.Likewise, the latencies for requests served by two cores start lower than those served by three cores, but at some point surpass them, while staying below the curve for single-core requests.The workload synthesized for the VGG-16-ImageNet configuration shows no advantage for hardware multithreading.The runtime and memory characteristics of such configurations are radically different from the previous ones.In fact, they only allowed for low-rate user requests, as the system could not sustain higher rates.Therefore, we could only observe the first part of the latency curve.The overhead of the hardware multithreading was significant here, and the crosspoint effect did not appear.
As a conclusion, the similarities in the latency plots for the first two workloads, MobileNet on CIFAR-10 and ResNet-50 on CIFAR-100, in Figure 9(a) for different interarrival times, show how the influence of the request rates exhibits similar behaviors across datasets and workloads.However, substantial differences in the VGG-16 on ImageNet workload appeared in different results when characterizing the system with multiple DPU cores.

Mixed CNNs.
The last experiment enforced the maximum diversity in the workload (i.e., both in CNNs and datasets).The generated multiuser workload ran all nine available CNN-dataset configurations, randomly sampling among them.Figure 9(b) shows the results of this experiment.We observed an expected latency curve.The same crosspoint effect observed in Figure 9(a) appeared in the plot for a single DPU core starting as the most efficient configuration.A notable difference is the order of the crosspoints between plots, which is not the same as in Figure 9(a).This difference arose from the randomness introduced by the CNN-dataset configuration sampling used in this experiment.

Energy Consumption Estimation
As described in Section 4, we only employed probes for timing and did not perform direct energy measurement.Nevertheless, the data we collected during the experimental campaign enabled us to perform a fine-grained estimation of the energy consumption of the DNN workload in Section 5.3.We leveraged the power estimation reports of the hardware design tool, namely Xilinx Vivado, on the three hardware configurations we synthesized.Dynamic power, particularly the FPGA component, is the dominant contribution, scaling linearly with the number of cores.However, the ARM-based main processing system accounts only for 28.9%, 17.3%, and 12.6% of the total power consumption, for the hardware configurations with one, two, and three DPU cores, respectively.
We estimated the energy requirements of the various CNNs in microjoules per frame from the data on power consumption for the various components of the architecture.We did not include data movement and preparation performed before the timeline in Figure 3, as it was not part of the measurement target, and only considered the energy consumed during the three hardware, software left, and software right timeframes.The latter two have orders of magnitude shorter latencies than hardware.Moreover, their energy consumption is related to the main processing system, which accounts for a small power contribution compared to the FPGA subsystem.It followed that in the energy estimation analysis, software left and software right impacts are minor, as high as 3% to 4% each for the fastest CNN configurations (e.g., MobiNet on CIFAR-10).Considering the energy consumption of hardware configurations with multiple DPU cores, the DPU documentation [30] reports on the possibility to perform clock gating for each DPU core.Since we had enabled this feature in the synthesized hardware design, we estimate the power consumption of hardware configuration with multiple cores by only considering the number of active DPU cores.The findings in Section 5.3.1 show that the number of active DPU cores depends on the number of active runners, and hence of software threads in our synthetic workload from Section 5.3.We highlight that we did not use any power-specific synthesis strategy for any of the synthesized hardware designs.Figure 10 plots the estimation of the energy consumption of the hardware timeframe for the three representative CNN-dataset configurations introduced in Section 5.1.1,the three hardware configurations with up to three DPU cores and up to three software threads.Different energy consumption values for different CNN-dataset configurations are apparent.Linear trends can easily be recognized when increasing the number of threads when there are still available DPU cores.Conversely, using a single thread is always the most energy-consuming configuration when only one DPU core is synthesized.As an interesting finding, the most energy-hungry hardware-software configuration is the one where the number of software threads matches the number of available DPU cores.This is reasonable, as all of the available hardware is used, although these configurations have the fastest runtime, as observed in Section 5.3.

48:23
It is possible to estimate the energy consumption also of the overall execution, taking into account the fork2join latency measures and the idle time caused by thread synchronization.However, although the difference in power consumption between the FPGA subsystem and the main processing system is not very high, the difference in runtime between the timeframes in Figure 3 and the idle time is considerable.We can estimate that the idle timeframes consume overall less than 0.1% of the total energy consumption, reaching 3% only for the worst hardware and software thread configurations (e.g., with two DPU cores and three software threads).Therefore, the conclusions for the energy consumption estimation of the overall execution are the same as those presented in Figure 10.

CONCLUSION
In this work, we proposed a characterization and evaluation approach for accelerated Deep Learning inference in FPGA-based MPSoCs.We described the activities and design challenges required for accurate evaluation.Furthermore, we employed the proposed methodology with a real-world platform, namely ZCU102 MPSoC and the Vitis-AI toolkit, as a representative instance of Deep Learning acceleration tools for edge-class FPGA MPSoCs.We designed a practical measurement framework based on Linux probes, aimed at minimal invasiveness, source code instrumentation, and performance interference, but also valuing versatility and extensibility as key requirements.Moreover, we designed and performed an experimental campaign based on several design dimensions, namely the number of DPU cores for the configuration of the hardware accelerator, the number of software threads of a multithreaded application, and the different CNN-dataset training configurations.Based on the collected data, we performed a latency analysis of the software support library, before and after the interaction with a DPU core.Our findings show that the duration of these timeframes is more sensitive to the shape of the input tensors than output tensors or network structure.Our multithreading analysis uncovered an unfair allocation policy of runner objects, used to launch inference jobs on DPU cores, which are the typical API used for software multithreading.The observed unfairness stems from the static and fixed allocation mechanism of the library runners.We found that allocation takes place at construction time, affecting the performance of the runner object until destruction.With the help of our framework, we designed a simple and fair task scheduling module for multiuser scenarios, effectively overcoming the limitations of the AI framework APIs with minimal invasiveness.This allowed us to perform multiuser workload analysis on the target system, unveiling the tradeoffs between workload characteristics and hardware multithreading of the target AI accelerator.Finally, we showed that the proposed flow can be repurposed for other types of analysis.In particular, we delivered an estimation of the energy consumption of the target DNN workloads leveraging the data collected during the experimental campaign, without any additional measurement.
Figure 2 summarizes the AI toolkit development flow, Xilinx Vitis-AI in our setup.A DNN model is trained in one of the supported Deep Learning frameworks

Fig. 5 .
Fig. 5. Stacked distributions of software left and software right intertimes in microseconds, highlighted per dataset (a) and CNN (b).

Fig. 8 .
Fig. 8. Mean latency in microseconds for the number of threads.

Fig. 10 .
Fig. 10.Estimation of energy consumption per frame of the hardware accelerator for representative CNNdataset configurations.

Table 1 .
Model Size of the Floating-Point Model (Stored as a .h5file)ProducedbyTensorFlow, and Model Size of the Quantized and Compiled Model (Stored as a .xmodelFile)ProducedbytheSequence of Vitis-AI Quantized and Vitis-AI Compiler ImageNet) have been downloaded as Keras Applications, then retrained on the target dataset, except for ImageNet, for which the best weights are well known and can be downloaded through the Keras APIs.The trained models are the first input to the Vitis-AI Quantizer (vitis_quantize.VitisQuantizer), which converts the 32-bit floating-point weights and activation to 8-bit fixed-point format.The second input, as explained in Section 2, is the calibration set (i.e., a subset of the considered training set).The obtained quantized model can be saved and compiled through the Vitis-AI Compiler, here vai_c_tensorflow2, which is also aware of the target DPU architecture, encoded in a .jsonfile.The output of the compiler is the DPU-executable model, in .xmodelformat.As target CNNs, we choose three different models, namely ResNet-50, VGG-16, and MobileNet, introduced in Section 2. The training was conducted for each CNN with three different datasets:• CIFAR-10, consisting of 60,000 32 × 32 color images divided into 10 classes, with 6,000 images per class; • CIFAR-100, including 60,000 32 × 32 color images divided into 100 classes, with 600 images per class; and • ImageNet, the image classification and localization dataset of ILSVRC 2012-2017.It spans 1,000 object classes and contains more than a million images.

Table 2 .
Average cycle_cnt Values for All Combinations of the Selected CNNs and Datasets

Table 3 .
Debug Performance Counters for the CIFAR-10 Dataset

Table 4 .
Debug Performance Counters for the CIFAR-100 Dataset

Table 5 .
Debug Performance Counters for the ImageNet Dataset