Abstract
Matrix Algebra and Deep Neural Networks represent foundational classes of computational algorithms across multiple emerging applications like Augmented Reality or Virtual Reality, autonomous navigation (cars, drones, robots), data science, and various artificial intelligence-driven solutions. An accelerator-based architecture can provide performance and energy efficiency supporting fixed functions through customized data paths. However, constrained Edge systems requiring multiple applications and diverse matrix operations to be efficiently supported, cannot afford numerous custom accelerators. In this article, we present MxCore, a unified architecture that comprises tightly coupled vector and programmable cores sharing data through highly optimized interconnects along with a configurable hardware scheduler managing the co-execution. We submit MxCore as the generalized approach to facilitate the flexible acceleration of multiple Matrix Algebra and Deep-learning applications across a range of sparsity levels. Unified compute resources improve overall resource utilization and performance per unit area. Aggressive and novel microarchitecture techniques along with block-level sparsity support optimize compute and data-reuse to minimize bandwidth and power requirements enabling ultra-low latency applications for low-power and cost-sensitive Edge deployments. MxCore requires a small silicon footprint of 0.2068 mm2, in a modern 7-nm process at 1 GHz and achieves (0.15 FP32 and 0.62 INT8) TMAC/mm2, dissipating only 11.66 μW of leakage power. At iso-technology and iso-frequency, MxCore provides an energy efficiency of 651.4×, 159.9×, 104.8×, and 124.2× as compared to the 128-core Nvidia’s Maxwell GPU for dense General Matrix Multiply, sparse Deep Neural Network, Cholesky decomposition, and triangular matrix solve respectively.
1 INTRODUCTION
Emerging computer applications of Artificial Intelligence (AI) [63, 70] based on Deep learning [49] and other Machine learning–based [45] approaches use computationally intensive tensor algebra across multiple domains such as Robotics [26], Drones [72], Augmented/Virtual/Mixed Reality [80], Data science, Physical Sciences, Graph Analytics [77], and Computational Imaging. Computer vision and Deep learning algorithms deployed in Edge systems are rich in matrix operations where pixel or operand data are processed through multiple stages. Several algorithms used in computer vision and AI demand efficient matrix processing. In such algorithms, Matrix Multiplication stages and scalar computation stages are often interleaved, with the output of one stage fed as input to the other. Cholesky decomposition [35] and Triangular Matrix Solve [79] are examples of such matrix processing kernels, where square-root or division operations are used as scalar operations to compute the final values of diagonal and non-diagonal elements, respectively. These values are fed to the matrix-multiplication kernel to partially compute the next set of rows and columns and then followed by the scalar operations for calculating the final values. Likewise in Deep Learning, convolution/General Matrix Multiply (GEMM) kernels can be mapped as matrix multiplications onto Single Instruction Multiple Data (SIMD) engines, and operations such as normalization, activation functions are performed on scalar or Single Instruction Single Data (SISD) engines. The output of these operations is further fed as input to a matrix-multiplication kernel for subsequent layer computations. Vector operations such as matrix multiplication are often offloaded to a dedicated engine for performance and energy efficiency. Such fine-grain data coupling between vector and scalar operations poses significant mapping challenges due to frequent data movements and synchronization issues. Execution efficiency is also sensitive to operand dependencies while processing successive stages of the computation as in matrix decomposition equations.
Current processor-based solutions either do not or struggle to meet tight real-time latency requirements, and neither are they sufficiently energy efficient. Dedicated Application Specific Integrated Circuit (ASIC) accelerator solutions may meet these targets but are not ideal for System on Chip (SoC) integration from a low form-factor, cost, and power perspective if the solution were to be limited to serve only a single algorithm or application. Such dedicated accelerator solutions limit scalability and flexibility. Constrained Edge deployments today require a highly programmable processor that maximizes compute reuse while minimizing interaction with the host to optimize energy efficiency. Such a processor needs to achieve low latency while allowing flexibility in the mapping of diverse matrix operations to optimize for performance per unit area. A very customized hardware design for a specific often narrow user requirement is not attractive to amortize the cost of research and development. We propose a unified Matrix Processor called MxCore that simultaneously supports a large set of Image Processing, Computer Vision and AI-based compute requirements. MxCore is submitted as an ideal solution for accelerating tensor operations that requires both matrix and scalar compute while allowing high programmability and threaded execution. An integrated Matrix Algebra-specific tiling and scheduling logic enables MxCore to execute computationally heavy tensor operations at an extremely low latency power and area budget suitable for diverse Edge applications.
MxCore makes the following unique contributions:
To the best of our knowledge, we propose the first Programmable Matrix Processor platform for general dense and sparse matrix algebra using tensor objects. When compared with Nvidia’s Maxwell GPU, MxCore provides an energy efficiency of 651.4x, 159.9x, 104.8x, and 128.2x for dense GEMM, sparse Deep Neural Network (DNN), Cholesky decomposition, and triangular matrix solve respectively at iso-frequency and iso-technology.
We describe a general abstraction as shown in Table 1 for mapping a rich set of matrix algebraic equations to MxCore’s compute engines using an integrated tiling and scheduling logic.
The Matrix Processor skips non-contributing operands to enhance compute efficiency by automatically detecting and exploiting sparsity in the input operands. It also skips irrelevant compute tiles by limiting the tile boundaries to only valid regions.
MxCore supports high programming flexibility using micro kernels, mainly for (a) sparse compute on the dense compute result matrix and other possible operand matrices, (b) data reformatting to ensure compatibility with consumer block, (c) duplicating result matrix to various memory regions as per system programmer requirement.
Workloads can be scheduled to execute only in SIMD or in SISD core or targeting both SIMD and SISD where output of SIMD results are consumed by the SISD kernel. Computed results can be integrated appropriately at host side for mapping complex equations using MxCore software Application Programming Interfaces (API), if the workloads are outside the scope of natively supported Matrix equations.
2 BACKGROUND AND MOTIVATION
2.1 Trends in Computing with Artificial Intelligence
The rapid re-emergence of Artificial Intelligence [63, 70] over the last decade has created a paradigm shift in computing. Machines are getting equipped, as never before, with capabilities to sense, analyze, react, learn and adapt, mimicking human intelligence with unprecedented results. The advent of Machine Learning–based approaches [23, 45] allows computers to perform tasks without explicit programming. These approaches leverage mathematical models to predict or make the decision based on a sample dataset. The performance of machine learning algorithms heavily depends on the representation of the data they are given.
In classical approaches, Artificial Intelligence tasks are performed by designing the right set of data representations for the task and using estimation techniques that operate over these representations. Statistical Bayesian Estimation [15], Kalman Filtering [46], and Regression Analysis [20, 64] are among the many prominent estimation algorithms that operate over hand-curated data representations. Bayesian machine learning methods draw conclusions by analyzing the posterior distribution of the quantities of interest, which are treated as random variables. Kalman filter uses a series of measurements observed over time, containing statistical noise or other inaccuracies, and predicts values of unknown variables with greater accuracy than those based on a single measurement by estimating the joint probability distribution of the variables. Regression analysis estimates the parameters of a mathematical model that best predicts the dependent variables observed in the data based on the independent variables in the data taking statistical random noise into account. These algorithms typically a set of linear equations or convergence techniques (e.g., gradient descent method [8, 69]) for non-linear systems. Compute patterns for several of these effective methods require various matrix operations (or Tensor Algebra) such as Matrix Multiply, Transpose, Decomposition, Inverse that operate over high-dimensional data [14, 28, 58]. Language identification [67, 75], spam detection [78], anomaly detection [11], sentiment analysis [59, 65], navigation [10, 57, 62], and target tracking [53, 68] are among the key use-case applications of classical machine learning.
Modern Deep Learning–based approaches [49] advance the effectiveness of Artificial Intelligence by introducing layers of data representation expressed in terms of other, simpler representations and self-learning these data representations [23]. It provides neuro-morphic (brainlike) intelligence to modern-age machines by constructing complex concepts over simpler ones. With multi-layer perceptron, Convolution Neural Networks (CNNs), Long Short-Term memory, Recurrent Neural Networks, Attention Networks, Generative Adversarial Networks (GANs), and others, Deep Learning has evolved greatly across domains and applications. While these all advancements rely on the basic abstraction of the neuron function \( y = f\left(\sum _{i}{w_{i} \times x_{i}}\right) \).
Deep Learning has revolutionized the computing industry with multiply-and-accumulate (MAC) as the most fundamental compute primitive along with a few handful element-wise and other simple functions (e.g., pooling) to perform various AI tasks. Deep learning–based AI tasks include but are not limited to classification, object detection, machine translation, speech recognition, and recommendation systems. Due to the high dimensionality of data and SIMD nature of compute, tensor processing–based hardware architectures optimized for compute primitives such as GEMM, vector dot product, and convolution are well suited for Deep Learning applications.
Classical machine learning–based approaches, however, work better on small and controlled data and are financially and computationally cost-effective and often easier to interpret. However, Deep Learning methods scale well with the data size providing best-in-class performance, and are adaptable and transferable without the need for data-representation (feature) engineering. While in practice, based on the use-case scenario, the best of the two methods is chosen, recent computing trends have also included approaches [47, 51, 56, 84] that adopt both classical Machine Learning and modern Deep Learning in performing a given real-world task. Having the flexibility to map both classical and Deep learning approaches in a hardware solution enables a broad range of AI applications.
2.2 Importance of Edge Intelligence
The world’s most valuable resource is no longer oil but data [21]. Rapid advancements in sensor technologies and a surge in the use of endpoint devices have equipped the eco-system with trillions of data bytes being generated at the Edge [19]. Meanwhile, several breakthroughs in AI-based algorithms, improvements in semiconductor technology, and rapid development in communication technology have propelled the growth in cloud-based applications across domains. However, the edge-to-cloud information processing model for AI leads to unsustainable and inefficient zettabyte-scale data movement [43]. In addition, data security, high turn-around time for latency-critical applications, data communication cost, and autonomous use cases for the swarm of endpoint devices (Figure 1 (right)) are driving the AI research and deployment towards a new computing paradigm with Edge Intelligence [41]. It is expected that Edge AI device shipments will reach 2.6 billion units annually by 2025 (Figure 1 (left), source: Reference [76]).
Fig. 1. Left: Trends and forecast AI edge device shipments (source: Tractica, [76]); Right: AI tasks and benefits with edge computing across few key domains.
Fig. 2. Prominent Hardware Solutions for AI at the Edge.
Catering to the surge in demand for edge AI devices with low cost and power constraints, advanced software techniques have been proposed to reduce computational complexity and memory footprint requirements through model compression [39]. These methods include one or more approaches such as model distillation [36], network architecture search [9, 18, 22], dynamic pruning with network regrowth during training [17, 30, 31, 61], operator-factorization [83], low-precision value quantization [27], data compression [29, 44], parameter sharing [16], and sparsification strategies [24, 33, 52]. However, many software techniques rely heavily on architectural support in hardware for efficient execution. The severe constraints on Edge Intelligence in terms of energy, form factor, and cost require novel hardware architectures to be co-designed with and exploit AI software and algorithmic innovations. A tight integration between logic and memory will have a major impact on the efficiency and footprint of Edge AI [41].
2.3 Edge Computing for AI: Architectures, Challenges, and Opportunities
Interleaving of SIMD and SISD operations: Compute primitives of Edge Intelligence, as mentioned in the previous section, often consist of two classes of operations: (1) SIMD MAC operations 2) SISD operations that are interleaved with SIMD operations. The interleaving of these operations in the overall algorithm causes the output data from one class of operations to be fed as input to the other class of operations. For example, in Cholesky decomposition, square-root or division operation (SISD) is performed to compute the final values of diagonal or non-diagonal elements. Further, these computed elements are then required for vector dot product operation (SIMD) while computing the successive row/column elements.
Similarly, a computational graph in Deep Learning contains multiple nodes corresponding to different compute primitives such as GEMM, convolution, activation function, quantization, pooling, normalization [71], and other element-wise operations. While GEMM or convolution kernels are performed as SIMD operations for efficient mapping, other primitives are executed as SISD operations. As these computational nodes feedforward to other nodes, which often belong to the other primitive class, interleaving between the two classes of primitives becomes inevitable. Since tile-based architectures are usually adopted for efficient execution due to high data reuse as part of a limited on-chip buffer/cache capacity, the data-dependent interleaving creates significant challenges in achieving high performance.
These tight operand (data) dependencies can significantly degrade the computational efficiency of a hardware architecture due to high data movement cost and the round trip latencies associated with input/output buffer sharing that occurs through memory.
Recent works [32, 55, 81, 82] offer automated frameworks for efficient task graph creation handling data dependencies at system level in multi-core architectures. Self-optimizing and self-programming computing systems (SOSPCS) [81] also presents a self-optimizing and self-programming resource allocator and run-time resource manager for multi-core heterogeneous architectures. Ma et al. [55] proposed automated graph cut algorithms to efficiently partition the work among many clusters on Non-Uniform Memory Access architectures. Reference [82] presented a framework that generates graph abstraction through program analysis, finds optimal parallelization strategy partitioning the work into interdependent clusters, and finally uses reinforcement learning to synthesize these clusters into processing elements on a heterogeneous architecture with irregular NoC. He et al. [32] proposed an energy-aware design methodology to maximize parallel execution while minimizing inter-core communication for applications in unmanned aerial vehicles. Though these frameworks offer significant improvements in network latency, application performance, and power consumptions at system level, for highly cost-sensitive and power-constrained edge applications, a low complexity, hard-wired implementation is also required to resolve acute data dependencies at the micro-architectural level without requiring host intervention.
Additionally, for complex data-dependency patterns (such as those required in Cholesky decomposition) Tile walk order becomes quite challenging, especially with host CPU-based tile scheduling. In such architectures, tight operand dependency can lead to inefficient resources utilization even if the individual compute primitives were to be accelerated through custom ASICs. In the case of modern AI algorithms, various activation functions require per-element computation on the result of dense MAC operations. Architectural flexibility and efficiency in filtering MAC results through downstream activation functions, while retaining higher level programmability, is highly desirable, especially for constrained Edge deployments.
Challenges with state-of-the-art Hardware Solutions for AI at the Edge: Intel’s AMX (Advance Matrix Extension) ISA [48] provides specialized instructions for matrix-multiplication using large Tile Registers. Non-MAC operations such as square-root, division for matrix decomposition or matrix-solve, or non-linear activation, pooling and other custom functions for DNN layers can be performed through the regular ISA. However, this requires costly data movement between AMX’s Tiled registers and the general purpose registers, creating bandwidth and efficiency bottlenecks. Furthermore, bandwidth for data accesses through CPU caches limits peak utilization of compute resources as provisioned in AMX. Intel’s DL Boost ISA-based technologies (AVX/VNNI [42, 54]) are also known to be compute limited for matrix-multiply operations.
GPU-based solutions provide large number of cores for highly parallelizable workloads. For matrix processing functions (like decomposition or solve) acute data dependencies between vector operations (highly parallelizable task) and scalar operations (with limited parallelization opportunities) cause synchronization issues. Varying ratios between the two classes of operations at different stages of end-to-end application processing poses additional challenges with thread-scheduling and global resource utilization and efficiency.
Hybrid solutions such as Intel Movidius’s NCS (DPUs and SHAVE cores) or Cadence DNA (NNE and DSP cores) offer a heterogeneous architecture for Matrix operations and DNN applications, where vector MAC operation are mapped onto an ASIC accelerator core while special, control-related functions and custom layers are executed on programmable DSP cores. In these architectures, shared memory is used for data sharing between MAC and DSP compute paths. However, this leads to bandwidth bottlenecks, scheduling and synchronization overheads, significantly impacting performance and energy efficiency.
Accelerator-based architectures offers customized data paths between compute units for an efficient execution. But even in a single application and across many use-cases that need to be efficiently supported by a product, different flavors of functions/kernels need to be supported requiring many custom accelerators to be built. For example, Intel VIO [58] has dedicated fixed functions for Matrix Multiply, Cholesky Decomposition and Matrix Solve. And, therefore, these solutions are very inefficient in terms of chip-area cost, resource utilization, and energy.
In this work, we address these challenges through a unified architecture comprising of a tightly coupled vector engine and programmable threaded scalar cores sharing data through highly optimized interconnects and a configurable hardware scheduler managing the co-execution.
Exploiting Sparsity: Applications that use Matrix processing, often use different types of matrix operations with different data formats, either for dense or sparse computing. A solution that builds a dedicated hardware accelerator for each distinct matrix operation is attractive neither from a cost nor from a power standpoint for constrained Edge AI applications. Overall penalties related to compute inefficiencies in the Cloud arising from data sparsity are higher compared to those suffered at an Edge device when the overheads of transferring a workload across the network are fully comprehended. Similarly, sub-optimal execution of a dense workload at the Edge also hurts system efficiency. The investment in a Matrix processor is rewarding and competitive only when high efficiency is achieved on both dense and sparse compute. Input data sparsity provides opportunities for the processor to drastically reduce computation time. Sparsity-driven efficiencies are achieved through the deliberate skipping of non-contributing computations based on an inspection of input data values. It also is achieved by directing the computation pattern to limit itself to only lower or upper triangle computation as dictated by the algorithm. An optimally designed architecture should also optimize memory usage, using compression algorithms when storing and operating on operands with high sparsity. Compute precision requirements also varies according to the application as seen in emerging use cases. DL training, control systems, or machine critical application demands high precision floating-point operations, whereas for most AI inference tasks low-precision integer compute suffices.
Need for Format Conversion: Matrix operands stored in memory may follow different storage formats either for memory saving through data compression methods or as required by desired computation ordering. An AI system is inefficient if the host processor is used to translate operand formats, while interfacing results from different algorithms required to execute a use case, just to achieve compatibility with the formats required by a specific co-processor. Hence, we stress the requirement to gather operand data from various storage formats as very critical for efficient Edge AI computing. The Processor’s support for various operand formats without sacrificing compute efficiency is very valuable.
Summing up: We discussed multiple architecture requirements for achieving efficient Edge AI in realistic SoC contexts. In general, the power and performance efficiency of an accelerator-based solution is influenced heavily by the compute utilization. Novel and integrated tile-based dataflows need to ensure an optimal balance between available operand bandwidth and compute resources (scalar, vector, and matrix) while reducing the load on the host processor significantly. The Tiling Algorithms need to flexibly and effectively skip non-contributing computations and must do so as early and efficiently as possible. Memory access is usually the bottleneck in constrained systems, especially when the matrix is larger than cache capacity. Optimal Tile traversal orders are necessary to maximize compute within the available bandwidth envelope. For real-time Edge applications, AI algorithm-specific operand dependency-aware tiling is required to enable optimized operand dataflow from memory, to achieve maximum compute utilization, system performance, and power efficiency. Software running on the Edge’s host processor should play a lightweight but effective role in decomposing complex algorithms into chunks of SIMD, SISD, or a combination of both SIMD and SISD workload and efficiently orchestrate their execution.
3 PROPOSED MATRIX PROCESSOR
The proposed Matrix Processor, MxCore, combines a vector engine and a tightly coupled programmable super scalar core. Vector Compute elements are arranged as two-dimensional (2D) compute array for performing parallel dense vector compute, a SIMD processing engine. Matrix operations for accelerations generally have dense vector operation followed by sparse scalar computation. All dense operations are executed in SIMD compute array structure and rest of the operations are executed in a programmable core, a SISD compute path, using natively supported instruction sets. SISD core can also be used for modifying the result matrices based on user defined algorithms, any other per element computation or operations like data formatting and duplicating the resultant matrix to multiple locations, and so on. Tightly fused programmable super scalar core along with vector compute allows efficient mapping of various matrix algebra for diverse acceleration scenarios, with minimum memory interactions.
The matrix processor’s integrated tiling and tile sequencing logic maps various matrix equations to its compute paths avoiding the overhead of the host processor and benefiting system level performance. The tiling algorithm divides input matrices into blocks of computational tiles that can be fitted into the SIMD compute array and optimally sequences the tiles for execution. Sequencing logic also controls execution order based on operand dependency, as tile walk patterns can change overall performance of both the vector and scalar compute units of matrix processors. Tiles are generated keeping output stationary, for retaining the partial results locally for the overall system’s power efficiency by avoiding external memory interaction. Integrated tiling logic also can skip non-contributing tiles based on sparsity of input operands. MxCore’s auto detect logic generates sparsity information of the operands from the compute path. Integrated tiling and scheduling logic consume autogenerated sparsity information and skips non-contributing tiles in successive iterations that uses the same operands.
Inline buffer interfaces the SIMD and the SISD compute engines and formats operands referenced in SISD-microkernel. The Inline buffer generates thread packet with operands and thread information for identifying transactions from the microkernel. In Thread packet, operands are arranged as tiles and tile resolution can be selected based on microkernel complexity and register space requirement. Matrix processor supports SISD4 \( \times \) 4 thread tiling format for 4 \( \times \) 4 tile size of output elements. And SISD4 \( \times \) 8 thread packing format is used for 4 \( \times \) 8 output tile size whereas SISD8 \( \times \) 8 contains operands for 8 \( \times \) 8 tile size worth of output elements.
MxCore’s operand data gathering blocks achieve higher compute utilization by supporting various data format and a custom arbitration fabric. Multiple references on same operands is very common in matrix operations like convolution, matrix multiplication and similar matrix algebraic equations. Caches and buffers of the data network judiciously caches operand data for optimizing bandwidth and working harmoniously with the tiling logic of the active matrix algebra kernel. Operand Gather can also perform data availability check before performing a read operation, thereby enabling out of order execution of SISD threads and smart scheduling of tiles that uses results matrices from a thread in flight as operands.
3.1 Matrix Processor Architecture
The proposed architecture for the MxCore, matrix processor, is divided into sub modules as shown in Figure 3(a) and also details potential integration into a example SOC platform. MxCore can be easily integrated through configuration registers and interrupt protocols. The host processor may set up and kick-off processing tasks by programming the memory mapped control registers and monitor the task by polling status registers or via task completion interrupts.
Fig. 3. MxCore top-level architecture and configuration.
MxCore can be configured at design time to achieve required compute power as per application requirement. MxCore’s framework allows scaling of the design for higher vector compute capacity, number of scalar threads and scalar compute resources, various matrix operation, as well as data gathering features; an agile platform for matrix algebra. Figure 3(b) shows the simulated configuration of the MxCore design and details of micro-architecture is explained in Figure 4. Building blocks of MxCore can be grouped as SIMD MAC Compute, Programmable SISD core, Operand Gather and Caches, L1 Memory Fabric, Tiling and Sequencer logic and Auto Sparsity detection. Matrix Kernel execution request from host interface is broken into multiple tiles by the Tiling and Sequencer logic. Executable tiles are routed to Operand Gather for fetching and formatting relevant operands. These operands are feed into SIMD MAC in terms of tiles for vector computations. MAC results from SIMD MAC path are packetized with other optional scalar operands from the external memory by the thread packing block. These thread packets are feed into SISD core for running microcode written using natively supported instructions. Microcode can be for matrix kernel’s computation or user defined algorithms. For example, in case of GEMM, we use microcode to perform blending of MAC results with scalar values. In parallel to operand gather cycles, Auto sparsity detector generates sparsity information per operand tiles and is retained in sparsity buffer. Sparsity information of operands are used for optimizing the tile sequence flow in successive iterations opportunistically. Detailed block level descriptions are given in following sections.
Fig. 4. Micro-Architecture of MxCore.
3.2 Tiling and Sequencer
The MxCore Matrix processor has an integrated tiling and sequencing logic for mapping frequently used matrix equations onto the compute elements. Tile is a unit of data elements of A or B operands that is processed in a cycle. The Tiling and Sequencer state machines divide input operands into tiles for calculating fixed output tile size, based on column (or channel) count of vector compute path. A and B operands are tagged for dense compute and R[0] to R[N] operands are tagged for sparse or scalar compute. MxCore’s low-level abstraction for computing output tile can be represented as \( Out[4,4x4] = SISD\_kernel[4,4x4](SIMD\_MAC[4,4x4](A(8xN),B(Nx8)), R[][4,4x4]); \) where SISD instructions are used for mapping different algorithms. Depending on the specific matrix algebra to be performed, multiple tiles of A and B operands are generated for dense compute and corresponding R tiles are generated for sparse compute as shown in Figure 5. At tiling stage, necessary information is generated for data gather logic to fetch operands from memory as well as for executing the tile to last stage of the compute path. In the simulated configuration, MxCore supports 32 single precision MACs in the vector pipe that is structured as 4 \( \times \) 8 2D array and tiling logic divides the input operands A and B for calculating 1 \( \times \) 8 vector size of output matrix. For exploring the computation advantage due to spatial operand reuse, A operands tile size of 1 \( \times \) 4 is broadcast to all the channels of the SIMD4 \( \times \) 8 compute array, and corresponding 4 \( \times \) 8 tile size of B operands are mapped to appropriate execution ports, producing partial results of 1 \( \times \) 8 elements of the output matrix, per cycle. The SIMD engine is capable of performing to its highest throughput within the block size of 8 \( \times \) 8 using the available memory port bandwidth of 16 B elements per operand, by changing 1 \( \times \) 4 row elements of A operand and keeping the B operands same for successive 8 cycles. Inner product results of size 8 \( \times \) 8 from SIMD vector compute path is taken through the SISD core for running microkernels where user defined algorithms are executed on SIMD results along with R[] operands. To maximize compute utilization and memory bandwidth, tiling logic works on output tile size of 8 \( \times \) 8 in case of Single Precision 32 MAC configuration and retains the partial results obtained at every cycle for successive accumulation, within the processor’s local buffer. Same tiling logic is shared in case of 32-bit integer compute as the number of bits for representing the operand data are same for both formats. Tiling logic uses the same state machine for generating executable tiles to map operands to SIMD engine for all supported operand formats. Irrespective of the operand formats, compute path always generates 1 \( \times \) 8 tile size of output results per cycle for maintaining interface compatibility, of 8 \( \times \) 8 tiles size, with programmable core for successive execution.
Fig. 5. Operand Tiling and Memory View.
It is also possible to bypass the integrated tiling logic by dividing operand matrices of the selected matrix equation in terms of 8 \( \times \) 8 blocks for batched execution, while maintaining the execution abstraction described to exploit the flexible compute capabilities of MxCore. Operand dependency-aware tile walking reduces execution stalls due to operand dependency from the previous tiles in flight for execution, hides operand fetching latency and enables potential parallel executions opportunistically. This is a common scenario where previous execution results are used for successive calculation as in control system loops. Tiles are generated keeping output stationary, for retaining partial results locally for overall system’s power efficiency. Buffering partial results locally helps to reduce memory traffic and saves system power. Tiles are judiciously ordered to minimize buffering while balancing available memory bandwidth and compute utilization. Tiles generated can be spatially separated if the operands are used from adjacent execution tiles to reduce stalls due to operand dependency. In case of sparse input matrices, tiles can be skipped by the state machines to avoid noncontributing memory fetches and computation cycles. Input sparsity can be expressed through predefined patterns like lower or upper triangles, or it can be completely random. Similar logic is also used to control the computation pattern of the output results by limiting the output calculation coordinates.
3.3 Block Level Sparsity
The MxCore matrix processor schedules operand tiles according to the compute capability of the processor configuration. MxCore’s tiling algorithm ensures that sufficient operand gathering is performed to feed the vector compute path at every cycle. As per the matrix equation, the Tile Walking Algorithm picks an optimal walking pattern that maximizes operand reuse and minimizes external memory traffic for both input operand and the computation results. As MxCore’s scheduler decomposes execution into tiles, it easily skips tiles that do not contribute to the output results. This happens in case of sparse workloads where all the elements of the selected operand tile are zero values for accumulation. MxCore’s auto-detection of block-level sparsity, efficiently select only the non-zero tiles for subsequent computation in case of sparse workloads. For retaining the block level sparse information, MxCore uses internal buffers and deallocated input operands space as temporary storage for resource sharing. Auto sparsity detector is placed on the data path of matrix processor and generates metadata for the scheduler to consume in subsequent iterations of the matrix equation being processed as shown in Figure 6. Matrix equations like Single Precision General Matrix Multiply (SGEMM) or Convolution (CONV) make multiple references to the same operand coordinate due to inherent nature of the matrix operation. This feature works well opportunistically even if workloads are not categorized as sparse workload.
Fig. 6. Operand Gather, Formating, and Sparsity detect.
3.4 Operand Gather And Caches
Operand Gather, as shown in Figure 6, of MxCore fetches and formats the operands for feeding the SIMD and SISD engines. Each scalar core operand R[] has vector dimension that matches with the 8 \( \times \) 8 tile dimension of dense MAC results. R[] operand fetch requests are scheduled concurrently with the SIMD dense calculation cycles, hiding the fetch latency. Being on slow path, these requests are arbitrated through a shared memory interface. Inline buffers are used in between operand fetch, SIMD results and operand packing logic for isolating thread packeting and SISD Core’s interface complexities.
Depending on the tile size of the operands, accumulation stage may take multiple cycles to create operand data. Reading SIMD Operand from memory is costly as it involves multiple memory cycles and data path toggles to create tile size of operand data. Operand cache retains data of respective tile size and allows possible quick reference in successive iterations of same tile by the scheduler. Operand caches benefit in these scenarios by providing buffered tile data for compute path to consume, benefiting power and performance. Therefore caching those operands helps to improve overall efficiency, especially when MxCore is integrated to the shared interconnect of a larger SoC. Depending on the matrix equation, potential successful references to these caches can vary and tiling walk order greatly influence the hit pattern. When matrix multiplication is performed, A operand matrix row is referenced multiple times to multiply with B matrix columns while calculating different columns of output matrix. Similarly, in case of convolution, activation layer is referenced multiple times for convolving with available feature weights. Pipelined SIMD execution path requires operands at every clock to achieve high utilization. Caches of the design, (SIMD operand cache, SISD Instruction cache and optional L1 caches) can reduce the memory fetch latency and benefit overall system power and performance. Buffering and caching of the operand in tight coordination with the tiling algorithm helps to reduce the cost of operand fetch as it produces a tile worth of operands at every cycle. Operand cache consists of a lookup structure using the tile’s tag parameters and supports buffering to cover variable latencies at interface ports.
3.5 Operand Format Conversion
Operand data may be stored in Row major or Column major format. It is well understood that operand data gather approach can greatly influence the computation efficiency of the processor in a balanced system. MxCore has the capability to auto convert operand layout formats on-the-fly, like row major or column major conversion, to align with compute dimension of matrix processor for higher utilization. Format converter, as shown in Figure 6, buffers unused elements of fetched memory line, for subsequent reference if the request appears in consecutive cycles. Tiling algorithm used for various matrix equations ensures operation at 8 \( \times \) 8 output tile boundary, ensuring successful reference to these buffers, producing format converted data every clock. Depth of buffer is limited to number of elements per memory line and tiled memory address generator flips tile address while converting the format before sending for look up in local buffer. Tiled memory fetches enable transposition of the formats in line with operand gathering, rendering the computation unhindered by the complexities of memory storage patterns and compression formats.
3.6 Memory Arbiter
MxCore’s Memory Arbiter arbitrates access requests on four 16-B ports for read or writes, for efficient bandwidth utilization and performs operand validity checks to ensure reads are ordered only after writes. Ports are mapped to a programmable base address for flexibility and transactions are spawned from tiling stage based on matrix operations. Memory read requests include instruction reads, thread packet reads from scratch space and operand reads. Arbiter receives write requests of result matrix from SISD microcode and from thread packet cache traffic. Port to operand association is generated while tiling and metadata carries this information, along with execution packets. While tiling the operands, MxCore’s Tiling Algorithm can choose available ports to direct arbiters for fetching from various base memory addresses through port identifiers. Fill path of memory arbiter allows mapping data access requests to any of the available cycles of the memory ports, opportunistically and delivers data out of order. Fill path is used for low-bandwidth sparse access, instruction fetches and scratch space access. However, memory arbiter has necessary buffers for hiding the fetching latency and ensures in-order response to the processor on all the supported client ports, even if the access requests from clients are routed to various memory ports, out of order. Since operand gather of the processor is tile based, MxCore support requests for a full tile worth of operand data to the arbitration network logic. MxCore employs address translation logic to convert a tile address to an actual physical linear address based on the stride values, column element storage space before successive rows, of the input operand memory image. Address generator also has the capability to decompress or compress lower or upper only valid matrices for saving memory space.
While operands are accessed through the Memory Arbiter, Qualifier logic certifies operand validity for data correctness if necessary. This is to ensure that the result matrix data are written to memory before un-gating the fetch requests for subsequent processing. This feature is used mainly for equations where row results are looped back for subsequent calculation, as is common in control feedback systems. While tiles are generated, operand information is carried through metadata along with the execution packet for enabling such checks and memory qualifier status is updated while results are stored from respective microcode kernels. Memory arbiter interfaces to the memory hierarchy where space is reserved for operand data, kernel Instruction and scratch space for optionally storing threads packets temporarily during multithread execution. MxCore never writes partial results to memory and external memory is used only for reading input operand matrices and instructions. Multi banked L1 memory is accessed through four 16-B memory ports and MxCore can consume memory bandwidth of 48 bytes per cycle. Optional L1 memory of MxCore can be organized as multiple memory banks. Associated memory controller should resolve bank conflicts and guarantee per port in-order responses.
3.7 SIMD MAC
SIMD vector compute path is a pipelined dense vector compute array of multipliers, adders, and comparators(for Min and Max Pool feature) supporting single precision and integer formats. Compute elements organized as a 2D grid with adders adding multiplier’s output in a column. This compute structure is loaded with (1xN) A operands and (NxP) of B operands, produces 1xP output elements per cycle. Each SIMD operand has 16-B memory fetch ports and A operand is broadcast to all the P channel’s compute nodes maximize spatial reuse of operands. To balance the compute-bandwidth ratio, Tiling Algorithm carve out operands for producing 8 \( \times \) 8 worth of result matrix in bursts. In the case of a 32-MAC dense compute configuration, the SIMD path works on an 8 \( \times \) 8 output tile size, N = 4, P = 8, for balancing the compute bandwidth, receiving 1 \( \times \) 4 tile of A operands and 4 \( \times \) 8 of B operands per cycle. Four elements of A operands are broadcast to 8 independent channels of the 32-MAC 2D compute array where the B operand vectors are wired, enabling systolic movements of operands as shown in Figure 7. It also contains data registers to store partial results (C output tile of PxP size) along with P accumulators. The last adder in the pipeline of each channel, has storage end point with N staged accumulator as shown in Figure 7. Channel data are accumulated for multiple clocks, back-to-back, to the final contributing tile of the operand matrix before hands over the next stage for further processing.
Fig. 7. SIMD Vector Compute and Accumulator.
SIMD compute path also can support 8-bit integer operation where a 2D compute array size of 16 \( \times \) 8 performs 128 MAC operations per cycles. Data gather can fetch 1 \( \times \) 16 elements of 8-bit A operand and 16 \( \times \) 8 elements of 8-bit B operand, producing 1 \( \times \) 8 matrix result. In case of 32-bit integer operation, 32 MACs are arranged as 4 \( \times \) 8 as in Single precision compute, producing 1 \( \times \) 8 matrix result. 2D compute array always produces 1 \( \times \) 8 worth of result matrix irrespective of operand precision, enabling a compatible end point to interface with scalar compute stages of matrix processor seamlessly. Execution dataflow logic and buffers are shared for all the supported formats, except the computing path, achieving higher area efficiency per matrix kernel feature. Agile framework of the MxCore architecture can seamlessly integrate fused computing path to support multiple formats and additional tensor operations and algorithms, to gain performance, power and area efficiency.
3.8 SISD Programmable Super Scalar Core
The SISD Super scalar core, shown in Figure 8, executes microcode from instruction memory written in MxCore SISD’s native instruction set. The SISD core is integrated with the SIMD vector computing path with intervening shared thread packet buffers to realize the fundamental unified computing platform for mapping diverse matrix algebra equations. At the tiling stage, Tiling Algorithm generates information necessary corresponding to the microcode and its input operands. The SISD Core supports concurrent scheduling of instructions, multi-thread execution and parent-child thread constructs for sharing data within a thread group. The SISD programmable compute core consists of instruction fetch, parallel instruction decoder, arithmetic unit and interfaces to thread packets that carry both SIMD results and other operands from memory. It uses a 64 \( \times \) 32bit register bank per thread for programming, as the operand registers are multidimensional tensors. Its ALU has four multipliers, four adders, one square root, and one inverse as compute elements and flexibly accommodates diverse algorithmic requirements.
Fig. 8. SISD Super Scalar Core.
The SISD Core supports out-of-order concurrent execution of active threads as well as parallel instruction scheduling for efficiency. The capability to execute multiple threads concurrently greatly improves performance and compute utilization especially when executing workloads that have high operand spatial sparsity and dependency. SISD-Core supports parallel instruction scheduling for efficiency as the operands are mostly tensors and microcode contains high degree of instruction level parallelism. The ALU has four Adder and Multiplier units supporting float and integer formats. With the use of concurrent instruction scheduling, SISD-ALU can achieve an effective throughput of four MACs per cycle.
The SISD core also supports parent and child thread status for sharing data between threads. This feature helps MxCore to extend the scope of register space beyond the allowed size limit. Parent-child thread grouping also enables the application developer to break larger threads into groups of smaller kernel size. At such fine granularity we can nicely align the completion time of the smaller threads and keep utilization high. At the execution boundary of family thread group, programmable registers are not invalidated for sharing data between successive child threads. This way threads in the same family group can share data between newly spawned child threads and the originating parent thread easily. MxCore guarantees execution of all threads in the same group on same thread register space by allocating threads against the parent thread’s identifier. Matrix operand and SISD kernels are stored in memory as shown in Figure 5. MxCore uses base pointers for fetching from respective regions of memory. Scratch space is also allocated per thread while concurrent threads are executed. Scratch space helps to store child thread packets of the thread family group while MxCore advances to prepare the next family of threads. The thread packets are read back for execution after the originating thread finishes to hand over the execution to the subsequent child thread.
The Operand Packet Cache retains the thread packet image to initialize threads at the start of the execution as well as to enable sharing of operand data between child threads. The Instruction cache prefetches memory lines and extracts instructions for efficient supply of microinstructions to the parallel instruction decoders. The Thread image data are stored using a dedicated operate cache interface prior to thread execution requests, using thread image pointers. This gives flexibility to interface the processor with other accelerators that may be present in the host processor, as only pointers of the thread packets are used across pipeline stages to minimize the data movement. Read and Write ports interfaces operand registers to the execution pipe for loading and storing of operand data, while thread execution scheduler synchronizes and coordinates overall execution of the instructions. A Parallel instruction decoder gives necessary controls for advancing instructions for execution and for flushing the thread pipe at the end of a thread. The SISD Core support 32-bit single phase and optional 256-bit multi-phase instruction formats as illustrated in Figure 8(b).
3.9 Thread Packing and Sequencing
MxCore support various thread packing formats for optimizing microcode execution efficiency based on program complexity while calculating a defined size of output tile matrix. The SISD Core executes microcode on the 8 \( \times \) 8 tile elements produced from the SIMD engine and its corresponding 8 \( \times \) 8 tile of R[] operand values. The Thread Packing logic creates thread packets using these operands and ensure packets are loaded as per thread packing formats for porting into corresponding microcode. This packet contains thread operand images for initializing the programable registers of the processor selectively, based on the matrix algebra operation to be performed. Thread packets also have control information to clear selected register contents before starting the instructions of the targeted micro-program. Information about the tile dimension, associated microcode header for fetching instruction and memory port identifiers for directing the results are also packed along with the operand data. Three different types of thread packing are supported based on compute complexity and programming register space requirements. A SISD4 \( \times \) 4 thread packet generates a output tile size of 4 \( \times \) 4 whereas the SISD4 \( \times \) 8 and SISD8 \( \times \) 8 thread packets can generate 4 \( \times \) 8 and 8 \( \times \) 8 tile size of output matrices, respectively. In case of complex microcode where the algorithm requires more register space for programming, the programmer can select SISD4 \( \times \) 4 thread packing as the thread is dealing with only 16 elements of result matrix and the rest of the register space can be used for programing.
MxCore also can be configured to execute four (4 \( \times \) 4) tiles as one family of threads for enabling sharing of data between sucessive child threads. In case of thread family execution, the first 1 \( \times \) (4 \( \times \) 4) tile is executed as a parent thread and remaining 3 \( \times \) (4 \( \times \) 4) tiles are executed as child threads. And native instruction set of SISD core supports sharing of data from thread register space of immediate thread in queue within family of threads. These features also can be used for offloading of instruction to adjacent threads within family if per thread instructions is exceeding instruction limit of 256 instructions per thread.
4 MAPPING OF MATRIX KERNELS ONTO MXCORE
MxCore supports various execution primitives for host processor to access its features, as shown in Table 1. Supported APIs allow execution of 8 \( \times \) 8 tile blocks on vector or scalar compute platform independently, in addition to an option for executing tile block on both compute resources. MxCore also support integrated tiling logic for some of the fundamental and frequently used matrix equations without bothering the low-level compute abstraction. The overall efficiency of MxCore while executing an equation understandably depends on how optimally the matrix equations are tiled and allocated for execution, the inherent operand dependencies and the memory bandwidth available. MxCore uses 8 \( \times \) 8 output matrix tiling granularity across all the natively supported matrix equations and APIs. Tiling for dense compute blocks follows special walking patterns for balancing available compute and memory bandwidth and produces an output tile of size 1 \( \times \) 8. Due to the Tile Walking style, dense compute path produces bursts of 1 \( \times \) 8 size of output tiles for 8 consecutives clocks, resulting in a 8 \( \times \) 8 size accumulated output matrix. SISD scalar operands R[N] that combine with a dense compute result matrix is also generated for a tile size of 1 \( \times \) 8 over consecutive 8 clocks, resulting in 8 \( \times \) 8 size of output for subsequent processing. Threads packing block can further group two 8 \( \times \) 8 tile blocks into 4 sub-blocks of 4 \( \times \) 4 tile size to align with the thread register space and microcode’s tiling format. Admittedly, there can be scenarios where operands are read from the resultant matrix of the earlier tiles during execution, creating dependencies on operand gathering. Since both the dense and scalar compute follow the tile-based execution, MxCore’s sequencing logic judiciously selects and optimizes tile walking patterns to reduce such runtime operand dependencies. The following section explains how the tiling ordering is designed for the diverse matrix equations presented in this article.
4.1 SGEMM, GEMV, SpMM, SpMV, and SDDMM
Matrix-Matrix multiplication (SGEMM) involves multiplication of two dense matrices to produce an output matrix, after performing per element blending operations. SGEMM is a fundamental building block of many algorithms. In case of a SGEMM operation, input operands are available without dependencies on resultant matrix. Therefore, tiles are can be selected naively and outputs are calculated horizontal row first, followed by vertical flow. Tiling Algorithm of GEMM is detailed in Figure 9. Computations are divided into 8 \( \times \) 8 blocks and in each iteration, 1 \( \times \) K(K = 4) elements of A operands are broadcast to P(P = 8) number of independent channels of SIMD compute, where K \( \times \) P size of B operands are wired to operate with A operands. While A is moving vertically (Row wise) for P consecutive clock cycles, B operands values are reused for calculating a P \( \times \) P tile of partial results. The same iterative execution is called by the P \( \times \) P tiling logic to complete an entire output matrix dimension of matrix equations. Tile offset (tileStepOffset (tSO)) is always set to one in case of dense multiplication where all blocks are considered for computation and accumulation steps are advanced in terms of K(K = 4). In case of sparse matrix multiplication, MxCore optimizes execution cycles in chunks of K \( \times \) P(4 \( \times \) 8) sized blocks. This is achieved by skipping non-contributing blocks (full of zeros) after advancing the tile offset value in the Tiling Algorithm dynamically based on the sparsity information received from the data path of the compute.
Fig. 9. SGEMM Tiling Algorithm.
Matrix-vector multiplication (GEMV) involves multiplication of a Matrix with a Vector producing Vector result. GEMV uses same Tiling Algorithm of SGEMM as in Figure 9, where the dimension of A is configured as 1 \( \times \) N, which results in a output result matrix dimension of 1 \( \times \) 8, for blend operations if any. GEMV is a memory bound operation and the Memory Arbiter of the MxCore uses all the available ports(4 \( \times \) 16 B) for fetching the operand data and streams operands for compute efficiency. The same block-level skipping algorithm used in spGEMM for updating tSO is also applied for Sparse Matrix Vector multiplication, to enhance performance. SDDMM kernels performs element-wise multiplications between matrix C and the results of a matrix multiplication between A and B. SDDMM computes a filtered Matrix-Matrix product, i.e., a Hadamard product between a Sparse Matrix and a product of two smaller dense rectangular Matrices. SDDMM uses the same Tiling Algorithm of SGEMM as shown in Figure 9, and R[N] operands are used for element wise multiplication in microcode, executed on the accumulated result from the SIMD path.
4.2 Convolution and Max/Min Pool
Convolutional Neural Networks are an increasing important operation in emerging applications deployed on modern Edge systems. CNNs are computed using dense kernels that differ from traditional dense linear algebra routines. In the case of a convolution operation, the input operands are tiled and the output is calculated based on the typical scan line order, and result matrix tiles are calculated on horizontal row first followed by vertical flow, as illustrated in Figure 10. The Activation layer is mapped to the A operand port of MxCore and B operand port is connected to the feature weights. This way, filter weights are convoluted with activation layers, on each channel of the SIMD compute paths and in parallel. Activation functions such as ReLU and SoftMax are enabled through SISD micro kernel using MxCore’s native instruction set. The convolution operation can gather operand data by traversing the Length, Breadth and Depth dimensions of the operands. MxCore performs better with dense operands on inner direction of the accumulation loop as the operands in inner loop stored in columns of the operand memory, resulting in potential alignment issues with compute size and memory data access width, based on filter dimensions. Hence, we keep dense operands with higher dimension value as inner dimension of the memory layout, to maximize compute utilization. MxCore supports A and B operand memory layouts using any of the following three formats: (1) (Depth, Length, Breadth), (2) (Breadth, Length, Depth), or (3) (Length, Breadth, Depth). MxCore’s tiling logic and operand gather logic can traverse on any of the above formats (picked optimally) to gather operands while performing convolutions. Activation functions are supported through a SISD kernel to operate on correlation results from the vector engine of MxCore.
Fig. 10. Convolution Tiling Algorithm.
4.3 Cholesky Decomposition
The Cholesky decomposition of A is a decomposition of the form \( A = LL^T \), where L is a lower triangular matrix and \( L^T \) denotes the conjugate transpose of L. Consecutive rows in this algebraic operations are calculated using the previous rows and column of output matrix L, creating a serialized operand gather dependency during execution. A diagonal walking order is selected for compute efficiency due to dependency on the top row and left column of the output matrix. The compute requirement increases as execution progresses from left to right; hence, the most compute-intensive tile is the rightmost tile on the diagonal. The proposed tile walking algorithm selects the leftmost tile first, followed by the rightmost tile before scheduling the remaining tiles, starting from the left tile until the last tile on the diagonal line as detailed in Figure 11. This walking order ensures that sequentially connected dense and scalar compute blocks are filled with their workloads efficiently while reducing operand dependency. Figure 11 also shows mapping of the triangular region of the Cholesky equation on to the SISD Core as an extended function. Similarly, other regions of the Cholesky equations are written using MxCore’s instruction sets enabling a complete mapping.
Fig. 11. Cholesky Tiling Algorithm.
4.4 Matrix Solve
Matrix Solve operation solves for X in \( AX = LL^T X = Y, \) where L is an invertible triangular matrix, \( L^T \) is its transpose and Y is the other input matrix. Matrix solve has a dependency on previous row’s X solution for solving consecutive rows. Compute requirement increases in equal proportion as the Solve execution progresses along rows. Therefore, the Tile walking algorithm schedules all the tiles in the same row in sequential order before it moves vertically, for calculating consecutive rows. Like other matrix operations such as decomposition, GEMM and CNN, for Matrix Solve the tiling logic divides operands for the dense MAC operation on to the vector engine and the rest of the execution to the scalar core for running microcode.
5 EVALUATION
5.1 Experimental Setup
Baseline Architectures: We compare our design against five baselines architectures: CPU, GPU, and ASIC. For ASIC-based architecture baseline, we use Eyeriss [13] for Dense DNNs, Tensauraus [73] for sparse DNNs, and Intel’s VIO [58] accelerators for EKF acceleration that include Cholesky decomposition and Triangular Matrix Solve. Table 2 summarizes key architectural and design parameters for these baseline architectures along with MxCore.
Table 2. Architecture and Design Parameters for Baseline HW Solutions and MxCore along with a Comparison of Achieved SIMD Compute Efficiency across AI Workloads
(1) CPU: We use a Jetson Nano-4GB Board [6] with a quad-core A-57 ARM CPU. For dynamic power measurement on the Jetson Nano-4GB board, we use the jetson_stats tools. To run the benchmarks on the CPU, we use the ARM Compute Library for the dense computations and the Eigen-3.3.9 [25] library for processing the sparse CNNs, Cholesky decomposition, and the Triangular Matrix Solve.
(2) GPU: We use the 128-core Maxwell GPU on the Jetson Nano-4GB Board with CUDA 10. We use the cuDNN and the cuSPARSE library provided by the NVIDIA Jet-Pack SDK to process the dense and sparse CNNs, respectively. For the GEMM, Cholesky, and Solve benchmarks, we use the cuBLAS library. For the power measurement, we use the jetson_stats tool.
(3) Accelerators: For the dense Alex-Net and VGG-16, along with the CPU and GPU we also compare our work against the Eyeriss [13] state-of-the-art CNN accelerator. We use the 8-bit version of the accelerator for our comparative assessment. For energy comparison, we use the nn_dataflow publicly available simulator. For the sparse CNNs, along with the CPU and GPU, we compare our work against the Tensaurus [73] accelerator. For the Cholesky and Solve benchmarks, we compare our work against the VIO [58] accelerator.
Datasets: We evaluate MxCore for both dense and sparse workloads. The dense computations include GEMMM, dense version of Alex-Net and VGG-16 CNNs, dense linear algebra routines, namely Cholesky and matrix solve benchmarks. For the GEMM, we use matrix sizes from the DeepBench [2] framework. The sparse computations include the sparse CNNs, namely AlexNet and VGG-16. The sparse network models are taken from [73]. The sparse convolutions are mapped as sparse matrix multiply operations using im2col transformation similar to Tensaurus.
Power and Area Scaling: Since the selected baseline architectures were presented on different process technology nodes, we scaled the power and area numbers with respect to MxCore (7nm). We use the work in Reference [74] to perform the technology scaling.
5.2 FPGA Mapping
We synthesize our design at a frequency of 100 MHz on an Arria10 FPGA in the Intel Devcloud setup [4]. We use System Verilog to design our hardware. We use the Altera Mega function IPs, both Floating-Point and Fixed-Point versions, to map the compute of our design on the respective FPGAs. For synthesis, we use the Quartus Pro version 19.2 [5]. All the reported hardware characteristics of the design are obtained Post Place and Route. Figure 12 summarizes FPGA synthesis results.
Fig. 12. Synthesis results of MxCore RTL on an Arria10 FPGA. The figure on the right shows the mapping of MxCore RTL on the Arria10 FPGA using the Intel OPAE framework.
The MxCore RTL is embedded as an Accelerator Functional Unit (AFU) inside the Open Programmable Acceleration stack, part of the Devcloud infrastructure. The CPU host interacts with the AFU through a 64-bit C-based API. The AFU connects to the FPGA Interface Unit, part of the FPGA Interface Manager, with a cache coherence interface protocol (CCIP) [3].
Host Interface: The MxCore AFU internally implements a 256 KB, 8 byte aligned MMIO address space. This address space is used for data exchanges between the AFU and the host. The C0 signals in the CCIP protocol are used to send the MMIO address and data from the host to the AFU, using a 64-bit write transaction. Host to AFU transactions is initiated by the corresponding C-based write API. The transactions are executed out of order and are tracked through a unique transaction ID. Responses from the MxCore AFU are written to the MMIO address space internally and communicated back to the host using the outgoing C2 channel of the CCIP interface. Data on the C2 channel is read by the host using the 64-bit read API through an out-of-order read transaction.
Memory Hierarchy: The MxCore AFU implements a two-level memory hierarchy. The 4-GB FPGA DDR4 external memory on the programmable acceleration card housing the FPGA forms the L2 memory. Incoming addresses on the CCIP C0 Rx-Channel, that are beyond the configuration space, are routed to this memory using an outgoing Avalon Memory-Mapped interface [1]. The MxCore AFU implements a CCIP-to-Avalon signal translator for this purpose. The L1 memory is realized using the on-chip FPGA Block RAMs, and the working set data are maintained in sixteen 128-bit-wide RAM blocks.
5.3 ASIC Implementation
We implement MxCore in three configurations varying data precision format and compute sizes: (1) 32 single precision floating-point MACs, (2) thirty-two 32-bit integer, and (3) one hundred twenty-eight 8-bit integer. As mentioned in Figure 3(b), MxCore can be configured as per design requirements for compute size and data precision. We further build a scaled-up configuration of MxCore, a MxCore-S version, with two hundred fifty-six 32-bit integer MACs for iso-compute evaluation with Tensaurus. To project die area for the scaled-up MxCore-S, we analytically scale the area cost of compute (SIMD engine) and storage (data gather block) resources while adding constant area overheads for other hardware resources. Data gathering block is upscaled in proportion to compute capacity for balancing compute and memory bandwidth ratio. In scaled-up SIMD engine Figure 7, additional compute resources are added along both horizontal and vertical direction in the 2D compute array. With more compute resources along horizontal direction in the 2D-Compute array demands higher B operand read bandwidth and larger buffers to store partial results. Scaling-up compute resources along vertical direction increases the number of required accumulation operations per clock. To achieve peak compute throughput, it demands proportional increase in A operand read bandwidth.
The Proposed Matrix processor is implemented in \( TSMC 7nm S_M40.cworst_CCworst_T /0.67V/-40T \) process and design quality report along with TOPS/mm\( ^2 \) is detailed in Figure 13(b) for various design configurations. Table 3 contains the power and area projection, from actual TSMC 7nm design database to a 256 MACs configuration, considering utilization value of 67%. MxCore has compute sensitive blocks that scales area in proportion to the number of MACs and memory bandwidth, whereas fixed resources such as host interface, configuration path, protocol handling units and shared caches are insensitive to compute scaling. Functional blocks that are sensitive to compute scaling include compute path, data gathering, thread packing and buffers for receiving higher throughput from vector engine. SISD Core also can be configured for number of threads, ALU resources, programmable register bank’s size, register bank read/write bandwidth and parallel instruction decoder. Depending on the target scalar compute requirement, MxCore allows flexible provisioning of the right number of SISD Cores. MxCore’s Thread Packing block provides architectural separation and allows independent, flexible and modular provisioning across scalar and vector compute engines.
Fig. 13. MxCore Physical Synthesis.
5.4 Power, Performance, and Area Evaluation
MxCore, a unified programmable matrix processor for matrix algebra and DNNs, is compared for performance, area, and energy against custom accelerator as well as generic processor as shown in Figures 14, 15 and 16. For the GEMM workload, MxCore is 3.47\( \times \) times faster than the GPU. For the dense convolution workload MxCore is 4.2\( \times \) times faster than Eyeriss and 1.2\( \times \) times faster than the GPU. For the sparse CNN workload MxCore is 1.27\( \times \) faster than Tensaurus and 7.9\( \times \) times faster than the GPU. Figure 16 compares the GOPs, GOPs/mm\( ^2 \), and GOPs/W normalized to 200 MHz for all the architectures. As evident from the charts, MxCore outperforms almost all the baseline architectures in these three metrics by a factor of 2\( \times \) to 14\( \times \) and achieves performance per unit area on par or better than ASIC solutions, while providing the flexible programmability to support variety of compute primitives. MxCore also provides significant power savings across CPU, GPU, and ASIC class of solutions at iso-technology.
Fig. 14. Relative speedup achieved by MxCore, GPU, Tensaurus, Eyeriss and VIO over the ARM CPU.
Fig. 15. Relative energy savings achieved by MxCore, GPU, Tensaurus, Eyeriss and VIO over the ARM CPU.
Fig. 16. The graphs plot the normalized GOPs, GOPs/W, and GOPs/mm \( ^2 \) improvement of MxCore over other architecture and workload combinations.Higher is better. In the charts E stands for Eyeriss, T stands Tensaurus, S stands for Sparse, D for dense, A for AlexNet, and V for VGG-16.
MxCore’s SISD-Instruction sets as well as microcode complexity influence execution performance of matrix equation. In case of Cholesky or Matrix Solve algorithm, respective microcode’s execution latency is sensitive to MAC operations, which are written using basic multiplier and adder instructions. Instead, MxCore can perform faster if fused MAC instruction is supported as greater percentage of respective algorithm’s microcode are for MAC operations. Being a programmable and scalable tile architecture, MxCore allows seamless integration of new instructions and compute resources as per design requirement.
MxCore’s integrated equation-specific tiling logic optimizes dataflow based on operand dependency, memory bandwidth and compute capacity. The execution efficiency of CNN, GEMM, Cholesky, or Matrix Solve is primarily controlled by respective Tiling Algorithms. Even though CNN workloads can be executed using GEMM API algorithmically, MxCore supports Convolution equation-specific tiling for gaining maximum compute efficiency. Convolution Tiling Algorithm maps activation layer operands to feature filters that are mapped to channels of SIMD compute, enabling systolic computation. This way, MxCore can extracts compute and operand bandwidth efficiency from spatial operand reuse. MxCore’s configuration balances compute and bandwidth, and any reduction in available bandwidth directly reduces performance of a CNN layer. This can happen in some CNN workloads where operand throughput from memory fabric can vary based on memory line width misalignments, resulted from vertical and horizontal strides. MxCore’s operand cache reduces this penalty when operands are spread across multiple memory lines.
While executing a sparse workload, MxCore gains significant efficiency relative to the corresponding “dense” version depending on the level of sparsity. Since execution is controlled from tiling stage, MxCore avoids unnecessary memory transactions and redundant computations. MxCore extracts high efficiency for sparse workloads by engaging initial cycles in dense mode for extracting sparsity information and optimizing the execution on successive iterations opportunistically. This distinctive ability of MxCore to auto-detect and exploit sparsity enables acceleration even in workloads where sparsity is dynamically and variably introduced and where the higher level software layers do not specifically employ sparse APIs (necessary for GPUs). While this version of MxCore only supports block-level sparsity, future versions will tap into further efficiency by exploiting fine-grained sparsity. MxCore can further achieve higher efficiency by supporting operand compression for saving storage space and optional metadata that holds sparsity information, as encoded by the host processor, to bypass the auto sparsity detection.
We also mapped SR-GAN [50], a neural network for generating super-resolution images, onto MxCore demonstrating its flexibility in supporting wide range of edge AI applications. Table 4 shows achieved latency, compute efficiency and power consumption.
| SR-GAN Network | Data Precision | Peak GOPS | Lateny (seconds) | Compute Efficiency | Power (mW) |
|---|---|---|---|---|---|
| Discriminator | FP32 | 64 | 31.301 | 98.79% | 3.03 |
| Discriminator | INT8 | 256 | 8.272 | 93.45% | 4.21 |
| Generator | FP32 | 64 | 0.028 | 76.32% | 3.03 |
| Generator | INT8 | 256 | 0.007 | 73.41% | 4.21 |
Table 4. MxCore Latency, Compute Efficiency, and Power Consumption with SR-GAN [50], a Network Architecture for Super-resolution Image Generation
6 RELATED WORK
Hardware solutions for mobile, IoT, and other Edge devices are evolving at a rapid pace extending their capabilities for the acceleration of AI computing. Qualcomm first applied AI acceleration in Snapdragon Neural Processing Engine [66]. Compared to Qualcomm, HiSlicon’s 900 series chips [38] rely on Neural Processing Unit (NPU) instead of performing AI tasks on GPU. DaVinci architecture [37] in the NPU contains a 3D matrix unit for matrix multiply and a vector unit for special functions with a shared C buffer to store intermediate results. Though MxCore has a similar data path, the programmability of the SISD engine and support for complex tiling patterns uniquely enable efficient execution of various other matrix processing primitives such as decomposition, triangular solve, and inverse, which are essential for navigation and tracking applications. Mediatek’s Helio P60 [60] uses both GPU and AI Processing Unit (APU) for AI tasks, though only quantized models run on the APU, while float networks are executed on Mali GPU [40]. In comparison, MxCore architecture provides design configurability supporting multiple precision-formats within a unified core. Myriad X architecture, as present in Intel’s VPU chips [7] contains multiple DSP cores and Neural Compute Engine (NCE) for AI tasks. While DSP cores provide programmability, data sharing with the NCE that performs efficient convolution/GEMM operation may introduce bandwidth bottlenecks for compute primitives with interleaved matrix multiply and special operators. MxCore overcomes these challenges through tightly coupled vector SIMD and programmable SISD cores along with a highly optimized internal data path, thus saving on round-trip latency to memory and power.
MIT’s Eyeriss v1 architecture [13] focuses on optimizing internal dataflow for performance and power efficiency. Eyeriss v2 [12] provides scale-up configuration through hierarchical mesh interconnects and improves efficiency by exploiting sparsity present in sparse DNN models. However, it lacks support for classical matrix processing kernels such as decomposition, inverse that continue to be an integral part of modern Edge intelligence. Intel’s VIO [58] accelerates Edge computing for navigation and tracking applications, but it comprises multiple custom accelerators, one for each variant of matrix processing kernel, and thus compromises on the area and power efficiency. Though sparse accelerators such as Tensaurus [73] and Extensor [34] provide customized ASIC solutions for sparse GEMM and DNN workloads, they cater to the needs of high-performance computing and are not optimized for energy and cost-constrained Edge applications. Also, many sparse accelerators such as Tensaurus are often not equally efficient for dense workloads. MxCore balances the performance for dense and sparse workloads through block-level sparsity support, which does not severely degrade the performance for dense applications.
7 CONCLUSION
The unified matrix processor MxCore presented here, achieves substantial improvements on performance, area and power efficiency across dense and sparse matrix operations including highly operand dependent Cholesky execution, when compared to current state of the art. MxCore also offers a programmable and flexible platform, allowing modular relative provisioning of scalar and vector compute engines. We demonstrated the efficient mapping of various high-dimensional matrix or tensor algebraic operations that demands high dense as well as sparse scalar compute. MxCore architecture can be enhanced in future versions for supporting features like operand compression and grouping of valid elements in case of sparse workload for higher efficiency.
- [1] [n.d.]. Avalon Interface Specifications. Retrieved from https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl_avalon_spec.pdf.Google Scholar
- [2] [n.d.]. DeepBench Benchmark Suite. Retrieved from https://github.com/baidu-research/DeepBench.Google Scholar
- [3] [n.d.]. Intel Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual. Retrieved from https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl-ias-ccip.pdf.Google Scholar
- [4] [n.d.]. Intel Devcloud. Retrieved from https://software.intel.com/content/www/us/en/develop/tools/devcloud/fpga.html.Google Scholar
- [5] [n.d.]. Intel Quartus Prime Pro Edition. Retrieved from https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/rn/archives/rn-qts-pro-dev-support-20-1.pdf.Google Scholar
- [6] [n.d.]. NVIDIA Jetson Nano-4GB. Retrieved from https://developer.nvidia.com/embedded/jetson-nano-developer-kit.Google Scholar
- [7] . [n.d.]. Intel Announces Movidius Myriad X VPU, Featuring ‘Neural Compute Engine’. Retrieved July 15, 2021 from https://www.anandtech.com/show/11771/intel-announces-movidius-myriad-x-vpu.Google Scholar
- [8] . 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of the International Conference on Computational Statistics (COMPSTAT’2010). Springer, 177–186.Google Scholar
Cross Ref
- [9] . 2018. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv:1812.00332. Retrieved from https://arxiv.org/abs/1812.00332.Google Scholar
- [10] . 2010. EKF-SLAM and machine learning techniques for robot navigation. In Proceedings of the 20th International Conference on Pattern Recognition. IEEE, 396–399.Google Scholar
Digital Library
- [11] . 2009. Anomaly detection: A survey. ACM Comput. Surv. 41, 3 (2009), 1–58.Google Scholar
Digital Library
- [12] . 2018. Eyeriss v2: A flexible and high-performance accelerator for emerging deep neural networks. arXiv:1807.07928. Retrieved from http://arxiv.org/abs/1807.07928.Google Scholar
- [13] . 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 367–379. Google Scholar
Digital Library
- [14] . 2009. Rapid prototyping of an improved cholesky decomposition based MIMO detector on FPGAs. In Proceedings of the NASA/ESA Conference on Adaptive Hardware and Systems. 369–375. Google Scholar
Digital Library
- [15] . 2007. Bayesian Statistical Modelling. Vol. 704. John Wiley & Sons.Google Scholar
- [16] . 2020. Parameters sharing in residual neural networks. Neural Process. Lett. 51, 2 (2020), 1393–1410.Google Scholar
Digital Library
- [17] . 2019. NeST: A neural network synthesis tool based on a grow-and-prune paradigm. IEEE Trans. Comput. 68, 10 (2019), 1487–1497.Google Scholar
Cross Ref
- [18] . 2019. Chamnet: Towards efficient network design through platform-aware model adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11398–11407.Google Scholar
Cross Ref
- [19] . 2020. Edge intelligence: The confluence of edge computing and artificial intelligence. IEEE IoT J. 7, 8 (2020), 7457–7469.Google Scholar
- [20] . 1998. Applied Regression Analysis. Vol. 326. John Wiley & Sons.Google Scholar
Cross Ref
- [21] . [n.d.]. The World’s Most Valuable Resource Is no Longer Oil, But Data. Retrieved May 6, 2017 from https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data.Google Scholar
- [22] . 2019. Neural architecture search: A survey. J. Mach. Learn. Res. 20, 1 (2019), 1997–2017.Google Scholar
Digital Library
- [23] . 2016. Deep Learning. MIT Press.Google Scholar
Digital Library
- [24] . 2018. Morphnet: Fast & simple resource-constrained structure learning of deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1586–1595.Google Scholar
Cross Ref
- [25] . 2010. Eigen v3. Retrieved from http://eigen.tuxfamily.org.Google Scholar
- [26] . 2019. By leaps and bounds: An exclusive look at how boston dynamics is redefining robot agility. IEEE Spectrum 56, 12 (2019), 34–39.Google Scholar
Digital Library
- [27] . 2018. A survey on methods and theories of quantized neural networks. arXiv:1808.04752. Retrieved from https://arxiv.org/abs/1808.04752.Google Scholar
- [28] . 2010. Algorithmic cholesky factorization fault recovery. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS’10). IEEE, 1–10.Google Scholar
Cross Ref
- [29] . 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149. Retrieved from https://arxiv.org/abs/1510.00149.Google Scholar
- [30] . 2020. TUTOR: Training neural networks using decision rules as model priors. arXiv:2010.05429. Retrieved from https://arxiv.org/abs/2010.05429.Google Scholar
- [31] . 2021. SCANN: Synthesis of compact and accurate neural networks. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. (2021).Google Scholar
- [32] . 2021. A design methodology for energy-aware processing in unmanned aerial vehicles. ACM Trans. Des. Autom. Electr. Syst. 27, 1 (2021), 1–20.Google Scholar
- [33] . 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 1389–1397.Google Scholar
Cross Ref
- [34] . 2019. ExTensor: An accelerator for sparse tensor algebra. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’52). Association for Computing Machinery, New York, NY, 319–333. Google Scholar
Digital Library
- [35] . 1990. Analysis of the Cholesky Decomposition of a Semi-definite Matrix. Oxford University Press.Google Scholar
- [36] . 2015. Distilling the knowledge in a neural network. arXiv:1503.02531. Retrieved from https://arxiv.org/abs/1503.02531.Google Scholar
- [37] . [n.d.]. DaVinci: A ScalableArchitecture for Neural Network Computing. Retrieved July 12, 2021 from https://www.cmc.ca/wp-content/uploads/2020/03/Zhan-Xu-Huawei.pdf.Google Scholar
- [38] . [n.d.]. Kirin 980, the World’s First 7nm Process Mobile AI Chipset. Retrieved July 12, 2021 from https://consumer.huawei.com/en/campaign/kirin980/.Google Scholar
- [39] . 2021. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. arXiv:2102.00554. Retrieved from https://arxiv.org/abs/2102.00554.Google Scholar
- [40] . 2018. Ai benchmark: Running deep neural networks on android smartphones. In Proceedings of the European Conference on Computer Vision (ECCV’18) Workshops. 0–0.Google Scholar
- [41] . [n.d.]. AI at the Edge—A Roadmap. Retrieved from https://thertoinnovationsummit.eu/sites/default/files/inline-files/Whitepaper_AIattheEdge_FINAL.pdf.Google Scholar
- [42] . [n.d.]. Intel’s Deep Learning Boost. Retrieved July 12, 2021 from https://www.intel.com/content/dam/www/public/us/en/documents/product-overviews/dl-boost-product-overview.pdf.Google Scholar
- [43] . 2017. Energy efficient computing and sensing in the Zettabyte era: From silicon to the cloud. In Proceedings of the IEEE International Electron Devices Meeting (IEDM’17). IEEE, 1–2.Google Scholar
Cross Ref
- [44] . 2019. DeepSZ: A novel framework to compress deep neural networks by using error-bounded lossy compression. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. 159–170.Google Scholar
Digital Library
- [45] . 2015. Machine learning: Trends, perspectives, and prospects. Science 349, 6245 (2015), 255–260.Google Scholar
Cross Ref
- [46] . 1960. A new approach to linear filtering and prediction problems. Journal of Basic Engineering 82, 1 (1960), 35–45. Google Scholar
Cross Ref
- [47] . 2020. Ensembling classical machine learning and deep learning approaches for morbidity identification from clinical notes. IEEE Access 9 (2020), 7107–7126.Google Scholar
Cross Ref
- [48] . 2018. Advanced vector extensions. In Modern X86 Assembly Language Programming. Springer, 87–107.Google Scholar
Cross Ref
- [49] . 2015. Deep learning. Nature 521, 7553 (2015), 436–444.Google Scholar
Digital Library
- [50] . 2017. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4681–4690.Google Scholar
Cross Ref
- [51] . 2019. Evolutionary deep learning with extended kalman filter for effective prediction modeling and efficient data assimilation. J. Comput. Civil Eng. 33, 3 (2019), 04019014.Google Scholar
Cross Ref
- [52] . 2020. Train big, then compress: Rethinking model size for efficient training and inference of transformers. In Proceedings of the International Conference on Machine Learning. PMLR, 5958–5968.Google Scholar
- [53] . 2008. EKF-based adaptive sensor scheduling for target tracking. In Proceedings of the International Symposium on Information Science and Engineering, Vol. 2. IEEE, 171–174.Google Scholar
- [54] . 2011. Introduction to Intel Advanced Vector Extensions. Intel White Paper, Vol. 23 (2011).Google Scholar
- [55] . 2021. A distributed graph-theoretic framework for automatic parallelization in multi-core systems. Proc. Mach. Learn. Syst. 3 (2021).Google Scholar
- [56] . 2020. A linear regression and deep learning approach for detecting reliable genetic alterations in cancer using dna methylation and gene expression data. Genes 11, 8 (2020), 931.Google Scholar
Cross Ref
- [57] . 2010. EKF-SLAM for AUV navigation under probabilistic sonar scan-matching. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 4404–4411.Google Scholar
- [58] . 2019. Inertial odometry at the edge: A hardware-software co-design approach for ultra-low latency and power. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’19). 960–963. Google Scholar
Cross Ref
- [59] . 2014. Sentiment analysis algorithms and applications: A survey. Ain Shams Eng. J. 5, 4 (2014), 1093–1113.Google Scholar
Cross Ref
- [60] . [n.d.]. MediaTek Helio P60. Retrieved July 15, 2021 from https://i.mediatek.com/p60.Google Scholar
- [61] . 2019. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the International Conference on Machine Learning. PMLR, 4646–4655.Google Scholar
- [62] . 2017. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. 33, 5 (2017), 1255–1262. Google Scholar
Digital Library
- [63] . 2014. Principles of Artificial Intelligence. Morgan Kaufmann.Google Scholar
- [64] . 2002. An introduction to logistic regression analysis and reporting. J. Educ. Res. 96, 1 (2002), 3–14.Google Scholar
Cross Ref
- [65] . 2009. Sentiment analysis: A combined approach. J. Informetr. 3, 2 (2009), 143–157.Google Scholar
Cross Ref
- [66] . [n.d.]. Qualcomm Neural Processing SDK for AI. Retrieved from https://developer.qualcomm.com/software/qualcomm-neural-processing-sdk.Google Scholar
- [67] . 2009. Language identification on the web: Extending the dictionary method. In Computational Linguistics and Intelligent Text Processing, (Ed.). Springer, Berlin, 357–368.Google Scholar
- [68] . 2014. EKF/UKF maneuvering target tracking using coordinated turn models with polar/cartesian velocity. In Proceedings of the 17th International Conference on Information Fusion (FUSION’14). IEEE, 1–8.Google Scholar
- [69] . 2016. An overview of gradient descent optimization algorithms. arXiv:1609.04747. Retrieved from https://arxiv.org/abs/1609.04747.Google Scholar
- [70] . 2002. Artificial intelligence: A modern approach. An ontology-based adaptive personalized e-learning system, assisted by software agents on cloud storage, M. Rani, R. Nayak, and O. P. Vyas (Eds.). Knowledge-Based Systems, Prentice Hall Upper Saddle River, NJ, USA.Google Scholar
- [71] . 2016. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.
arxiv:1602.07868 . Retrieved from http://arxiv.org/abs/1602.07868.Google Scholar - [72] . 2018. On-demand ultra-dense cloud drone networks: Opportunities, challenges and benefits. IEEE Commun. Mag. 56, 8 (2018), 85–91.Google Scholar
Digital Library
- [73] . 2020. Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20). IEEE, 689–702. Google Scholar
Cross Ref
- [74] . 2017. Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm. Integration 58 (2017), 74–81.Google Scholar
Cross Ref
- [75] . 2020. Machine-oriented NMT adaptation for zero-shot NLP tasks: Comparing the usefulness of close and distant languages. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects. 36–46.Google Scholar
- [76] . [n.d.]. AI Edge Device Shipments by Device Category, World Markets: 2017-2025. Retrieved July 12, 2021 from https://twitter.com/tractica/status/1040255585189601286/photo/1.Google Scholar
- [77] . 2009. On the evolution of user interaction in facebook. In Proceedings of the 2nd ACM Workshop on Online Social Networks. 37–42.Google Scholar
Digital Library
- [78] . 2010. Don’t follow me: Spam detection in twitter. In Proceedings of the International Conference on Security and Cryptography (SECRYPT’10). IEEE, 1–10.Google Scholar
- [79] . 2017. Communication-avoiding parallel algorithms for solving triangular systems of linear equations. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). IEEE, 678–687.Google Scholar
Cross Ref
- [80] . 2018. Virtual, augmented, and mixed reality for human-robot interaction. In Companion of the ACM/IEEE International Conference on Human-Robot Interaction (HRI ’18). Association for Computing Machinery, New York, NY, 403–404. Google Scholar
Digital Library
- [81] . 2019. Self-optimizing and self-programming computing systems: A combined compiler, complex networks, and machine learning approach. IEEE Trans. VLSI Syst. 27, 6 (2019), 1416–1427.Google Scholar
Digital Library
- [82] . 2021. Plasticity-on-chip design: Exploiting self-similarity for data communications. IEEE Trans. Comput. 70, 6 (2021), 950–962.Google Scholar
Cross Ref
- [83] . 2019. Learning efficient tensor representations with ring-structured networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 8608–8612.Google Scholar
Cross Ref
- [84] . 2020. Inertial velocity estimation for indoor navigation through magnetic gradient-based EKF and LSTM learning model. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’20). IEEE, 4545–4550.Google Scholar
Cross Ref
Index Terms
A Unified Programmable Edge Matrix Processor for Deep Neural Networks and Matrix Algebra
Recommendations
EIE: efficient inference engine on compressed deep neural network
ISCA '16: Proceedings of the 43rd International Symposium on Computer ArchitectureState-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom ...
High Performance Coordinate Descent Matrix Factorization for Recommender Systems
CF'17: Proceedings of the Computing Frontiers ConferenceCoordinate descent (CD) has been proved to be an effective technique for matrix factorization (MF) in recommender systems. To speed up factorizing performance, various methods of implementing parallel CDMF have been proposed to leverage modern multi-...
EIE: efficient inference engine on compressed deep neural network
ISCA'16State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom ...
























Comments