skip to main content
research-article
Open Access

Leveraging Computational Storage for Power-Efficient Distributed Data Analytics

Published:18 October 2022Publication History

Skip Abstract Section

Abstract

This article presents a family of computational storage drives (CSDs) and demonstrates their performance and power improvements due to in-storage processing (ISP) when running big data analytics applications. CSDs are an emerging class of solid state drives that are capable of running user code while minimizing data transfer time and energy. Applications that can benefit from in situ processing include distributed training, distributed inferencing, and databases. To achieve the full advantage of the proposed ISP architecture, we propose software solutions for workload balancing before and at runtime for training and inferencing applications. Other applications such as sharding-based databases can readily take advantage of our ISP structure without additional tooling. Experimental results on different capacity and form factors of CSDs show up to 3.1× speedup in processing while reducing the energy consumption and data transfer by up to 67% and 68%, respectively, compared to regular enterprise solid state drives.

Skip 1INTRODUCTION Section

1 INTRODUCTION

With the advent of data centers, which fuel the demand for virtually unlimited storage, it has been estimated that 2.5 EB of data is being created every day [11], from text to images, music, and videos. The deep learning revolution has given rise to running computationally intensive algorithms on these huge amounts of data. Different technologies have been developed to support this trend toward high storage and computation demands, and they shift the bottleneck as a result. This work investigates one of the opportunities made possible by this shift in bottlenecks, namely the elimination of data transfer from the storage unit to the host system for processing.

1.1 Motivation

The storage system where data originally resides plays a crucial role in the performance of data-intensive applications. To be processed, data always needs to be read from the storage units to the memory units of the application servers. As the size of the data increases, the role of the storage subsystem becomes more important. The computational storage drive (CSD) is a class of storage drives that can meet this growing demand by augmenting the storage drives with processing resources while eliminating unnecessary data transmission to the host’s CPU. The main concept that distinguishes the CSDs from conventional SSDs is in-storage processing (ISP) capability. ISP is the computing paradigm of moving computation to data storage as opposed to moving data to compute engines. ISP is considered by some as the ultimate form of the near-data processing, as it must strike a balance between storage and computing sides.

On the storage side, hard disk drives (HDDs) have been increasing in capacity with high reliability and very competitive cost. However, their performance has plateaued, and their power consumption is ultimately lower-bounded by the mechanical motion. Solid state drives (SSD), which are based on NAND flash memory and offer better performance at lower power consumption, have not been competitive in the pricing until recently, but they are starting to find their way into data centers. The market share of enterprise SSDs jumped from almost 0% in 2010 to 12% in [16].

On the computing side, general-purpose graphic processing units (GPUs) have turned conventional PCs into supercomputers in terms of their ability to execute parallel operations. However, GPUs are power-hungry and are not suitable for storage systems, which now are most concerned with the power budget, because power not only incurs significant cost, but cooling also incurs additional cost to operate and maintain and can become another source of unreliability. As a result, general-purpose GPUs are not so suitable for ISP. Instead, they often assume data to be on the local computer. However, if the data resides on a storage system rather than the memory, the communication network will become the bottleneck, as data from the storage to the processor consumes 5,000× more energy and 2,500× more latency compared to loading it from the volatile memory [20].

One growing class of applications in wide use is data analytics, including artificial intelligence (AI) and search engines. In these applications, the processing nodes require huge volumes of raw data, and the transfer of raw data can easily become the bottleneck in conventional architectures with separate host and storage. ISP can minimize or eliminate this bottleneck by moving part or all of the computation to the storage unit and sending only the output back to the host. To support ISP, the storage system needs to be capable of not only storage but also processing, and CSDs are being used to build servers such as that shown in Figure 1.

Fig. 1.

Fig. 1. Handling queries on a server with generic storage systems (a) vs. a server with CSDs (b).

1.2 Contributions

In this work, we introduce a family of CSDs based on our indigenous storage controller named Newport. The Newport family includes three CSDs codenamed Newport, Laguna, and Solana, all of which incorporate variants of the original Newport controller to implement high-capacity storage drives in different form factors and capacities. In the ensuing text, we simply refer to our CSD design as Newport. Newport is the first commercially available CSD built with both the SSD controller and the ISP engine on the same custom chip. To support ISP, we have developed a custom software stack that enables Newport to run a full-fledged operating system (OS) (e.g., Linux) and to provide seamless access to data stored on the flash, which in turn enables application developers to run general-purpose application binaries on the storage drive without modification and to access the data residing on the flash. To support efficient communication with the ISP engine, we have developed a TCP/IP-based tunneling system that is implemented on the regular NVMe/PCIe (Peripheral Component Interconnect express) bus and allows the ISP engine to communicate with other processing nodes and the Internet. We then deploy these CSDs on datacenter-grade storage servers and demonstrate the effectiveness of distributing data analytics applications such as distributed training and inferencing of machine learning models or database search applications on a group of host processors and ISP engine in a heterogeneous fashion. To do so, we have developed frameworks and tools to efficiently distribute general applications such as the ones mentioned previously over a heterogeneous configuration for a host system and a cluster of CSDs.

The rest of the article is organized as follows. Section 2 provides a background and related work on the newly emerged data transfer bottleneck, the concept of computational storage, CSDs proposed to date, and their use in heterogeneous systems. Section 3 describes our proposed CSD architecture and its design, including both hardware and software. Section 4 explains our developed application for efficient and robust distribution of deep neural network (DNN) training over heterogeneous systems, and Section 5 covers the other classes of applications that we enhanced and deployed on CSDs. Section 6 evaluates our approach with experimental results on our CSD design. Finally, Section 7 concludes the article with directions for future work.

Skip 2RELATED WORK Section

2 RELATED WORK

2.1 Data Transfer Bottleneck

The trend toward SSDs has also brought major changes in the architecture of storage area networks (SANs). In traditional SANs, multiple HDDs are attached to a server via SCSI or SATA, and the servers are connected to each other via a high-speed protocol such as Fibre Channel (FC). Although SATA with its 750-MBps bandwidth sufficed for the data transfer rate of HDDs, SATA cannot keep up with the performance that can be extracted from today’s NAND flash media. New industry-standard protocols such as NVMe have resolved this bottleneck with not only much higher transfer rate but also by connecting to the CPU more directly via PCIe. However, even such advances still cannot remove those bottlenecks on the long path from the non-volatile data storage (flash) to the processing units (CPU, GPU), which continue hindering the full utilization of data read/write speed in storage servers. In I/O-intensive applications such as DNN training or natural language processing (NLP), this bandwidth mismatch can easily become the bottleneck of the entire operation. For example, the datasets used for training DNNs are tens of orders of magnitude larger than the DRAM size of the computing systems; the embedding tables of an NLP application such as the recommendation engine can exceed tens of gigabytes, much more than the typical DRAM capacity of such systems. Studies show that even by using efficient methods such as pipelining and prefetching, up to 70% of the training time for DNNs can be wasted on blocking I/O operations that bring the raw, unprocessed training data to the processor. These challenges signify the importance of near-data processing or ISP.

2.2 Computational Storage

Computational storage architectures enable improvements in application performance or infrastructure efficiency or both through the integration of compute resources outside the traditional compute-and-memory architecture. The compute resources may be either directly integrated with storage or between the host and the storage. The goal of these architectures is to enable parallel computation while alleviating constraints on existing compute, memory, storage, and I/O resources. Figure 2 shows different classes of computational storage, or CSx [3], where x may be the processor, drive, or array.

Fig. 2.

Fig. 2. Classes of computational storage systems.

2.2.1 Computational Storage Processor.

A computational storage processor (CSP) is a component capable of executing one or more computational storage functions (CSFs) for an associated storage system without providing persistent data storage. The CSP contains computational storage resources (CSRs) and device memory. The mechanism by which the CSP is associated with the storage system is implementation-specific.

2.2.2 Computational Storage Drive.

A CSD is a component that can execute one or more CSFs while providing persistent data storage. The CSD contains a storage controller, CSR, device memory, and persistent data storage. A CSD may continue to function as a standard storage drive with existing host interfaces and drive functions. As such, the system can have a storage controller with associated storage memory along with storage addressable by the host through standard management and I/O interfaces.

2.2.3 Computational Storage Array.

A computational storage array (CSA) is a storage array capable of executing one or more CSFs. As a storage array, a CSA contains control software, which provides virtualization to storage services, storage devices, and CSRs for the purpose of aggregating, hiding complexity, or adding new capabilities to lower-level storage resources. The CSRs in the CSA may be centrally located or distributed across CSDs or CSPs within the array.

All CSx classes are similar in design, as they all consist of the following components:

(1)

CSRs, which contain a resource repository to store blocks such as CSFs and/or computational storage engine environments (CSEEs), a function data memory (FDM) that can be partitioned into allocated function data memory (AFDM), and one or more computational storage engines (CSEs)

(2)

A storage controller (for CSD) or an array controller (for CSA)

(3)

A device memory

(4)

A device storage (only for CSD and CSA).

Of these three classes, CSD is the most efficient approach for several reasons. First, it has a shorter data path than CSP does. Second, it has finer granularity and less complications compared to the CSA.

2.3 Computational Storage Drives

CSDs can be structured in different ways. The computing resource may use the existing SSD controller, a separate processing engine, or an integrated SSD+ISP engine. We also discuss issues with accelerators such as FPGAs or GPUs.

2.3.1 Utilizing the SSD Controller’s Processing Resources.

The easiest way to develop a CSD is to use the processing engines already available in the SSD [29, 37, 40, 43, 59, 66]. Lee et al. [43] implement a CSD based on the Jasmine OpenSSD Platform [5] to perform external sorting. Since the platform has a single ARM7TDMI-S core running at up to 87.5 MHz, they used the same processor for ISP, which yields a low performance improvement of 39%, compared with the traditional external sorting algorithm. RecSSD [66] proposes a near-data processing solution that improves the performance of the underlying SSD storage for embedding table operations for recommendation applications. It utilizes the internal SSD bandwidth and reduces data communication overhead between the host processor and SSD by offloading the entire embedding table operation to the SSDs, including gather and aggregation computations. The hardware is the commercial Cosmos+ OpenSSD evaluation platform [4] with a dual 1-GHz ARM Cortex-A9 core featuring a custom software stack on the flash translation layer (FTL). RecSSD reduces end-to-end neural recommendation inference latency by 4× compared to off-the-shelf SSD systems at the cost of losing SSD performance, since the ISP engine shares the same processing cores with the conventional SSD controller. Biscuit [29] is a near-data processing framework for running applications distributed both on the host system and the storage device. Biscuit uses two ARM R7 cores that are originally dedicated for running conventional SSD controlling tasks. By offloading some parts of the computation to the SSD, the overall performance improves. The original read/write functionality of an SSD cannot be suspended because of running the user application in place. Just like previous projects, it is unclear how much the performance of Biscuit degrades in the case of simultaneous host I/O requests and running user applications in place. They also proposed an innovative flow-based programming model to offload tasks to the embedded processing engine of Biscuit dynamically, but designing complex user applications based on the programming model is potentially time-consuming.

2.3.2 Coupling an External Processing Engine.

The second approach is to couple an external processing engine with the storage drive [24, 38, 42, 62]. Jun et al. [38] proposed a scalable architecture, code-named BlueDBM, that is a NAND-based storage system coupled with an external FPGA accelerator. It shows up to 10-fold speedup for running pre-defined tasks such as nearest-neighbor search or graph traversal compared to a system using no CSD. However, such systems have a drawback in design and implementation time and reconfigurability. Deploying an FPGA-based unit requires register transfer level (RTL) design, which significantly complicates the deployment of new tasks [52, 54]. Likewise, reconfiguring an FPGA requires loading a compressed bitstream into configuration memory, which can take hundreds of milliseconds [51, 53, 55]. Our early prototype, CompStor, also followed the same concept, and there was an off-chip ISP engine attached to the main FPGA-based flash storage controller. Such a design may suffer from low efficiency due to constantly transferring the data off-chip to the ISP engine [62]. Several research works have implemented data analytics on SmartSSD, a well-known CSD platform that uses external FPGA as ISP engine [14, 24, 39, 42, 56]. SmartSSD is a 4-TB NVMe storage drive with an external FPGA as the ISP engine. Chapman et al. [24] investigated a computational storage platform based on SmartSSD for big data analytics. It can directly communicate with the storage part by transferring data in P2P mode. With their software stack, it runs user binaries without modification, making it easy to port applications. Evaluation shows that this system can achieve 6× improvement in query times and the average of 60% lower CPU usage. However, no power and energy results are reported, even though the FPGAs and the off-chip accelerator are likely to be power-hungry in this setup. Just like other research works, FPGA-based ISP requires extensive efforts to implement new applications. Moreover, this design does not represent a true ISP platform, as the data has to migrate from the SSD to the ISP module. Hence, the data transfer bottleneck still persists in high-speed applications.

2.3.3 An Integrated SSD + ISP Engine.

The last approach is to build an integrated ISP+SSD controller to eliminate off-chip communication for ISP. To the best of our knowledge, there are only two designs that incorporate such integration: ScaleFlux and Catalina. ScaleFlux [1] is an FPGA-based CSD where both the SSD controller and the ISP engine are implemented on an FPGA. Like other FPGA-based ISPs, the critical functions that need to be accelerated must be implemented in RTL and off-loaded onto the FPGA. Catalina [63] is a CSD in AIC form factor that uses a Xilinx’s Virtex UltraScale+ SoC to implement the ISP and SSD controller functionality. Unlike ScaleFlux, Catalina uses two ARM Cortex R5 embedded real-time processors and FPGA programmable resources for the SSD controller and implements the ISP engine on a quad-core ARM A53 application processor.

We take ISP a step further by designing an ASIC to include an SSD controller and a dedicated ARM processor for running general applications. To the best of our knowledge, Newport is the first and only single ASIC-based CSD controller with a multi-purpose processing engine. The ARM processor can run Linux executables without source code modification or recompiling. We have also implemented a TCP/IP-based tunneling system that allows the ISP engine to connect to a network, including the Internet and other processing nodes. Table 1 summarizes the technical characteristics of some of the well-known CSDs. The next section details the hardware and software that enable our system to function both as a regular storage system and as a high-capacity stand-alone processing node.

Table 1.
NameCapacitySSD ControllerCSRISP EngineSupport for OSsProgramming Model
OpenSSD Jasmine [5]64 GBARM7TDMI-SSharedARM7TDMI-SNoBare-metal software
OpenSSD Cosmos+ [4]512 GBXC7Z045-3FFG900 Zynq-7000 FPGASharedDual-core ARM Cortex-A9 and dual Neon DSP co-processorsNoRTL design and bare-metal software
Samsung SmartSSD [14]4 TBASICDedicatedKintex XCKU15P UltraScale+ FPGANoRTL design
ScaleFlux [1]8 TBFPGADedicatedFPGANoRTL design
Newport32 TBASICDedicatedQuad-core ARM A53 processor and four Neon DSP co-processorsUbuntu and DebianSoftware on Linux OS and TCP/IP networking

Table 1. Technical Specifications of Commercial CSDs

2.4 CSDs in Heterogeneous Computing Architectures

CSD is a new concept that is making its way into the data storage market. In some ways, CSD resembles GPU in that both are considered hardware accelerators that communicate with the main processor through PCIe. Due to long expected lifetime, high requirements for quality of service (QoS), and limitations in working condition (available space, temperature, etc.) in storage servers, a suitable replacement for conventional SSDs in enterprise-level storage servers should fit in the same power, size, and working condition envelope. These constraints eliminate alternatives such as GPU, TPU, or FPGAs to act as an accelerator in storage servers. However, what distinguishes CSD is the ultra-low-power footprint and very small overhead size that makes it indistinguishable from a regular SSD. In addition, sitting behind the (relatively) slow front-end (FE) interface, the ISP can have much higher-speed access to the stored data on the flash, making it suitable to perform low-computation, I/O-intensive applications and more efficient than the host.

Compared to FPGA platforms, our CSD design benefits from a central ASIC that hosts both the SSD controller and the ISP engine. Although FPGA is more flexible than ASIC and can massively parallelize an architecture, they consume considerably more power, require more silicon area, and can work with lower frequencies and throughput [17, 41]. An FPGA platform is ideal for rapid prototyping and engineering development, but they can never go beyond a mass production level. As a result, CSD is not meant to compete with high-end GPUs or FPGAs. Instead, it introduces a new paradigm that enhances the performance of storage systems by augmenting the host with extra processing power at the cost of minimal overhead and by running the I/O-intensive parts of the process with less data transmission. A comparison is shown in Figure 3.

Fig. 3.

Fig. 3. CSD as a complementary component in a high-performance architecture.

Skip 3DESIGN OF A CSD Section

3 DESIGN OF A CSD

This section demonstrates different stages of development of the Newport family CSDs and how the early off-chip CSD architecture led us to a more mature and complete solution. First, we describe the fundamentals of a general SSD architecture and how the computational storage concepts can fit in this domain. Then, we present our early CSD architecture and discuss the shortcomings of this off-chip design. In the end, we explain the detailed hardware and software architecture of the Newport family of CSDs.

3.1 Solid State Drives

SSDs are becoming popular both in personal computers and enterprise-level storage systems. They deliver significantly higher performance compared to traditional HDDs. In addition, due to lack of mechanical components, SSDs can be smaller in size and consume less power while accommodating much higher capacity. A modern SSD is composed of two main components: the SSD controller and the non-volatile storage media [25]. The controller unit can be further broken down to the FE, back-end (BE), and central controlling units.

3.1.1 NAND Flash Media.

The structure of a flash memory chip is shown in Figure 4. A NAND flash memory chip is a package containing multiple dies. A die is the smallest unit of flash memory that can independently execute I/O commands and report status. Each die is composed of a few planes, and each plane contains multiple blocks, where a block is the smallest erasure unit. Inside each block are multiple pages, which are the smallest programming (writing) units. The key point in this hierarchical architecture is the programmable unit versus the erasable unit. The NAND flash memory can be programmed at the page level, usually, 4 to 16 KB, whereas the erase operation cannot be done on a smaller segment than a block, which can be several megabytes. Although the cost of SSDs is higher than that of HDDs, the difference is rapidly shrinking, thanks to new flash technologies such as QLC (quadruple-level cell) flash and the upcoming PLC (penta-level) flash at the cost of increased bit error rate (i.e., lower reliability). Such advancements have paved the way for the adoption of NAND flash-based storage technology everywhere—from consumer goods to the cloud and the edge [27, 36, 47].

Fig. 4.

Fig. 4. Flash chip organization in an SSD.

3.1.2 Controller.

As the brain of the SSD, the controller runs various functions to efficiently manage the NAND flash media. Examples of such functions include garbage collection, which reclaims blocks containing invalid data, and wear leveling, which ensures that flash blocks are used evenly to prolong the SSD life. An FTL maps logical addresses to physical addresses so that the logical view of storage is separated from the inner management of the physical memory. A major responsibility of the controller is to handle data writes. The data cannot be overwritten on flash memory and may only be written on the erased blocks. This means that if a page within a block should be updated, the SSD controller has to read the whole block of data, update the page content, and write it back to an erased block. To modify the write operations, the garbage collector routine erases the blocks during off-peak times to maintain optimal write speeds [60].

3.1.3 Front-End Interface.

Protocols for transferring data between the host and the storage devices include SATA [9], SAS [8], and NVMe over PCIe [6]. SSDs can execute multiple I/O commands simultaneously and with low latency. Among the preceding interfaces, NVMe is more suited for enterprise SSDs and has impressive performance in transferring data between the SSDs and the host. This makes NVMe the protocol of choice for our proposed CSD architectures. The technical specification of NVMe over PCIe protocol is quite complicated and is outside the scope of this research. The rest of this section reviews the NVMe protocol.

PCIe [44] is a high-speed bus standard that uses a set of unidirectional pairs of serial and point-to-point links, called lanes. A PCIe slot can have 1, 4, 8, or 16 lanes, denoted as ×1, ×4, ×8, and ×16, respectively. The PCIe protocol is composed of three layers, namely the transaction layer, data link layer, and physical layer, and currently there are four generations of the PCIe bus protocol. Each lane of PCIe Gen 1, Gen 2, Gen 3, and Gen 4 provides a data bandwidth of 250, 500, 985, and 1,970 MBps, respectively. PCIe links can be used to connect different peripherals to hosts, such as video cards, expansion cards, and storage units. Our proposed CSD architectures contain a host interface based on PCIe Gen 3 × 4, which can provide a bandwidth of up to 3,940 MBps.

NVMe protocol uses the PCIe data link to transfer data between a host and an SSD. Traditional data transfer protocols were developed for HDDs that have just one queue for submitting I/O commands, but they cannot sustain SSDs’ much higher bandwidth. To support SSDs’ ability to run multiple I/O commands at the same time, MVMe provides up to 64K data transmission queues, each of which supports up to 64K parallel I/O commands.

3.1.4 Standard Form Factors.

Several industry-standard form factors for flash-based storage devices exist. Some target commodity machines, whereas others are designed for data centers and edge infrastructure [19].

Among all form factors, EDSFF (Enterprise and Data Center SSD Form Factor) is the gold standard that also takes into account the thermal condition and the harsh environment of enterprise systems. It offers a set of flexible specifications of form factors, including E1.L, E1.S, E3.L, and E3.S, among others. The proposed CSD, Solana, is designed and manufactured for edge systems, and the E1.S form factor has been chosen to address thermal challenges and also to provide enough room for a large number of NAND flash chips. The width and length of Solana are 31.5 mm and 111.49 mm, respectively. Moreover, a heat sink can be mounted on drives in E1.S form factor for passive thermal dissipation.

The U.2 form factor is 2.6 inches and is the most common for SSDs, offered in PCIe (with NVMe), SAS, or SATA interfaces. Based on capacity and interface, the 69.85 × 100 mm U.2 drive can have a thickness of 7 or 15 mm. U.2 can be deployed in a wide range of systems, from laptops and desktops to enterprise storage servers. U.2 is defined as compliance with the PCI Express SFF-8639 Module specification, instead of referencing SAS or SATA SSDs as was typically done previously.

The M.2, formerly known as Next Generation Form Factor (NGFF), is widely used for internally mounted SSDs. It supports PCIe, SATA, and USB interfaces and comes in various widths and lengths. It also has keying notches on the edge connector to designate various interface or PCIe lane configurations. M.2 is smaller than the typical 2.5-inch SSD and is typically removable.

We used our Newport controller to prototype CSD in all three form factors. Our M.2 CSD is a Gen 3 × 8 storage device with a capacity of up to 8 TB and 6 GB of on-board DRAM. Note that due to space limitations, the size of DRAM is smaller compared to the U.2 form factor. The CSD in E1.S form factor is similar to that in M.2, but with more PCB area and power range. It has 12 TB of NAND flash and 6 GB of DRAM. Table 2 shows the specifications of our CSDs in different form factors, and Figure 7 shows the actual prototypes.

Table 2.
NameForm FactorInterfaceDRAMCapacityDimensions (mm)
NewportU.2NVMe over Gen3 PCIe × 48 GBUp to 32 TB69.85 × 100 × 15
LagunaM.2NVMe over Gen3 PCIe × 44 GBUp to 8 TB22.15 × 110 × 3.88
SolanaE1.SNVMe over Gen3 PCIe × 44 GBUp to 12 TB31.5 × 111 × 5.9

Table 2. Hardware Specifications of Newport Family CSDs

3.2 CompStor: Early Off-Chip CSD Solution

The first CSD that we designed and prototyped was CompStor [62] (“computational storage”), the first with a dedicated quad-core application processor as the ISP engine. In this early CSD prototype, we chose to separate the development of the conventional flash management functionality from the innovative ISP capabilities to avoid potential errors and extra complications. Therefore, CompStor is composed of two subsystems, namely SSD subsystem and ISP engine, on separated boards that implement the flash management routines and ISP engine, respectively.

3.2.1 SSD Subsystem.

The conventional subsystem contains a controller, 2 GB of DRAM, and an array of flash packages. The controller is composed of two MicroBlaze processors [10], an ECC unit, a host NVMe-over-PCI interface, a memory controller, and a flash memory interface. All of these components are designed and implemented in the FPGA. Among these components, the two MicroBlaze processors are the FE and BE processors that run the SSD controller firmware and control the other modules.

An internal data bus in the conventional subsystem transfers data between different components. This data bus is attached to the ISP engine, which is responsible for running user applications. In other words, the ISP engine is attached as an external utility to the conventional subsystem to augment the storage unit with ISP capabilities. There is an FMC (FPGA mezzanine card) connector [21] that provides the connection between these two subsystems. In other words, CompStor is an off-chip computational storage solution. In addition, an Ethernet connection has been provided in CompStor to allow for a TCP/IP connection between the ISP engine and the host. Figure 5 shows the prototype.

Fig. 5.

Fig. 5. CompStor prototype.

3.2.2 ISP Engine.

The CSD controller is implemented by attaching an ISP engine via an FMC connector to the databus that also connects the conventional subsystems. For the implementation of the conventional subsystem, we used a Xilinx Vertex-7 2000T FPGA, whereas the ISP engine was implemented using a Xilinx Zynq Ultrascale+, which is an MPSoC chip containing an FPGA together with a quad-core 64-bit ARM Cortex-A53 processor.

CompStor may be able to considerably improve the system performance and energy efficiency of I/O- and compute-intensive applications [62], but it is limited by the off-chip data transfer. As an ISP device, CompStore still needs to move data to/from the conventional subsystem. Although this data transfer is less expensive than the one via a complex NVMe interface, it is off-chip, which can incur higher latency and higher energy consumption compared to on-chip solutions.

3.3 Newport’s Hardware Architecture

To address the shortcomings in CompStor, we developed the more integrated, single-chip architecture for CSD controllers. We first prototyped it on an FPGA platform called Catalina [63]; however, due to inherent FPGA limitations, it became clear that FPGA-based CSD controllers cannot reach the maximum potential of ISP capabilities. Therefore, the next family of single-chip ASIC-based CSD solutions, called Newport, was introduced. The Newport ASIC was designed by NGD Systems in a 14-nm CMOS FinFET process. The EDA tools, RTL design, and layout methodologies are mainstream for this process node. The ARM processors, PCIe, DDR, embedded SRAMs, and chip I/Os are IPs licensed from third-party IP providers. The rest of the logic was coded in Verilog and synthesized with tools from Cadence Design Systems.

Figure 6 shows the high-level architecture of the CSD. The proposed CSD is composed of three main components, namely FE, BE, and ISP subsystems. The FE and BE subsystems are similar to the subsystems in the off-the-shelf flash-based storage devices (i.e., SSDs). Table 2 summarizes the specifications of our three prototypes, whereas Figure 7 shows the actual prototypes.

Fig. 6.

Fig. 6. High-level architecture of Newport CSD.

Fig. 7.

Fig. 7. CSD prototypes in different form factors.

To manage the NAND flash modules, the BE subsystem contains three ARM Cortex-M7 processing cores together with a fast-released buffer (FRB), an error-correction unit (ECC), and a memory controller interface unit (MIC). The ARM Cortex-M7 cores run the FTL processes, including logical and physical address translation, garbage collection, and wear-leveling. The ECC unit is responsible for handling those errors in the flash memory modules. The FRB transfers data between the MIC and other components, whereas the MIC issues low-level I/O commands to NAND flash modules. These modules are organized in 16 channels, so 16 I/O transfers can be performed simultaneously.

3.3.1 Front-End.

The FE is responsible for communicating with the host via the NVMe protocol. It receives the I/O commands from the hosts, interprets them, checks the integrity of the commands, and populates the internal registers to notify the BE that a new I/O command is received. The FE also packetizes/depacketizes the data transferred to/from the host and the CSD. This subsystem consists of a single-core ARM Cortex-M7 processor and an ASIC-based NVMe/PCIe interface.

3.3.2 ISP Engine.

Besides the FE and BE subsystems, there is an ISP engine inside Newport, composed of a quad-core ARM Cortex-A53 processor running at 1 GHz and a software stack that provides a seamless environment for running user applications. The ISP engine has access to the shared 8 GB of memory. A full-fledged Linux OS has been ported to the ISP engine, so the ISP engine supports a vast spectrum of programming languages and models. The ISP engine has a low-latency, power-efficient direct link to the BE subsystem to read and write the flash memory. In other words, the data transferred to the ISP engine bypasses the whole FE subsystem and the power-hungry NVMe/PCIe interface.

3.4 Newport Software Architecture

The ISP’s dedicated embedded processors require different software layers to provide the underlying environment for executing various types of applications. By deploying a conventional embedded Linux-based OS, we enable a compatible environment to support a wide spectrum of applications, programming languages, and command lines without modifications. Considering the unique architecture of our CSD, the embedded Linux is extended with our own custom features, including device drivers, file systems, and communication paths.

3.4.1 Customized Block Device Driver.

We developed a customized block device driver (CBDD) to enable accessing to the storage units, optimized to the specific on-chip communication links and protocols, which are different from common protocols between processing units and storage devices such as PCIe and SATA. The CBDD uses a command-based mechanism to communicate with the BE subsystem to initiate data transfers and receive their completions. Using the scatter-gather mechanism, the BE subsystem handles data transfers through the DDR addresses exposed by the ISP’s OS.

3.4.2 Shared File System between Host and ISP.

One unique feature of our CSD is that the CBDD supports file system access by the ISP applications and the host. The ability to mount partitions inside the ISP engine and access files through file systems makes it not only easier to port applications but also more efficient to access the data. Moreover, there is a software layer for file system synchronization between the host and the Newport ISP engine’s OSs. These two OSs can access the data stored on the flash memory array at the file system level and concurrently mount the same storage media, which can be problematic without a synchronization mechanism [64]. We implemented the Oracle cluster filesystem 2nd version (OCFS2) [7] between the host and the CSD. Using the OCFS2, both the host and the ISP engine can issue flash I/O commands and mount the shared flash memory natively and simultaneously.

3.4.3 Communication Paths.

Figure 8 depicts the different layers of the Newport software stack. It supports three communication paths over the different interfaces on our CSD: flash-to-host, flash-to-ISP, and TCP/IP tunneling.

Fig. 8.

Fig. 8. The Newport software stack provides three different communication paths. (a) The conventional data path through the NVMe driver to the host. (b) The path through on-chip connection to the ISP subsystem with file system abstraction. (c) The TCP/IP tunneling over the PCIe/NVMe.

The conventional data path (shown as the path in Figure 8(a) is the flash-to-host interface that goes through NVMe protocol over PCIe to allow the host to access the data stored in the NAND flash memory. To do so, our NVMe controller is capable of handling NVMe commands by responding to them appropriately and managing data movement through communicating with our flash media controller.

The path in Figure 8(b) is the flash-to-ISP interface, which provides file system based data access to the ISP engine through on-chip connection. This file system based data access interface is implemented by the CBDD through communication with our flash media controller. In fact, the flash media controller handles requests from both the ISP engine and the host.

The path in Figure 8(c) is for the TCP/IP tunneling over the PCIe/NVMe. From the user’s point of view, it is crucial to have a standard communication link between the host and the ISP engine. The user on the host side can use this communication link to initiate the execution of the application inside the ISP engine and monitor the outcome of the executions. To provide such a standard link, a TCP/IP tunnel through the NVMe/PCIe is supplied. In other words, host-side applications can communicate to the applications running inside the CSD via a TCP/IP link. This link is also essential for distributed processing applications, where processes running on multiple nodes require a TCP/IP link to communicate [61].

In addition to enabling communication between the host (and the wide area network) and the ISP system, this tunneling feature eliminates the need for unwieldy network setup. In other words, many cables and switches would otherwise be required to connect the many tightly assembled storage units, which would be impractical to maintain and scale. The proposed TCP/IP tunnel uses two shared buffers on the on-board DDR to provide the data communication. Two user-level applications, one on the host and one on the ISP running in the background, are responsible for managing NVMe-based data movement and encapsulating/decapsulating the TCP/IP data through NVMe packets.

Skip 4DISTRIBUTED TRAINING ON CSDS Section

4 DISTRIBUTED TRAINING ON CSDS

Distributed training of deep learning models can be done efficiently on CSDs while preserving data privacy. This section first reviews the types of parallelization applicable to DNNs so they can run in a distributed environment with heterogeneous or homogeneous nodes. We propose a framework named Stannis for distributing training workload onto a network of CSDs and a method called Hypertune for adapting the hyperparameters for quicker convergence.

4.1 Model and Data Parallelization

One practical approach to reducing the training time is to parallelize the training task onto multiple processors. Two common methods for parallelizing such tasks are the model-parallel and data-parallel. In model-parallel, each processing node is responsible for training a part of the neural network, whereas in data-parallel, all processing nodes have a replica of the entire network and update it locally during the training, but since different copies of the network get different updates, a synchronization method is needed to merge all updates. Figure 9 shows the architecture of these two approaches. Several well-known AI frameworks, such as TensorFlow [15], Pytorch [50], Theano [58], and Horovod [57], have implemented distributed training of DNNs. Even though all of these frameworks can run distributed training on multiple heterogeneous nodes, they work more efficiently in a homogeneous environment such as a cluster of GPUs but cannot address a heterogeneous mix of low-power nodes such as CSDs.

Fig. 9.

Fig. 9. Model parallel vs. data parallel [65].

4.2 Stannis: Framework for Training Neural Networks in Storage

To address the needs in our approach, we developed a framework named Stannis based on Horovod. Stannis, for System for TrAining of Neural Networks In Storage, is a framework for the model-parallel distribution of neural network training on homogeneous and heterogeneous systems. Horovod may achieve great speedup on homogeneous systems, but on a heterogeneous system, the slowest processor becomes the bottleneck due to required synchronization in training. In other words, a faster processing engine must wait for the slower ones. To overcome Horovod’s inability to work efficiently on heterogeneous systems, Stannis scales the workload with consideration for transmission delay and data privacy.

4.2.1 Time Equalization by Benchmarking.

To minimize stalling by all of the processors, Stannis tries to equalize the processing time by setting different batch sizes for each processing engine. In other words, the slower processor gets a smaller batch size and thus ideally finishes the epoch in the same elapsed time as the more powerful processors. Algorithm 1 shows the pseudocode for Stannis. It starts by running a series of benchmarks on all processing engines to assess their processing speeds and to find the optimal batch size for each processor. Based on the results, the best batch size is selected for the slowest engine, since the slower engine has more total processing nodes and, subsequently, more impact on the system’s performance. In addition, slower engines are more sensitive to batch size changes compared to the more powerful host processor.

4.2.2 Batch Size Adaptation.

Having the batch size for one device, we calculate the time it takes to finish one batch and find the best batch size on the other processing engines that consume comparable elapsed time. We increase the batch size by a fraction (\( \frac{1}{C} \)) of the difference in time for the two processors until they consume similar times. C is a constant that determines how much to adjust the batch size after each test. Larger C means more fine-grained batch size updates. We also consider the slowdown that occurs due to the synchronization process and allocate a (\( \frac{time}{E} \)) margin to the final time, where E is determined by observing the slowdown pattern for the newly added processing nodes and is set to allow a fixed 20% margin in the results.

4.2.3 Data Localization.

Stannis also considers the access permissions of the data and assigns the private data to the local ISP engine while sharing public data with the ISP engine and the host processor. This method eliminates the transmission of private data over the network and to the host, thereby increasing the data protection level. Figure 10 shows the overall architecture of running Stannis on a data server.

Fig. 10.

Fig. 10. Distribution of training workload onto CSDs and host using Stannis.

4.3 HyperTune: Adaptive Scheduler for Time Equalization

HyperTune is the runtime mechanism that ensures minimum stalling by all processing nodes. Even though Stannis attempts to assign workload proportionally to the estimated performance of each CSD, that alone cannot guarantee equal execution time for several reasons. First, as the processors are time-shared with other tasks, they can take away the available processing time from the training session. Although the batch size may be updated based on feedback, some hysteresis must be built in to ensure stable operation. Stannis tries to compensate for the workload interrupt in nodes by using our implemented function called HyperTune.

4.3.1 Tuning Mechanisms.

HyperTune reschedules the portion of operation that is assigned to each node based on the availability of the processing cycles on that node. The rescheduling is done by measuring the local processing speed and available processing power and, consequently, updating the batch size list, by either decreasing the batch size on the busy node or increasing it on the other nodes. Since changing the batch sizes also requires a recalculation for the dataset assignment, Stannis reassigns the dataset based on the new batch size to prevent rank stall on the training session. New batch sizes, dataset indexes, and lengths are passed to each node using the MPI_scatter() function. We implemented a monitoring session after each step within the epoch. The speed measurements from all nodes are gathered on one arbitrary node using MPI_gather() and passed to a decision-making function. For better decision making, the speed change along with the percentage progress of the current epoch are passed to the function and converted to a decline index based on the weighted sum in Equation (1): (1) \( \begin{equation} {index}_i = {0.7}\times \frac{{SP}-{SP_i}}{{SP}}+{0.3}\times \left(\frac{{N_\text{step}}-{step_i}}{N_\text{step}}\right), \end{equation} \) where SP is the normal speed that is obtained from the \( \it batchsize\_to\_speed \)() function. \( SP_i \) and \( step_i \) are the current speed and step, respectively, and N is the number of steps per epoch.

4.3.2 Hysteresis in Adaptation.

To avoid chattering, a hysteresis-like algorithm is implemented to ignore glitches and mis-measurements in speed. When the decline index exceeds 25%, the step will be flagged as underutilized, and this report will be saved in a separate array. Having five consecutive underutilization flaggings terminates the current epoch and triggers the batchsize_controller() function to determine a new batch size.

Our initial approach to determining the new batch size was to use the inverse of the batchsize_to_speed() function we got from the tuning section and find the new batch size based on the current processing speed of the interrupted node. Although this method sounds promising, our evaluations showed non-negligible errors that could worsen the performance.

As a second solution, we decided to use a weighted averaging method from the nearest values to the current speed. We use Equation (2) to calculate the new batch size: (2) \( \begin{equation} {BS}_i={BS}_n \times \frac{{SP_i} - {SP_n}}{{SP}_{n+1}-{SP_n}} + {BS}_{n+1} \times \frac{{SP}_{n+1} - {SP_i}}{{SP}_{n+1}-{SP_n}}, \end{equation} \) where \( {BS}_i \) is the new optimal batch size, \( {SP_i} \) is the current speed between \( {SP_{n}} \) and \( {SP}_{n+1} \), and \( {BS_n} \) and \( {BS}_{n+1} \) are the batch sizes corresponding to \( {SP_{n}} \) and \( {SP}_{n+1} \).

A third method that yields almost the same result with no further processing cost is to monitor the CPU usage of the training session on each node. For this approach, we have implemented a sliding window to keep track of the CPU usage for the last 10 steps. The new batch size is proportional to the average of the last five steps (second half of the sliding window) with the declined CPU usage and the normal CPU usage (stored at the first half of the sliding window). The window size should be large enough to ignore possible glitches but not too large so that it can detect the utilization decline pattern.

4.3.3 Hyperparameter Tuning.

Note that parameters such as the size of the sliding window and the margin for speed decline detection are experimental and can be changed based on the required precision. A benefit of using CPU utilization as a gauge is that the system can also increase the batch size in case extra processing cycles are available. When the system frees the occupied processing cycles, the training session can claim it back by increasing the batch size, whereas the speed-monitoring approach cannot detect the availability of extra processing cycles.

By rapidly monitoring the performance and updating the hyperparameters, Stannis assigns a larger portion of processing load to the free nodes to mitigate the overall performance decline from the interruption. One concern is that by early termination of epoch, the network will lose part of the dataset in the training process, and multiple occurrences of such epoch terminations might lead to completely missing that portion of the dataset. The solution to this concern is to use the shuffle function when creating the input batch of the data such that the randomness statistically ensures that all input data will go through the training after a sufficient number of epochs.

Another concern is that the training might not converge as a result of constant changing of batch size. To address this concern, we restrict the batch size change to a limited range such that it will not affect the convergence. Careful design of the model’s architecture and dynamic selection of hyperparameters such as batch size and learning rate result in better convergence rate and higher achieved accuracy [18]. Dynamic selection of the learning rate by the decision-making function will be investigated for future work.

Skip 5APPLICATIONS Section

5 APPLICATIONS

In this section, we explain the applications we used to benchmark the CSD capabilities in the context of data analytics. Other than distributed training described in the previous section, we also benchmarked distributed inferencing, database lookup, and the search and analytics engine.

5.1 Distributed Inferencing of NLP

Distributed inferencing is somewhat similar to distributed training (Section 4), but it is simpler as there is no closed-loop feedback for updating the parameters. Just like distributed training, distributed inferencing can be done either in a model-parallel or data-parallel scheme. In model-parallel, each part of the model is mapped to one node in a way to improve the overall performance. For instance, the dense computation such as fully connected layers or convolutions can be mapped to GPUs, whereas embedding table lookups or sparse operations can be mapped to CPUs. This method is particularly efficient for large models that cannot fit in one node’s memory, but it requires careful mapping of operations to each individual node. However, data-parallel distribution is useful for smaller models with a large number of requests or queries, where each node has a full copy of the model and can work independently of the other nodes.

5.1.1 Natural Language Processing.

One class of applications that fits the parallelization scheme is NLP. NLP is the automatic manipulation of natural language in the form of speech and text for applications ranging from speech-to-text converter and text auto correction to content recommendation and filtering. It is already in wide use in automated call centers and search engines to voice assistants such as Apple Siri and Amazon Alexa. Such NLP services demand scalable solutions for the high workload. For example, Google’s search engine handles 63,000 search queries per second, and Netflix filters more than 3,000 titles at a time using 1,300 recommendation clusters based on user preferences. Handling such numbers of queries in timely manner requires efficient algorithms and powerful hardware [12, 45].

Many NLP applications are I/O intensive, as they use considerably large models of up to tens of gigabytes, thus requiring loading and swapping between the main memory (DRAM) and disk. In these cases, the data transfer between storage and host can grow quickly, leading to long latency and high energy consumption. The model size is taken up mostly by the embedding tables or layers that contain many parameters that describe each entry. In most cases, finding the final answer to a query entails dispatching the processing on the neighbor node that is closest to the input data. Virtually all applications perform preprocessing, including but not limited to tokenization (breaking down text into smaller semantic units), word tagging (marking up words as nouns, verbs, adjectives, adverbs, pronouns, etc), stemming or lemmatization (standardizing words by reducing them to their root forms), and stop-word removal (filtering out common words that add little or no unique information) such as prepositions and articles (at, to, a, the). Given these features, we believe that ISP can greatly improve the performance of the NLP and many applications with similar characteristics by both model-parallel and data-parallel schemes.

In the model-parallel method, CSD can use the data residing on the flash and partially process it before sending it to other nodes such as the host CPU or GPU for further processing. This approach helps those neural network applications that require preprocessing on the raw input data. It also enhances the performance of those neural networks that saturate a system’s DRAM with huge embedding tables that require high I/O intensity but relatively small computation. Such processing models have been developed in Facebook’s deep learning recommendation model (DLRM) project, where huge embedding tables exceeding tens or hundreds of gigabytes are processed on specific nodes and only the output is sent to the upper nodes for the remainder of the processing [30, 46, 48].

In the data-parallel method, each processing node is responsible for running the entire model on its local input data before outputting back to the supervising node. For this application, we deployed a data-parallel scheme to maximize the benefits of ISP.

Note that ISP offers a general solution to increasing system performance for any general application that can be run in parallel, including distributed training of DNN [32, 33]. NLP happens to be one such representative application that we use to demonstrate the advantages of ISP. We deploy several well-known NLP applications on a data server and compare the results between using CSDs and regular SSDs.

5.1.2 Query Scheduler for Workload Distribution.

To efficiently divide the processing over the host system and the CSDs with minimum overhead, we developed a scheduler that distributes applications onto multiple nodes. This scheduler is MPI-based and developed in Python. It can redirect requests or queries based on the availability of the nodes. The pseudocode of the scheduler is shown in Algorithm 2 [34].

The scheduler starts by running the tuning algorithm, which is a small instance of the application, on the CSD to determine the best batch size in terms of query-processing speed. Generally, with the increase in the batch size, the latency of processing each query increases; the overall processing speed increases as well, until the communication becomes the new bottleneck. Hence, we look for the smallest batch size resulting in a processing speed closest to the maximum. To find this batch size, we compare the speedup for each new batch test, and in case the difference is less than some constant margin, we conclude the tuning. Otherwise, the largest batch size will be selected.

We then run the same test on the host to find the optimal batch size for it. The key parameters for the scheduler are the two optimal batch sizes for the host and the CSDs, along with the batch ratio (BR), which is the ratio between the two batch sizes. The BR is calculated based on the difference in the processing performance in a heterogeneous system and can be used for later runs. Since the host has a more powerful CPU (Xeon) than the CSD does (ARM A53), the BR is considerably large, ranging from 20 to 30 based on the experimental results. Any ratio other than the optimal BR results in underutilization of the system. Using the BR decreases the workload of the scheduler and also increases the host’s performance due to lower scheduling overhead and larger chunks of data to process at a time.

After determining the optimal batch sizes, each node is assigned one batch of queries to process. Upon completion, the node replies to the scheduler with an ack, which also acts as a request for the next batch. The scheduler runs in a separate thread on the host and wakes up every 0.2 seconds to check if there is a new completion message from the working nodes. By putting the scheduler to sleep, the thread releases the host processor and thus increases the available processing capacity. Our setup uses the OCFS2 as the shared disk file system to enable both the host and the ISP to access the same shared data stored on the flash. As a result, the scheduler sends only the data indexes or addresses to the ISP engine. This method significantly reduces the communication overhead and eliminates one of the greatest bottlenecks in a parallel systems; it also increases the read/write speed, as all nodes access the data at a much higher speed when directly communicating with the flash. In other words, the host and the ISP engine’s data access speed is on the order of gigabytes per second compared to megabytes per second over TCP/IP.

5.2 Database (MongoDB)

MongoDB is a document-oriented database management system released in 2009. It is considered a new generation of “NO SQL” [31] designed to overcome the problem of poor scalability of relational databases, which are widely used in many modern applications. Relational databases can handle a limited amount of data efficiently, but they perform poorly with the demands of today’s data-intensive applications, including big data, data analytics, IoT, AI, machine learning, multimedia, and social media. MongoDB stores data as JSON-like documents with dynamic schemas (the format is called BSON). MongoDB focuses on four characteristics: flexibility, power, speed, and ease of use [22]. Two common challenges for databases are scaling and high availability.

5.2.1 Sharding for Scalability.

Sharding is a common way of scaling databases by partitioning a database into several chunks of databases, called shards, each residing on a separate storage drive. Scaling can be done vertically or horizontally. The vertical scaling method uses servers with more capabilities, such as the ones with increased computing power and adding memory. However, upgrading the server memory and computing power requires non-trival effort and is not practical for rapid scaling. In contrast, horizontal scaling distributes big data across multiple servers. The primary basis of this method is parallel processing in the development of database scalability.

MongoDB can shard data across multiple drives, giving it the ability to use the efficiency and performance of parallel dynamics. It provides as many shards per server as the number of available storage drives. For example, thirty-two 8-TB CSDs in a 1U server will create up to 32 shards with 128 processing cores for added capabilities such as creating a node from each CSD while providing 256 TB of integrated capacity. Figure 11 shows a sharding scenario for a host and nine NGD CSDs.

Fig. 11.

Fig. 11. A sharding scenario for a host and N CSDs.

5.2.2 Replication for Availability.

Replica sets provide redundancy and high availability in MongoDB. High availability indicates that a system is designed for durability, redundancy, and automatic failover. The applications supported by the high-availability system can operate continuously, without downtime for a long time. With multiple copies of data on different database servers, replication provides a level of fault tolerance against the loss of a single database server. If the primary node is unavailable, an eligible secondary node will hold an election to elect itself as the new primary. The CSD’s ISP capability provides horizontal scaling and high availability in a single machine. Figure 12 shows how two CSDs emulate a data storage server within another storage server and hence reduce the number of server deployments from three servers to only one. Creating multiple replicas using a single server requires far fewer resources for each cluster while setting up multiple replicas per storage server maximizes redundancy.

Fig. 12.

Fig. 12. Data storage servers can be substituted by CSDs in replication mode.

5.2.3 Sharding and Replication.

Sharding and replication are “best practice” configurations used to improve performance and data security. Until now, these functions required the use of two storage servers for each replica set and one or more servers per shard; with CSDs, this is no longer the case, and one or both can be configured into a single server. Using a single storage server with multiple shards will reduce the data center footprint and overall cost while providing more performance per host and less latency while replicating. Figure 13 shows a sharding and replication scenario of three shards and two replicas for each shard based on CSDs.

Fig. 13.

Fig. 13. Sharding and replication scenario of three shards and two replicas for each shard based on CSDs.

Skip 6EXPERIMENTAL RESULTS Section

6 EXPERIMENTAL RESULTS

6.1 System Setup

To evaluate our CSD designs, we chose three data servers that support all three form factors. All servers run the Ubuntu 18.04.6 OS.

6.1.1 U.2 Form Factor.

For the U.2 form factor, we used an AIC 2U-FB201-LX server with an Intel Xeon Silver 4108 CPU with 32 GB of DRAM and 24 Newport CSDs, each with 32 TB of flash memory, making a 2U-class storage server with a total capacity of 768 TB. A second AIC server with Intel Xeon Gold 6240 CPU and 96 GB of DRAM was used to evaluate the database benchmarks.

6.1.2 M.2 Form Factor.

For the M.2 form factor, we chose a FlacheSAN1N36M-UN server equipped with the same Intel Xeon Silver 4108 CPU and 64 GB of DRAM. We put 36 Laguna CSDs, each with 8-TB capacity on the server to make a 1U class server with 288-TB storage capacity, as shown in Figure 14.

Fig. 14.

Fig. 14. A data storage server with 36 Laguna CSDs in M.2 form factor.

6.1.3 E1.S Form Factor.

For the E1.S form factor, we used an AIC FB128-LX equipped with the same Intel Xeon Silver 4108 CPU running at 2.1 GHz and 64 GB of DRAM. This server can support up to 36 E1.S drives in the front bay, 12 TB each, for a total capacity of 432 TB on a 1U class server. Table 3 shows the overall spec of the system setup for running the tests.

Table 3.
BrandModelStorage BayCPUDRAMDimensionsMax Storage Capacity
AIC2U-FB201-LX24 × U.2Intel Xeon Silver 410832 GB41.3” × 23.4” × 12.5”24 × 32 TB = 768 TB
AIC2U-FB201-LX24 × U.2Intel Xeon Gold 624096 GB41.3” × 23.4” × 12.5”24 × 32 TB = 768 TB
AICFB128-LX36 × E1.SIntel Xeon Silver 410864 GB31.5” × 17.2” × 1.7”36 × 12 TB = 432 TB
EchoStreamsFlacheSAN1N36M-UN36 × M.2Intel Xeon Silver 410864 GB27.5” × 19” × 1.75”36 × 8 TB = 288 TB

Table 3. Specifications of the Servers Used for the Experiments

To measure the power and energy consumption for each test, we use an HPM-100A power meter that sits between the power plug and the server and measures the power consumption of the entire system, including the power of the host processor unit, the storage systems, and peripherals such as the cooling system. Since there is no comparable storage drive on the market with similar capacity and form factor, in all tests except one, we choose the baseline test system to be the server with the same CSD drives but with the ISP engines disabled, acting solely as a storage drive.

6.2 Distributed Training

6.2.1 Stannis.

To evaluate Stannis, we used a dataset of 72,000 images as public data and 12,000 images as private data distributed over 24 Newport CSDs. We chose MobileNetV2 with 3.47 million parameters and 56 million multiply-and-accumulate (MAC) operations as our main neural network. To compare the speedup on different neural networks, we ran the same test for NASNet, InceptionV3, and SqueezeNet. The only concern in choosing a network and the batch size is the available DRAM on the systems. A large batch size on big networks can saturate the DRAM and thus stall the entire training process. The 6 GB of DRAM on Newport has proved sufficient for most of the test cases. The solution to DRAM saturation is to choose a smaller batch size. Since the processing speed converges beyond a certain batch size, this reduction in batch size would not affect the processing speed. For instance, the speed for MobileNetV2 on Newport is about 3 images per second for all batch sizes greater than 16. This happens as a result of the full utilization of the processing engine when the task becomes computation-intensive rather than communication-intensive.

Stannis first determines the optimal batch size for the host and the CSDs by running the tuning algorithm for each network. The tuning results are presented in Table 4. After tuning, it runs the main training session on different numbers of CSDs.

Table 4.
NetworkParamFlopMACbatch size Host/CSDspeed (img/sec) Host/CSD
MobileNetV23.47M7.16M56M315/2531.05/3.08
NASNet5.3M10.74M564M325/1547.31/2.80
InceptionV323.83M47.82M5.72G370/1630.80/1.85
SqueezeNet1.25M2.46M861M850/50219.0/16.3

Table 4. Parameter Tuning from Algorithm 1

Figure 15 shows the processing speed for different numbers of CSDs for each network training session. There is a slowdown in all processing nodes in distributed training mode, which is due to partial stalls when nodes are synchronizing the parameters. The slowdown pace fades out, and the individual node’s performance converges to a certain speed after the number of nodes grows beyond five to six devices. The relative speedup for different networks is also shown in Figure 16. The result shows that the smaller networks get better speedup than the larger ones; this is because the more parameters to update, the more time it takes to synchronize the nodes. Another important parameter is the number of MACs. As Figure 16 shows, SqueezeNet (2.46M FLOPS) gets less speedup compared to MobileNetV2 (7.16M FLOPS), because it has 15× more MACs.

Fig. 15.

Fig. 15. Experimental results for distributed training for different neural networks.

Fig. 16.

Fig. 16. Normalized results of distributed training for different neural networks.

It is difficult to compare the power consumption of our server and a similar setup, but with a non-CSD drive, as there is no equivalent product with same storage capacity in the market. The closest product to Newport that we could find was the 11-TB Micron MTFDHAL11TATCW-1AR1ZAB SSD. We evaluate the power consumption of an AIC server with twenty-four 11-TB Micron SSDs against the same system with twenty-four 32-TB Newport CSDs. Table 5 shows the energy per processed image for MobileNetV2. Measurements show up to 69% saving in energy per processed image and 2× FLOPS per watt with 24 Newport CSDs compared to a system with no CSD. We skip the results for other networks since the measurements are almost identical.

Table 5.
Number of CSDs0481624
Energy per image (J)13.108.306.845.054.02
Energy saving (%)0%37%48%62%69%
FLOPS per watt5.87M7.05M8.18M10.37M12.26M

Table 5. Energy Consumption of Distributed Training Using Stannis

6.2.2 HyperTune.

To evaluate the HyperTune algorithm, we ran a training session on three similar nodes (AIC 2U-FB201-LX servers) in terms of processing performance, as similar processing performance highlights the significance of HyperTune. We chose to run the tests on the MobileNetV2 neural network, with 300,000 images as the input to the system. We simulated external workloads using the Gzip compression application, which enables the desired number of cores on the processors to be occupied. Without loss of generality, we also generated interrupts to one node at a time to facilitate experimental procedure: since the master node collects the interrupt reports from all nodes, it can easily detect if the slowdown is due to the local workload or an external node.

Similar to the previous experiment, Stannis starts by finding the best batch size for each node, which is 180 for all three nodes, since they all use the same Intel Xeon processor. Then, in two separate steps on separate nodes, we set Gzip to occupy four and six cores out of eight cores and to monitor the speed change. We define the performance as the average number of processed images per second for the training. As Figure 17 shows, the overall processing speed of the three nodes during the test is 93.4 images/second in normal operation mode. As the workload increases, the speed drops to 75.6 and 53.3 images/second for 50% and 75% external workloads, respectively. These speeds remain nearly constant as long as the workload sustains thereafter. In contrast, if HyperTune detects interrupts by new workload and determines non-transient decline in the speed, it recomputes the new batch size for the busy node and updates the hyperparameters, which are 140 and 100 for four-core and six-core workloads, respectively. By changing the batch size, the other two nodes regain the initial processing speed, which shows that the algorithm works effectively in determining the new batch size. The processing speed declines respectively to 83.7 and 85.8 images/second for 75% and 50% external workloads. In other words, HyperTune is 14% and 57% faster than the baselines at no cost of power or accuracy and with negligible computation overhead.

Fig. 17.

Fig. 17. Experimental results for Hypertune.

6.3 Distributed Inferencing: NPL

We have chosen three common NLP applications, namely sentiment analysis, movie recommender, and speech-to-text converter to run on our server. Although the models are intentionally chosen in a way that fits in the CSD’s internal memory, they cover a wide range of characteristics in terms of model size, input size, and input type. A summary of all three applications are presented in Table 6.

Table 6.
ApplicationInput TypeInput SizeModel SizeProcessing Time (base 100 \( \mu \)s)
Sentiment analysisTextSmall1.5 MB
Movie recommenderTextVery small3.4 GB20×
Speech-to-text converterAudioLarge50 MB1000×

Table 6. Summary of the NLP Application Characteristics

6.3.1 Speech-to-Text Benchmark.

This benchmark is developed based on Vosk, an offline speech-recognition toolkit. Vosk can support 17 different languages and has multiple models, as small as 50 MB, that can be deployed on ARM-based and other lightweight devices [2]. To test our speech-to-text benchmark, we use a public domain speech dataset named LJ [35], which consists of 13,100 short audio clips of a single speaker reading passages from seven non-fiction books. The dataset is about 24 hours long in 225,715 words. When running a small benchmark, the host can process 102 words/second, whereas a single-node CSD can do 5.3 words/seconds. This batch size ratio of around 20 depends on the nodes’ processing performance and is almost the same for all three applications, as their common bottleneck is the computation. We use this ratio in the scheduler.

Figure 18(b) shows the overall output of this benchmark in term of the number of transcribed words per second, based on the batch size and the number of engaged CSDs. It shows that by augmenting the host with the processing power of all 36 CSDs, the output rate increases from 96 words/second to 296 words/second for a batch size of 6. That is 3.1× better performance compared to the host alone. In addition, the single node performance for different batch sizes shown in Figure 18(a) means that the processing speed does not change much when varying the batch size. In terms of I/O transfers, CSD engines process 68% of the input data (\( (296-96) \div 296\simeq 0.68 \)); in other words, 2.58 GB out of the 3.8 GB of the dataset never leave the storage units, and the only output of about 1.2 MB transferred to the host is the output text. This reduction in the number of I/O’s and in data transfer size can drastically reduce the power consumption, network congestion, and usage of the host’s CPU memory bandwidth. It also enhances the security and the privacy aspect of the data, as it never leaves the storage device.

Fig. 18.

Fig. 18. Individual node’s benchmarks (a) and performance results (b) for the speech-to-text application.

6.3.2 Movie Recommender Benchmark.

Our second benchmark is a movie recommendation system based on the work of Ng [49]. The recommender creates a new metadata entry for each movie, including the title, genres, director, main actors, and story line keywords. It then uses the cosine similarity function to generate a similarity matrix to be used later for content suggestion. An extra step uses the ratings and the popularity indexes to filter the top results. For each query, the target movie title is sent to the recommender function, which returns the top-10 similar movies. To train and test the application, we use the MovieLens Dataset [13], which consists of 27 million ratings and 1.1 million tag applications applied to 58,000 movie titles by 280,000 users. We ran the training process once and stored the matrix on flash for further uses. To simulate the queries, we made a list of all movie titles and randomly shuffled them into a new list. The results from Figure 19(a) show that, just like the speech-to-text application, the processing speed does not change much (about 3%) over different batch sizes. Therefore, smaller batch sizes are preferable due to the shorter worst-case latency.

Fig. 19.

Fig. 19. Individual node’s benchmarks (a) and performance results (b) for the movie recommendation application.

Figure 19(b) shows the result of movie recommendation queries for different configurations. With 36 drives, the system can address 1,506 queries/second compared to the 579 queries/second of the stand-alone host, which is 2.6× improvement. The measurements show that the processing speed scales linearly with the number of CSDs, and the overhead of distributing the task is almost zero. In fact, in some cases, the distributed result scales superlinearly due to the faster data access speed of the ISP engine than on the host. In the host-only configuration, the host’s CPU never reaches 100% utilization due to other I/O bottlenecks. By increasing the number of CSD drives and further distribution of input data, the I/O bandwidth increases, shifting the bottleneck on the host to the computation, which in turn pushes the CPU to higher utilization levels and enhances the host output. This is how the added processing capability from the new CSDs results in a benevolent cycle.

6.3.3 Sentiment Analysis Benchmark.

The last benchmark is a tweet sentiment analysis app based on Python’s Natural Language Toolkit (NLTK). We modified this app from DigitalOcean [26] to fit to our parallelization goals. It uses labeled data to train a model to detect the positivity or negativity of the tweet’s content and then uses the model to predict the sentiment of the incoming tweets. The input data undergoes a series of preprocessing steps where the words are tokenized, lemmatized, stripped of noises and meaningless words, and eventually converted to a dictionary of words ready to be fed to the model. Our test dataset consists of 1.6 million tweets [28] and are duplicated in cases where we need a larger number of queries. We ran a single-node test to evaluate the host and the CSDs performance for different batch sizes. Unlike the other two benchmarks, performance changes considerably based on the batch size.

Figure 20(a) plots the performance in terms of queries per second for different batch sizes on a logarithmic scale. It shows that both CSD and the host get better performance with larger batch sizes. For instance, CSD’s performance increases from 107 queries/second for the batch size of 2k to 364 queries/second for the batch size of 40k, which is a \( 3.4\times \) increase. However, the trade-off for large batch sizes is the increased latency in processing some of the queries. Because the input is sequentially fed to the nodes, once a data is assigned to one processing node, it has to wait for prior queries on the same node to finish but cannot be migrated to run on some other idle nodes. For example, if the first N queries are assigned to node 1 and the next M queries \( (N+1 \ldots N+M) \) are assigned to node 2, then the \( N^\text{th} \) query has to wait for all queries from 1 to \( N-1 \) to finish before it can be processed on node 1, whereas in smaller batch sizes, those queries could be assigned to other available nodes. Based on these numbers, we set the batch size ratio to \( 9496\div 364\simeq 26 \) and ran a benchmark with 8 million tweets. Figure 20(b) shows the performance for different batch sizes and on different numbers of CSDs. The best result is for the batch size of 40k, where the number of queries per second increases from 9,496 to 20,994, or \( 2.2\times \) the performance.

Fig. 20.

Fig. 20. Individual node’s benchmarks (a) and performance results (b) for the sentiment analysis application.

6.3.4 Power and Energy Analysis.

For all NLP test cases and in idle mode, the server consumes 167 W with no storage drives and 405 W with 36 CSDs. Thus, each CSD consumes an average of 6.6 W compared to 10 to 15 W for commercial E1.S units with lower storage capacity in normal operation mode. When we ran the benchmarks, the power consumption of the entire system went up to 482 W without enabling ISP (i.e., CSD acting as storage only) compared to 492 W with all 36 ISP engines running. In other words, each ISP engine consumes 0.28 W on top of the storage-only feature. Note that our datasets are small in size and all three benchmarks are compute-bound rather than I/O-bound, which means that most of the power is drawn for CPU activity. As a result, all three benchmarks yield the same power measurements, and only the amount of energy per query varies by the application. Figure 21 shows the energy consumption per query, or per word in case of the text-to-speech benchmark. Figure 22 shows the normalized energy consumption to better demonstrate the energy saving trend. Note than these measurements are averaged over the entire test session. The power consumption enhancement can be attributed to (1) the low-power processors of the ISP, and (2) the reduction in data transfer size, as minimal raw data needs to be moved out of the storage drives. For future work, we plan to measure the energy consumption in each step to better demonstrate the importance of data transfer reduction. Finally, Table 7 summarizes the experimental data for all three applications.

Fig. 21.

Fig. 21. Energy consumption per query for different configurations of the three applications.

Fig. 22.

Fig. 22. Energy per query, normalized to a host-only setup.

Table 7.
Test CaseSpeech to TextRecom- menderSentiment Analysis
AccuracySameSameSame
Max speedup3.1×2.8×2.2×
Energy per query (host) (mJ)5,02183251
Energy per query (w/CSD) (mJ)1,66232723
Energy saving per query (%)67%61%54%
Data processed on host (%)32%36%44%
Data processed in CSDs (%)68%64%56%

Table 7. Summary of the Experimental Results

6.4 Database: MongoDB

Database benchmarks tend to face different bottlenecks and can be tricky. To overcome these bottlenecks over different configurations, we investigate two scenarios for benchmarking: direct configuration and sharding clustering configuration. For database benchmarking and data generation, we used the Yahoo! Cloud Serving Benchmark (YCSB) [23] suite, which allows measuring the performance of both NoSQL and SQL database management systems with simple database operations on synthetic data. Since YCSB requires considerable resource and power to generate data requests, we use a separate system (HP ProLiant DL380p Gen8) solely to generate the benchmarks and queries. All systems are connected through a regular 1-GBps router. Note that we use the same parameters for YCSB in all cases (record count = 100k, operation count = 100k, read operation = 0.5, write operation = 0.5).

6.4.1 Direct Configuration.

Figure 23 depicts the direct mode configuration, where each CSD acts as a stand-alone database and queries are sent directly to each CSD. As seen earlier, ISP provides the opportunity to run multiple databases on a single data server—that is, each database running an independent MongoDB agent while the host system runs its own agent, all on the same CSD. Since a single database could not saturate the storage nodes, we put two databases on each CSD, one managed by the ISP engine and the other by the host.

Fig. 23.

Fig. 23. Direct mode configuration in MongoDB.

Figure 24 shows the overall throughput in term of operations per second. As can be seen, the increase in the host performance is sub-linear but monotonic, whereas the utilization of each CSD degrades as the number increases.

Fig. 24.

Fig. 24. Experimental results for MongoBD in direct mode.

Although our experimental server maxed out at 24 CSDs due to the inherent technical limitations of the storage server design, the utilization can go higher by deploying more number of CSDs. However, based on the measurements, we estimate the scaling to be upper bounded at around 56 drives.

Compared to a data server with the same configurations but with regular SSDs, the overall throughput of the system with 24 CSDs is higher by 2.78×.

To investigate the power and energy consumption, we measured the overall power consumption of the server with 20 CSDs in idle, host-only, CSD-only, and hybrid modes. As Figure 25 shows, when running the application on the host-only setup with 20 CSDs with inactive ISPs, the total power consumption rises from 354 to 456 W, yielding an average energy consumption of 798 uJ per operation. For the CSD-only case, where the host processor is inactive, the increase in power consumption is only 29 W, which yields an average of 416 uJ per operation. When both the CSD and host are active, the overall power goes up to 421 W, and the average energy consumption drops to as low as 339 uJ per operation. These results show that deploying CSDs for I/O-intensive applications can greatly reduce the energy consumption by up to 57% while increasing the performance by up to 3.4×.

Fig. 25.

Fig. 25. Power analysis of MongoBD in direct mode.

6.4.2 Sharded Cluster Configuration.

In a sharded cluster, each database is sharded over several storage nodes, or CSDs in our case. To handle data forwarding, a set of MongoDB routers, known as Mongos, are needed to run on the host system. These routers redirect queries to the appropriate storage node. Figure 26 shows the system configuration for the sharded cluster test. Each router connects to a MongoDB agent that either runs on the host or on one of the CSDs. Note that just like the previous test, two external systems run the YCSB and generate the queries. Figure 27 shows the throughput of the sharded cluster with 24 CSDs for host-only, CSD-only, and hybrid modes. Since the router application is lightweight and the overhead is negligible, the host’s performance remains the same after engaging the CSDs. Each CSD handles an average of 765 operations per seconds. That said, our data storage server with 24 CSDs can increase the performance by up to 1.7 times compared to the same storage server with regular SSDs.

Fig. 26.

Fig. 26. Sharded cluster configuration in MongoDB.

Fig. 27.

Fig. 27. Experimental results for MongoBD in sharded cluster mode.

Another observation is the performance drop on each CSDs in sharding clustering mode compared to direct mode. Although sharding enables handling much larger datasets in a distributed fashion, it can degrade the performance as the routers’ operations add latency to the system. We also investigated a sharded cluster with replications, where each data is replicated on multiple CSDs to provide resiliency. In this scenario, the system performance is just like the regular sharded cluster, but with less overall capacity, as each data is replicated on several CSDs. For that, we did not include the results to avoid redundancy with the previous setup.

Skip 7CONCLUSION Section

7 CONCLUSION

CSDs are a promising building block for servers that run the next-generation data-dominated computing applications. They remove the bottleneck of moving huge amounts of data, resulting in significant gains in processing speed and reduction in power. However, hardware and software must work together to achieve the potential benefits.

Our hardware contribution is a new integrated controller-processor ASIC that enables the SSD to run code in storage with minimal overhead. It runs conventional Linux executables without modification and is more practical than FPGA solutions that require RTL design and reconfiguration, which are difficult to integrate in an OS-based, multitasking server environment. Our ASIC also works well in the storage system environment, where GPU solutions would not work due to the high power demand.

To realize the full potential, the software must identify and overcome synchronization and other limits on parallelization. We proposed Stannis to minimize the synchronization stalling by equalizing the workload distribution for neural network training applications based on pre-runtime test runs, and our HyperTune mechanism adapts the hyperparameters to keep the workload balanced at runtime. We also developed an online scheduler for distributing inferencing workloads, as represented by several public NLP applications. In the case of databases, the existing sharding structure already implemented in MongoDB can readily take advantage of the CSD organization and benefit from both the power and performance advantages without modification, as we have demonstrated in our experiments.

Directions for future work include methods to handle data-aware and hardware-aware workload distribution as well as their evaluation. First, data-aware distribution will enable better utilization of CSDs by exploiting not only temporal locality but also spatial locality of data, by classifying data requests and queries into categorical groups and redirecting them to associated nodes. Another direction will be to explore hosts with different performance characteristics, including the use of FPGAs and GPUs, in achieving the most streamlined configuration in conjunction with our CSD setups. Finally, a more in-depth comparison with other CSD implementations can better demonstrate the merits of each design.

REFERENCES

  1. [1] ScaleFlux. n.d. Home Page. Retrieved April 28, 2022 from https://www.scaleflux.com/.Google ScholarGoogle Scholar
  2. [2] Alpha Cephei. n.d. Vosk. Retrieved April 28, 2022 from https://alphacephei.com/vosk/.Google ScholarGoogle Scholar
  3. [3] SNIA. n.d. Home Page. Retrieved April 28, 2022 from https://www.snia.org/.Google ScholarGoogle Scholar
  4. [4] OpenSSD Project. n.d. Cosmos OpenSSD Platform. Retrieved April 28, 2022 from http://www.openssd-project.org/wiki/Cosmos_OpenSSD_Platform.Google ScholarGoogle Scholar
  5. [5] OpenSSD Project. n.d. Jasmine OpenSSD Platform. Retrieved April 28, 2022 from http://www.openssd-project.org/wiki/Jasmine_OpenSSD_Platform.Google ScholarGoogle Scholar
  6. [6] NVM Express. n.d. Non-Volatile Memory Express Project Web Page. Retrieved June 5, 2019 from https://nvmexpress.org.Google ScholarGoogle Scholar
  7. [7] Oracle. n.d. Oracle Cluster Filesystem Second Version Web Page. Retrieved February 5, 2019 from https://oss.oracle.com/projects/ocfs2.Google ScholarGoogle Scholar
  8. [8] TechTarget. n.d. Retrieved June 5, 2019 from https://searchstorage.techtarget.com/definition/serial-attached-SCSI.Google ScholarGoogle Scholar
  9. [9] SATA. n.d. SATA Ecosystem Web Page. Retrieved June 5, 2019 from https://sata-io.org.Google ScholarGoogle Scholar
  10. [10] Xilinx. n.d. MicroBlaze Processor Reference Guide. Retrieved June 19, 2019 from https://www.xilinx.com/support/documentation-navigation/design-hubs/dh0020-microblaze-hub.html.Google ScholarGoogle Scholar
  11. [11] InsideBIGDATA. 2017. How Data Is Stored and What We Do With It. Retrieved April 28, 2022 from https://insidebigdata.com/2017/11/12/how-data-is-stored-and-what-we-do-with-it/.Google ScholarGoogle Scholar
  12. [12] SERPwatch. 2020. How Many Google Searches Per Day? Retrieved April 28, 2022 from https://serpwatch.io/blog/how-many-google-searches-per-day/.Google ScholarGoogle Scholar
  13. [13] GroupLens. 2021. MovieLens. Retrieved April 28, 2022 from https://grouplens.org/datasets/movielens/.Google ScholarGoogle Scholar
  14. [14] Samsung. 2021. SmartSSD. Retrieved April 28, 2022 from https://samsungsemiconductor-us.com/smartssd/.Google ScholarGoogle Scholar
  15. [15] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation OSDI. 265283.Google ScholarGoogle Scholar
  16. [16] Alsop Thomas. 2020. Global Enterprise Byte Shipment Share: HDD and SSD 2010-2025. Retrieved April 28, 2022 from http://www.statista.com/statistics/815308/worldwide-enterprise-byte-shipment-share-hdd-ssd/.Google ScholarGoogle Scholar
  17. [17] Amara Amara, Amiel Frédéric, and Ea Thomas. 2006. FPGA vs. ASIC for low power applications. Microelectronics Journal 37, 8 (2006), 669677. Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Asadi Reza and Regan Amelia C.. 2020. A spatio-temporal decomposition based deep neural network for time series forecasting. Applied Soft Computing 87 (2020), 105963.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] StorageReview. 2019. What Is EDSFF Form Factor? Retrieved April 28, 2022 from https://www.storagereview.com/what-is-edsff-form-factor.Google ScholarGoogle Scholar
  20. [20] Bahn H. and Cho K.. 2020. Implications of NVM based storage on memory subsystem management. Applied Sciences 10 (2020), 999.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Barker Dave. n.d. VITA Technologies, Introducing the FPGA Mezzanine Card: Emerging VITA 57 (FMC) Standard Brings Modularity to FPGA Designs. Retrieved June 15, 2019 from http://vita.mil-embedded.com/articles/introducing-fpga-brings-modularity-fpga-designs.Google ScholarGoogle Scholar
  22. [22] Boicea Alexandru, Radulescu Florin, and Agapin Laura Ioana. 2012. MongoDB vs Oracle—Database comparison. In Proceedings of the 2012 3rd International Conference on Emerging Intelligent Data and Web Technologies. IEEE, Los Alamitos, CA, 330335.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] GitHub. n.d. brianfrankcooper/YCSB: Yahoo! Cloud Serving Benchmark. Retrieved April 28, 2022 from https://github.com/brianfrankcooper/YCSB.Google ScholarGoogle Scholar
  24. [24] Chapman Keith, Nik Mehdi, Robatmili Behnam, Mirkhani Shahrzad, and Lavasani Maysam. 2019. Computational storage for big data analytics. In Proceedings of the 10th International Workshop on Accelerating Analytics and Data Management Systems (ADMS’19).Google ScholarGoogle Scholar
  25. [25] Cornwell Michael. 2012. Anatomy of a solid-state drive.Communications of the ACM 55, 12 (2012), 5963.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Daityari Shaumik. 2021. How to Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK). Retrieved April 28, 2022 from https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk.Google ScholarGoogle Scholar
  27. [27] Do Jaeyoung, Ferreira Victor C., Bobarshad Hossein, Torabzadehkashi Mahdi, Rezaei Siavash, Heydarigorji Ali, Souza Diego, et al. 2020. Cost-effective, energy-efficient, and scalable storage computing for large-scale AI applications. ACM Transactions on Storage 16, 4 (Oct. 2020), Article 21, 37 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Go Alec, Bhayani Richa, and Huang Lei. 2009. Twitter sentiment classification using distant supervision. Processing 150 (2009), 1–6.Google ScholarGoogle Scholar
  29. [29] Gu Boncheol, Yoon Andre S., Bae Duck-Ho, Jo Insoon, Lee Jinyoung, Yoon Jonghyun, Kang Jeong-Uk, et al. 2016. Biscuit: A framework for near-data processing of big data workloads. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 153165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Gupta Udit, Wu Carole-Jean, Wang Xiaodong, Naumov Maxim, Reagen Brandon, Brooks David, Cottel Bradford, et al. 2020. The architectural implications of Facebook’s DNN-based personalized recommendation. arxiv:1906.03109 [cs.DC] (2020).Google ScholarGoogle Scholar
  31. [31] Győrödi Cornelia, Győrödi Robert, Pecherle George, and Olah Andrada. 2015. A comparative study: MongoDB vs. MySQL. In Proceedings of the 2015 13th International Conference on Engineering of Modern Electric Systems (EMES’15). IEEE, Los Alamitos, CA, 16.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] HeydariGorji Ali, Rezaei Siavash, Torabzadehkashi Mahdi, Bobarshad Hossein, Alves Vladimir, and Chou Pai H.. 2020. HyperTune: Dynamic hyperparameter tuning for efficient distribution of DNN training over heterogeneous systems. In Proceedings of the 2020 IEEE/ACM International Conference on Computer Aided Design (ICCAD’20). 18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] HeydariGorji Ali, Torabzadehkashi Mahdi, Rezaei Siavash, Bobarshad Hossein, Alves Vladimir, and Chou Pai H.. 2020. Stannis: Low-power acceleration of DNN training using computational storage devices. In Proceedings of the 2020 57th ACM/IEEE Design Automation Conference (DAC’20). 16. Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] HeydariGorji Ali, Torabzadehkashi Mahdi, Rezaei Siavash, Bobarshad Hossein, Alves Vladimir, and Chou Pai H.. 2021. In-storage processing of I/O intensive applications on computational storage drives. arXiv preprint arXiv:2112.12415 (2021).Google ScholarGoogle Scholar
  35. [35] Ito Keith and Johnson Linda. 2017. The LJ Speech Dataset. Retrieved April 28, 2022 from https://keithito.com/LJ-Speech-Dataset/.Google ScholarGoogle Scholar
  36. [36] Janukowicz Jeff. 2018. How New QLC SSDs Will Change the Storage Landscape. Technical Report. Micron, Framingham, MA. https://www.micron.com/-/media/client/global/documents/products/white-paper/how_new_qlc_ssds_will_change_the_storage_landscape.pdf?la=en.Google ScholarGoogle Scholar
  37. [37] Jin Yanqin, Tseng Hung-Wei, Papakonstantinou Yannis, and Swanson Steven. 2017. KAML: A flexible, high-performance key-value SSD. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 373384. Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Jun Sang-Woo, Liu Ming, Lee Sungjin, Hicks Jamey, Ankcorn John, King Myron, Xu Shuotao, and Arvind. 2015. BlueDBM: An appliance for big data analytics. SIGARCH Computer Architecture News 43, 3S (June 2015), 113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Kang Yangwook, Kee Yang-Suk, Miller Ethan L., and Park Chanik. 2013. Enabling cost-effective data processing with smart SSD. In Proceedings of the 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST’13). 112. Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Kim Sungchan, Oh Hyunok, Park Chanik, Cho Sangyeun, Lee Sang-Won, and Moon Bongki. 2016. In-storage processing of database scans and joins. Information Sciences 327, C (Jan. 2016), 183200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Kuon Ian and Rose Jonathan. 2007. Measuring the gap between FPGAs and ASICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26, 2 (2007), 203215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Lee Joo Hwan, Zhang Hui, Lagrange Veronica, Krishnamoorthy Praveen, Zhao Xiaodong, and Ki Yang Seok. 2020. SmartSSD: FPGA accelerated near-storage data analytics on SSD. IEEE Computer Architecture Letters 19, 2 (2020), 110113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Lee Young-Sik, Quero Luis Cavazos, Lee Youngjae, Kim Jin-Soo, and Maeng Seungryoul. 2014. Accelerating external sorting via on-the-fly data merge in active SSDs. In Proceedings of the 6th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’14).Google ScholarGoogle Scholar
  44. [44] Mayhew David and Krishnan Venkata. 2003. PCI express and advanced switching: Evolutionary path to building next generation interconnects. In Proceedings of the 2003 11th Symposium on High Performance Interconnects. IEEE, Los Alamitos, CA, 2129.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Meltzer Rachel. n.d. How Netflix Uses Machine Learning and Algorithms. Retrieved April 28, 2022 from https://www.lighthouselabs.ca/en/blog/how-netflix-uses-data-to-optimize-their-product.Google ScholarGoogle Scholar
  46. [46] Mudigere Dheevatsa, Hao Yuchen, Huang Jianyu, Tulloch Andrew, Sridharan Srinivas, Liu Xing, Ozdal Mustafa, et al. 2021. High-performance, distributed training of large-scale deep learning recommendation models. arxiv:2104.05158 [cs.DC] (2021).Google ScholarGoogle Scholar
  47. [47] Narayanan Iyswarya, Wang Di, Jeon Myeongjae, Sharma Bikash, Caulfield Laura, Sivasubramaniam Anand, Cutler Ben, Liu Jie, Khessib Badriddine, and Vaid Kushagra. 2016. SSD failures in datacenters: What? when? and why? In Proceedings of the 9th ACM International Systems and Storage Conference. ACM, New York, NY.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Naumov Maxim, Mudigere Dheevatsa, Shi Hao-Jun Michael, Huang Jianyu, Sundaraman Narayanan, Park Jongsoo, Wang Xiaodong, et al. 2019. Deep learning recommendation model for personalization and recommendation systems. arxiv:1906.00091 [cs.IR] (2021).Google ScholarGoogle Scholar
  49. [49] Ng James. 2019. Content-Based Recommender Using Natural Language Processing (NLP). Retrieved April 28, 2022 from https://towardsdatascience.com/content-based-recommender-using-natural-language-processing-nlp-159d0925a649.Google ScholarGoogle Scholar
  50. [50] Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. arxiv:1912.01703 [cs.LG] (2019).Google ScholarGoogle Scholar
  51. [51] Razavi Seyyed Ahmad and Zamani Morteza Saheb. 2013. Improving bitstream compression by modifying FPGA architecture. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 167170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Rezaei Siavash. 2020. Field Programmable Gate Array (FPGA) Accelerator Sharing. Ph.D. Dissertation. University of California Irvine.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Rezaei Siavash, Bozorgzadeh Eli, and Kim Kanghee. 2019. UltraShare: FPGA-based dynamic accelerator sharing and allocation. In Proceedings of the 2019 International Conference on ReConFigurable Computing and FPGAs (ReConFig’19). IEEE, Los Alamitos, CA, 15.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Rezaei Siavash, Kim Kanghee, and Bozorgzadeh Eli. 2018. Scalable multi-queue data transfer scheme for FPGA-based multi-accelerators. In Proceedings of the 2018 IEEE 36th International Conference on Computer Design (ICCD’18). IEEE, Los Alamitos, CA, 374380.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Sadeghi Misha, Razavi Seyyed Ahmad, and Zamani Morteza Saheb. 2019. Reducing reconfiguration time in FPGAs. In Proceedings of the 2019 27th Iranian Conference on Electrical Engineering (ICEE’19). IEEE, Los Alamitos, CA, 18441848.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Salamat Sahand, Aboutalebi Armin Haj, Khaleghi Behnam, Lee Joo Hwan, Ki Yang Seok, and Rosing Tajana. 2021. NASCENT: Near-storage acceleration of database sort on SmartSSD. In Proceedings of the 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). ACM, New York, NY, 262272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Sergeev Alexander and Balso Mike Del. 2018. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).Google ScholarGoogle Scholar
  58. [58] The Theano Development Team: Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frederic Bastien, et al. 2016. Theano: A Python framework for fast computation of mathematical expressions. arxiv:1605.02688 [cs.SC] (2016).Google ScholarGoogle Scholar
  59. [59] Tiwari Devesh, Boboila Simona, Vazhkudai Sudharshan, Kim Youngjae, Ma Xiaosong, Desnoyers Peter, and Solihin Yan. 2013. Active Flash: Towards energy-efficient, in-situ data analytics on extreme-scale machines. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 119132. https://www.usenix.org/conference/fast13/technical-sessions/presentation/tiwari.Google ScholarGoogle Scholar
  60. [60] Torabzadehkashi Mahdi. 2019. SoC-Based In-Storage Processing: Bringing Flexibility and Efficiency to Near-Data Processing. University of California Irvine.Google ScholarGoogle Scholar
  61. [61] Torabzadehkashi Mahdi, Heydarigorji Ali, Rezaei Siavash, Bobarshad Hosein, Alves Vladimir, and Bagherzadeh Nader. 2019. Accelerating HPC applications using computational storage devices. In Proceedings of the 2019 IEEE 21st International Conference on High Performance Computing and Communications, the IEEE 17th International Conference on Smart City, and the IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS’19). IEEE, Los Alamitos, CA, 18781885.Google ScholarGoogle Scholar
  62. [62] Torabzadehkashi Mahdi, Rezaei Siavash, Alves Vladimir, and Bagherzadeh Nader. 2018. CompStor: An in-storage computation platform for scalable distributed processing. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’18). IEEE, Los Alamitos, CA, 12601267.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Torabzadehkashi Mahdi, Rezaei Siavash, Heydarigorji Ali, Bobarshad Hosein, Alves Vladimir, and Bagherzadeh Nader. 2019. Catalina: In-storage processing acceleration for scalable big data analytics. In Proceedings of the 2019 27th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP’19). IEEE, Los Alamitos, CA, 430437.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Torabzadehkashi Mahdi, Rezaei Siavash, HeydariGorji Ali, Bobarshad Hosein, Alves Vladimir, and Bagherzadeh Nader. 2019. Computational storage: An efficient and scalable platform for big data and HPC applications. Journal of Big Data 6, 1 (2019), 129.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] TORRES.AI Jordi. 2021. Scalable Deep Learning on Parallel and Distributed Infrastructures. Retrieved April 28, 2022 from https://towardsdatascience.com/scalable-deep-learning-on-parallel-and-distributed-infrastructures-e5fb4a956bef.Google ScholarGoogle Scholar
  66. [66] Wilkening Mark, Gupta Udit, Hsia Samuel, Trippel Caroline, Wu Carole-Jean, Brooks David, and Wei Gu-Yeon. 2021. RecSSD: Near Data Processing for Solid State Drive Based Recommendation Inference. ACM, New York, NY, 717729. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Leveraging Computational Storage for Power-Efficient Distributed Data Analytics

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Embedded Computing Systems
          ACM Transactions on Embedded Computing Systems  Volume 21, Issue 6
          November 2022
          498 pages
          ISSN:1539-9087
          EISSN:1558-3465
          DOI:10.1145/3561948
          • Editor:
          • Tulika Mitra
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 October 2022
          • Online AM: 20 April 2022
          • Accepted: 23 March 2022
          • Revised: 15 March 2022
          • Received: 16 July 2021
          Published in tecs Volume 21, Issue 6

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)860
          • Downloads (Last 6 weeks)185

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!