An In-Storage Processing Architecture with 3D NAND Heterogeneous Integration for Spectra Open Modification Search

Spectra open modification search (OMS) is the critical step in mass spectrometry (MS) analysis and proteomics to identify peptides underlying protein samples. However, large-scale spectra OMS is a data-intensive workload that takes hours to days. In this work, we propose a reconfigurable architecture based on 3D NAND ISP with heterogeneous integration to accelerate the mass spectrum data processing. We present two types of encoding designs for optimization. Then we design scalable and reconfigurable 3D NAND ISP tiles to further optimize the performance. The experiments show that the 3D NAND ISP architecture with proper hardware configuration achieves 14.3 × to 24.2 × speedup over the GPU baseline [10]. The energy consumption is also improved by four orders of magnitude without data movements. The proposed design is an energy-efficient and high-performance ISP solution for the emerging large-scale spectra OMS.


INTRODUCTION
Proteomics is a key to understanding the molecular processes of proteins, which are responsible for a variety of activities in cell life.Proteomics scientists use a powerful technique, called mass spectrometry (MS), to recognize and measure peptides and proteins underneath biological samples.Figure 1 illustrates the standard flow to identify peptide sequences contained in protein digestion.First, a method called tandem mass spectrometry (MS/MS) produces a large amount of unknown query spectra data.Second, the key step here is to compare the experimental query spectra against a pre-built spectral reference library with known peptides, using the spectral library searching method [12].The algorithmic challenge of spectral library search is: a large amount of acquired query spectra cannot be directly identified by just using popular similarity metrics (like cosine similarity or inner product) [5].This is due to the data mismatch between experimental and reference spectra data.The analyzed protein samples may encounter multiple post-translational modifications (PTMs) that modify the inherent mass and MS/MS fragmentation patterns.However, reference spectra in pre-built spectral libraries are mainly unmodified peptides.So more advanced searching algorithm is needed to address PTMs.Open modification searching (OMS) is a promising solution to accurately identify modified spectra [14].Unlike the standard spectral library search that only queries spectra to reference with a similar precursor mass, OMS accepts reference spectra from a much wider range such that modified query spectra are searched against their unmodified reference variants with different precursor masses.Spectra OMS enables the study of more complex protein interaction in virus-host and proteomics analysis of non-model organisms [8].However, OMS workloads create three major challenges in terms of algorithm and data analysis acceleration.1. OMS is a memory-intensive workload that exhibits very low searching speed and efficiency even with careful optimizations [2] since OMS drastically increases the search space.2. The increasingly available spectra data in public databases [15] promote research development, but the massive spectral libraries created by repository-scale MS data [25] further increases the OMS time from hours to days.For example, UCSD MassIVE contains 5.6 billion spectra, which corresponds to 448TB in size [25].
Several tools have been presented to shorten the OMS time [3,10,13].These tools use advanced nearest-neighbor search algorithms with optimized metrics to boost OMS.Among the state-of-the-art accelerations, HOMS-TC [10] with the aid of hyperdimensional computing (HD) demonstrates the best runtime performance as well as memory efficiency because it leverages the HD technique to simplify the required operations to hardware-friendly Boolean operations while maintaining good searching quality.Although HD-based HOMS-TC significantly speeds up OMS workloads, it still incurs a large memory footprint due to the memory-intensive HD primitives.As shown in Figure 2, the HD encoding and database search dominate the overall runtime even using a NVIDIA RTX 4090 GPU with 1TB/s memory bandwidth.
In-storage procesing (ISP) [17,21,22] is considered an effective solution to extend available bandwidth and reduce data movement cost.Meanwhile, the high-density 3D NAND Flash provides a costeffective solution that allows the storage of spectra data with over GB or TB sizes.In this work, we combine the heterogeneous integration techniques [18] with 3D NAND ISP to develop an architecture to accelerate HD-based OMS workloads in HOMS-TC [10] that shows high data parallelism and energy efficiency.To accommodate the entire reference datasets, several tiles are required, thus offering the reconfigurability of the 3D NAND ISP architecture.We simulate the hardware performance with industry-grade 3D NAND parameters [21] and implement the encoding and search circuits in 7nm FinFET technology node with ASAP7 PDK [6].The 3D NAND peripheral circuits are extracted from NeuroSim [24].Our in-house simulator shows the 3D NAND ISP has 14.3× to 24.2× speedup versus the HOMS-TC.The energy efficientcy is also improved by four orders of magnitude without massive data movements.

BACKGROUND ON MS AND ISP 2.1 HD-based Spectra Open Modification Search
Spectra data contain the mass-to-charge ratio (m/z) and ion signal intensity of proteins.We call them peak intensities and peak indices, respectively.Hyperdimensional computing-based (HD-based) OMS improves the efficiency of the conventional spectra OMS pipeline (Figure 1) in two aspects: 1. encoding and 2. Hamming similarity search.In this work, we use the similar HD-based OMS in [9,10] as the OMS algorithms.HD Encoding for Spectra.Figure 3 shows the encoding step that transforms the raw spectra data into hyperdimensional space, where the spectra are expressed as binary vectors with high dimension, called hypervectors (HVs).To model the peak shifts and intensity changes due to PTMs, HD encoding [9,10] considers both spatial locality (for peak shift) and value locality (for peak intensity change).Each index in the spectrum vector is assigned with the associative position HV F such that F  corresponds to index , and F ∈ {F 1 , F 2 , . . ., F  }, where  denotes the spectrum vector dimension.Likewise, level HVs L are utilized to model the intensity values in each index.The intensity values are quantized to  levels and L  is assigned to the associative level  where  ∈ [0, ).
With the two sets of encoding HVs, namely F and L, the preprocessed spectrum vector with multiple pairs of peak intensities and indices are encoded into the HV I format as: where P denotes all pairs of peak intensities and indices represent the element-wise multiplication.Note that the resulting aggregated HV I is non-binary HV.We binarize it for better computation and memory efficiency.
Hamming Similarity Search.After the encoding step, HD-based OMS leverages Hamming similarity search to identify the reference peptides in HV format most matched to the query HV.Specifically, Hamming similarity is adopted as the search metric.Therefore, the search step requires to compute the Hamming similarity between query and reference HVs.Each spectrum has its own spectrum charge (+2, +3, . ..) and precursor m/z value.In addition to Hamming similarity, the matched reference HVs also need to satisfy other constraints including the spectrum charge and precursor m/z condition.The final search results satisfy both: (1) having the identical spectrum charge as the query and ( 2) falling into the valid range of precursor m/z difference between query and reference.We apply the cascade search [11] to reduce the misidentification rate, where a narrow precursor m/z tolerance is firstly used for the standard search and FDR filtration is applied as Figure 3(b)-• 1 .
In the second phase, remaining unidentified spectra are searched using a larger precursor m/z tolerance as • 2 .
The advantages of HD-based OMS lie in: the binary HV representation instead of the high-precision format in existing OMS tools [3,13], which only requires simple Hamming similarity operations during OMS.The simplified data format and computations dramatically reduce the circuit complexity for ISP implementation.

3D NAND In-Storage Processing (ISP)
Large datasets beyond several GB in scale often require Solid State Drives (SSD) to accommodate the entire dataset.While SSDs offer high read-throughput, accessing the entire dataset can still incur significant latency and energy consumption.To address this issue, in-storage-processing (ISP) has been proposed as a promising paradigm [17,21,22] to eliminate the overhead caused by data movements.Figure 4 illustrates the configuration of 3D NAND ISP.In this design, an additional set of Analog-to-Digital Converters (ADCs) is integrated into the separated source line (SL) corresponding to each block in the mature 3D NAND Flash configuration.The weight matrix or the reference data is stored in the 3D NAND Flash, while the input vector or the query is sent to the 3D NAND as bit line (BL) voltages.The results of either the vector-matrix multiplication of the input vector and the weight matrix or the dot product of the reference data and the query equal to the summed currents along the sourcelines (SLs).The ADC then converts this current into the digital domain for post-ISP processing.Without the need for GB-level data movements, 3D NAND ISP reduces overall latency and lowers energy consumption.As a result, in-storageprocessing holds great potential for optimizing the performance of systems dealing with large datasets on SSDs.

Heterogeneous Integration
To further boost the performance, heterogeneous integration techniques are proposed to stack peripheral circuits on top/bottom of the 3D NAND Flash array.Incorporating with Cu-Cu hybrid bonding [19] and CMOS under array (CUA) [20], ISP achieves a compact form factor. CUA enables the overlapping of memory peripherals under the array, reducing the area of a single tier.Meanwhile, the high-density inter-chip Cu-Cu bonding connects the processing elements on the CMOS wafer to the 3D NAND wafer, ensuring seamless integration.The CMOS wafer can be fabricated in an advanced technology node to yield a smaller area and better performance.The combination of CIM with heterogeneous integration [18] offers a compact solution for large-scale data processing with enhanced performance.This approach opens new possibilities for the development of low-power, high-performance, and compact data processing systems applicable to various applications.

PROPOSED 3D NAND ISP ARCHITECTURE
The datasets for mass spectrometry have reference data in the number of million-level.In this work, we propose a reconfigurable architecture based on 3D NAND ISP with heterogeneous integration for mass spectrometry applications.The 3D NAND ISP tile possesses the capability to perform both query encoding and hamming similarity search in HyperOMS.In this section, the architecture of 3D NAND ISP and reconfigurability are discussed.

3D NAND ISP Tile with Heterogeneous Integration
Figure 5 shows the proposed 3D NAND ISP tile with heterogeneous integration.The peripheral circuits are folded on the top and bottom of the 3D NAND tile.Notably, the high-voltage circuits including word line (WL)/string select line (SSL) switch matrix (SW) and the pass transistors are fabricated underneath the 3D NAND array using CUA approach with the transistor size equivalent to 65 nm technology to sustain high-voltage program/erase operations of 3D NAND Flash.On the other hand, the low-voltage circuits including digital circuits, buffers, decoders and ADCs are fabricated on a separate CMOS wafer in an advanced 7 nm technology node and later face-to-face bonded on top of the 3D NAND wafer using Cu-Cu hybrid bonding.The inter-tier Cu-Cu bonding has a tight pitch of 1 m [23] to guarantee high bandwidth data communication across tiers.With Heterogeneous integration, the 3D NAND ISP can accommodate encoding circuits and search circuits, therefore performing both encoding and OMS in a single compact tile.

In-Memory Encoding vs. Near-Memory Encoding
The hardware implementation of XOR encoding can also be incorporated in an in-storage fashion.Unlike the previous ISP approach for dot products on SLs, the in-memory encoding performs bit-wise dot products on each BL. Figure 6 illustrates both the near-memory and in-memory encoding hardware designs.The near-memory encoding method deploys a set of XOR gates after sense amplifiers (SA) in the page buffer.The position HVs are read from 3D NAND Flash and fed into the XOR gates alongside cached level HVs.On the other hand, in the in-memory encoding design, the position HVs are also stored in the 3D NAND array, while in need of storing position HVs and the level HVs are sent in as the BL voltages.The XOR operation can be replaced by the OR operation of two bit-wise dot products as:  Integrating a set of AND gates after two sense amplifiers, the inmemory requires less logic area with respect to the simplicity of the OR gate compared to the XOR gate.The tradeoff will be discussed in the Evaluation section.

Reconfigurability
Since the 3D NAND ISP tile performs encoding and search, multiple tiles can be partitioned for specific tasks, e.g., encoding and search tiles.The versatility offers the reconfigurability for the chip to accelerate specified tasks with optimized tile designs.Figure 7 demonstrates the reconfigurable architecture of the 3D NAND ISP tiles.The tiles communicate through H-tree routing on the top CMOS tier with memory controllers.This H-tree routing offers inter-tile communications including tile-to-tile data transmission and broadcasting.The reconfigurable architecture design provides a design space for optimization when dealing with various datasets with different parameters.

Data Flow
Figure 8 illustrates the data flow of the architecture.First, the preprocessed spectral data is fetched through the IO sequentially.The specified encoding tiles encode the pre-processed spectral data into query hypervectors, which are subsequently broadcasted to the search tiles for simultaneous parallel searching.Finally, the hamming similarities are sorted after exploring all the search spaces, and the top-k results are sent out serially through the IO interface.

EVALUATION 4.1 Methodology
Datasets.We use two real-world datasets, including: 1. small-scale iPRG2012 dataset [4] (total spectra: 15, 867) as query while yeast spectral dataset [16] with the human HCD spectral library (total spectra: 1, 162, 392) as reference.2. large-scale HEK293 (Human Embryonic Kidney 293) dataset [5] (total spectra per query: 46, 665 on average) as query while the human spectral library [1, 26] (total spectra: 2, 992, 672) as reference.The query and reference spectra follow the preprocessing flow of existing works [2,3,10].The preprocessing configurations for query and reference spectra are listed in Table 1.The low-quality spectra with less than ten peaks and a 250 m/z mass range or peaks within a 0.05 m/z window around the precursor m/z were removed.All MS data, spectral libraries, preprocessed spectra, and identification results are available on the MassIVE repository with the dataset identifier MSV000091183.
Benchmarking.The evaluation of software baselines is run on Intel i7-11700K CPU with 64GB of RAM, and NVIDIA Geforce RTX 4090 with 24GB of VRAM.We measure the energy consumption of the CPU and GPU using Intel Power Gadget and nvidia-smi, respectively.We count the number of identifications to compare the search quality.All search results are evaluated at fixed 1% FDR threshold, using Pyteomics [7].
Hardware Modeling.The hardware parameters of the proposed 3D NAND are listed in Table 2.The HD encoder and search circuits are implemented using Verilog and synthesized on ASAP 7nm PDK [6].The peripheral circuits of the 3D NAND array are extracted from NeuroSim [24].The clock frequency is set to 1GHz.To estimate the performance and energy efficiency of proposed ISP designs, we develop an in-house simulator to run the trace extracted from the HOMS-TC [10] software.In-memory encoding vs near-memory encoding.For the 3D NAND ISP hardware evaluation, we first compare the performance of the two hardware implementation methods for encoding.Figure 10 shows the simulation results of in-memory encoding and near-memory encoding.Note that the BL number is set to 1KB (8192) for fair comparison.Although in-memory encoding can reduce the circuit complexity, the doubled read operations for position HVs yield longer latency and larger energy consumption for the specific XOR encoding approach.In-memory encoding will outperform near-memory encoding in the more complex encoding methods.Later simulations are based on near-memory encoding.Page size scaling.The latency and energy consumption of a 3D NAND memory array is dominated by the WL charging/discharging. Therefore, a sizable page offers a degree of freedom to further optimize the performance.Figure 11 shows the hardware simulation results of various page sizes, i.e., numbers of BL.We selectively simulate 1KB(8192), 2KB(16384), 4KB(32768) and 8KB (65536).With respect to the dimension of hypervectors is 8192, the minimum number of BL is set to 8192 to avoid additional partial sum overhead.
The simulation results show a larger number of BL yields worse performance.This is because the latency and energy consumption of WL operations are scaled accordingly.We propose to design the 3D NAND ISP with a minimum page size that equals the dimension of hypervectors for agile operations.Tile scaling.The reconfigurable design also provides the scalability for further speedup.Figure 12 shows the hardware simulation results of scaled tile numbers.As the number of tile scales, the latency Table 3: Speedup over the state-of-the-art OMS library on GPU, HOMS-TC [10].The HEK293 runtime is the average runtime for each query file.

Workload
Spectra OMS Dataset iPRG2012 HEK293 HOMS-TC [10] 2.08s (1×) 10.4s (1×) This work 0.145s (14.3×) 0.429s (24.2×) is decreased.However, the scaling of latency is not inversely linear due to the digital processing overhead.We propose to scale the tile number by 2× to obtain an optimized result with a reasonable area of 14.4 and 35.6 mm 2 for iPRG2012 and HEK293, respectively.Speedup versus GPU.With the optimized configuration of 3D NAND ISP, we compare the performance versus CPU and GPU.Table 3 compares the latency for HOMS-TC which accelerates Hy-perOMS on GPU and HyperOMS on 3D NAND ISP.The proposed 3D NAND ISP has 14.3× and 24.2× speedup on respective datasets.The simulated energy consumptions are 0.067 J and 0.491 J. Considering the average power of GPU 450 W, 3D NAND ISP improves the energy efficiency by four orders of magnitude.

CONCLUSION
In this work, we propose the 3D NAND ISP architecture to accelerate memory-intensive spectral open modification search (OMS) workloads.We also present two types of encoding design and determine the near-memory encoding for the state-of-the-art HD-based OMS algorithm [9,10].The proposed 3D NAND ISP provides reconfigurability and scalability for further optimization.Without the need to move massive data from SSD and memory, the energy consumption is significantly reduced by four orders of magnitude and 14.3× to 24.2× speedup is achieved over the GPU baseline [10].Our design is an energy-efficient and high-performance ISP solution for the emerging large-scale spectra OMS.

Figure 3 :
Figure 3: Two major steps: (a) encoding and (b) search in HDbased OMS [9].The encoding step converts spectra peaks into hypervectors.The search step uses Hamming similarity to efficiently find the matched peptides.

Figure 4 :
Figure 4: Overview of 3D NAND in-storage processing (ISP) architecture: (a) Configuration of 3D NAND ISP.An additional set of ADCs are deployed after separated SLs, which converts The VMM results or dot product results into digital domain.(b) Data mapping scheme of 3D NAND ISP.Taking OMS for example, the reference dataset is mapped to the 3D NAND array and the query for hamming similarity calculation are sent in the array as BL voltage.The summed currents in the SL represent the dot product results and later sorted after ADCs.

Figure 5 :
Figure 5: 3D NAND ISP tile with heterogeneous integration.The high-voltage circuits are stacked underneath the 3D NAND array using CUA.The low-voltage circuits and digital circuits are fabricated on a separated CMOS wafer in an advanced technology node and.The 3D NAND wafer and CMOS wafer are bonded using Cu-Cu bonds offering highbandwidth inter-tier communication.

Figure 6 :
Figure 6: Block diagrams of in-memory encoding and nearmemory encoding: (a) In-memory encoding.The position HVs and position HVs are stored in the 3D NAND array.The XOR encoding is achieved by the OR result of two dot products.(b) Near-memory encoding.The position HVs are read from the 3D NAND array and complete the XOR encoding with the cached level HVs.

Figure 7 :
Figure 7: Reconfigurable 3D NAND ISP architecture.The tile performs encoding and search operation.Combining several tiles with H-tree routing provides flexibility to assigned encoding or search to the specified tile for optimization.

Figure 8 :
Figure 8: Data flow in the 3D NAND ISP architecture.The preprocessed spectra are fetched and encoded by the encoding tiles.The encoded query is broadcast to the search tiles for the Hamming similarity search in parallel.Finally, the sorted top-k results are sent out.

Figure 9 :
Figure 9: Impact of ADC precision on the OMS search quality in terms of identified peptides.

Figure 10 :
Figure 10: Hardware simulation results of in-memory encoding versus near-memory encoding.Note that BL number is 8192.

Figure 12 :
Figure 12: Hardware simulation results of scaled tile numbers for near-memory encoding implementation.Note that BL number is 8192.

Table 2 :
Hardware Simulation Parameters ure 9 demonstrates the impact of ADC precision on the OMS search quality.The quantization error is negligible when ADC is 6-bit.Therefore, we design the ADCs with 6-bit SAR ADC.