Towards LDPC Read Performance of 3D Flash Memories with Layer-induced Error Characteristics

3D flash memories have been widely developed to further increase the storage capacity of SSDs by vertically stacking multiple layers. However, this special physical structure brings new error characteristics. Existing studies have discovered that there exist significant Raw Bit Error Rate (RBER) variations among different layers and RBER similarity inside the same layer due to the manufacturing process. These error characteristics would introduce a new data reliability issue. Currently, Low-Density Parity-Check (LDPC) code has been widely used to ensure the data reliability of flash memories. It can provide stronger error correction capability for high RBERs by trading with longer read latency. Traditional LDPC codes designed for planar flash memories do not consider the layer RBER characteristics of 3D flash memories, which may induce sub-optimal read performance. This article first investigates the effect of RBER characteristics of 3D flash memories on read performance and then obtains two observations. On one hand, we observe that LDPC read latencies are largely diverse in different flash layers and increase in diverse speeds along with data retention. This phenomenon is caused by the inter-layer RBER variation. On the other hand, we also compare RBERs between different pages of the same flash layer and observe that read latencies with LDPC codes are quite similar, which is caused by the intra-layer RBER similarity. Then, by exploiting these two observation results, this article proposes a Multi-Granularity LDPC (MG-LDPC) read method to adapt read latency increase characteristics across 3D flash layers. In detail, we design five LDPC decoding engines with varied read level increase granularity (higher level induces higher latency) and assign these engines to each layer dynamically according to prior information, or in a fixed way. A series of experimental results demonstrate that the fixed and dynamic MG-LDPC methods can reduce SSD read response time by 21% and 51% on average, respectively.

We evaluate MG-LDPC in Disksim with SSD extension with nine real-world workloads. Compared with existing progressive LDPC methods, experimental results show that the proposed method can improve 3D flash read performance up to 51% and reduce the energy overhead by 53% on average, respectively. The contributions of this article are listed as follows: • We observe that read latency of 3D flash layers increases at significantly diverse speeds. Existing single-granularity LDPC read methods would induce sub-optimal read performance. • We propose a multi-granularity LDPC reading method, MG-LDPC, by exploiting the above observation. Five multi-granularity decoding engines are designed and assigned to adapt read latency increase characteristics across 3D flash layers. • We propose two implementations of MG-LDPC: FixAdp and DynAdp. FixAdp directly separates flash layers into five parts, while DynAdp decides the granularity according to previous read levels stored in extra storage space, which enables MG-LPDC to have better layer error adaptability. • We evaluate MG-LDPC with real-world workloads in Disksim with SSD extension and verify the effectiveness of the proposed strategies on read performance improvement of 3D flash memories.
The rest of this article is organized as follows. Section 2 introduces the background of 3D flash and LDPC. The motivation of this article is presented in Section 3. Section 4 describes detailed designs of MG-LDPC. The performance of MG-LDPC is evaluated with basic experiments and sensitivity studies in Section 5. Section 6 presents related works, and Section 7 concludes this article.

BACKGROUND
This section first presents the basics of 3D flash memories and then introduces the concept of read level. Finally, the current progressive LDPC read method is presented.

Layer-stacked 3D Flash Memories
Due to scaling limitations, the storage capacity of planar flash memories can hardly be further increased because of severe reliability issues, which promotes the emergence and development of 3D flash memories. The storage capacity is significantly improved by increasing the number of vertically stacked layers [10,31]. However, data reliability concerns still exist and become even worse, which require advanced ECCs such as LDPC codes to guarantee data correctness, as introduced in Section 2.3.1.
Most existing 3D flash memory designs use charge trap (CT) transistors, which replace the floating gate. The CT-based flash layer contains an insulator, in which electrons can be stored in a more stable state [32,43]. A section of the layer-stacked structure of 3D flash is shown in Figure 1(a), in which there are three dimensions: wordlines, bitlines, and layers [18]. It can be seen that a block contains data across all the layers, which largely increases the block size and induces the big block problem. Besides, as the manufacturing process may induce varied characteristics on each layer, the strengths of storing data are also diverse among different layers.
As is shown in Figure 1(b), stacked layers are vertically connected through the cylindrical channel in 3D flash memories. However, due to the process variation and the etching technologies when 3D flash memories are fabricated [17,31], the stacked channels exhibit characteristics of a larger diameter at the top layer and a smaller diameter at the bottom layer. As a result, manufacturers are unable to produce identical stacked 3D flash layers. The asymmetric feature will lead to significant variation in the error characteristics of flash cells that reside in the different stacked layers.

Read Operation of Flash Memory
NAND flash memories use transistors (flash cells) to store charges, which determine the data values stored in flash cells. Like Phase-change memory, NAND flash memory is a non-volatile memory device [16,52]. For the flash memory with n bits per cell, the threshold voltage (V th ) will be divided into 2 n different voltage states. Each state represents a different value, which is assigned a voltage window within the range of all possible threshold voltages. Figure 2 illustrates the threshold voltage distribution of MLC NAND flash memories. The x-axis shows the threshold voltage, and the y-axis shows the probability density of each voltage level. The four V th distributions represent the four possible cell states: "11, " "01, " "00, " and "10. " We refer the left bit as the most significant bit (MSB) and the right bit as the least significant bit (LSB). All the most significant bits and the least significant bits make up the upper pages and lower pages in MLC flash, respectively.
The boundaries between neighboring threshold voltage windows are labeled as V a , V b , and V c for the MLC distribution in Figure 2, and we refer to them as read reference voltages. To read the LSB of MLC flash, we only need to apply a single read reference voltage (V b ) to distinguish the states where the LSB value is 1 and where the LSB value is 0. To read the MSB, we need to apply (V a ) and (V c ) to distinguish value 1 and value 0, so MSB is determined by two consecutive read operations. We call this read method of applying a single reference voltage in a read operation as hard decision read.
However, with the influence of Program/Erase (P/E) cycles and various interference (retention interference, read interference, etc.) [5,6,39,40,46], different voltage states will form an overlap area, as is shown in Figure 3. When we access flash memory in hard decision reads that apply single read reference voltage to the control gate (i.e., single read level), the overlap area of the connected states will cause errors [27,53,54], and it will lead to a failed access. So read operations that apply different voltage levels to the control gate, called soft-decision reads, are introduced to upgrade the accuracy of accessing data [3,11,12].
Each overlap area is split into two regions. Each bit can be read two times in the example shown in Figure 3, and the higher the number of read levels, the more accurate the information is. This  article defines the read level as the number of read reference voltages between adjacent threshold voltage states of flash memory. When read level increases, more read voltages are applied and thus higher error correction capability can be obtained. However, the read latency would also increase, as more voltage sensing latency is involved. The total access latency is linearly proportional to read levels. For example, for the MLC flash memory, in the first read level, only one read voltage is applied in the sensing process. But sensing of two extra read voltages (one on the left side and the other on the right side) happens in the second read level, which induces higher read latency. Note that we refer to hard sensing as the first read level and soft sensing as the read levels larger than one. The higher the read level, the higher the read latency, and the more iterations of LDPC decoding, which will be described in detail in Section 2.3.

LDPC Reading and Decoding
2.3.1 Progressive LDPC Reading. As advanced ECCs, LDPC codes can provide multi-level high error correction strengths. Higher read levels often induce high flash read latency that increases quickly along with the level. As shown in the first three columns of Table 1 [1], the first read level takes 85 μs and the ECC strength is less than 0.005. It takes 24 μs to add one extra read level, and ECC strength is also enhanced due to the increase in read accuracy. Thus, the latency of different read levels is defined as L(n) = 85 + 24 * (n − 1), and the accumulated latency of read levels is defined as AL(n) = n i=1 L(i). For example, if LDPC decoding is successful when using read level 3, according to the read-retry strategy of the progressive LDPC reading, read level 3 will be used only after read level 1 and read level 2 fail to read. Therefore, the total latency of successful decoding with read level 3 should be the latency of read level 1 plus the latency of read level 2 plus the latency of read level 3. In a word, the total latency of successful decoding with read level 3 is AL(3) = L(1) + L(2) + L(3) = 85μs + 109μs + 133us = 327μs, which corresponds to the fourth column of Table 1, namely, accumulated latency. The "ECC strength" column shows the RBER ranges in which data can be successfully decoded by the corresponding read level. Besides, as read levels cannot be decided in advance, it is not trivial to find the ideal read latency. The existing progressive LDPC read method [58] balances error correction strength and read latency by progressively applying increasing read levels in a read-retry way, as shown in Figure 4. From Figure 4, we can see that once LDPC decoding is successful, the read process will end. Otherwise, a retried read happens with a higher read level. The iterative read-retries continue until the maximal read level is reached. Note that one extra read level will be incremented when LDPC decoding fails. In this way, current RBERs can be adapted with suitable read latencies and relatively low read levels. However, high accumulated latency will be induced for high RBERs. As shown in Table 1, when the read level is 1, the latency is 85 μs, and the accumulated latency is the lowest. However, when the read level is 7, the latency increases to 229 μs, but the accumulated latency from read levels 1 to 7 reaches up to 1,099 μs. This article studies the progressive LDPC read method in the layer-stacked structure of 3D flash memories.

LDPC Decoding Process.
There are two types of nodes in LDPC coding: variable nodes and check nodes. The decoding process is to iteratively pass information between these nodes, which consists of three steps: initialization, check node update, and variable node update. In an additive white Gaussian noise channel, variable nodes v are initialized with a priori log-likelihood ratio (LLR) calculated from the channel outputs y v and channel noise variance σ 2 , as shown in Equation (1): In Equation (3), is the conditional probability that the b v is equal to b (b is either 0 or 1), given that the channel output y v is received. After the initialization, the LLRs are assigned with a message that variable nodes send to the check nodes along the edges of the Tanner graph. The message is called as the variable-to-check message M v2c = LLR in v . Each check node calculates new messages that will be sent to the corresponding variable nodes using the min-sum approximation [15], as shown in Equation (2): In Equation (2), V c represents the set of all variable nodes connected to the check node c. Each variable node now updates the associated LLR (also referred to as intrinsic LLR or the soft output for the variable node v) with its posteriori value, as shown in Equation (3), and new variable-tocheck messages as in Equation (4): In Equations (3) and (4), C v represents the set of all check nodes connected to the variable node v and it is the index of the current iteration. Note that in the first iteration, the old posteriori LLR value is the priori channel LLR, i.e., LLR 0 v,apost = LLR in v , and that M 0 c 2v = 0.

The Min-sum LDPC Decoder.
The min-sum decoder architecture [37] is shown in Figure 5. Variable-to-check messages are calculated in variable node units (VNUs) from the posteriori LLRs as in Equation (4). These messages are buffered in the M v2c FIFO for later use in the LLR update calculation. To facilitate proper connections between the variable and check nodes, LLRs are cyclically shifted by the value equal to the corresponding shift of the identity submatrix in the parity-check matrix before the variable node calculation. Variable-to-check messages are passed to check node units (CNUs), where new check-to-variable messages are calculated using Equation (2). Besides that, the sign product of signs of variable-to-check messages is calculated. New check-to-variable messages are used for calculation of the new intrinsic LLRs, which are then cyclically shifted in the opposite direction. LLR memory always contains up-to-date LLRs.
There are many existing works that optimize the min-sum algorithm. Jiang et al. [33] introduce a more efficient adjustment for check-node update computation in view of different minimum values. According to the optimal correction factor of the normalized min-sum algorithm, the adaptive offset item can be determined. Li et al. [23] propose a low-complexity LDPC decoding algorithm with simplified check node updating. Zhao et al. [57] show that in many cases, only four quantization bits suffice to obtain close-to-ideal performance over a wide range of signal-to-noise ratios and propose modifications to the min-sum algorithm that improve the performance by a few tenths of a decibel with just a small increase in decoding complexity. There are also some advanced technologies applied in the LDPC code decoder. For a recording system that has a run-length-limited constraint, Chou and Sham [8] impose the hard error by flipping bits before recording. Ma et al. [29] present a data packing technique for the Quasi-Cyclic LDPC codes decoder applied to the NAND flash controller.

MOTIVATION
This section first introduces the inter-layer error variation. By performing a preliminary experiment, we uncover the increasing speed variations of LDPC read levels in different 3D flash layers, Then, based on the study results, we further present the intra-layer error similarity and observe that read latencies with LDPC show similar increase curves. Finally, by exploiting 3D flash error characteristics, this article proposes a multi-granularity progressive LDPC read method called MG-LDPC.  Table 3. RBER Variations of 3D Flash Layers [28] Layer index

Observation on Inter-layer Error Variation
Due to the manufactured PVs (mainly circuit etching technologies [28]) of 3D flash memories, RBERs of each layer are significantly different. The RBERs of the middle layers can reach more than 6X of that in the top layers. By carefully calibrating RBER results when the P/E cycle is 10,000 in [28], RBERs of five representative flash layers can be obtained, as shown in Table 3. These five layers correspond to the 3rd, 9th, 15th, 21st, the 27th layer of 30 flash layers. It can be seen from the table that the RBERs of the data in the other four layers are 2.6X, 6.6X, 5.3X, and 4.0X of that in the third layer, respectively. Besides, RBERs of data in the middle layers are the highest, while those in the top layers are the lowest.
We use the error model from Luo et al. [28] to compute RBERs of 3D flash memories, as shown in Equation (5). This formula takes the P/E cycle and data retention time as variables, and the other error factors are reflected with constants. In Equation (5), A = α * PEC + β, B = γ * PEC + δ . PEC represents the P/E cycle, while t represents the retention time. The parameters of α, β, γ , and δ are constant for a fixed page type and set with values shown in Table 2. Note that these parameters are obtained by testing real chips [28]. In reality, they would change for different chip types. RBERs in this article are obtained by taking the average RBERs of lower pages and higher pages.
We draw the RBER increase curves of three typical layers along with data retention. As shown in Figure 6(a), solid lines represent the upper pages of the flash layer, and dashed lines represent the lower pages of the flash layer. As Equation (5) was defined without special treatment for different flash layers, there is currently no simulation model for stacked-layer 3D flash memories. In our experiment, the RBER results of the top layer are obtained according to planar flash error models based on Equation (2) and those of the other two layers are computed by multiplying corresponding times according to Table 3. Taking the upper page of the flash layer as an example, we can see from the figure that the RBER of the top layer does not change obviously with data retention and maintains a relatively low level, while RBERs of the middle and bottom layers sharply increase when data retention increases at high levels. Note that RBER data in different layers vary significantly with data retention. By combining the ECC strengths of LDPC read levels in Table 1, the required read levels of the three typical layers are obtained, as shown in Figure 6(b). We can see that the top layer only requires Level 1 during the investigated period, while the middle layer and bottom layer require multiple high read levels. Besides, the read levels sharply increase, especially for the middle layers. This would induce largely accumulated latencies on the middle and bottom layers. These observations show that the traditional progressive LDPC method is not effective anymore with the granularity of one extra level in each read-retry.

Observation on Intra-layer Error Similarity
Intra-layer error similarity was first introduced in [41]. Despite process variation between different stacked layers, 3D flash memory possesses significant process similarity in the same flash layer. That is, the flash pages on the same layer present similar reliability characteristics. This is because the same layer encounters the same etching environment when etching technologies are applied to 3D flash memory.
In Figure 6(a), we can see that the upper pages and lower pages of the top layers show almost the same RBER. The middle layers and the bottom layers, although there is a certain difference in RBER of the upper pages and lower pages, show a similar growth trend. Therefore, although inter-layer error variation is large, intra-layer errors show similar growth. As we know, the read latency with LDPC is greatly dependent on the reliability characteristics of flash; we can assume that different pages in the same layer have similar read latency characteristics.
In order to investigate the read latency characteristics of the same layer, we studied the required LDPC read level that can successfully read the data of pages on the same layer. Figure 6(b) shows that the upper pages and lower pages of the top layer only require Level 1 during the investigated period; they exhibit the same read level. For the middle layer and the bottom layer, the upper pages and lower pages require the same read level when retention time is less than 10 days, although they exhibit different read levels when retention time is greater than 10 days. The difference in read levels is small and the growth trend is similar. Thus, the pages in the same layer have a similar read latency.

Exploiting Layer-induced Error Characteristics in 3D Flash
According to the above observation and analysis, as different 3D flash layers show significant RBER variations and diverse read level increase speeds, the current LDPC read method with a single level-increasing granularity cannot satisfy the low-latency requirements of different layers at the same time. Directly using the current read method induces high accumulated latency and sub-optimal read performance.
Meanwhile, the pages in the same layer of flash memory show similar reliability characteristics, which indicates we can employ layer reliability estimation instead of page reliability estimation. Thus, by exploiting these two observation results, this article proposes a multi-granularity progressive LDPC read method called MG-LDPC to apply varied level-increase granularity when reading different flash layers and to apply the identical level-increase granularity when reading the same flash layers.

MG-LDPC READING METHOD
This section first presents the architectural overview of MG-LDPC and then illustrates details of its two implementations in 3D flash memories. Consequently, the storage of LDPC read granularity is presented. Finally, the overhead and complexity of MG-LDPC are discussed.

Overview of MG-LDPC
The system architecture with MG-LDPC integrated into the 3D flash controller is presented in Figure 7. When the host requests data from 3D flash memory, the physical addresses of the data are first located through the mapping table in the flash translation layer of the controller. Then, the data is read out and transferred into the MG-LDPC component to obtain the original data. Five LDPC decoding engines are designed by MG-LDPC with different progressive granularity, details of which are illustrated in Section 4.2. An important task of MG-LDPC is to assign LDPC decoding engines for different flash layers. Two assigning implementations are provided, as shown in Figure 7. The first implementation directly distributes multiple layers into each LDPC decoding engine in a fixed way, named FixAdp, according to the layer read level characteristics in Section 3.1. The other implementation adapts LDPC decoding engines in a dynamic way, called DynAdp, according to previous read levels stored in extra flash space. Details of these two implementations are presented in the following sections.

Fixed MG-LDPC Decoding Engines
This section presents the detailed design of five LDPC decoding engines, as well as FixAdp, the fixed decoding engine assignment. Since the current LDPC method cannot satisfy the low-latency requirements of different layers at the same time, different layers require varied increasing granularities of read level in LDPC reading. By analyzing Figure 6(b) in Section 3, it can be seen that while a low read level is enough to ensure the correctness of data in the top layer, the required read levels in the middle and bottom layers probably become very high due to the high RBERs. To adapt to the variety of level-increasing speeds, the five LDPC decoding engines are designed with the granularity of 1, 2, 3, 4, and 6, respectively. The beginning level and maximal level are set to be 1 and 7, respectively, in the current seven-level designed LDPC read method.
FixAdp divides all flash layers into five parts, each of which adopts the corresponding LDPC decoding engine by analyzing RBERs. As shown in Figure 8, once a read request comes in, the layer information is first obtained from metadata and then used to determine which LDPC decoding engine should be used by analyzing RBER characteristics obtained from simulations. For example, there are 30 layers in a 3D flash block with the P/E cycle equal to 10,000, and the data RBERs of top layers from 1 to 9 are relatively low. Thus, FixAdp chooses the first LDPC decoding engine with one-level increase in granularity. The other decoding engines are assigned to the left layers as follows. Layers 10 to 13 are assigned with the second decoding engine with a granularity of 2, while layers 14 to 18 are considered as middle layers and assigned with the third decoding engine with the highest granularity of 6. The fourth LDPC decoding engine with a granularity of 4 is used in layers 19 to 23, and layers 24 to 30 are considered as the bottom layers and read by the LDPC decoding engine with a granularity of 3.
As the fixed LDPC decoding engine assignment may become not applicable when simulated RBER characteristics largely deviated, a dynamic decoding engine assignment method is provided by storing read-level information, details of which are presented as follows.

Dynamic MG-LDPC Decoding Engines
DynAdp also uses the five LDPC decoding engines mentioned above to read 3D flash layers and dynamically decides which decoding engine to use according to previous read levels of each layer stored in extra space in flash. Hence, DynAdp is more flexible when RBER patterns change. In this section, we first illustrate the read process with DynAdp and present the situation that would increase LDPC read granularity. Then, we present the LDPC read granularity decay scheme. Figure 9, when the read request comes in, DynAdp first judges which layer the data belongs to and then performs the progressive LDPC reading in the first LDPC decoding engine with the granularity of 1. Consequently, the final read level is recorded in the extra storage space. We denote the recorded read level as R. For each layer, the initial value of R is 1, and R will be updated when the final read level of the current layer changes. The layer would be assigned to the corresponding decoding   engine with the granularity of R, which means that each read-retry reading progress increases R read levels at one time.

LDPC Reading with DynAdp and Read Granularity Increase. As shown in
The read process with DynAdp is briefly summarized in Algorithm 1. First, we need to determine the validity of the access address. If the layer number is less than 0, stop accessing. When accessing a flash layer for the first time, DynAdp sets the read level as 1 and then reads data using the LDPC decoding engine with the granularity of 1, as shown in lines 5 to 7 of Algorithm 1, in which r denotes the read level of LDPC. Note that the maximum LDPC read level in this article is 7, and thus r ≤ 7. Once data are successfully read out, the read level R would be recorded immediately, as shown in lines 8 to 10 of Algorithm 1, in which ERR_RAT E[r ] denotes the maximum ECC strength of LDPC when the read level is r . Besides, the stored read level R will be updated again when the layer read level changes to R , as shown in lines 15 to 17 of Algorithm 1. The flash controller will use the LDPC decoding engine with the granularity of R to read the data in the current layer for the next time. DynAdp can adapt varied RBER to increase the speeds of flash layers more accurately.

LDPC Read Granularity Decay.
DynAdp also includes a mechanism to reduce the stored LDPC read granularity, named the decay. In a real-world flash-based SSD, RBERs can increase for many transient reasons, which induces a quick increase in RBERs of a small proportion of pages inside the same layer. Besides, there also exists the situation that RBERs of some pages drop down. For example, retention error would introduce high RBERs, but once the data are updated, their RBERs would drop. In these situations, the stored high read granularity will result in unnecessary read latency on these pages. Our read granularity decay mechanism is just designed to deal with these situations and aims at better read performance. The key idea is to reduce the LDPC read granularity upon a condition of a high number of read successes.
As DynAdp shares the LDPC read granularity in the unit of the flash layer, our granularity decay mechanism would not be invoked when RBER updates happen on some parts of pages. Otherwise, we design the granularity decay according to the read success numbers. Once it reaches the preset maximum read success number N , one-level read granularity decay would happen. As shown in Figure 10, when LDPC read is successful with the granularity R for N times in the same layer, DynAdp will lower the read granularity of the corresponding layer into R − 1. Once the decay is finished, the maximum read access number would be reset to 0. For example, we assume N to be 4 and the read granularity for layer 16 to be 3. When the success number to read the current layer reaches 4, the read granularity will be reduced to 2. Thus, the reads on this layer would use the granularity of 2. Note that the maximum read success number N would directly affect the efficiency of DynAdp; we evaluate this effect in the sensitivity study in Section 5.5.

Storage of Read Level Granularity.
In DynAdp, the current read granularity as well as the read success number with the current granularity of each layer needs to be stored. As shown in Figure 11, a mapping table is built to store the mapping information between flash layers and their LDPC read granularity as well as the read success number. In order to reduce the access overhead

Discussion
About Settings of LDPC Decoding Engines. According to RBER distributions in 3D flash memories [28], this method uses five commonly used step sizes to form LDPC decoding engines, and the number of decoding engines can be increased or decreased according to actual use. The impact of decoding engine number and level increase steps on read performance would be investigated in Section 5.
Algorithm Complexity. The complexity of Algorithm 1 is O(1) in selecting which granularity to use. Noting that the current maximal LDPC read level is 7, the complexity of traditional LDPC methods is not high. As some intermediate unnecessary reading steps have been removed in MG-LDPC, DynAdp does not increase computing complexity but reduces LDPC reading iterations by using extra storage of the previous read level and the read success number for each layer. Besides, due to the uniqueness of the codeword, our method does not affect the accuracy of data reading.
Storage Overhead. The storage overhead for DynAdp is estimated based on our experimental settings. For the 64 GB flash with 32 layers, the previous read level of each layer takes only 1 byte. As it also needs extra storage for the granularity decay, we store the read success number with 1 byte. Thus, the overall storage overhead is 33 bytes, which is negligible. As the FixAdp implementation directly assigns LDPC read decoding engines, no extra complexity is induced, and the storage overhead is also a negligible 32 bytes. In summary, the proposed MG-LDPC method can decrease LDPC read latency and optimize flash read performance with reduced complexity and negligible extra storage.

EVALUATION
This section first introduces the experiment settings to evaluate MG-LDPC. Then experimental results with basic settings and sensitivity studies are presented and analyzed.

Experimental Setup
We introduce the experimental setup in three aspects, the experiment platform and workloads, compared methods, and parameter settings, on the basic experiment and sensitivity study.

Experiment Platform and Workloads.
The evaluation experiment for MG-LDPC is performed on Disksim with SSD extensions [2,4]. We use Disksim with SSD extensions to simulate a 64 GB 3D MLC flash. The simulated SSD has four channels, with two chips per channel and four planes per chip, with 32 layers in each flash block. The page size is 16 KB and there are 384 pages in   Table 4, which is mainly based on the specifications of the 3D NAND chip manufactured by Samsung [7], and the max queue depth in the simulation is set to 8. As shown in Table 5, nine workloads [35,47] with obvious differences in read/write ratios are chosen to verify the effectiveness of the proposed method.

Compared Methods.
Four LDPC methods have been implemented and compared, which are listed as follows: • PrLDPC is the current progressive LDPC method [58].
• LaLDPC is the method that estimates the starting read level of pages that exist in a mapping cache [13]. • FixAdp is one implementation of MG-LDPC with fixed LDPC decoding engines.
• DynAdp is the other implementation of MG-LDPC with dynamic decoding engine assignment.

Parameter Settings in the Basic Experiment and Sensitive Study.
In our evaluation of MG-LDPC, we study the results with basic settings and sensitivity study on four parameters of P/E cycles, retention time, number of LDPC decoding engines, and number of read successes to invoke the LDPC granularity decay of DynAdp. These parameter settings are shown in Table 6, in which the Basic column presents the basic settings, while the Sensitivity column presents extra parameters in the sensitivity study.
For the P/E cycle, we set 4,000 and 6,000 P/E cycles to represent normal SSDs and 7,000 and 8,000 P/E cycles to represent SSDs under an extreme situation with high RBERs. For retention time, in order to verify the effectiveness of MG-LDPC under high RBERs, data are considered to be kept for a long time, i.e., the retention time parameter listed in Table 6. For the number of LDPC decoding engines, each decoding engine has a fixed read granularity; thus, it involves several engines in LDPC decoding. The basic setting is five engines with the granularity of 1, 2, 3, 4, and 6, respectively. In order to evaluate its effect, the sensitivity study has two extra engine number settings. One setting is four engines with the granularity of 1, 2, 4, and 6, respectively. The other setting is six engines with the granularity of 1, 2, 3, 4, 5, and 6. For the number of read success involved in the granularity decay mechanism, three values are chosen to evaluate its effect on read performance of DynAdp, and details are shown in Table 6.

Flash Read Latency
Experimental results of flash read latency for the four methods are shown in Figure 12. Flash read latency is the total latency involved in flash sensing, data transferring, and LDPC decoding. We take PrLDPC as the baseline to compute the normalized read latency in all result figures. Compared with PrLDPC, the two proposed implementations have different degrees of reductions in flash read latency under the tested nine workloads, and 20% and 45% of PrLDPC read latency are decreased on average, respectively. Compared with LaLDPC, although FixAdp only reduces the latency by 4%, DynAdp can greatly decrease the read latency by 29% because of its RBER adaptability. These results verify that the proposed MG-LDPC method can work effectively.
In order to explain the results in Figure 12 more clearly, Figure 13 plots the cumulative read-retry level distribution of the four methods under the workload of hm0. It can be seen that the read level distributions under PrLDPC and LaLDPC are similar and mainly concentrated into levels less than 4. The read levels of FixAdp and DynAdp are obviously higher than the other two methods, which means that when reading data with MG-LDPC, the read levels are shifted into higher levels, thus avoiding many unnecessary read-retry latencies in low levels, further reducing the read latency.
Meanwhile, for some workloads, such as proj4, the flash read latency of the proposed method cannot outperform the LaLDPC method. We explain that LaLDPC utilizes a mapping cache to optimize flash read performance, which requires a high read level locality for workloads [13].  Therefore, under the LaLDPC method, we performed a locality analysis of the workload proj4. Figure 14 plots the comparison of cache read hit ratios of LaLDPC under evaluated workloads.
We can see that proj4 shows the highest cache read hit ratios, and it possesses a better read level locality. Thus, LaLDPC exhibits better read performance on workloads with high hit ratios.

SSD Read Response Time
Experimental results of SSD read response time, which computes the sum of read latency and queuing time in the controller, are shown in Figure 15. Three observations can be obtained. First, FixAdp and DynAdp both perform better than PrLDPC, and up to 51% of response performance has been improved on average of workloads. This is because the proposed method can remove unnecessary latency in the middle and bottom layers; meanwhile, the mechanism of the decay can also largely avoid the high read delay caused by the high read level. Second, DynAdp performs better than LaLDPC and 31% of read response time has been decreased. Access patterns of workloads largely decide the performance variations. This is because both methods utilize extra storage to record previous read levels that closely relate to workload patterns. At last, FixAdp sometimes performs worse than LaLDPC but shows more stability for workloads, and the average response time has been improved by 3%. This is because the fixed assignment does not rely on workload patterns as no extra storage is used. FixAdp and DynAdp of MG-LDPC can be chosen according to actual requirements in practice.

Energy Overhead
This article specifically analyzes the energy overhead during MG-LDPC operation. The energy overhead mainly comes from the data access process by MG-LDPC at the software level. In order to obtain the results of energy overhead, we increase the energy overhead value of each step in the read process when simulating MG-LDPC to read the flash memory data [21,34]. In the simulation process, by increasing the read level of LDPC, the energy overhead will increase proportionally. The energy overhead for data access comprises flash sensing, data transferring, and LDPC decoding.
We model the energy consumption of reading NAND flash memory as the sum of the energy for array access (E ac ) and LDPC decoding (E do ) [21]. The array access includes flash sensing and data transferring. The total energy consumption E r is calculated in Equation (6). Since the read operation is performed simultaneously for both LSB and MSB data, the energy consumption of LSB and MSB pages is considered. Let E LS B and E MS B be the read energy for an LSB page and an MSB page, respectively. In MLC flash memory, reading an MSB page uses two times more sensing reference voltages than that of an LSB page access; hence, the energy consumption of the array access operations for an LSB page and an MSB page can be modeled as E ac 3 and 2 3 E ac , respectively. Because two pages of data are delivered simultaneously in the LSB and MSB concurrent access scheme, the decoding energy of each page is modeled as E do 2 . Therefore, the energy consumption of each page can be represented as Equations (7) and (8): Note that we are only concerned with the active energy and ignore the idle energy. In this simulation, NAND Flash memory that operates at 100 MHz and V ccq of 1.8 V in the synchronous mode consumes the minimum read energy. The energy parameters are shown in Table 7. Let the number of sensing reference voltages be N s , and N b = loд 2 (Ns +1) bits are needed to represent the threshold voltage. Table 8 shows the estimated energy consumption when N s are 3, 6, 9, and 15.
The experimental results of energy overhead are shown in Figure 16. We take PrLDPC as the baseline to compute the normalized energy overhead in the result figures. FixAdp and DynAdp both perform far better than PrLDPC and LaLDPC under the nine tested workloads. Compared with PrLDPC, two proposed implementations decrease the energy overhead by 35% and 53% on  average, respectively. The reason for the result is that MG-LDPC skips over many low read levels and further avoids unnecessary flash sensing, data transferring process, and so forth. Compared with LaLDPC, two proposed implementations decrease the energy overhead by 21% and 39% on average, respectively. This is because the cache size obstructs the access performance; if the cache size is small in LaLDCP, it will not perform better than MG-LDPC.

Sensitivity Study
This section presents the sensitivity study results on four factors related to the read performance of the above LDPC reading methods. These factors are retention time, P/E cycle, the number of  LDPC decoding engines, and the read success number parameter to invoke decays in the read granularity decay scheme.

The Effect of Retention Time and P/E Cycle.
The sensitivity study results on the two parameters of initial data retention time and P/E cycle are shown in Figure 17. It can be observed that when the retention time increases, both FixAdp and DynAdp still perform better than the other two methods and improve the performance of existing LDPC methods by up to 36% and 59% on average, respectively. However, when the retention time is short, the fixed method performs worse, as shown in Figure 17(a). This is because for short retention, only low read levels are induced and latencies from intermediate levels are small, with which a fixed LDPC decoding engine assignment would not be so efficient anymore. When the P/E cycle increases, read performance improvements by the proposed method are more obvious, as shown in Figure 17(b). DynAdp behaves better and can reduce the response time significantly for high P/E cycles. In summary, FixAdp can improve read performance for long-kept data, while DynAdp behaves well under different settings but requires extra storage overhead.

The Effect of LDPC Decoding Engine Number.
It can be observed from Figure 18 that when the number of decoding engines increases, DynAdp with more decoding engines performs better than those on read performance with fewer under all nine tested workloads. Meanwhile, this observation comes to our notice: DynAdp with decoding engine-5 is more obvious than DynAdp with decoding engine-4, and up to 20% of read performance has been improved on average, but DynAdp with decoding engine-6 does not significantly improve read performance compared to DynAdp with decoding engine-5; only 2% of read performance has been optimized. Therefore, considering the complexity of the design, we set decoding engine-5 as the flash read method.

The Effect of Read Success Number in Read Granularity Decay
Mechanism. The sensitivity study results are shown in Figure 19 on the number of read successes N to invoke the read granularity decay. It can be observed from the figure that with the parameter of read success number increasing, DynAdp performs better on read performance; compared with the number of successes-2, the number of successes-8 improves read performance by up to 33%. This is mainly because the increase of the parameter reduces the frequency of the decay, which in turn reduces the probability of read-retry, thereby improving the read latency. However, if the parameter is too large, the read performance will not improve much, because this may increase the probability of unnecessarily high read levels, thus making the read performance worse. It can also be seen from Figure 19 that compared with the number of successes-4, the optimization on read performance of the number of successes-8 is not obvious, only improving the read performance by 7%.

RELATED WORKS
Existing works related to this article can be classified into two aspects with LDPC read methods and access performance optimization of 3D flash memories.
The first aspect aims to optimize LDPC read schemes for flash memories. Zhao et al. [58] proposed a fine-grained progressive LDPC reading method to increase one extra level when reading with a lower level fails, which induces matched latency for current RBERs. Sun et al. [44] proposed intra-cell data placement interleaving and intra-cell data-dependence-aware LDPC decoding to efficiently improve the LDPC decoding efficiency. Zhang et al. [55] exploited the error correlations between different flash pages to accelerate LDPC decoding convergence. Du et al. [13] proposed a latency-aware LDPC read method by exploiting read-level characteristics along with data retention. Shao et al. [38] propose a construction method of dispersed array LDPC (DA-LDPC) codes based on an array square for flash memories; DA-LDPC codes benefit from not only the array property but also a hybrid and efficient storage architecture due to their stair-like structure. Zhang et al. [56] propose a PBE-aware LDPC (PAL) decoding scheme based upon an observation of the pair-bit errors (PBEs) characteristic of MLC NAND flash memory, in which PBE provides the promotion information for LDPC decoding to reduce decoding latency.
Li et al. [24] propose asymmetric error-aware read voltage placement based on error characteristics of 3D flash memory, which is in the view of read voltage placement. Li et al. [25] propose a smart sensing level placement scheme to reduce the LDPC decoding latency. Our work focuses on read level increase granularity in the progressive decoding process, which is in the view of read level numbers. Zuolo et al. [59] exploit hardware resources in NAND flash chips to propose an optimized LDPC decoding approach, named NAND-Assisted Soft Decision (NASD), which optimizes read performance by reducing the number of data transfers. This method mainly reduces the number of data transfers in the LDPC hardware decoder, while our work focuses on the number of data sensing in the memory controller. These works are orthogonal with the proposed MG-LDPC.
The second aspect aims to optimize the access performance of 3D flash memories. Luo et al. [28] exhibited a new phenomenon of layer-to-layer process variation in the 3D flash chip and designed several methods to estimate optimal sensing levels to improve read latency. Wang et al. [49] presented a process-variation-tolerant reliability management strategy for 3D flash memories, which enhances the reliability and reduces access latency effectively. Chen et al. [7] presented a performance-boosting strategy to optimize the access performance of CT-based 3D flash by utilizing its asymmetric page access speed feature. Wu et al. [51] proposed a one-shot programmingaware data allocation policy, called OSPADA, to improve the read performance of 3D CT flashbased SSDs by enhancing read parallelism. Compared with these works, this article mainly focuses on read performance improvement of 3D flash memories by exploiting the LDPC read level variations among layers based on the RBER variations [28].

CONCLUSION
In order to investigate and optimize the read performance of LDPC codes under 3D flash error characteristics, this article proposes the multi-granularity progressive LDPC read method to apply varied read level increase granularity to read different flash layers. Two implementations with fixed and dynamic granularity assignments are provided. The fixed strategy directly separates layers into five parts, while the dynamic strategy decides the granularity according to previous read levels stored in extra storage space. From the experimental results, the proposed method can effectively improve read performance and reduce the energy overhead of 3D flash memories compared with the existing LDPC methods. This article is the first work to re-design the LDPC read scheme in 3D flash memories.