Improving Performance of Network-on-Memory Architectures via (De-)/Compression-in-DRAM

Network-on-chips (NoCs) are envisioned to be a scalable communication substrate for Network-on-Memory (NoM) architectures. However, modern data-intensive workloads continue to overwhelm the NoC link capacity, dramatically increasing memory service latency and causing a great performance loss. We introduce DECORAM, a data (de-)/compression scheme implemented within a DRAM-based NoM architecture. DECORAM uses a lookup table (LUT) to store compressed codes of common data patterns, and exploits this LUT during LLC misses to transmit these codes via NoC, instead of the original uncompressed data. We formulate compression and decompression mechanisms as a combination of LUT-based pattern matching and prefix concatenation, which are implemented using low-latency DRAM row activations and exploiting analog properties of the DRAM cell. To support DECORAM, we introduce a minimal design change of adding isolation transistors in a subarray to activate inter-subarray data movement based on the content of its row buffer. Our DECORAM controller reduces the compression and decompression latency by exploiting subarray-level parallelism to compress/decompress several CPU data misses, simultaneously. We evaluate DECORAM using data-intensive workloads from SPEC, APACHE, PARSEC, and in-memory computing benchmark suites. Our results show that compared to a baseline NoM, DECORAM significantly improves performance (average 30%) and energy (average 32%). Compared to a conventional NoC compression mechanism, DECORAM reduces memory area by 27% and energy by 12%, while delivering 7% higher performance improvement.


INTRODUCTION
Networks-on-Chip (NoCs) have become an integral part of modern systems-on-chip (SoCs) to support several connected components (mainly processing elements) with high area and energy efficiency, and strong quality-of-service (QoS) guarantee [4,8,10].Inspired by this, an emerging trend is the Network-on-Memory (NoM) architecture, where the key idea is to use a NoC as the main interconnect for memory modules (e.g., DRAM banks) of a memory subsystem [20].NoM architectures can be used to build high performance memory systems such as a Hybrid Memory Cube (HMC) [19] and waferscale memory [3].A NoM architecture can reduce the latency of data block accesses between CPU and memory, thereby increasing the performance of memory-intensive workloads such as machine learning, business analytic, and image/video processing applications [18].On the other hand, if data blocks are already inside memory, a NoM can facilitate parallel (and faster) data movement from one memory location to another, increasing the speedup of in-memory bulk data copy and initialization operations [23].
We analyze the last level cache (LLC) misses of a wide-range of data-intensive workloads from SPEC, APACHE, PARSEC, and in-memory computing benchmark suites.Our results (see Sec. 2.4) show that even with the best performing NoC, these workloads easily overwhelm the NoC link capacity, increasing the memory service latency and data communication energy.The essence of this paper revolves around the idea of enabling data compression on the underlying NoC of a NoM architecture for gain in performance and power.Intuitively, data compression would reduce NoC traffic, and in effect, lower the memory service latency and power consumption, leading to application-level performance improvement.
We propose DECORAM, a practical lightweight data (de-)/compression scheme implemented in DRAM to reduce the data traffic in NoMs.DECORAM uses a lookup table (LUT) to store compressed codes for commonly-used data patterns of LLC misses.During workload execution, if DECORAM encounters one of those patterns, it transmits the short compressed code over the NoC instead of the long uncompressed data.We store the LUT inside a DRAM subarray.We implement compression and decompression mechanisms as a combination of LUT-based pattern matching and prefix concatenation.These operations are implemented by scheduling DRAM row activation commands and by exploiting analog properties of the DRAM cell.Following are our key contributions.
• DECORAM directly compresses and decompresses LLC misses within DRAM banks of a NoM architecture , thus requiring no extra (separate) hardware to execute these operations.By moving compressed data over the NoC, DECORAM reduces NoC traffic, which leads to improvement in application performance and energy.• DECORAM leverages existing DRAM infrastructure to perform data compression and decompression by issuing DRAM row activation commands and exploiting analog properties of the DRAM cell to achieve bitwise operations such as row copy and majority function.Thus, DECORAM minimizes the design changes needed to the existing DRAM infrastructure.• DECORAM introduces a minimal design change to implement bitwise multiplexing, a new in-memory computing operation and a key constituent of our approach.• We engineer DECORAM controller to explore the tradeoff between compression ratio and compression/decompression latency, reducing the critical path delay of memory accesses.DECORAM does not store compressed data in DRAM, which would lead to a very high OS page management overhead, nor does it require the CPU to operate on compressed data, which would require changing the CPU design.We evaluate DECORAM using data-intensive workloads from SPEC, APACHE, PARSEC, and in-memory computing benchmark suites.Our results show that compared to a baseline NoM design, DECORAM significantly improves performance (average 30%) and reduces energy (average 32%).Compared to a conventional NoC compression mechanism, DECORAM reduces memory area by 27% and energy by 12%, while delivering 7% higher performance improvement.

BACKGROUND AND MOTIVATION
We provide the background necessary to understand DECORAM.We also provide observations leading to DECORAM.

Data Compression
To understand the scope of compressing LLC miss data, we take the example of sign-bit extension, a technique to represent smaller integers (say, 8-bit) into a 64-bit cache block.All the information in the block is stored in the least significant few bits.Now imagine this cache block to be part of an LLC miss that is transmitted over NoC.Without compression, the entire 64 bits block is transmitted either from DRAM banks to the memory channel (for a read access) or vice versa (for a write access).With compression, the cache block can be compressed into 12 bits -8 bits for the integer and 4 bits as the prefix.The compression ratio achieved for this LLC miss is 64 / 12 = 5.3.Our studies show that there are several other frequently-used data patterns.We summarize them below.
• All Zeros: This is one of the most frequently-used data pattern where all data bits are zeros [2].All zero data blocks are common due to initialization and null pointers.• Zero Padding: These are patterns which include narrow data in MSBs and the remaining bits padded with zeros to increase the width of the pattern to the closest power of two.Zero padding is commonly used in audio and video processing applications [13].• Others: There are application-specific patterns such as character repetition and low color gradient image data [12].Figure 1 illustrates the compression and decompression mechanisms.Commonly-used patterns and their prefixes are stored in a table called the Pattern Match Table (PMT).Compression is performed on a data by looking up the PMT and using the short prefix instead of the original data pattern.
For compression, the incoming uncompressed data (from DRAM bank or memory channel)  is analyzed to extract the pattern .This pattern is searched in the PMT.Let the associated prefix of this pattern be .The compressed data is formed by concatenating the Figure 1: Compression/decompression using a PMT prefix with the remaining data  as  .This compressed data is then transmitted via the NoC.The decompression mechanism follows the exact same procedure.Here, the prefix  is searched in the PMT and the corresponding data pattern  is concatenated with the remaining bits  to form the uncompressed data  .Therefore, compression and decompression can be formulated as a combination of pattern matching and prefix concatenation.

Network-on-Memory (NoM)
To the best of our knowledge, DECORAM is the first work that implements data compression inside DRAM for Network-on-Memory (NoM) architectures.We would like to leverage memory-oriented computing approaches in implementing the compression scheme.To understand the data compression challenges and implementation scope, we briefly introduce network-on-memory (NoM) design [20] in this section.Figure 2 illustrates a Network-on-Memory (NoM) design.Here, each DRAM module consists of multiple chips (Fig. 2(a)), each of which contains multiple banks (e.g., 8 banks per chip for DDR3 and 16 banks per chip for DDR4).A bank is internally divided into subarrays (Fig. 2(b)), each of which has its own local row buffer.A subarray is a two-dimensional array of capacitor-based DRAM cells (Fig. 2(c)).Subarrays are the smallest memory structure that can be accessed in parallel with respect to each other [14].We elaborate on the internal details of a subarray when we discuss DECORAM in Section 3.
In a NoM design, all DRAM banks of all chips are physically organized as NoC tiles that are arranged in a mesh, i.e., a Manhattanlike structure (Fig. 2(d)).However, they continue to maintain the logical hierarchy of channels and ranks [20].In this way, there is no change needed to the interface between CPU and DRAM.A tile consists of a router which connects to its four neighboring tiles along North (N), East (E), South (s), and West (W) directions (Fig. 2(e)).A DRAM bank is connected to the router on its Local (L) direction via the network interface (NI).The NI hardware consists of injection and ejection buffers to manage the data communication through the router (Fig. 2(f)).Figure 2(g) shows the architecture of a router.A few virtual channels (VCs) are assigned to each link to increase the transmission concurrency.Inside the NoC, data is split into flits (typically 64-bits wide) with the head flit consisting of the required metadata (e.g., for routing computation).

Motivation
(1) Common patterns appear frequently in the data of memory accesses.Figure 3 shows the frequency of all common patterns (as discussed in Section 2.1) in LLC misses of a wide-range of data-intensive workloads that are used to evaluate DECORAM (see Section 6).Our observations show that common data patterns constitute a significant fraction (91%, on average) of LLC misses of data-intensive workloads.If we compress these common patterns, we can reduce the NoC traffic, significantly.(2) The hardware overhead of data compression engines can be eliminated using internal architecture and analog properties of DRAM.Previous work [7,22,23] has proposed to use the dense and parallel internal architecture and analog properties of DRAM to implement fast and bulk (in granularity of a DRAM row) bit-wise operations inside DRAM, such as bitwise majority and bulk copy.We show that data compression can be implemented inside DRAM bank using bitwise operations.This eliminates the need for extra hardware to execute data compression.
(3) The significant reduction in NoM traffic compensates for the compression and decompression delays in the overall performance of the data-intensive workloads.
Typically, a 64-byte cache block data is converted into eight 64-bit flits in NoM.Without compression, the end-to-end latency of this data inside memory (  ) includes the transmission latency of 8 flits over the NoC and the service latency (i.e., read or write) inside the DRAM bank.Considering this data is compressed to 4 flits, the end-to-end latency inside memory (  ) includes the transmission latency of 4 flits over the NoC, the service latency inside DRAM bank, compression delay (at source), and decompression delay at destination.We show that   is significantly lower than   when considering all data accesses for a given workload.The significant reduction in NoM traffic compensates for the compression and decompression delays in the overall performance of the data-intensive workloads.

Observations Leading to DECORAM
Figure 4 illustrates the increase in the execution time (reported in million cycles) of a microbenchmark as we increase the number of memory requests from 1K to 256K using the evaluation setup which we describe in Section 6.We observe that the execution time increases significantly with the number of memory requests.This is due to an increase in the network congestion, which increases the buffer utilization in each router, thereby increasing the routing delay.We further analyze this for real workloads in Section 7.

DECORAM: DATA (DE-)/COMPRESSION INSIDE DRAM
Compression in DECORAM is triggered when a read request accesses a row inside a DRAM bank (by a DRAM activation command).Compression in DECORAM is not performed after reading the requested cache block from the DRAM bank which has latency penalty and hardware overhead.Instead, it compresses the accessed row and transmits the compressed data of the requested cache block.DECORAM implements compression and decompression by taking advantage of the DRAM analog properties and scheduling activation commands to minimize the need for extra hardware inside DRAM.We provide a simple example to illustrate the exact compression/decompression mechanism of DECORAM.In the following, we describe the compression mechanism, which is a combination of pattern matching using a PMT followed by adding the prefix of matched pattern to the data.The decompression mechanism follows the exact same procedure but with different patterns and prefixes.

Pattern Matching in DECORAM
Without loss of generality, we illustrate matching data  against 8-bit sign extension pattern with sign-bit = 0.A data that matches this pattern will have a hexadecimal representation of the form 0000_0000_0000_00 .Our compression mechanism replaces the leading zeros from the MSB with a prefix (of fewer bits), while retaining  .We define mask  for the pattern as where 1 56 is a 56-bit binary number with all bits set to 1, and 0 8 is a 8-bit binary number with all bits set to 0. We define the complementary mask  ′ as We represent  with two components   and   as We define pattern matching of  with  as a combination of the following three sub-operations.
• Matching (MAT): We perform bitwise majority operation with  ,  ′ , and  to match  with  .The result of this operation is the following.
Bit-wise majority can be achieved by simultaneously activating three DRAM rows, containing  ,  ′ , and , respectively [23].This triple row activation operation exploits analog capacitor behavior of DRAM cells, which we elaborate in Section 4. In Equation 4, if  matches  , then  56  = 0 56 , i.e.,  = 0 64 .We rewrite Equation 4as where  64 is some combination of 1's and 0's Our objective is to encode  as following.
This encoding simplifies the prefix concatenation step of DECORAM, which we describe in Section 3.2.To obtain Equation 6 from Equation 5, we introduce the following two sub-operations.

• Inversion (INV):
We perform bitwise NOT on  to obtain an intermediate encoding  as follows.
This sub-operation makes  = 1 64 for a match.Otherwise,  is some combination of 0's and 1's.• Zero-Take-All (ZTA): We introduce a minimal design change involving adding a few transistors to propagate a 0 to all bits of  when atleast one of the bits in  is a 0 (e.g., when does not match  ).Essentially, a single zero in  takes over the entire result in the case of a mismatch.On the contrary, for a match, all bits in  are set to 1. Therefore,  remains unchanged due to this sub-operation.We describe this sub-operation as follows.
We describe our design changes in Section 4.3.

Prefix Concatenation in DECORAM
Once a match is found, DECORAM adds the corresponding prefix to the data.Without loss of generality we let  be a 4-bit prefix corresponding to the 8-bit sign extension pattern with mask  .We represent the prefix as a 64 bit mask  as The complementary prefix mask  ′ is defined as To formulate the prefix concatenation problem, we further represent  from Equation 3as Prefix concatenation is a combination of two sub-operations: • Encoding (ENC): We perform bitwise majority on ,  ′ , and  to insert the prefix into the data.The result is If  matches the pattern  , then  is its prefix-encoded version.Otherwise, we need to discard the content in .• Multiplexing (MUX): The last step of DECORAM is therefore, multiplexing between the original data  and its prefixencoded version .We note that  = 1 64 for a match and 0 64 , otherwise.Therefore, we perform bitwise multiplexing between  and  with  as the selector to obtain  as

Additional Pattern-Specific Processing
Our workload study shows a few additional patterns that are commonly observed in memory accesses (see Table 2).DECORAM can match these patterns by disabling the INV sub-operation.To illustrate this, consider the pattern 8-bit sign extension with sign-bit = 1.A data matching this pattern has a hexadecimal representation of the form     _    _    _  .Our compression mechanism replaces all leading 1's with a prefix.We rewrite Equations 1-8 as if  matches  some combination of 1's and 0's otherwise We do not use prefixes that are all 0's or all 1's for any patterns in DECORAM since this arrangement of bits are used by the source NI to identify if a received data is compressed or not.
We use a total of 9 patterns (5 baseline and 4 additional) to evaluate DECORAM.As Figure 3 shows, the selected 9 patterns constitute a significant fraction (91% on average) of all data patterns in the evaluated data-intensive workloads.Other application-specific data patterns may exist in the evaluated data-intensive workloads.Our evaluations show that the number of such application-specific patterns is significantly fewer (less than 10%) than all data patterns in the evaluated data-intensive workloads.DECORAM can be extended to support such application-specific patterns if the the reduction in NoM traffic can compensate for the delay of in-memory compression in the overall performance of the data-intensive applications.Our future work will thoroughly analyze the performance benefit of using these application-specific patterns in DECORAM while evaluating hardware and latency overheads.

Final Data Processing
A typical 64-byte cache block consists of eight 64-bit data and is received from the requested DRAM bank as eight   (see Equation 13) with  = 0 to  = 7 .Once an   is received in the source NI of the NoC, the NI identifies if it contains compressed data based on the leading all zeros (for base patterns) or leading ones (for additional patterns) and removes these bits starting from MSB.The remaining bits from all eight   are packetized into 64-bit flits, which are then routed over NoC links towards their destination.
The destination NI locates flits that contain compressed data using DECORAM metadata which is stored in the unused portion of the header flit of the packet.DECORAM metadata consists of one bit per flit of the packet which is 1 when the flit in the corresponding position contains compressed data and 0 when it contains uncompressed data.The size of DECORAM metadata depends on the number of flits inside the packet which can be maximum 8-bits.The unused portion of the header flit is much larger than 8-bits [9].The destination NI then transmits each compressed and uncompressed data of the packet to the DRAM bank.DECORAM decompression is triggered inside the destination DRAM bank by a DRAM write command.Once a data is decompressed, DECORAM writes the decompressed data to the destination row in the DRAM bank.Uncompressed data is written to the destination row without any change since it doesn not match with any DECORAM pattern.
We present implementation details of DECORAM next.

DECORAM IMPLEMENTATION DETAILS
DECORAM leverages subarray architecture of DRAM and analog properties of its cells to implement all sub-operations.

Implementing Multiplexer
Figure 5 shows the internal architecture of a DRAM module, organized into subarrays.Each subarray consists of a row buffer, which includes a set of sense amplifiers.Each sense amplifier is used to sense the content of a DRAM cell located at the intersection of a bitline and a wordline.This is shown to the right.When a wordline is enabled, the content of a row is copied into the row buffer, which can then be read or be overwritten with a new content.Modern DRAMs support inter-subarray links to quickly move row buffer content between adjacent subarrays [7].An inter-subarray link consists of an isolation transistor, which can be enabled or disabled using the content of a latch.The circuitry is shown to the left.Link latches can be programmed externally using a special DRAM command that is issued from the memory controller.Additionally, these latches can be programmed with the row buffer content.
We leverage inter-subarray links to implement the MUX suboperation of DECORAM as follows.Consider multiplexing between data  and  based on the select signal .We load  and  in row buffers of subarray  and ( + 1), respectively.We load  in the latch array of subarray .If  = 1, then the inter-subarray links are enabled, which copies  from subarray 's row buffer into the input of sense amplifiers in subarray ( + 1).Subsequently, when these sense amplifiers are enabled, the old content  is over written with .On the other hand, if  = 0, then inter-subarray links are disabled and the old content  is retained in subarray ( + 1)'s row buffer.

Implementing Bitwise Majority Logic
DRAM operations consist of ACTIVATE (A)-PRECHARGE (P) cycles.The ACTIVATE command opens a row and copies the content of its DRAM cells into the row buffer.The PRECHARGE command closes the row after copying back the row buffer content into its DRAM cells.In Figure 6(a), we illustrate bitwise-majority logic implementation in a subarray by issuing three back-to-back ACTIVATE commands to three rows (A, B, and C, respectively), without issuing PRECHARGE commands [23].The voltage at the input of a sense amplifier is the result of charge sharing of the three DRAM cells (belonging to the three open rows) on its bitline.On the left most bitline, the voltage is ≈ 2 3   , while that on the rightmost bitline is ≈ 1 3   .Due to the sense amplifier action, input voltage above (  /2) is reinforced to   , while that below (  /2) is deplete to 0. Activating more than one row destroy their content.To precontent, bitwise majority logic is implemented by issuing a sequence of ACTIVATE-ACTIVATE-PRECHARGE (AAP) primitives.

Implementing Zero-Take-All Circuitry
We introduce Zero-Take-All (ZTA) circuitry, a simple design that propagates input bits to the output bits such that a single zero in the input bits takes over the entire output bits.If all input bits are one, they are propagated to the output bits without any change.Figure 7 shows the Zero-Take-All circuitry for 64-bits input  (i.e.,  0 to  63 ) and output  (i.e.,  0 to  63 ).The circuit consists of ZTA units each of which takes two input bits and generates two output bits.
Inside each ZTA unit, each input bit enables a PMOS to propagate it's zero value or an NMOS transistor to propagate a feedback from the preceding transistor.If at least one input bit is zero, it is propagated as a feedback to the entire bits of  through the nmos transistors.The Zero-Take-All circuitry is enabled by the memory controller by (1) rising the  1 signal and (2) resetting the  1 and rising the  2 signal. 1 connects Vdd to the first ZTA unit in order to propagate one to all bits of  if all bits of  is ones. 2 enables the propagation of a zero to the entire  .Spice simulations show that from the time, the  1 signal is enabled, it takes approximately 0.5 for zero input bits to take over the  .

DECORAM CONTROLLER: THE COMPRESS-PREFETCH CYCLE
Figure 8 shows our implementation of DECORAM controller to perform compression and decompression inside DRAM.We exploit the following DRAM capabilities.
(1) Subarrays are the smallest memory structure within a bank that can be accessed in parallel with respect to each other [14].Each dual-contact DRAM cell is connected to both bitline and bitline.The content of a row is captured in the row buffer by enabling the regular wordline (called the d-wordline), while its negation, i.e., bitwise-NOT of the content is captured in the row buffer by enabling the negation (n) wordline.(2) To support fast bitwise operations within a DRAM subarray, the entire row address space is divided into three groups -Bgroup (or bitwise group), C-group (or control group), and Dgroup (data group) [23].Out of these, C & D groups consist of rows that are decoded using a regular row decoder, while the decoder for B group incorporates additional functionalities to simultaneously activate multiple rows from this group.We use two adjacent subarrays  & ( + 1) to perform DECORAM sub-operations within a bank.Adjacent subarrays allow fast copying of data from one subarray to another.Subarray  performs the pattern matching operation (sub-operations MAT, INV, and ZTA), while subarray ( + 1) performs the ENC sub-operation of prefix addition.Finally, both subarrays jointly implement the MUX.
We formulate DECORAM as a COMPRESS-PREFETCH cycle.We use subarray  to store pattern masks  's and their complement  ′ in its D-group (Eqs 1 & 2).If there are  patterns to compare, we use 2 *  rows to store these patterns.Henceforth, we call this subarray the PM subarray.We use subarray ( + 1) to store the prefix masks 's and their complement  ′ in its D-group (Eqs 9 & 10).Due to one-to-one mapping between patterns and prefixes, we also use 2 *  rows in this subarray, which we call the PA subarray.A typical DRAM bank consists of at least 32k rows [1] (64 subarrays, each with 512k rows).DECORAM uses 9 patterns (5 baseline in Table 1 and 4 additional in Table 2) which require 36 rows in a DRAM bank to store the pattern and prefixes.This shows that the storage space overhead of DECORAM is negligible (0.1% or lower in typical DRAM banks).We now describe sub-operation scheduling using the two subarrays.

The COMPRESS Operation
Table 3 reports the content and addresses of B-group rows that are used to store DECORAM patterns and prefixes, and execute bitwise operation for pattern matching and prefix concatenation.Equation 15shows the steps that DECORAM controller takes to perform pattern matching and prefix concatenation for one of the 9 patterns.
Figure 9 shows step-by-step actions of our DECORAM controller to implement Equation 15.The AAP primitive in step 1 performs bitwise-majority on rows T0, T1, and T2 from PM subarray and stores its bitwise-NOT ( in Eq. 7) in DCCO.The first A of this primitive activates B12 (i.e., the three source rows as described in Table 3).Upon activating three rows, the result is bitwise-majority ( from Eq. 4) and is available at the subarray's row buffer (RB) ❶.We describe this in Figure 6(a).The next A in this AAP primitive activates B5, which copies the row buffer's content to  ❷.
Step 2 loads the row buffer content of PM subarray into the latch array (LA) to perform the zero-take-all functionality ❸.Step 3 performs AAP on three rows (T0, T1, T2) of PA subarray and saves the result ( in Eq. 12) in .The result is also available in the subarray's row buffer ❹.Step 4 performs a RowClone operation to copy this result to PM subarray in its row buffer ❺.Step 5 performs an AAP in the PA subarray to copy D back to B2.Its row buffer is also updated with content D ❻. Finally, step 6 performs ZTA to overwrite the content of PA's row buffer () with PM's row buffer () if the data  matches the target pattern ❼.
If there are  patterns, DECORAM controller repeats Equation 15 for all  patterns until a match is found for data .If there is no match for , it is uncompressible and it's original content remains in the PA's row buffer.Figure 10 shows how DECORAM controller performs Equation 15for compressible and uncompressible 64-bit data patterns that are stored in a row.We assume that the row consists of  + 1 64-bit data patterns ( 0 to   ), out of which same steps are applied to other data patterns in the row.When the data row is accessed, DECORAM controller starts with performing pattern match of  0 to   with  0 by executing steps 1 and 2 in Equation 15 in the PM subarray❶.The output of the pattern match is 1 64 (64-bit all ones) for  0 while it is  64 (a combination of zeros and ones) for the remaining data patterns.The PA subarray performs prefix concatenation (step 3 in Equation 15) to compress the data patterns with prefix masks 0 and their complements❷.
We use    to represent the compressed data of   that matches .DECORAM controller copies compressed patterns  0 0 to  0  to the PM's row buffer and the original data patterns  0 to   to the PA's row buffer❸.Finally, DECORAM controller performs ZTA which overwrites the copy of only  0 in the PA's row buffer with  0 0 since it matches with  0 while  1 to   remain the same as their original content❹.The same steps are repeated for patterns  1 to  4 and the PA's row buffer remains with  0 0 and  1 to   because no pattern match is found.
DECORAM controller then performs the pattern matching of  0 to   with pattern  5 and finds a match for only  1 ❺.The output of the pattern matching is 1 64 for  1 and  64 for the other data patterns.The PA subarray repeats the same step 3 in Equation 15to perform prefix concatenation with prefix masks 5 and their complements❻.DECORAM controller copies the compressed patterns  5 0 to  5

𝑀
to the PM's row buffer and the last compressed row (in ❷) to PA's row buffer❼.Upon executing ZTA, the original  1 in the PA's row buffer is overwritten with it's compressed data  5 1 ❽.DECORAM controller repeats the same steps for the remaining patterns and updates the PA's row buffer with the compressed data of the data patterns for which it finds a match.Since DECORAM controller doesn't find any match for   , this data pattern remains the same as it's original content in the PA's row buffer❾.

The PREFETCH Operation
The PREFETCH operation ( ) of DECORAM brings the PM and PA subarrays to their prefetch states as shown in Fig. 8.In this state, we copy a target pattern and prefix along with their complement into the B-group of their respective subarrays.DECORAM places one copy of  each in the D-and B-group of the PM and PA subarrays.Overall, prefetching requires three B-group rows in PM subarray (for   ,  ′  , and ) and three in PA subarray (for   ,  ′  , and ).

EVALUATION METHODOLOGY 6.1 Experimental Setup
To evaluate DECORAM, we develop a full-system simulator with the following components (configuration parameters in Table 4).

Evaluated Designs
We evaluate the following NoM designs.
• Baseline [20]: This is our baseline NoM design where each 64-byte cache block is packetized into 64-bit flits at the NI and then routed via NoC routers to their destination.Table 5: Workloads used to evaluate DECORAM.
• BaselineCompress [24]: This is our baseline where the NI design is modified to include the PMT and compression and decompression engines.A cache block is compressed at the source router using the PMT.The compressed data is packetized into flits and communicated over the NoC.At the destination router, flits are depacketized and then, the data is decompressed.By moving compressed data over the NoC, Baseline-CMP improves application performance.However, due its NI design, it has a high area and power overhead.The results for BaselineCompress and DECORAM includes the delay of their compression and decompression operations.We make the following observations.First, when compression is enabled inside the NI of NoC routers, cache blocks are compressed into fewer flits than in baseline, which routes uncompressed cache blocks via the NoC.So, the NoC traffic is lower in BaselineCompress, which improves performance (average  Second, DECORAM further improves performance with respect to BaselineCompress by implementing the compression and decompression mechanisms using in-memory operations, which are faster, and area and energy efficient (we present area and energy analysis in Sections 7.5 & 7.4, respectively).We observe an average 8% lower workload execution time for SPEC workloads, 4% lower for mixed workloads, 11% lower for apache workload, 12% lower for PARSEC workloads, and 2% lower for in-memory workloads.
Although compression and decompression operations have delays, we observe significant improvement in the performance of the workloads because of a considerable reduction in the NoC traffic and congestion (results reported for NoC traffic in Section 7.2).Overall, considering all evaluated workloads, DECORAM delivers a higher performance than both baseline (average 30% higher) and BaselineCompress (average 7% higher).

NoC Traffic
Figure 12 plots the buffer utilization of NoC routers for our evaluated workloads, normalized to baseline.Buffer utilization can be used to estimate the NoC traffic in a NoM design as we have analyzed in Section 2.4.We observe that the buffer utilization in DEC-ORAM is on average 55% lower than baseline for SPEC workloads, 67% lower for mixed workloads, 37% lower for apache workload, 63% lower for PARSEC workloads, and 84% lower for in-memory workloads.Overall, the average buffer utilization of DECORAM is 72% lower than baseline considering all evaluated workloads.These improvements are because DECORAM sends fewer flits per cache block for each memory access, which reduces the congestion.

Compression Ratio
DECORAM uses 9 patterns (5 baseline and 4 additional) by default (DECORAM-default).The first bar in Figure 13 shows the average data compression ratio achieved using DECORAM-default for the evaluated workloads.The compression ratios are for the entire workload, including both (1) data patterns that match these 9 patterns and (2) data patterns that do not match any pattern.The average compression ratio is 5.0 for SPEC workloads, 4.8 for mixed workloads, 4.3 for apache workload, 3.6 for PARSEC workloads, and 6.4 for in-memory workload.The average compression ratio achieved for all evaluated workloads is 6.We also report two additional results -compression ratio of DECORAM using the baseline patterns in Table 1 (DECORAM-baseline) and that using additional patterns in Table 2 (DECORAM-additional).The average compression ratio using DECORAM-baseline and DECORAM-additional are 5.7 and 3.3 for all evaluated workloads, respectively.

DRAM Energy
Figure 14 plots the total energy consumption of the evaluated workloads, normalized to baseline.For each workload, the total energy consumption is measured by adding the energy consumed inside NoC and the DRAM banks.We make the following observations.First, BaselineCompress has lower total energy (on average, 22% lower) than baseline.This is due to a reduction of NoC traffic by enabling compression.Second, DECORAM has even lower energy (on average, 32% lower than baseline and 12% lower than BaselineCompress).This improvement is because DECORAM implements compression/decompression inside the DRAM, whereas BaselineCompress implements a LUT for pattern storage and logic for compression/decompression inside the NI.

Latency Breakdown
Figure 16 plots the total latency distributed into queue, router, and routing delay.We observe that these delays constitute 86%, 12%, and 2% of the total delay, respectively.

CONCLUSIONS
We propose DECORAM, a data compression and decompression mechanism implemented inside the DRAM by leveraging dense subarray architecture of a modern bank and exploiting analog properties of DRAM cells.DECORAM formulates these mechanisms as a sequence of sub-operations which are scheduled on the subarrays by exploiting subarray-level parallelism.We evaluate DECORAM for netowk-on-memory (NoM) architectures using data-intensive workloads from SPEC, APACHE, PARSEC, and in-memory computing benchmark suites.Our results show that compared to a baseline NoM design, DECORAM significantly improves performance (average 30%) and energy consumption (average 32%).We also implement a conventional NoC compression mechanism and integrate it in baseline.Compared to this compression-enabled baseline, DECO-RAM reduces memory area by 27% and energy consumption by 12%, while delivering 7% higher performance improvement.

Figure 2 :
Figure 2: Architecture of Network-on-Memory (NoM) design.(a) DRAM organization using the Network-on-Chip (NoC).(b) A DRAM bank with subarrays.(c) A subarray with DRAM cells.(d) A NoM with memory tiles in a Manhattan-like structure.(e) A Memory tile with a router and DRAM bank.s(f) Network interface connecting a DRAM bank to a router.(g) Router architecture.

Figure 3 :
Figure 3: Frequency of common patterns in data of LLC misses of data-intensive workloads.

Figure 4 :
Figure 4: Execution time with different network loads.

Figure 5 :
Figure 5: Subarray organization with inter-subarray links to copy row buffer content between adjacent subarrays.A DRAM cell consists of a capacitor and access transistor.

Figure 8 :
Figure 8: Formulating DECORAM as a COMPRESS-PREFETCH cycle.These are implemented using two adjacent subarrays  and ( + 1), called the PM and PA subarrays, respectively.

Figure 9 :
Figure 9: DECORAM controller for compression and decompression inside DRAM.

Figure 10 :
Figure 10: DECORAM controller for compression and decompression of an entire row inside DRAM.

Figure 11 plots
Figure 11  plots the execution time for each evaluated workloads, normalized to baseline.The absolute execution time (in million cycles) for the baseline design is reported for easier comparison.The results for BaselineCompress and DECORAM includes the delay of their compression and decompression operations.We make the following observations.First, when compression is enabled inside the NI of NoC routers, cache blocks are compressed into fewer flits than in baseline, which routes uncompressed cache blocks via the NoC.So, the NoC traffic is lower in BaselineCompress, which improves performance (average

b w a v e s d e e p sj e n 3 AFigure 12 :
Figure 12: Router buffer utilization.

b w a v e s d e e p sj e nFigure 16 :
Figure 16: Latency, distributed into queue and NoC delay.

Table 1
reports all baseline patterns that DECORAM supports.

Table 3 :
0 and  1 match with  0 and  5 , and   is uncompressible.The B-group addressing in PM and PA subarrays.
for 45nm technology node.

Table 5
reports workloads we use to evaluate DECORAM.