As a result of these systems, applications, and technology trends and the resulting requirements, it is our position that researchers and designers need to fundamentally rethink the way we design memory systems today to 1) overcome scaling challenges with DRAM, 2) enable the use of emerging memory technologies, 3) design memory systems that provide predictable performance and quality of service to applications and users. The rest of the paper describes our solution ideas in these three directions, with pointers to specific techniques when possible. Since scaling challenges themselves arise due to difficulties in enhancing memory components at solely one level of the computing stack (e.g., the device and/or circuit levels in case of DRAM scaling), we believe effective solutions to the above challenges will require cooperation across different layers of the computing stack, from algorithms to software to microarchitecture to devices, as well as between different components of the system, including processors, memory controllers, memory chips, and the storage subsystem.
Challenge 1: New DRAM architectures
DRAM has been the choice technology for implementing main memory due to its relatively low latency and low cost. DRAM process technology scaling has for long enabled lower cost per unit area by enabling reductions in DRAM cell size. Unfortunately, further scaling of DRAM cells has become costly [4, 47, 37, 1] due to increased manufacturing complexity/cost, reduced cell reliability, and potentially increased cell leakage leading to high refresh rates. Several key issues to tackle include:
1) reducing the negative impact of refresh on energy, performance, QoS, and density scaling ,
2) improving DRAM parallelism/bandwidth , latency , and energy efficiency [33, 41, 44],
3) improving reliability of DRAM cells at low cost,
4) reducing the significant amount of waste present in today’s main memories in which much of the fetched/stored data can be unused due to coarse-granularity management [49, 74],
5) minimizing data movement between DRAM and processing elements, which causes high latency, energy, and bandwidth consumption .
Traditionally, DRAM devices have been separated from the rest of the system with a rigid interface, and DRAM has been treated as a passive slave device that simply responds to the commands given to it by the memory controller. We believe the above key issues can be solved more easily if we rethink the DRAM architecture and functions, and redesign the interface such that DRAM, controllers, and processors closely cooperate. We call this high-level solution approach system-DRAM co-design. We believe key technology trends, e.g., the 3-D stacking of memory and logic [45, 2] and increasing cost of scaling DRAM solely via circuit-level approaches, enable such a co-design to become increasingly feasible. We proceed to provide several examples from our recent research that tackle the problems of refresh, parallelism, latency, and energy efficiency.
Reducing refresh impact
With higher DRAM capacity, more cells need to be refreshed at likely higher rates than today. Our recent work  indicates that refresh rate limits DRAM density scaling: a hypothetical 64-Gb DRAM device would spend 46% of its time and 47% of all DRAM energy for refreshing its rows, as opposed to typical 4-Gb devices of today that spend respectively 8% of the time and 15% of the DRAM energy on refresh. Today’s DRAM devices refresh all rows at the same worst-case rate (e.g., every 64 ms). However, only a small number of weak rows require a high refresh rate [30, 43] (e.g., only approximately 1000 rows in 32-GB DRAM require to be refreshed more frequently than every 256 ms).
Retention-Aware Intelligent DRAM Refresh (RAIDR)  exploits this observation: it groups DRAM rows into bins (implemented as Bloom filters to minimize hardware overhead) based on the retention time of the weakest cell within each row. Each row is refreshed at a rate corresponding to its retention time bin. Since few rows need high refresh rate, one can use very few bins to achieve large reductions in refresh counts: our results show that RAIDR with three bins (1.25-KB hardware cost) reduces refresh operations by approximately 75%, leading to significant improvements in system performance and energy efficiency as described by Liu et al. .
Note that such approaches that exploit non-uniform retention times across DRAM, like RAIDR, require accurate retention time profiling mechanisms. Understanding of retention time as well as error behavior of DRAM devices is a critical research topic, which we believe can enable other mechanisms to tolerate refresh impact and errors at low cost. Liu et al.  provides an experimental characterization of retention times in modern DRAM devices to aid such understanding.
Improving DRAM parallelism
A key limiter of DRAM parallelism is bank conflicts. We have recently developed mechanisms, called SALP (subarray level parallelism) , that exploit the internal subarray structure of the DRAM bank to mostly parallelize two requests that access the same DRAM bank. The key idea is to reduce the hardware sharing between DRAM subarrays slightly such that accesses to the same bank but different subarrays can be initiated in a pipelined manner. This mechanism requires the exposure of the internal subarray structure of DRAM to the controller and the design of the controller to take advantage of this structure. Our results show significant improvements in performance and energy efficiency of main memory due to parallelization of requests and improvement of row buffer hit rates (as row buffers of different subarrays can be kept active) at a low DRAM area overhead of 0.15%. Exploiting SALP achieves most of the benefits of increasing the number of banks at much lower area and power overhead than doing so. Exposing the subarray structure of DRAM to other parts of the system, e.g., to system software or memory allocators, can enable data placement and partitioning mechanisms that can improve performance and efficiency even further.