In the age of broadband Internet, 4G smart phones, and untethered tablet computing, there is a relentless demand for ever-increasing computing performance. Over the years, processing performance has rapidly progressed, initially via increasing clock speeds and then later courtesy of architectural innovations such as instruction-level parallelism, pipelining, and the issuing of multiple instructions per cycle. Memory performance, on the other hand, has not kept pace, thus creating the traditional processor-memory gap.
Despite attempts to temper that gap with huge increases in on-chip memory capacity and the advent of multicore architectures (once again increasing the effective processing performance), system on chip (SoC) architects and designers continue to struggle to meet the performance requirements of today’s data-hungry applications. Memory technology is long overdue for an innovation that can increase performance by an order of magnitude. One promising technology, algorithmic memory, combines existing embedded memories with the capabilities of algorithms to increase embedded memory performance by a factor of 10. While not a panacea, it offers a new and innovative approach to alleviating the disparity between processor and memory performance in SoCs.
Traditionally, the processor-memory performance gap referred to the difference between the performance of processors and the external memories, which took hundreds of cycles or more to access. The obvious solution to closing this gap was to alleviate off-chip memory delay by integrating the processors with the memory and other components on the same chip thus leading to the advent of the SoC approach. SoCs have emerged as the architecture of choice for delivering higher and higher levels of computing performance. Have SoCs really solved the processor-memory performance gap, though, or have they just pushed it to a lower level and recreated it within the microcosm of the chip?
SoCs are typically designed with their processors primarily accessing the embedded memory, and accessing external memory only when absolutely required. SoCs architects embed cache memory for frequently requested data, for example, or implement dedicated on-chip memories where possible. Memory used for these purposes can be accessed within a few clock cycles, and is typically placed immediately next to the processing cores to minimize latency. However, while latency remains a major concern, these memories are also required to respond to back-to-back sustained access requests issued by the processor(s), which in many applications have been dramatically increasing. Once more, systems architects are up against a processor-memory gap, this time with embedded memory (figure 1).
Click image to enlarge
Figure 1: Over the years, processing performance (red line) has rapidly progressed. Memory performance, on the other hand, has not kept pace (green and blue lines), thus creating a processor-memory gap.
Before tackling the problem of how to increase memory performance, we need a way to measure memory performance that accurately reflects real-life requirements. Note that, colloquially, memory bandwidth has often been used to describe memory performance. Memory bandwidth is the rate at which data can be read from or stored into a memory. It is a measure of the rate of data transfer to or from memory, and can easily be increased by expanding the data bus width of the embedded memory. An increase in the data bus width does not allow more unique accesses to memory, however.
Consider a processor, or a set of multiprocessor cores, that make an aggregate of 500 million unique accesses to memory in a second. Suppose that there is a single port memory, supporting one memory access per clock cycle, that runs at a frequency of 250 MHz. This memory supports exactly 250 million unique accesses per second. Doubling the memory bandwidth of this memory by widening the data bus would only help in giving more data for each of the 250 million unique accesses—it would not support the processor’s 500 million unique requests. A more inclusive measure of memory performance, then, would be the memory operations per second (MOPS) metric.
MOPS refers to the rate at which unique accesses can be performed to a memory system. The relation between the bandwidth and MOPS is:
Memory Bandwidth = MOPS X Databus Width.
In other words, doubling the MOPS of a memory while keeping everything else the same doubles the total memory bandwidth. The use of MOPS for measuring memory performance mirrors the trend of using input/output operations per second (IOPS) for measuring the performance of computer storage device.