datasheets.com EBN.com EDN.com EETimes.com Embedded.com PlanetAnalog.com TechOnline.com  
Events
UBM Tech
UBM Tech

Design Article

Breaking through the embedded memory bottleneck, part 1

Sundar Iyer, Memoir Systems

7/30/2012 12:48 PM EDT

In the age of broadband Internet, 4G smart phones, and untethered tablet computing, there is a relentless demand for ever-increasing computing performance. Over the years, processing performance has rapidly progressed, initially via increasing clock speeds and then later courtesy of architectural innovations such as instruction-level parallelism, pipelining, and the issuing of multiple instructions per cycle. Memory performance, on the other hand, has not kept pace, thus creating the traditional processor-memory gap.

Despite attempts to temper that gap with huge increases in on‐chip memory capacity and the advent of multicore architectures (once again increasing the effective processing performance), system on chip (SoC) architects and designers continue to struggle to meet the performance requirements of today’s data‐hungry applications. Memory technology is long overdue for an innovation that can increase performance by an order of magnitude. One promising technology, algorithmic memory, combines existing embedded memories with the capabilities of algorithms to increase embedded memory performance by a factor of 10. While not a panacea, it offers a new and innovative approach to alleviating the disparity between processor and memory performance in SoCs.

Traditionally, the processor-memory performance gap referred to the difference between the performance of processors and the external memories, which took hundreds of cycles or more to access. The obvious solution to closing this gap was to alleviate off‐chip memory delay by integrating the processors with the memory and other components on the same chip thus leading to the advent of the SoC approach. SoCs have emerged as the architecture of choice for delivering higher and higher levels of computing performance. Have SoCs really solved the processor-memory performance gap, though, or have they just pushed it to a lower level and recreated it within the microcosm of the chip?  

SoCs are typically designed with their processors primarily accessing the embedded memory, and accessing external memory only when absolutely required. SoCs architects embed cache memory for frequently requested data, for example, or implement dedicated on‐chip memories where possible. Memory used for these purposes can be accessed within a few clock cycles, and is typically placed immediately next to the processing cores to minimize latency. However, while latency remains a major concern, these memories are also required to respond to back‐to‐back sustained access requests issued by the processor(s), which in many applications have been dramatically increasing. Once more, systems architects are up against a processor-memory gap, this time with embedded memory (figure 1).


Click image to enlarge

Figure 1: Over the years, processing performance (red line) has rapidly progressed. Memory performance, on the other hand, has not kept pace (green and blue lines), thus creating a processor-memory gap.

Measuring memory
Before tackling the problem of how to increase memory performance, we need a way to measure memory performance that accurately reflects real-life requirements. Note that, colloquially, memory bandwidth has often been used to describe memory performance. Memory bandwidth is the rate at which data can be read from or stored into a memory. It is a measure of the rate of data transfer to or from memory, and can easily be increased by expanding the data bus width of the embedded memory. An increase in the data bus width does not allow more unique accesses to memory, however.

Consider a processor, or a set of multiprocessor cores, that make an aggregate of 500 million unique accesses to memory in a second. Suppose that there is a single port memory, supporting one memory access per clock cycle, that runs at a frequency of 250 MHz. This memory supports exactly 250 million unique accesses per second. Doubling the memory bandwidth of this memory by widening the data bus would only help in giving more data for each of the 250 million unique accesses—it would not support the processor’s 500 million unique requests. A more inclusive measure of memory performance, then, would be the memory operations per second (MOPS) metric.

MOPS refers to the rate at which unique accesses can be performed to a memory system. The relation between the bandwidth and MOPS is:

Memory Bandwidth = MOPS X Databus Width.

In other words, doubling the MOPS of a memory while keeping everything else the same doubles the total memory bandwidth. The use of MOPS for measuring memory performance mirrors the trend of using input/output operations per second (IOPS) for measuring the performance of computer storage device.




DaveWyland

8/2/2012 2:04 PM EDT

We are being reminded that a CPU is a memory controller. Its function is to read data, combine it and write it back, using an instruction stream from the same (von Neumann) or a different (Harvard) memory. The performance of the system is ultimately determined by the memory, once the CPU has been optimally designed for its task universe. And CPU architectures have stabilized at the Pentium style of ~2.5 instructions/clock.

Given the CPU design, system performance is limited by MOPS x Number of memories. An N-port memory looks like N memories, but the performance starts dropping off for N greater than 2. I have some experience with this, having worked on dual and quad port memory designs. Multi-port is useful, not a panacea.

IMO, we are being dragged kicking and screaming into the land of data flow processing. This is where you have chains of processing nodes that crunch data that flow through them, assembly line style. You have small nodes, each with its small memory, and lots of them. This lets you multiply memories (N much greater 1) and thereby multiply system performance. The fact that each node is small in both memory size and processing logic helps, too.

The pain is that your algorithm is now in the wiring of nano-sized processing chunks. And you may want some chunks to be different than others. Also, you have to have a system of hardware that lets you do this flexibly and tools that let you create and debug this wired-chunk algorithm design.

Two thoughts come to mind. FPGAs are now good candidates for the hardware. They now have HUGE capability and software for wiring them up, by definition. And graphic data flow systems such as Matlab/Simulink and Labview are successful in making such systems.

We will have to change our way of designing computer systems if we want more performance. OTOH, it is possible.

Sign in to Reply



Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)