datasheets.com EBN.com EDN.com EETimes.com Embedded.com PlanetAnalog.com TechOnline.com  
Events
UBM Tech
UBM Tech

Design Article

Breaking through the embedded memory bottleneck, part 1

Sundar Iyer, Memoir Systems

7/30/2012 12:48 PM EDT

Current solutions to the MOPS problem
SoC architects and designers are well aware of the MOPS bottleneck of embedded memory. Unfortunately, today’s embedded memories (built using circuit techniques alone) that offer more MOPS require a large amount of die area and can be extremely impractical. Achieving a 4X MOPS increase for a memory built using circuit techniques alone, for example, typically takes 400% to 800% more physical memory area than a corresponding memory providing 1X MOPS. As a result, architects and designers must use a variety of other techniques to achieve the necessary performance.

A common approach is to break up memory into multiple banks. Each memory bank can be accessed independently, and if two accesses in the same clock cycle go to different banks, then they can both be serviced in parallel to effectively double the MOPS supported by the memory as a whole. What happens when multiple accesses go to the same bank, however? We refer to this as a bank conflict, and when it does occur, memory stalls. Subsequent memory accesses need to be queued up in FIFOs, increasing both the memory latency and, because accesses are no longer guaranteed to be read or written to memory in a fixed time, raising the coherency management of the memory. The combination leads to processor stalls that are propagated as backpressure to earlier stages of the system pipeline. As a result, system performance can no longer be guaranteed.

Multi‐banked solutions are relatively inexpensive to implement in terms of memory area and power. The technique increases the design complexity by adding additional logic required to manage non‐deterministic memory output results, however. Also, the increase in design verification complexity significantly increases SoC development time. In the end, the system performance will still be affected in cases in which bank conflicts occur. An ideal memory solution should 100% guarantee the required MOPS, avoiding non‐deterministic output results.

Rethinking memory performance
It is time to take a fresh perspective on how to increase memory performance. Today, a single-port embedded memory can perform one memory operation per clock cycle. Embedded memory performance has traditionally been closely tied to memory clock speed, and is therefore ultimately limited by it. The question to consider is whether it is possible to increase memory performance without increasing memory clock speeds.

Historically, advances in embedded memories have been limited to maximizing the number of transistors on a chip and cranking up the clock speed. This has been successful up to a point, but as transistors approach atomic dimensions, manufacturers are running into fundamental physical barriers. For this reason, the industry needs to rethink its approach to embedded memory design. As an analogy, increases in processor performance have come not only because of advances in circuitry, but also because of architecture improvements, such as pipelined execution and exploitation of instruction-level parallelism. What if embedded memories could be designed to take advantage of architectural and parallel mechanisms similar to processor architectures to increase memory performance? A new approach called algorithmic memory technology does exactly that.




DaveWyland

8/2/2012 2:04 PM EDT

We are being reminded that a CPU is a memory controller. Its function is to read data, combine it and write it back, using an instruction stream from the same (von Neumann) or a different (Harvard) memory. The performance of the system is ultimately determined by the memory, once the CPU has been optimally designed for its task universe. And CPU architectures have stabilized at the Pentium style of ~2.5 instructions/clock.

Given the CPU design, system performance is limited by MOPS x Number of memories. An N-port memory looks like N memories, but the performance starts dropping off for N greater than 2. I have some experience with this, having worked on dual and quad port memory designs. Multi-port is useful, not a panacea.

IMO, we are being dragged kicking and screaming into the land of data flow processing. This is where you have chains of processing nodes that crunch data that flow through them, assembly line style. You have small nodes, each with its small memory, and lots of them. This lets you multiply memories (N much greater 1) and thereby multiply system performance. The fact that each node is small in both memory size and processing logic helps, too.

The pain is that your algorithm is now in the wiring of nano-sized processing chunks. And you may want some chunks to be different than others. Also, you have to have a system of hardware that lets you do this flexibly and tools that let you create and debug this wired-chunk algorithm design.

Two thoughts come to mind. FPGAs are now good candidates for the hardware. They now have HUGE capability and software for wiring them up, by definition. And graphic data flow systems such as Matlab/Simulink and Labview are successful in making such systems.

We will have to change our way of designing computer systems if we want more performance. OTOH, it is possible.

Sign in to Reply



Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)