Design Article
Breaking through the embedded memory bottleneck, part 1
Sundar Iyer, Memoir Systems
7/30/2012 12:48 PM EDT
Parallel architecture performance boosts
Algorithmic memories introduce architectural improvements by adding logic to existing embedded memory macros that enables the memories to operate much more efficiently. Within the memories, algorithms intelligently read, write, and manage data in parallel using a variety of techniques such as buffering, virtualization, pipelining, and data encoding. Woven together, these techniques create a new memory that internally processes memory operations an order of magnitude faster and with guaranteed performance. This increased performance capability is made available to the system through additional memory ports such that many more memory access requests can be processed in parallel within a single clock cycle (see figure 2). The concept of using multi-port memories as a means of multiplying memory performance mirrors the trend of using multicore processors to increase performance over uniprocessors. In both cases, it is parallel architecture rather than faster clock speeds that drives performance gains.

Algorithmic memory technology is implemented as a soft RTL. The resulting solutions appear exactly as standard multi-port embedded memories. A system architect can specify the level of memory performance that is required from a customized algorithmic memory. As will be described later, an algorithmic memory can also significantly lower layout area and reduce memory power in certain instances. Using this approach requires no change to existing memory interfaces or ASIC design flows. Algorithmic memory technology is both process node and foundry independent. In essence, the approach opens the door to allow system architects to rapidly and reliably create customized memory solutions that can be optimized for specific applications. The extra area overhead required to implement a 2X MOPS increase is typically around 15% of the total physical memory area, for example. In one implementation of a networking SoC, the performance of a 32Mb (128K deep x 256 bits wide) ultra‐high-density SRAM running at 500 MHz (500 million MOPS) in 32-nm process-, was increased to 1000 million MOPS with 13% area overhead.
Insofar as one is prepared to tradeoff some area, memories can be made significantly faster and up to 10X increase in performance is possible. In practice, the majority of applications benefit from up to 4X in memory performance. In some cases, algorithmic memory technology can also be used to lower memory area and power consumption without sacrificing performance.
Developing higher performance memory using circuits alone imposes a significant area and power penalty. Algorithmic memory technology combines a lower performance memory circuit (which typically has lower area and power requirements) with memory algorithms to synthesize a new memory. This algorithmic memory achieves the same MOPS as a high performance memory built using circuits alone, but can lower area and power up to 50%.
In part 2 of this article, we’ll take a closer look at the specifics of how algorithmic memories work and how the techniques can be integrated into a system.
About the author:
Sundar Iyer is co-founder and CTO at Memoir Systems, a start-up specializing in semiconductor intellectual property (SIP) for algorithmic memories. Previously, Iyer was CTO and co-founder of Nemo (“Network Memory”) Systems, acquired by Cisco Systems in ’05. Iyer was a founding member at SwitchOn Networks (acquired by PMC-Sierra in ‘00), where he developed algorithms for associative memory and deep packet classification. In 2008, Iyer was awarded the MIT technology review (TR35) young innovator award for his work on network memory. He received his Ph.D. in Computer Science from Stanford University in 2008.
Sundar can be reached at sundaes@memoir-systems.com.
Did you find this article of interest? Then visit the Memory Designline, where we update daily with design, technology, product, and news articles tailored to fit your world. Too busy to go every day? Sign up for our newsletter to get the week's best items delivered to your inbox. Just click here and choose the "Manage Newsletters" tab.
Algorithmic memories introduce architectural improvements by adding logic to existing embedded memory macros that enables the memories to operate much more efficiently. Within the memories, algorithms intelligently read, write, and manage data in parallel using a variety of techniques such as buffering, virtualization, pipelining, and data encoding. Woven together, these techniques create a new memory that internally processes memory operations an order of magnitude faster and with guaranteed performance. This increased performance capability is made available to the system through additional memory ports such that many more memory access requests can be processed in parallel within a single clock cycle (see figure 2). The concept of using multi-port memories as a means of multiplying memory performance mirrors the trend of using multicore processors to increase performance over uniprocessors. In both cases, it is parallel architecture rather than faster clock speeds that drives performance gains.

Click image to enlarge
Figure 2: Physical memory (left) can deliver up to 500 million memory operations per second (MOPS), while algorithmic memory (right) can deliver 2000 million.
Algorithmic memory technology is implemented as a soft RTL. The resulting solutions appear exactly as standard multi-port embedded memories. A system architect can specify the level of memory performance that is required from a customized algorithmic memory. As will be described later, an algorithmic memory can also significantly lower layout area and reduce memory power in certain instances. Using this approach requires no change to existing memory interfaces or ASIC design flows. Algorithmic memory technology is both process node and foundry independent. In essence, the approach opens the door to allow system architects to rapidly and reliably create customized memory solutions that can be optimized for specific applications. The extra area overhead required to implement a 2X MOPS increase is typically around 15% of the total physical memory area, for example. In one implementation of a networking SoC, the performance of a 32Mb (128K deep x 256 bits wide) ultra‐high-density SRAM running at 500 MHz (500 million MOPS) in 32-nm process-, was increased to 1000 million MOPS with 13% area overhead.
Insofar as one is prepared to tradeoff some area, memories can be made significantly faster and up to 10X increase in performance is possible. In practice, the majority of applications benefit from up to 4X in memory performance. In some cases, algorithmic memory technology can also be used to lower memory area and power consumption without sacrificing performance.
Developing higher performance memory using circuits alone imposes a significant area and power penalty. Algorithmic memory technology combines a lower performance memory circuit (which typically has lower area and power requirements) with memory algorithms to synthesize a new memory. This algorithmic memory achieves the same MOPS as a high performance memory built using circuits alone, but can lower area and power up to 50%.
In part 2 of this article, we’ll take a closer look at the specifics of how algorithmic memories work and how the techniques can be integrated into a system.
About the author:
Sundar Iyer is co-founder and CTO at Memoir Systems, a start-up specializing in semiconductor intellectual property (SIP) for algorithmic memories. Previously, Iyer was CTO and co-founder of Nemo (“Network Memory”) Systems, acquired by Cisco Systems in ’05. Iyer was a founding member at SwitchOn Networks (acquired by PMC-Sierra in ‘00), where he developed algorithms for associative memory and deep packet classification. In 2008, Iyer was awarded the MIT technology review (TR35) young innovator award for his work on network memory. He received his Ph.D. in Computer Science from Stanford University in 2008.
Sundar can be reached at sundaes@memoir-systems.com.
_________________________
Did you find this article of interest? Then visit the Memory Designline, where we update daily with design, technology, product, and news articles tailored to fit your world. Too busy to go every day? Sign up for our newsletter to get the week's best items delivered to your inbox. Just click here and choose the "Manage Newsletters" tab.
Navigate to related information


DaveWyland
8/2/2012 2:04 PM EDT
We are being reminded that a CPU is a memory controller. Its function is to read data, combine it and write it back, using an instruction stream from the same (von Neumann) or a different (Harvard) memory. The performance of the system is ultimately determined by the memory, once the CPU has been optimally designed for its task universe. And CPU architectures have stabilized at the Pentium style of ~2.5 instructions/clock.
Given the CPU design, system performance is limited by MOPS x Number of memories. An N-port memory looks like N memories, but the performance starts dropping off for N greater than 2. I have some experience with this, having worked on dual and quad port memory designs. Multi-port is useful, not a panacea.
IMO, we are being dragged kicking and screaming into the land of data flow processing. This is where you have chains of processing nodes that crunch data that flow through them, assembly line style. You have small nodes, each with its small memory, and lots of them. This lets you multiply memories (N much greater 1) and thereby multiply system performance. The fact that each node is small in both memory size and processing logic helps, too.
The pain is that your algorithm is now in the wiring of nano-sized processing chunks. And you may want some chunks to be different than others. Also, you have to have a system of hardware that lets you do this flexibly and tools that let you create and debug this wired-chunk algorithm design.
Two thoughts come to mind. FPGAs are now good candidates for the hardware. They now have HUGE capability and software for wiring them up, by definition. And graphic data flow systems such as Matlab/Simulink and Labview are successful in making such systems.
We will have to change our way of designing computer systems if we want more performance. OTOH, it is possible.
Sign in to Reply