TORONTO – The efficiency of memory continues to lag, according to one startup that believes it has an agnostic answer.
Performance-IP recently introduced its Memory Request Optimizer, a block of IP able to improve memory efficiency and increase the performance of a system on chip (SoC) by reducing latency between the memory subsystem and the SoC client. In a telephone interview with EE Times, Chief Technology Officer Gregg Recupero said the embedded IP manages widely divergent request streams to create a virtual locality of reference that make requests appear more linear.
This improves memory bandwidth, he said, as most memory subsystems operate at less than 80 percent efficiency. This inefficiency slows the pipeline performance of the communications to and from the SoC client to the memory. “At the end of the day you need to keep pipeline busy, and not just the CPU," said Recupero. Other parts of the subsystem need better efficiency as well, such as the graphics processing unit (GPU), codec or video processor.
The company's benchmarks show its Memory Request Optimizer reduces read latency from between 71 percent to 78 percent. Unlike a memory scheduler, the IP is a memory prefetch engine that works with memory schedulers by grouping similar requests together. Recupero said it analyzes multiple concurrent requests streams from clients and determines which requests should be optimized, or prefetched, and which should not. The result is high hit rates with ultra-low false fetch rates.
"As new memory standards evolve, you need even more efficiency out the memory subsystem," Recupero said. “We like to say we recover lost system performance."
When a client request has been optimized, it is stored in a request optimization buffer, a small micro-cache holding optimized client requests, until it is needed by a client. Recupero said a multi-client Interface that supports both AXI and OCP protocols. can manage up to 16 clients, specified by the designer when configuring the technology.
Performance-IP's stream benchmarks for the Memory Request Optimizer reduced latency by 71 percent, 77 percent and 78 percent across all modes. Source: Performance-IP
The configuration tool will build automatically the specified number of client interfaces, each functioning independently and able to support concurrent operation. This allows the IP to issue multiple concurrent client requests for any responses issued from the request optimization buffers. Consequently, the IP supplies a higher peak burst bandwidth than is provided by the underlying memory subsystem.
The IP can be implemented anywhere in the memory hierarchy, said Recupero. “It could be sitting right in front of your DDR controller. It reaps the benefits to any client that is trying to get to the memory subsystem." As the requests pass through the Memory Request Optimizer, they're analyzed by the tracker sand the trackers will determine which requests they should prefetch and place into the request optimization buffers, and which ones they should not. “It allows you to dynamically tune power performance profile you'd like to operate at."
The mode of operation can be determined via the control port, said Recupero, varying from low optimization to aggressive optimization and in between. “That implements a different algorithm in the trackers to determine which requests it should prefetch and which ones it should not," Recupero said. “The trackers can optimize any client in the system." For example, optimization could be set only for requests to the CPU or video acceleration.
The Memory Request Optimizer helps get the most bandwidth to the client at the lowest possible latency, said Recupero. “If we can reduce the average latency, we can improve performance." Another potential benefit is reduced power consumption, he said, as clock rates can be lower while still maintaining the same user experience. Voltage scaling can also be increased to get more power savings.
In addition, many customers have designs and products they want to stay in production as long as possible using DDR3 or DDR4 because it's most cost effective system, said Recupero. “You can stay with current memory subsystem designs." Even more advanced, high performance systems can benefit from the reduced latency and increased memory bandwidth, he said.
Performance-IP's Memory Request Optimizer allows number of cache banks to be set during operation. Source: Performance-IP
So far, Performance-IP has seen the most uptake in the advanced driver assistance systems (ADAS) market, said Recupero. “There's all of these video accelerators that need to process information in real time." The next emerging market is the networking space to support the movement of data, he said, and the Internet of Things segment shows promise, too.
Ultimately, Recupero said, the whole point of the Memory Request Optimizer is to reduce latency, and it's designed to memory agnostic, whether it's talking to DRAM or flash.
Jim Handy, principal analyst with Objective Analysis, said what Performance-IP is doing is a smart idea that used to be only used in supercomputers and similar to the pre-fetching technique used in hard drives. “It's migrated down a lot," he said.
Handy said what the Memory Request Optimizer is doing is basic prefetching. “The more transistors you have the more sophisticated of a prefetch you can do," he said.
Performance-IP can support 16 clients and 16 memory channels, which means you can have a program that is running a number of sub-routines and a number of loops on a number of data streams, said Handy. “As long as the total is 16 or less, which is a good number, you should be able to accelerate whatever you are doing," he said.
Handy also sees the value for customers that want to use the technology to extend the life of existing systems that might still have use DDR3, for example. “That's smart. If you've got a system that works fast enough and you can speed it up by adding by this particular IP to the design, you might not have to speed any other part of the system."
He said the ability to put it anywhere in the memory hierarchy makes a lot of sense, including between the memory and processor. “You could use it within a chip. You could have a cache memory inside the chip being communicated with through this."
Handy said today people throw transistors at a problem because they are so cheap. “They don't worry about how efficient they're being with them," he said. “This is an example of way of how you can use more transistors more efficiently to get a performance improvement without actually increasing the clock speed or increasing the number of transistors dramatically."
Handy sees Performance-IP's Memory Request Optimizer as a general-purpose option rather than something that is application specific. “This can be used in any computing application that either uses sub-routines or had data outside of the code space, which also means all computing."
—Gary Hilson is a general contributing editor with a focus on memory and flash technologies for EE Times.