Part 1 of this article introduced the concept of algorithmic memory and the benefits of memory operations per second (MOPS) as a metric. In part 2, we delve into the specifics of how algorithmic memory works and how it is implemented in embedded systems.
The existence of the processor-memory performance gap is well known in the industry. Up until now, advances in embedded memories have focused on maximizing the number of transistors on a chip and cranking up the clock speed. As transistors approach atomic dimensions, however, manufacturers are running into fundamental physical barriers. For this reason, the industry needs to rethink its approach to embedded memory. What if embedded memories could be designed to take advantage of architectural and parallel mechanisms similar to those used to enhance processor architectures? Algorithmic memory technology provides a way to do exactly that.
Inside an algorithmic memory Every algorithmic memory consists of a number of individual memory macros, where each of the macros can be accessed in parallel. Each memory macro has its own physical address and data bus—thus, four external accesses that address four different macros can take place in parallel in a single clock cycle. The matter gets more complex when all four accesses are trying to access the same memory macro. In this case, the logic temporarily buffers the accesses in an internal cache, or directs the other accesses to other macros within the memory.
The actual addresses of the alternative locations are a form of virtual addressing that is kept track of in scratchpad memory so that virtual addresses are correlated with the intended address. Since reads and writes can come in rapid succession and in all kinds of combinations, the logic in the algorithmic memory core has to be able to manage all the patterns of hot spots and multiple accesses to the same macro intelligently. When there is time, the algorithm can move data to its intended location in main memory and perform clean up. The logic must also handle a worst-case-for-life scenario, however, and intelligently rearrange things so that the operations continue to be posted. In fact, it is possible to prove mathematically that with the right scheme of data caching, virtualization, and data rearranging, all sequences of write operations can be posted.
Memory read operations are a little more complex than writes. In the case of two simultaneous read accesses, for example, if the application wants data from the same memory macro, both addresses cannot be accessed in parallel. At the same time, trying to access them sequentially would impact performance by introducing latency. To avoid these problems, all of the data that is stored in the physical memory is encoded using a variety of schemes to allow the algorithm to extract the read data using data from other macros, so that multiple read accesses can proceed simultaneously.
A key characteristic of algorithmic memory is that the increased performance (MOPS) is completely deterministic; this performance guarantee has been mathematically proven using adversarial analysis models. Algorithmic memory even resolves all row, address, and bank conflicts that may arise due to simultaneous accesses from multiple interfaces. Since algorithmic memory is not subject to memory bank conflicts or memory stalls, system-on-chip (SoC) designs can be greatly simplified because there is no need to deal with the possibility of system backpressure.
It seems like the details of the encoding scheme are invisible to the user. Is this correct or are there dependencies the user needs to understand in order to guarantee simultaneous access?
Also- it's been a few years since I worked with Galois fields, but I seem to remember they took many cycles to compute. Are the some that have single cycle(or very low) latency without using too much logic?