This article first appeared as a series of blogs on All Programmable Planet, which was a thriving community website devoted to all things programmable. Sadly, the site is no longer with us, but many friendships were forged there that will last for years to come.
Hi, everyone. I'm Tobias from Munich, and I would like to talk about a subject I've been working on for the last few years. I call this technology System Hyper Pipelining (SHP), which is an extended version of C-Slow Retiming (CSR).
I guess everybody knows what pipelining is, and that CPUs are pipelined to optimize their throughput. And we can all agree that inserting registers at the right places in the data path can improve throughput. It's certainly not a unique idea, and this technique is already used in various designs. Having said this, I think this method has a lot more potential -- especially in the multicore era -- than many people realize. Let's talk about it.
The idea is all about reusing logic in a time-sliced fashion, scheduling, and synchronization. The contrary asynchronous approach might appear sexy, but using asynchronous techniques will result in a total mess (to be a little bit provocative).
Amdahl's Law: Bad synchronization ruins everything.
My history with C-Slow Retiming
As a student, I worked on a chip for a keyboard manufacturer with many different kinds of interfaces: PS2, I2C, RS232, smart card IF, matrix scan, bar code, and magnetic card decoder. The magnetic card decoder consumed a relatively huge chunk of the chip and had four identical designs -- one for each track. This led me to the idea of multiplying the track decoder by inserting registers. I ended up writing my diploma thesis on this technique.
As an field application engineer for LSI Logic's MIPS processors, I enjoyed a few months working with the MIPS team in Milpitas, Calif. I realized that the multiplication of processors by inserting registers cannot be performed realistically by a design team using hand-crafting techniques. An automated approach is mandatory. I created an EDA tool that performs timing estimation on RTL and can automatically modify the RTL for things like register insertion.
In 2010, I decided to revive my student project and to further develop my EDA tool. I have spent a substantial amount of time on this subject. The more I dig into it, the more excited I become, especially with regard to multicore system architectures.
The basics of C-Slow Retiming
Don't worry if you don't fully grasp the concept of CSR straight away. I once explained it to two engineers in front of a white board. One of the engineers got it immediately, but it took both of us an inordinate amount of time to explain it to the other engineer. Eventually, we gave up. It's like the classic picture shown below. Either you can see both the old woman and the young woman, or you don't. Once you do see both, you realize how simple it is.
Do you see both the old and the young woman in this image?
A theoretical example
Let's start by considering a simple circuit involving a two-input AND gate driving a second two-input AND gate. The output feeds the input of a register, as illustrated below.
Solving an equation in one cycle.
We can think of the two levels of AND as implementing a simple equation or algorithm. In the image above, this equation is solved (or evaluated) in a single clock cycle. An alternative approach would be to add some register elements and solve the equation in two clock cycles, as illustrated below.
Solving the same equation in two cycles.
The logical result is identical for both circuits. However, in this new circuit, we can start a completely independent calculation on the second clock cycle. Also, if we assume that each AND gate represents multiple levels of combinatorial logic, we can theoretically run the clock at twice the speed, so the time required to solve a single equation does not change. Looking at this another way, by adding registers, we can solve the same equation twice as often.
To Page 2 >