# Introducing C-Slow Retiming & System Hyper Pipelining

System Hyper Pipelining allows a large number of processor cores to be instantiated in the same FPGA.

**A design example**

Let's apply this technique to a slightly more sophisticated design. Any single-clock design can be defined as a set of inputs, a set of outputs, and a graph of logic elements and registers.

CSR can automatically perform appropriate register insertion on our more sophisticated design, as illustrated below.

In this case, it takes two clock cycles to achieve the same behavior as the original design, but we now have a second, totally independent design that uses the combinatorial logic in a time-sliced fashion.

Whether the original design is already pipelined (as in a CPU) is totally irrelevant. If we follow the rule to insert the same number of registers in any of the original logic paths, we multiply the functionality of the design/core. If the registers are placed using a timing-driven algorithm, the performance of a single core remains almost the same. More register levels can be inserted as required, and the functionality multiplies accordingly. Performing this automatic register insertion on the RTL simplifies the entire implementation and verification process.

**Timing estimation on RTL**

No matter which Altera or Xilinx component families I use (e.g., Flex10k or Virtex), this is the central observation that facilitates my work on the CSR technology on RTL. Now, what do you think Johann Carl Friedrich Gauss might have to do with timing estimation on a Virtex 5 FPGA in 2014?

Mr. Gauss (1777-1855) was a German mathematician and physical scientist who contributed significantly to many fields, including number theory, algebra, statistics, analysis, differential geometry, geodesy, geophysics, electrostatics, astronomy, optics, and -- unbeknownst to him -- FPGAs.

I think we are all familiar with the concept of a lookup table (LUT), which forms the basis for the programmable fabric in an FPGA. Assuming each LUT has one output with an associated net, let's call this an LUT net pair. Now, let's take a fairly big design -- a 32-bit MIPS processor -- and place it unconstrained with low utilization in a Virtex 5 FPGA. If we extract enough data out of the static timing analysis (STA) report, we will see that the individual LUT net pair delays follow an X^{2} distribution.

I don't want to go too much into the math (as if I could), but if you have multiple behaviors following an X^{2} distribution (with high *k*), they can be estimated using a Gaussian (normal) distribution. If you extract enough empirical data, you see the following distribution of consecutive LUT net pairs in your timing report file.

Based on the empirical data, we can say that one LUT net pair delay can be estimated with a certain probability to µLN=820 ps on a Virtex 5 without constraining the design, where µLN is the mean for one LUT net pair. The delay of a path through multiple LUTs can be estimated using (lut * µLN), where lut equals the number of LUTs in the path. You may be prompted to say, "So, what? Every FPGA engineer does this quite naturally." However, I believe this indicator deserves more attention. It is definitely useful for timing estimation on FPGAs using CSR-based designs. So let's discuss some points:

- Apart from anything else, it is a lot of fun -- predicting a certain statistical behavior, extracting empirical data, and finding a good match. A good match saves your day.
- Special hard-coded logic in the FPGA (DSP blocks, fast carry chains) also follow a normal distribution with an FPGA specific mean (e.g., µSN=1.582 ns for a Virtex 5).
- The normal distribution becomes 1 for a high number of LUTs on the path, and constraining the critical path affects the path delay. In any case, the µLN indicator can be used for fast static timing estimations. In fact, timing optimizations start improving paths with delays greater than (lut * µLN). It is obvious that timing optimizations get more costly as soon as the worst-case delay = (lut * µLN).
- CSR-based designs usually don't have more than four LUTs on a critical path. This is why this estimation works so well. It must be performed on higher-level representations (RTL or higher) where the concrete value of µLN is not important.
- It is an indicator that lets you compare two individual FPGAs independent of the design.
- It provides you with a design- and synthesis-independent indicator to compare two different technologies.
- It lets you predict the timing of your design on a new technology (and it doesn't look good for FPGAs).
- It lets you predict how far you are from a realistically achievable timing for your design.
- The indicator works for FPGA and ASICs alike. Having said this, I personally haven't run the analysis on an ASIC database for more than 10 years as of this writing.

You could argue that if you have a complete random design with enough statistical data, the static timing diagram would follow a Gaussian distribution. I guess you sometimes assume that by looking at the STA histogram. Do you have any thoughts on this?

I haven't found these curves in any of the IEEE papers I scanned. Do you know about any reports with similar observations? I'm very much looking forward to hearing your thoughts on this one. Also, are there any other statistical behaviors you know of in the FPGA timing estimation field?