As frequencies continue to increase in complex FPGA designs, finding the optimal point for pipeline stage insertion so as to manage routing delay issues may not be so easy. Register retiming comes in very handy in these situations, and this article outlines recommended practices that show you how to qualify an FPGA-based design as a good candidate for register retiming, along with specific examples for optimal performance results.
As an increasing number of high-performance designs are now being realized using programmable logic platforms, designers need to figure out how working with these platforms differs from traditional cell-based design processes. The effect of routing delay is a case in point.
As one of the discerning designers among the growing legions of engineers tackling FPGA designs today, you will inevitably find yourself increasingly restricted within the fixed programmable interconnect network. Not having complete freedom with respect to signal routing can become a rather tricky proposition.
In many cases, achieving performance requirements hinges on sequential elements being optimally placed so as to minimize and balance path delays. Historically, when the routing performance did not comfortably meet requirements, you simply inserted additional pipeline stages manually, at strategic points within large combinatorial logic paths, to reduce and balance path delays. Accommodating these extra pipeline stages was generally not an issue, as most programmable logic architectures offered ample sequential elements.
As design frequencies continue to increase, however, determining the optimal point for pipeline stage insertion may not be so easy. One way to overcome this is to make the most of the algorithms offered in today's EDA tools, such as register retiming, which is an optimization strategy that leverages positive slack on one side of a sequential element to address or balance negative slack on the other.
The register retiming algorithm works by literally "moving" registers across portions of combinatorial logic such that the worst-case combinatorial delays on each the input and output sides of the register are more balanced. Of course, in order to move the register, the algorithm must take great care to preserve all reset, preset, and enable functionality associated with the register's original situation within the circuit.
It's important to be aware that each synthesis tool has its own implementation of a register retiming algorithm. A good implementation, such as that used in the Precision Synthesis tool from Mentor Graphics, will allow you to move registers either forward or backward across combinatorial logic in order to reduce negative path slack. As seen in (Fig 1), register retiming can lead to either an increase or decrease in the number of flip-flops in the design. If an increase occurs, accommodating these extra flip-flops is generally not an issue, as most programmable logic architectures today still offer an ample supply of sequential elements. Moreover, while the number of flip-flops may change, the number of pipeline stages does not; in fact register retiming is constrained to operate only in such a way that preserves design functionality at the top-level design ports. Hence, the algorithm will only use the pipeline stages that are described in the circuit.
1. Look for an implementation of the retiming algorithm that allows you to move registers either forward or backward across combinatorial logic in order to reduce negative path slack.
Register retiming can take either a level-driven or a timing-driven approach. In a level-driven approach, the algorithm counts the number of logic levels (usually look-up tables or other cells) between sequential elements. Typical level-driven approaches lack the accuracy required for effective performance, as not all logic levels will incur the same delay penalty. For instance, a fast carry-chain multiplexer has a significantly different delay than a look-up table, which – in turn – has a significantly different delay than a combinatorial multiplier cell. Thus, simply counting logic levels in between sequential elements is not accurate enough to guide register retiming decisions.
Surround yourself with slackers
In a timing-driven approach, a full static timing analysis is performed and the path slack data is used to guide the retiming algorithm. Given the complexity of current programmable logic architectures, a timing-driven approach is recommended.
So how can you determine whether or not register retiming is appropriate for your design? You can predict the effectiveness of register retiming by examining the critical paths, along with their adjacent paths. For retiming to be effective, look for positive slack on one or both critical timing paths adjacent to each negative slack path. A critical path with -2.0 ns negative slack is shown in Fig 2. Using the Precision Synthesis tool in this example, you may use the "View Critical Path" capability, and then "hover" the mouse pointer over either the input pin of the launching register or the output pin of the capturing register. This way, you can check to see if there is positive slack for one or both of the adjacent timing paths. Positive slack on one or more adjacent timing paths indicates a potentially good candidate path for register retiming. Use the "Advanced Report Timing" capability to explore the timing paths of other negative slack paths.
2. In order for retiming to be effective, look for positive slack on one or both critical timing paths adjacent to the negative slack path(s).
Take the proactive route
You may also take a more proactive approach, by increasing the number of pipeline stages made available to the register retiming algorithm in areas where they may be useful. This approach is best adopted early in the design cycle, where there is still freedom to change implementation details. Prior to functional sign-off, run a quick synthesis pass of the design and examine the most critical paths. You can configure the static timing engine (known as PreciseTime within Precision Synthesis) to list any number of timing paths in its setup path slack summary report (Fig 3).
3. Using a more proactive approach, you can increase the number of pipeline stages made available to the register retiming algorithm in areas where they are more useful.
By examining the source and destination registers from the setup path slack summary, you can determine the precise locations where adding the extra pipeline stages will be most effective (Fig 4). When adding pipeline stages, it is best to describe these without an asynchronous reset or preset, because these conditions can prevent registers from being moved by retiming, especially in the backward direction.
4. Examine the source and destination registers from the setup path slack summary to determine the precise locations where adding the extra pipeline stages will be most effective.
Finally, it is valuable to examine the effects of placement. Since pre-layout timing data is generally less accurate than post-layout timing data due to variances in interconnect delay modeling, it is worth taking note when negative slack paths emanate from – or terminate with – a dedicated resource such as an embedded RAM or DSP block. As these dedicated resources tend to be situated only in specific locations on the die, one should expect a wider variance in the accuracy of pre-layout delay modeling of the interconnect associated with these cells. With a pre-layout register retiming approach, manual effort may be required in these situations to coach the pre-layout timing analysis to see the same negative slack paths as seen in the post-layout timing analysis.
Alternatively, Mentor Graphics has built into its physical synthesis technology full physical register retiming capabilities that use actual post-layout timing data for greater accuracy. This and other innovative physical synthesis algorithms have achieved a high level of maturity over the product's three-year history; they offer the same benefits you would expect from a retiming algorithm used in RTL synthesis, coupled with the added predictability associated with using actual post-layout timing data rather than more simple pre-layout models.
Achieving performance requirements hinges on sequential elements being optimally placed so as to minimize and balance path delay. You can effectively use advanced analysis capabilities, such as those offered within Mentor's Precision Synthesis environment, to estimate the effectiveness of using a register retiming strategy on a given design. You can also identify circuit areas where adding pipeline stages to a new design will be most effective.
It is necessary to be especially careful when dedicated resources such as an embedded RAM block or DSP are in the critical path. In these situations, a physical register retiming algorithm can offer more predictable performance improvement by making intelligent use of post-layout timing data.
Darren Zacher is a Technical Marketing Engineer with Mentor Graphics Design Creation and Synthesis Division, where he is focused on device and flow support for Mentor Graphics leading synthesis products. Prior to Mentor, Darren came from a design background with a breadth of experience in ASIC & FPGA design and verification, and embedded software design, working in a variety of application areas including networking, GPS, digital video, and USB. Darren holds a B.A.Sc. in Computer Engineering from the University of Waterloo, Ontario, in Canada. Darren can be reached at email@example.com.