Software has come to dominate system-on-chip (SoC) development. It is increasingly common for the software effort to be on the critical path of the project schedule. Only FPGA-based prototyping provides both the speed and accuracy necessary to develop and validate complex software integration prior to silicon. The exciting benefits of an FPGA-based prototype are:
- Quick fine tuning of hardware/software integration and software validation pre-silicon
- In-system device validation with real-time interfaces and in end application
- Extended register transfer level (RTL) testing and debugging
Prototyping next-generation SoCs, which contain more functionality than the capacity offered by a single FPGA, means spreading that functionality across multiple FPGAs, leading to challenging partitioning and timing closure issues. Traditional prototyping solutions manage device under test (DUT) partitioning either at the RTL level or at the gate level, but they fail to offer a predictable and efficient flow that would allow the FPGA-based SoC prototype to be brought up quickly.
This post examines FPGA-based prototyping challenges and presents an innovative methodology unifying the benefits of gate-level partitioning and RTL partitioning, providing a short, automated, and predictable path to prototype.
Multi-FPGA partitioning challenges
Multi-FPGA partitioning is a complex optimization problem that must handle multiple constraints and concurrent objectives. The partitioning challenges that have to be overcome to make FPGA-based prototyping effective are:
- Heterogeneous FPGA logic resources management
- Unbalanced interconnect management and pin multiplexing
- Timing closure issues and timing constraints generation
- Incremental flow for fast turnaround
- Full system verification and simulation
- Bug hunting methodologies
In Figure 1, we compare the two traditional methodologies for multi-FPGA partitioning. Gate-level partitioning takes a full-chip synthesized netlist as input and therefore has the advantage of providing accurate logic and timing information that lets you run optimized partitioning automatically.
Gate-level flow vs. RTL flow (click here for a larger image).
However, as shown in the table below, generating netlist partitions at the gate level complicates the partitioned DUT simulation, reduces debugging capabilities, and prevents incremental updates when the RTL DUT changes locally.
By comparison, RTL partitioning has the advantage of providing the user with a good level of confidence at each step of the flow by continuously using the initial testbench on the various versions of the DUT. In addition, this is an incremental flow allowing local updates when changes are made to the RTL. However, the high-level RTL representation prevents accurate evaluation of the resource utilization of the DUT components. Also, precise analysis of the DUT's timing aspects (combinatorial paths, clock domains, etc.) is very difficult.
We will now discuss partitioning challenges and the pros and cons of the various methodologies in more detail.
Heterogeneous FPGA logic resource management
Accurate estimation of logic utilization: FPGA devices contain heterogeneous logic resources like LUTs, registers, RAMs, and DSPs. Each resource-limited capacity must be considered a hard constraint to be met in the partitioning phase. As shown in Figure 2, an FPGA can be modeled as a device with multiple resource layers that must be managed in parallel during partitioning.
Hardware platform: heterogeneous logic resources and unbalanced interconnect distribution (click here for a larger image).
In general, LUTs are the most critical resource with the highest utilization/capacity ratio. DSP and RAM blocks can also be critical for some designs. All this makes the partitioning problem multi-constrained and therefore very complex. Furthermore, FPGA logic utilization must be managed carefully, since it impacts the routing congestion and timing constraints and thus the required FPGA place-and-route (P&R) effort.
With RTL partitioning, it is very difficult to estimate how much RTL code will fit into an FPGA. The high-level RTL representation is unsuitable for computing the resource utilization of DUT components. This is why partitioning in an RTL flow is a painful hands-on process that requires repeated FPGA synthesisto check whether the solution meets the FPGAs' logic resource capacities. In most cases, one must proceed iteratively in this painful way to obtain a feasible solution; this can take days or weeks. This huge effort and the delay required for implementation makes time to prototyping with RTL partitioning unpredictable and the overall validation process unstable.
Partitioning at the gate level provides accurate information regarding logic resources. This makes finding a feasible partitioning solution that meets logic resources more straightforward without requiring multiple iterations.
Logic and signal optimization: Over time, SoCs have evolved into complex assemblies of multiple configurable intellectual property (IP) cores, which may be developed internally or provided by third parties. For a specific SoC implementation, IP components may have unconnected outputs or inputs tied to constant logic values. The problem is that partitioning adds hierarchical boundaries that may mask unconnected logic when FPGA local synthesis is run.
Execution and propagation of potential simplifications is often ineffective at the RTL level. As illustrated in Figure 3, signals may be cut and cross partition boundaries after RTL partitioning, thereby disabling the inter-FPGA pruning propagation during FPGA synthesis. Consequently, useless logic and inter-FPGA signals may be preserved and consume costly logic and interconnect resources.
Gate-level partitioning takes advantage of netlist pruning and constant propagation performed during full-chip synthesis.
Inefficient logic pruning and interconnect optimization (click here
for a larger image).
To Page 2 >