Board interconnect management
Inter-FPGA unbalanced interconnect management: Multi-FPGA hardware systems provide physical connections between FPGA input/outputs (I/Os). These inter-FPGA routing resources are very limited due to the limited number of FPGA I/Os. In addition, the distribution of connections between FPGA pairs can be unbalanced due to floor plan constraints. In Figure 4, FPGA 1 has 375 bidirectional physical connections with FPGA 2 and no connection with FPGA 4. Thus, the number of DUT signals cut between two FPGAs must be correlated to available physical connections to obtain a balanced routing congestion distribution.
Unbalanced interconnect must be considered during partitioning to reduce the max mux ratio (click here for a larger image).
This complex problem requires a large exploration effort. Decisions about which design modules to separate and their concurrent effect on resulting cut signals are not obvious. A local improvement of signals cut between two FPGAs may hurt the solution quality of other FPGA pairs.
The main partitioning objective is to find a tradeoff between FPGA logic and board interconnect resource utilization. The inaccuracy of logic resource estimation at the RTL level makes partitioning a tweaking process. Thus, unanticipated effort and time are spent trying to find a high-quality partitioning solution where cut signals match available physical tracks.
Accurate gate-level resource estimation simplifies the partitioning process and avoids wasting time and effort exploring unfeasible and performance-penalizing solutions.
Logic replication: The goal of replication is to minimize the use of I/Os for each FPGA. Sometimes it is very useful simply to replicate chunks of logic in each FPGA, thereby paying an area penalty but reducing the number of inter-FPGA DUT signals. In Figure 5, by replicating the control block (CTL) in FPGA 2 and using only two signals, we can avoid the ADDR 32-bit bus connection between FPGA 1 and FPGA 2.
With an RTL methodology, replicating a logic module increases logic utilization in the relevant FPGA. The replication of RTL modules prevents the estimation of logic utilization increase and may lead to unfeasible partitioning that does not meet FPGA logic resources capacity. In addition, logic replication may set some outputs of the replicated module unconnected in the relevant FPGA. Post-partitioning synthesis can prune logic locally, but it cannot propagate optimization across FPGA boundaries. Consequently, useless logic and connections may be preserved, thereby consuming costly logic and interconnect resources.
At the gate level, estimating the accuracy of logic resources lets you select candidate modules to replicate without exceeding FPGA logic capacity. Since synthesis is not run after partitioning and replication, you must provide optimization features to prune logic starting from unconnected outputs of replicated modules.
Pin multiplexing: Unlike logic resource constraints, physical connection constraints can be relaxed by allowing a set of signals to share the same physical track. An example is shown in Figure 6, where the output signals S1 and S2 are multiplexed and driven by the same output pin of FPGA 1.
Example of multiplexed signals sharing a physical connection (click here for a larger image).
The downside of this technique is that it reduces the overall system clock frequency, since DUT signals must be transmitted serially in separate time slots. To obtain a high system frequency, the multiplexing ratio (inter-FPGA signals/physical connections) must be reduced. The partitioning process has to control the cut signals distribution and correlate to available physical connections between FPGA pairs. This makes partitioning a complex optimization problem that cannot be solved by hand.
To improve FPGA I/O bandwidth, RTL and gate-level methodologies can manage multiplexing IP insertion to transmit various signals serially through the same physical connection. Nevertheless, optimizing the multiplexing ratio, ordering the signals to be multiplexed, and selecting signals not to multiplex according to the board interconnect topology remain complex tasks that can be better managed with an automatic or semi-automatic partitioning at the gate level. We'll discuss this in more detail later.
Multi-FPGA partitioning timing challenges: Not all inter-FPGA DUT signals are valid candidates for grouping through pin multiplexing. Thus, a careful timing analysis must be carried on the DUT to identify non-muxable signals, which include the following.
- Combinatorial hops: As shown in Figure 7, partitioning may lead to combinatorial paths being cut multiple times. The resulting combinatorial hop signals drastically reduce overall system frequency when they are multiplexed.
- Half cycle paths: These paths, as illustrated in Figure 8, must also be identified, and signals located in the data path from the launch flop to the capture flop must not be multiplexed. Otherwise, timing constraints may not be met.
- Asynchronous control signals: Control signals may be connected to multiple modules, and partitioning may lead to cutting them. Thus, they must be managed carefully in the multiplexing phase. For example, signals connected to asynchronous reset flops must not be multiplexed.
- Gated clocks: Most modern SoC integrate power management techniques like gated clocks. Partitioning may lead to spreading registers driven by the same gated clock into multiple FPGAs. As shown in Figure 9, gated clock logic must be detected and replicated into the relevant FPGAs. In this way, the post-partitioning synthesis tool can optimize and convert gated clocks locally in each FPGA.
Inter-FPGA combinatorial hop analysis and the effect on signal multiplexing (click here for a larger image).
Inter-FPGA half-cycle path detection (click here for a larger image).
Gated-clock replication to enable conversion during each FPGA synthesis process (click here for a larger image).
To Page 3 >