United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 

Programmable Logic

FPGA Design: Early Implications In Partitioning

Partitioning an FPGA design is more involved than simply executing a post-processing step. Often, the original design must be significantly transformed to accommodate dozens of real-world partitioning issues.

by Stephen L. Wasson


Dozens of factors affect FPGA partitioning. Some are obvious, while others are subtle. The obvious factors include electrical, mechanical, manufacturing, cost, and schedule constraints. The subtle influences derive from tool, design, and device dependencies. Successful partitioning depends on early recognition of these constraints and dependencies as well as the subsequent forethought to construct an appropriate implementation that balances restrictions with requirements.

Definitions Presently, partitioning has two definitions. In the not-so-distant days of single-chip design, partitioning meant "to assign netlist logic to physical resources within a device." This process is currently referred to as "mapping." The prevailing definition of partitioning arrived with multi-chip design, that being "to divide netlist logic among several devices (to assign netlist modules to particular devices)."

Figure 1. On the left, six logic modules of various sizes are interconnected by buses (a-d) and by miscellaneous control. Given a requirement to use 20-kgate FPGAs at about 75 percent capacity, the logic modules are "scooped up" into three netlist piles of approximately equal size and assigned to three devices. With no particular regard given to device dependencies, significant routing difficulties and performance degradation can be expected.

FPGA design is more than synthesizing a functional concept into target devices. Design is both a concept and a process. As a concept, the design has a virtual structure representing some desired functionality. As a process, to design is to transform the virtual structure into a workable physical implementation.

Implementation involves both concept and process. The implementation process involves the exploration of various permutations of the original design in an attempt to reconcile various constraints and dependencies. The implementation itself should be the selected permutation that achieves an optimal balance.

"Optimal," of course, generally means "fastest, smallest, and cheapest." As it relates to partitioning, an optimal design is one that is "mappable, placeable, and routeable." The optimal partition is derived by considering the downstream impact of upstream decisions.

Discussions In the simplest case, partitioning a multi-module design is as easy as assigning whole logic functions to individual FPGAs (the Bus interface to one, the DMA to another, etc.). For prototyping or high-margin products, this is convenient, flexible, and expedient. However, as margins shrink, so must inefficiencies. Thus, with a little more thought, several modules should be assigned to the same FPGA(s). Figure 1 demonstrates an oversimplified case of partitioning six modules into three devices based solely on logic resource and I/O considerations--the traditional partitioning method.

Figure 2. Not all design implementations are created equal. That is, some will perform better than others. Targeting FPGAs, any given design concept may be implemented in a variety of ways. Which way performs best will depend upon how optimally the implementation is mapped and partitioned. Performance analysis should precede commitment to any particular proposal.

This traditional method--doing things by brute force--is an "after-the-fact" method. It is a design first, impose partition later sequence. On the sequential surface, this two-dimensional thinking may appear to make some sense. However, effective exploration of differing design proposals is less like a sequential list traversal and more like a tree search in that consequences for every decision branch out in all directions.

Figure 2 represents five proposed implementations (A-E), for the same design. Each proposal contains different adaptations to the same design constraints and device dependencies. A pre-implementation analysis of the anticipated logic and route resource consumption and allocation will provide an early indication of probable performance. Thus forewarned, the engineer may proceed in implementing the proposal most conducive to map, place, and route success.

Rather than engage in partitioning as a passive, post-processing step, it should be conducted throughout the design from a priori partitioning in anticipation of consequences, to post-facto partitioning as enforcement of earlier decisions. In doing so, consider the following question: which has more flexibility, device constraints or implementation constructs? "The device" is the answer, as only the design has the flexibility to evolve. Hence, successful FPGA design survival depends upon adopting an "early adapter" strategy.

Familiar factors List 1 is a partial listing of factors that typically affect FPGA partitioning. Foremost are physical space issues such as board form factor and density, which determine how many FPGAs may fit in a specific location. Board topology defines general dataflow, and will govern the pattern and sequence of device interconnection (especially in wide datapath applications). The number of board layers required is a function of board routing density, which in turn is a function of device package pin density. High pin density packages such as fine-pitch QFPs and large BGAs require significant routing channels, typically 50/in./layer. Considered in reverse, available routing channel space pre-determines viable device package options (see "Case study: a video pattern generator).

Acceptable package options are also governed by component mounting style. Obviously, through-hole devices limit what can be mounted directly opposite on the board's other side. Oppositely mounted SMDs may have incompatible footprints with conflicting via pattern requirements. For example, care must be taken to mount a square, four-sided PQ160 opposite a pair of rectangular, two-sided SO64s.

Production quantities certainly affect device partitioning options. Low quantities generally indicate less price sensitivity, thus allowing for more expensive parts (that is, larger devices). Conversely, huge quantities are more price-sensitive and compel a lower board layer count and lower density parts. Higher quantities also have higher interest in optimizing panel yield. After calculating various board geometries and panel partitioning schemes, shaving one-fourth of an inch off the board length may increase panel yield, but doing so could possibly limit device options.

Of course, device affordability is a major consideration. But what constitutes "affordable"? Device speed and array size have significant bearing on this question. Larger, faster devices are always more expensive and frequently more necessary for the "less-than-optimal" design implementation. If a device comes in several size and speed grades, the largest and fastest may be many times more expensive than the smallest and slowest. Against additional costs in a 100,000-piece buy, another week in the development schedule to get the partitioning right so the design fits in smaller, slower, cheaper parts is an affordable bargain.

Device packaging should be a main partitioning consideration not only for mounting style, but for migration to larger array sizes. If preliminary resource calculations indicate high utilization of the target part, it is wise to ensure that a footprint compatible upgrade exists in the event of excessive resource demands. Package size also matters. There certainly is no sense in targeting parts which are physically too large to fit anywhere onboard. Besides, using several, smaller packages often helps distribute board routing pressures which then may help reduce board layers.

Significant constraints are inherited from specification restrictions and I/O connector pin assignments. A current example of this is the PCI specification, which strongly suggests that VLSI pinouts interconnect with the PCI connector in a contiguous manner. Unfortunately, this suggestion is less than optimal for FPGA implementations. The specification also defines maximum trace lengths. These constraints govern device placement, limit pin assignment choices, and bias implementation proposals. Schedule pressures may adversely affect partitioning options by precluding a thorough exploration of differing implementation proposals.

Power points List 2 is a partial listing of the subtler partitioning influences that, if taken advantage of, will produce more powerful designs. Foremost on this list are the dependencies of downstream tools. While simple-minded synthesis appeals to rapid front-end design entry and satisfying simulation, the netlist subsequently produced will still be at the mercy of the FPGA target tools (the map, place, and route post-processors). This point is especially worth appreciating if these downstream processes do not perform timing-driven mapping and placement. Then, more than ever, map and place manipulations must be embedded within up-front entry. Still, it isn't enough to simply produce something that is mappable and placeable. The target tools also have to permit map and place enforcement.

Another upstream-downstream miscue is on logic block pin assignments. In some medium- and coarse-grain FPGAs, not all logic block pins have the same degree of routing freedom, and logic pin locking may become essential to meet route performance requirements. For example, in a simple pipeline sequence, it may appear to be optimal to make direct connections from flip-flop Q-out to D-in. In the Xilinx 3K architecture, this can be done using the direct input DI pin, thus bypassing (and saving) the logic block function generator. However, the DI pin cannot source from the nearest long line, whereas the two function generator inputs, B and C, can. Unfortunately, these input types are considered (by the route optimizer) to be non-equivalent resources. Therefore, once assigned to DI, a node cannot be swapped to the more routeable B or C pins.

A different kind of partitioning advantage is gained by using a device viewer, a target tool for peeking inside the device architecture. Originally intended to enhance post-route analysis, this tool presents an excellent opportunity for determining the effects of various pre-route implementation proposals. To be gained is knowledge regarding the effects of mapping on placement, and the effects of placement on routing. The engineer may then reverse these implications and propose a routing-driven placement, a placement-driven mapping, and a mapping-driven implementation.

Another class of partitioning subtleties falls under the domain of design parameters. Parameters such as databus width may, by sheer magnitude of I/O and logic requirements, compel multi-chip partitioning from one to two to four or more devices.

Dataflow dominance may compel a partitioning focused on device pin assignment. It is often a significant advantage to make adjustments accommodating pinout of intraconnected VLSI (taking care not to compromise FPGA internal patterns). Such accommodations may suggest an interleaved pin assignment that in turn will suggest an interleaved mapping. Be advised that some VLSI present a chaotic, haystack pin order that, if accommodated, will inflict serious performance degradation on the FPGA's routed results.

A design's performance parameters have significant bearing on resource requirements, especially in conjunction with dataflow considerations. In high-speed designs, the databus may need several pipelines. At one flip-flop per bus-bit per pipeline, resource requirements quickly multiply and may force the design into multiple devices. Very high-speed pipelines will additionally require perfect mapping and ordered placement of both logic and I/O.

Case study: a video pattern generator
Background The design task was to insert pattern generation logic into the 64-bit datapath of an existing 90-MHz video board. The original concept called for mask registers to impose the pattern, a FIFO to feed the mask, 32-bit parity trees to alter the pattern, and muxes to swap the parity into the video stream. Because of the FIFO RAM, a Xilinx 4K FPGA was originally targeted. Initial I/O and resource estimates called for ~382 flip-flops, ~356 four-input function generators, 64 CLBs of RAM, and ~176 I/O.

Partition #1 Based purely on logic requirements of ~380 CLBs, the whole design could be squeezed into a single, 400-CLB 4010. Based on the requirement for ~176 I/O, the smallest 4010 package meeting this requirement was the BG225--a package deemed unsuitable for this board because of anticipated routing difficulties. Reducing the I/O would allow use of a more suitable package. The FIFO write-data bus had originally been specified as 32-bit. Reducing the bus size to 8 bits (in exchange for writing the FIFO four times as often) dropped the I/O count to 152 which then permitted consideration of the 4010PQ208. To splice in this single-chip partition, an ECO was created and applied to the original board. Analysis indicated that the board height would have to grow--not a viable option for this form factor. Therefore, proposal #1 was abandoned.

Partition #2 In the quest for a smaller package, the design was split over two 32-bit devices, each with a full copy of the FIFO function. New logic and I/O estimates called for ~226 CLBs and ~85 I/O each, and two 4006TQ144s were proposed. Because of the design's high speed and dataflow dominance, input timing calculations called for contiguous databus pin assignment along the full length of each device edge. However, the TQ144 does not have 32 available I/O per side. Furthermore, internal timing calculations suggested that a larger array would be needed. So, partition #2 was retargeted for two 4008PQ160s. A second ECO was created and applied to the original board. Even though board routing pressures were distributed over two devices, it still appeared that board height would have to grow.

Partition #3 Evidently even smaller packages were needed. The 64-bit databus was running to eight VRAM, two devices per 16 bits. It seemed natural to split the databus into four 16-bit slices and place each FPGA slice within the immediate vicinity of its associated VRAM. The resource estimates for this four-chip partition initially called for ~145 CLBs and ~40 I/O. For this, four 4005PQ100s were tentatively chosen. However, having split the bus into 16-bit slices introduced some new complications in the 32-bit parity generator. Intermediate parities would have to be exchanged between adjacent 16-bit slices. An analysis of the worst-case path from one device to the other indicated that to make timing, five pipelines were needed not only in the worst-case path, but in the video data stream, as well. This increased the logic requirement to ~223 CLBs (that is, a 4006--which doesn't come in a PQ100).

Partition #4 At this point, it seemed clear that carrying around full copies of the FIFO function in each device was just too expensive and that implementing it externally in a discrete device would be more cost-effective. Doing so would reduce the CLB requirement to ~169, which would fit back in the 4005 again. Externalizing the RAM also obviated the need for using the Xilinx 4K family. Hence, resource, I/O, and timing analyses were re-performed for the Xilinx 3K family. Two viable devices were tentatively identified: the cheaper, smaller array, larger package 3142PQ100; and the more expensive, larger array, smaller package 3164TQ144. After considering FPGA costs, configuration file sizes, configuration PROM costs, and FPGA price breaks, it was determined that it would actually be cheaper (and simpler) to spread the design across five 3142s.

Partition #5 Finally, a partition that physically fit on the board without changing the form factor appeared to be board routeable. By I/O and CLB counts, they also fit within the selected targets: five 3142PQ100s, four for datapath manipulations and one for control logic. From here, it was verified that package shape and orientation were not in conflict with die orientation and dataflow transport. Floorplanning revealed several route congestion areas that were eliminated by interleaving data structures in one device and by merging pipelines in another. Logic block pin locking was absolutely required to obtain timing on critical routing. And, after mapping and placing 100 percent of the logic in the four datapath devices, timing would be made, and a viable implementation was ready for post-processing.

The largest class of partitioning constraints and dependencies derives from the target device itself. Foremost in device concerns are the cost, speed, size, and resource issues previously mentioned. More subtle influences derive from shape and orientation. Package shape is an issue when die orientation matters. For example, PQ100s are rectangular, not square, and board placement may mandate a particular orientation. The mandated package orientation then defines the die orientation, which in turn impacts the FPGA internal structure, mapping, placement, and routing.

Figure 3. There are many ways to transform original concept, yet preserve original functionality. The goal is to implement the permutation which produces the optimal result. This is done by reconciling constraints and dependencies before implementation.

A more critical consequence of package orientation is the subsequently implied routing axis in devices with asymmetric route resources. In the Xilinx 3K and 4K devices, 3-state resources are along only one axis. In the PQ100, this axis is from package short-side to short-side. Therefore, dataflow transport on these 3-state lines implies a short-side to short-side package orientation. This is not optimal from a board layout point of view.

Available I/O per side of a device is an important consideration when trying to assign dataflow nets to contiguous I/O. If a partition calls for a 32-bit, high-speed databus on a single device, then a PQ144 would not be suitable. Be aware of dual-function I/Os, such as boundary scan pins, which may "punch a hole" in an otherwise perfectly contiguous pin assignment. More subtly, unbonded I/O may punch a hole in high-speed data structure patterns that ought to be contiguously placed within the logic array.

Device I/O setup and hold parameters (for devices with flip-flops in the I/O) should be compared to internal flip-flop setup and hold parameters. If I/O setup exceeds internal setup, high-speed designs may profit by bypassing I/O flip-flops and going directly to internal flip-flops. However, the additional internal route will have to be kept as short as possible. Once again, this requirement necessitates absolute mapping and placement control.

One of the more esoteric partitioning considerations centers around the size and number of configuration streams. If two FPGA partitions are almost identical, they will still have different configuration streams. If loaded from onboard PROM, these devices may have to be loaded sequentially at the cost of time and PROM space. But if two FPGAs can be made identical (with perhaps some extraneous but ignored logic in one), they may be configured simultaneously. The PROM size may then be reduced to a smaller, cheaper part.

Price breaks should influence partitioning. Even though a partition fits within a smaller device, it may be that going up in device size creates a price break opportunity, which results in reduced materials stocking and costs. If so, it may be that some logic can be moved out of some of the other FPGAs. Such a move may result in reductions in design complexity and the number of different configurations.

Summary Dozens of obvious and subtle factors have early implications in partitioning. Much more than a resource and pin count exercise, partitioning requires more than a "design first, ask questions later" approach. Partitioning is a design methodology that produces a mappable and partitionable implementation by considering the downsteam consequences of upstream actions. Figure 3 summarizes the main thrust of the partitioning story: FPGA design success depends upon the early recognition and consideration of design constraints and device dependencies, as well as the application of forethought to transform the original design into an optimal implementation. That is, pre-pushbutton partitioning appreciates that FPGA silicon is crystalline, that design functionality is plastic, and that the implementation methodology is fluid.

Stephen L Wasson is a principal of HighGate Design Inc. (Saratoga, CA), consultants specializing in FPGA design.

Stephen will be presenting an in-depth case study on partitioning in the tutorial "FPGA Design for Inherited Constraints and Dependencies" scheduled for presentation session S113 on January 21 in the On-Chip Systems Design Conference of Design SuperCon97 in Santa Clara, CA.

To voice an opinion on this or any Integrated System Design article, please e-mail your message to michael@asic.com.


integrated system design  February 1997



[ Articles from Integrated System Design Magazine ] [ ICs and uPs ]
[ Custom ICs and Programmable Logic ] [ Vendor Guide ]
[ Design and Development Tools ] [ Home ]



For more information about isdmag.com e-mail cam@isdmag.com
For advertising information e-mail amstjohn@mfi.com
Comments on our editorial are welcome
Copyright © 1997 Integrated System Design Magazine

  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Looking for a new job?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
SRC Expands R&D Centers
The Semiconductor Research Corp has added a new center to its university R&D efforts.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About