United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 



Programmable Logic

Top-Down FPGA Design­ A 12-Step Program

FPGA design flow benefits from a structured design approach.

by Stephen L. Wasson


As engineers contend with a growing variety of new architectures, vendors experiment with an increasing assortment of new tools­and the future promises more permutations of both. This article presents a current view of FPGA design in a 12-step, top-down process.

Every process has three major elements: (1) context, (2) steps, and (3) flow. Context defines boundaries of applicability; steps identify functions to perform; and flow describes process interconnectivity. Understanding these elements and their relationships will benefit hardware engineers in efficient navigation, software engineers in cohesive coding, and vendors in coherent packaging of the FPGA design process.

Figure 1 depicts the current FPGA design process context: using reconfigurable devices with course-grain architectures, from text or schematic entry using multiple point tools. The 12 steps, from Grok to Archive, span the whole design cycle from inception through delivery. The flow is a simple, serial sequence with infinite recursion opportunities. Successful and expedient flow navigation will depend upon the foresight applied in each step­step by step, from top to bottom.





Figure 1. Expediently navigating the FPGA design process with the fewest iterations while
producing the highest quality results will depend upon the foresight applied in each step­top to
bottom. The greatest design leverage is gained in pre-pushbutton forethought.


1. Grok At the top of every process is the need to understand the problem, identify the goal, and chart the course between. To "grok"­from Heinlein's Stranger in a Strange Land ­means to arrive at a consummate understanding of the whole: the market, the technology, the tools, the people, the resources, the before and after.

In the beginning is the Marketing Request Document (MRD)­the big picture. According to product creationists, the MRD asks and answers all the relevant questions: Who are we? Where are we going? How many channels do we really need? These answers are all "cast in paper" and thrown "over the wall" to Engineering.

2. Spec Specification is the step where reality is applied to the MRD, where all the high-level design work is done: researching unknowns, identifying choices, weighing options, selecting tools, evaluating technologies, prioritizing features, making tradeoffs, and planning schedules­the overall design strategy . Although it's been demonstrated in a thousand projects that time spent in specification is time well used, specification is often compromised by calendar-driven schedules.

The specification should allow an implementor to "bang out" a design. The most important part of the specification is the block diagram identifying all design inputs, outputs, and functions. At this point, it should hierarchically specify functionality down to the data object level (counters, registers, adders, etc.) but not lower. Such specification greatly enhances resource estimations for partitioning and floorplanning efforts in mapping.

3. Partition/Map Partitioning is the allocation of logic among multiple devices; mapping is the allocation of logic within a single device. Partitioning logic between devices often involves isolating macro functions: bus interfaces, video controllers, comm modules, etc. Mapping logic within a device means aligning micro functions (counters, registers, adders, etc.) with logic and route resources. From the top-level block diagram, bounding boxes, sketched around macro functions, determine I/O requirements (across boundaries) and logic resource needs (within boundaries). Partitioning proposals should be considered preliminary and dependent upon inferences obtained from single-chip mapping analysis.

Mapping analysis involves matching block diagram functions to target device architectures. Understanding device logic and route resource constraints, the designer can readily infer two important estimations: (1) logic consumption and (2) route consumption. These are best obtained by the floorplanning exercise (described in "Floorplanning for High Performance FPGA Designs," Integrated System Design , March 1995). An important design point is that device constraints take precedence over design constraints.

During partitioning and mapping, the block diagram should be modified to accommodate device architecture constraints. Only after partitioning and map-ping can the device, die, and package selections be confidently made. Once the device types and number are known, board size and costs can be more accurately estimated. And, most importantly, after floorplanning each device, subsequent block diagrams can be expediently implemented. From here on, all steps are single-chip.

4. Implement Only after partitioning, mapping, and floorplanning indicate that a chosen target will accommodate a design should the designer proceed with final implementation. Specifically, implementation is the completion of the block diagram lower-level hierarchy, followed by design entry and ending with behavioral netlist generation. In this step, function block design and interconnect are completed and a syntactically correct netlist is produced.

Design Note In hierarchical implementation, it is good practice to maintain identical module sizes. That is, in schematics, when the chosen sheet size fills up, additional logic should be placed in lower-level modules; in HDLs, when the page fills up, additional code should be called in lower-level routines.

Whether from text or schematic, floorplanning should pre-determine most significant implementation details: counter styles (binary vs. LFSR), carry types (dedicated vs. lookup), multiplexing (combinatorial vs. 3-state), clock enables (dedicated vs. combinatorial), reset styles (asynchronous vs. synchronous), etc. To date, schematic entry has the advantage of greater topological similarity between design specification and device architecture and therefore has greater success in producing high-utilization, high-performance implementations.

Other high-performance implementation guidelines are as follows: design synchronously, implement state-per-bit state machines, use linear-feedback shift registers, prescale binary counters, consider combinatorial clock enables, and duplicate critical logic. Also, perform a route-budget analysis to determine the practical maximum number of intermediate logic levels between sequential elements for a given clock period in the target device. (See "Design Tips for High Performance FPGA Designs," ASIC & EDA , October 1994.) All functions should be implemented with this maximum in mind.

5. Simulate After generating a syntactically correct netlist, the first of three simulation steps should be performed. Simulation, at this point, proves that functions are behaviorally correct­both as pieces and as a whole. In contrast to the overall design process, this functional simulation is most efficiently performed bottom up to maintain the narrowest focus on the possible sources of misbehavior. Also, simulation is greatly assisted if all nets and instances are labeled, not with arbitrary system-assigned names, but with meaningful, user-assigned names.

First, demonstrate that counters count, registers register, decoders decode, and adders add, then proceed up to the next hierarchical level and show that decoders enable registers to load the counters, etc. Next, establish that state machines properly traverse, arbiters arbitrate, and data moves between the pieces, and finally, up at the device pins, prove that data flows properly in and out of the whole.

6. Compile Compilation maps the behavioral netlist into target device logic elements­ "automatically." Depending upon the rigor of the vendor design-rule check, with the touch of a button, voluminous reports requiring many hours of manual interpretation may be produced. Without an intelligent post-processor, repeatedly wading through and weeding out syntax and binding errors with each design modification may be necessary. Tersely implemented functions­those without unnecessary logic to trim­will expedite this step.

Completing compilation produces (1) the logic-resource utilization calculation, and (2) a syntactically correct, target-mapped netlist. Depending upon clock speed, design structure, and the potential for future modifications, the utilization calculation will indicate how well the design is partitioned. Depending upon the entry method, signal naming discipline, and partitioning and mapping techniques, the compiled netlist appearance may deviate from the original implementation.

7. Simulate The target-mapped netlist should be simulated functionally to verify that nothing was lost during compilation. If possible, re-use the same input vectors as before and compare results. If results do not compare, problems will likely have to be retraced into the target-mapped netlist. This means that deviant netlists­such as those synthesized from text entry­may present considerable challenges in determining which gates and nets represent which functions and signals.

On the other hand, a thoughtfully partitioned, mapped, and floorplanned implementation will compile into a simulation-friendly netlist. Implementations approximating device granularity more readily retain their identities. That is, function block and signal names will be recognizable, which in turn will expedite simulation. Successful completion of simulation yields a functionally-correct, target-mapped netlist ready for place and route (P&R).

8. Place Placement­the assignment of target-mapped functions to specific logic resources within a device­may be either automatic or manual. Auto-tools are good at placing unstructured logic; humans are better at placing structured objects. A structured object is one that exhibits high-interconnect uniformity. All datapath objects (such as counters, registers, and adders) are structured and should be considered excellent candidates for manual placement or floorplan enforcement .

Empirical evidence in hundreds of FPGA implementations suggests that most designs are about two-thirds structure. The inference is that nearly every design will benefit by some human-aided placement. The purpose of placement is to align the target-mapped micro-structures within the device such that route resource consumption will be both minimized and distributed. This most often involves arranging structured objects so that the major datapaths can flow uniformly across the chip.

Of course, to perform manual placement, several things have to be true: (1) the designer must be familiar with the device architecture, (2) the entry tools must accept naming, mapping, and placement attributes, and (3) the target P&R tools must permit placement manipulations. Without these benefits, automatic placement will likely produce a less-than-optimal starting point for the router.

9. Route Although routing is the most automatic step in the FPGA design flow, there may still be opportunity for significant user guidance­such as route-effort parameters, guided routing, and manual pre-routing. Route-effort parameters suggest how hard to try. For incremental design modifications, guided routing "borrows" information from successful prior routes. And manual pre-routing, requiring a target-device editor and an intimate knowledge of the device architecture, "seeds" critical lines before turning control over to the router.

But the most significant user routing guidance comes from timing specifications. Minimally, the designer should specify pad-to-clock, clock-to-clock, clock-to-pad, and pad-to-pad default delay parameters. More importantly, specific parameters should be applied to all special cases­for instance, between sequential elements with clock enables, in different time domains, looping through I/O, or crossing through RAM. Cooperative design environments accept timing specifications in initial entry. Intelligent target tools will utilize timing-driven specifications throughout map, place, and route.

10. Analyze After route completion, the next goal is performance. For fully synchronous designs, this static timing analysis is mostly a matter of querying the routed netlist for timing specification failures (no failures means the target netlist is routed at-speed; conversely, timing failures mean that some end-to-end path exceeds the route budget). Rectifying timing failures may involve re-placing logic into closer proximity, adding additional pipe stages, or compressing intermediate logic levels between sequential elements. (See "High-Speed State Machine Design," Integrated System Design , June 1995.)

If the design contains asynchronisms­asynchronous sections, multiple time domains, or back-to-back internal 3-state cycles­post-route analysis should include full time delay simulation. Here, the previous discussions about "simulation-friendly" netlists have even greater applicability, because tracking down setup violations or bus conflicts will likely require visiting every node along each suspect path. If possible, each violation should be corrected to be non-route dependent; otherwise, every re-route will require re-analysis.

11. Bring-up Once a design passes post-route analysis (and assuming the proto-board FPGA pinouts are still usable), it's time for in-circuit test. On the proto-board, include all the necessary connections for reconfiguring the FPGA­such as mode jumpers and configuration headers. Depending upon FPGA tools, it may be advantageous to initially install a larger-than-necessary device that includes additional debug functionality. If the tools permit post-route editing, complex trigger functions can be added, downloaded, tested, and re-modified­all in minutes.

For bring-up, knowledge of device architecture is a considerable advantage not only in post-route editing but in pre-route implementation. It enables the designer to determine where to leave logic for circuit expansion and where to leave I/O available for test pins. After bringup verifies design functionality at-speed in-circuit, the design, device, and pinouts may finally be frozen for production ship and project archival.

12. Archive Proper archival includes capturing all information necessary to manufacture, maintain, and modify a design. Besides the obvious specification documents and implementation source, it's prudent to archive all state diagrams, truth tables, route-budget calculations, dataflow drawings, placement worksheets, constraint information, critical-path notes, time-domain analyses, simulation command and waveform files, and debug circuits, as well as all software versions (of all P&R, schematic and text entry, simulation and synthesis tools), compilation options, command-line parameters, make files, data files, and PROM files­absolutely anything and everything that may be of assistance to the next person to come along. Keep in mind, that next person may be you .

One archival technique is to embed design comments and explanations within the implementation source. Again, schematic tools have another advantage over HDLs­as drawing tools. Including bubble diagrams, truth tables, and waveform drawings with text descriptions significantly improves initial simulation and bringup efforts as well as future maintenance and modification efforts throughout the FPGA design process.

Final observations Currently, the advantages of text-based entry tools are rapid initial entry, prodigious synthesis, powerful simulation, and ostensible design portability. However, the advantages of schematics are efficient map-and-place control, friendly post-route analysis, and optimal device targeting. Presently, schematic implementations are typically achieving 2x performance improvements over synthesis implementations.

Successful traversal of the above 12-step FPGA design flow depends on many factors: target device familiarity, implementation discipline, entry method, floorplanning efforts, entry and target tools, simulation coverage, bringup technique, design experience, and schedule pressure. Expedient traversal of the design flow depends upon minimizing step reiterations. In the fantasy, design flow is navigated by pushing buttons; in the reality, the designer must interact with the tools at every step­often looping back to make new design adjustments. The further down adjustments are made, the greater their cost. Remember: The greatest design leverage is gained in pre-pushbutton forethought­from top to bottom.



A case history: Xilinx , napkin to netlist

A video reformatter was to be designed using FPGAs (for the rapid prototyping capability) and using a Xilinx 4K device (for the RAM feature). The specification called for 16-bit data to be input at 25MHz, transformed, and output as 32 bits at 12.5MHz. For the major design element, a line buffer, "dual-port" RAM structure was proposed. For each 512-word input line, 8K bits of RAM would require 256 configurable logic blocks (CLBs).

This initial estimate suggested the use of the XC4010. Viewlogic schematic entry was chosen because of the need for mapping and placement control. ViewSim would be the simulation tool. Since incoming video would be constant, two line buffers would be required, so the design was partitioned over two devices.

Both devices, containing the identical configuration, would share output data and control pins­each 3-stated appropriately. From this, the initial pinlist was drafted and, from the pin count, the PQ208 package was selected. Proceeding from the specification block diagram, the major data structure (a RAM array) was mapped into the 4010 via the paper floorplanning exercise.

From studies of the route distribution pressures and probable delays, the line-buffer RAM structure was modified to a quad-pipeline, two-bank structure. This modification also permitted optimal use of global buffer distribution as well as synchronous generation of RAM write enables (WEs). However, the additional pipelines increased the resource requirements up to a 4013.

With the floorplan subsequently readjusted and the RAM structure centered within the 4013 array, the dataflow pins were assigned: inputs and outputs on the left and right device rails; controls along the top and bottom rails. These assignments were reflected in the preliminary constraints (CST) file. Once the floorplan suggested that the modified block diagram would both fit and route at-speed in the selected target, implementation of the lower-level hierarchy could proceed.

From the route budget analysis, in a 4013-5, the 25MHz logic should be limited to three intermediate logic levels. But, because of the two-bank design, the RAM address generators could be built with slow ripple-carry counters. With the RAM floorplanned as a 16x16 array, there would be 16 different output sources (one from each column); therefore, 3-state multiplexers were implemented. Both the input and output time domains were autonomously 100% synchronous, and all I/Os were registered in the pins.

Because of the contiguous, internal 3-state cycles, a shift register was used to generate 16 non-overlapping 3-state output enables (OEs). All schematic objects were labeled and the CST file expanded to include placement of the RAM, pipeline, 3-state buffers, and 3-state control objects. In the CST, all other logic was confined to the CLB rows below the RAM array.

Because of schedule pressures (to parallel board layout), "final" pin assignments were made: input controls at lower left and output controls at lower right. With lower-level schematics completed, simulation was first performed on the address counters, OE controls, WE generators, and pipeline structures. Then, input synchronization, RAM write cycles, RAM read cycles, and output control generation were simulated.

Finally, an end-to-end, full-line buffer simulation was performed using an input pattern that would be easy to verify when observed after transformation onto the outputs. After this functional simulation, the netlist was compiled in the Xilinx environment using Xmake with the 'stop-after-DRC' option. The Xnfprep PRP file was checked for errors and trimmed logic, and the schematic-to-compilation flow retraversed until all errors and trimmed logic were accounted for. Xmake was re-run with PPR's 'just-place-don't-route' option to check the CST file for syntax errors.

The CST's veracity was then confirmed in EditLCA by a quick inspection of the placed-but-unrouted design. The PPR.log file confirmed initial resource estimates (about 70 percent utilization). With the design successfully mapped into Xilinx , it was reconverted back to Viewlogic for unit-delay simulation LCA-to-XNF-to-WIR-to-VSM. The same vectors used in the functional simulation were re-used with the same results.

Returning to the Xilinx environment, PPR was run to completion: structured objects were placed according to floorplan and random logic was auto-placed in the lower CLB rows­all according to the CST file. Structured objects were arranged with relative location (RLOC) schematic attributes and placed with "place instance" CST constraints.

Combinatorial logic was mapped for route-budget level control with FMAP schematic primitives and BLKNM schematic attributes and then placed with "place block" CST constraints. All I/Os were placed according to their schematic labels and corresponding "place instance" CST constraints.

After placement, PPR proceeded to route. In this design, two time specs were used­one for each domain: 40ns clock-to-setup in the 25MHz domain, and 80ns clock-to-setup in the 12.5 MHz domain. RAM address timing­with such large route delay allocation­was 'ignored.' Again, the Xnfprep PRP file was checked for errors and logic trimming. PPR.log was consulted to verify parameters used, route results (0 unroutes), and the time-spec summary.

The RPT file was inspected for final resource utilization ("packed CLB" count), and the OUT file was checked for compilation errors. The completed route was then analyzed . Static timing analysis was performed in the Xilinx environment using Xdelay to check for failures of the two time specs. Failed specs were corrected by logic duplication and mapping for additional level compression and by placement refinements in the CST.

One troublesome path­from a left-side control input through one intermediate logic level to a right-side global buffer­required considerable attention, all because the pins were fixed before final timing analysis. With the static timing issues resolved, the routed LCA was converted back to a Viewlogic VSM file for full timing simulation. Once again, using the same test vectors, dataflow in and out of the device was verified. Moreover, the contiguous, internal 3-state enables were inspected to confirm non-overlapping operation. After full-timing verification, the design was ready for bring-up .

From the routed LCA file, Makebits produced the BIT file for download via the download cable. In the lab, using EditLCA, the routed LCA file was hand-edited to change counter decodes (to modify pulse widths), to add delays to synch inputs (to increase margins), and to add test pins. All these edits were thoroughly documented and incorporated into the original schematic source and, at appropriate intervals, re-compiled, re-routed, and re-tested. In a matter of days, bring-up confirmed full, in-circuit operation, and the design was ready to archive .

All Viewlogic INI, SCH, SYM, CMD, DAT, and SEE files as well as all Xilinx PRO, PRP, RPT, LOG, OUT, LCA, BIT, and MBO files were zipped and saved. Lastly, the fundamental circuit operation was written up and may be found in "Floorplanning Xilinx FPGA Designs for High Performance," Design SuperCon '95, On-Chip Systems Design Conference proceedings, pp. 12-1 to 12-27.



Stephen L. Wasson is a principal of HighGate Design, Inc. (Saratoga, CA), consultants specializing in FPGA implementations.

To voice an opinion on this or any Integrated System Design article, please e-mail your message to michael@asic.com.


integrated system design  January 1996



[ Articles from Integrated System Design Magazine ] [ ICs and uPs ]
[ Custom ICs and Programmable Logic ] [ Vendor Guide ]
[ Design and Development Tools ] [ Home ]


For more information about isdmag.com e-mail marcello@isdmag.com
For advertising information e-mail amstjohn@mfi.com
Comments on our editorial are welcome.
Copyright © 1996 - Integrated System Design Magazine

  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Looking for a new job?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
SRC Expands R&D Centers
The Semiconductor Research Corp has added a new center to its university R&D efforts.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About