To keep their products portable across different vendors' devices, designers increasingly are turning to high-level design languages (HDLs) and FPGAs. StateCAD from Visual Software Solutions is one product that has found its way into the state-machine design procedure at Chrysalis-ITS. It was used to design an FPGA memory/DMA controller on a PCI-based cryptoraphic accelerator card. Here we focus on this design example, which was started in a QuickLogic device, migrated and mutated into a Xilinx device and finally back to a new QuickLogic one.
The design is based on its predecessor, Luna 2, which is a PC-Card format cryptographic accelerator. Luna 2's breakthrough performance was achieved by using a 32-bit 233-MHz RISC CPU, Pentium Level 2 cache memory and DES hardware acceleration through an FPGA. The goal of Luna VPN, the new design, is to preserve the functionality of Luna 2 but give the performance a boost by moving to a higher-bandwidth bus (PCI) and by using a standalone cryptographic accelerator IC.
The design consists of a 32.5-MHz local bus interface between the StrongARM processor and its memory (SSRAM, flash, DRAM and a HiFn cryptographic accelerator). For the PCI interface, the AMCC 5920 was chosen, as the requirement was for a target-only interface.
As the FPGA for Luna 2 we chose a QL2007, partly because of the security that an antifuse device provides. Antifuse devices are exceedingly difficult to reverse-engineer, since exposing the die doesn't reveal the interconnections. It also provides an assurance that the design cannot be stolen, unlike an SRAM device that requires a bit stream for configuration.
Other requirements for this design were low cost and low power, both of which antifuse devices address well.
Because of the extremely tight schedule of the PCI version of the product, an SRAM-based technology was chosen for Luna VPN, since FPGA changes could be made much more quickly and easily. We chose a large part, the XC4028, to make sure we didn't run out of gates and didn't become speed-limited because of routing. Configuration was done through a parallel EPROM, but there were more soldering problems with the socket than we would have liked. Having said that, the prototyping stage went very well and was definitely a success.
Several features had to be removed or changed and other features needed to be added for the PCI version.
Most of this was possible with very little hand-tweaking, except in one parameter. The StrongARM processor requires a very early response on its "wait state" line, which is less than half the clock period. This is a very tight timing budget, since it effectively doubles the clock rate. So it was necessary to manually place the four flip-flops that generated the wait state's output. This was extremely cumbersome due to a bug in the software, which requires a small manual modification to the synthesis output. The required overall timing was always achievable, but occasionally the compile times were quite long (for example, 45 minutes).
In moving to production, we required both a cost reduction and better security for the intellectual property contained in the design. The move to antifuse was a natural progression that addressed both these concerns. Given that the second revision of the printed-wiring board was started two months after the first boards were received, we thought it wise to socket the FPGA on the first batch.
In this case we moved from a slower SRAM technology to a faster antifuse one, which greatly eased the conversion. No longer did we have to manually place flip-flops, and the overall device speed (flip-flop to flip-flop) was considerably faster than we needed.
The conversion process is not only possible but is also relatively painless. The design presented here went from its original state to being retargeted twice relatively quickly the hardware portion of the project took only about four months.
One interesting note that came out of all this is that, in our case, a Xilinx 4013E contained the same amount of logic as a QuickLogic 3025. This means that according to the way Xilinx and QuickLogic are currently counting gates, 25,000 QuickLogic gates are approximately equal to 13,000 Xilinx gates. Incidentally, 25,000 QuickLogic gates are also equal to 9,000 ASIC gates (the way QuickLogic used to count them).
Since we had no experience with PCI, one of our first decisions was to figure how to interface to the PCI bus. We didn't have the time or the inclination to design our own PCI interface, so that left two possible solutions: implement a PCI core from one of the PLD vendors or purchase a standalone PCI interface chip. For our application, we needed only a target (slave) interface.
Going with an off-the-shelf interface chip reduced the risk of implementing someone else's PCI core and meant lower parts cost. Most of the reason is that including a core in the PLD would require a large (and more expensive) PLD. The increase in cost of the PLD is usually going to work out to be more than the cost of a standalone device. The only thing that would have taken us down the IP core path would be if we didn't have room for two devices.
Designing FPGAs with retargetability in mind not only opens up options that would not otherwise be possible, but it enforces good design practices. Keeping the design in a vendor-independent high-level language that is fully synchronous is the best way to avoid those difficult-to-find bugs.
Performing a detailed resource calculation with a spreadsheet helps the designer get a good early indication of which parts he can fit into and therefore target.