The Advanced Encryption Standard (AES) replaced DES and Triple DES in 2001 as the preferred symmetrical cryptographic cipher for the 21st Century. The US government's National Institute of Science and Technology (NIST) mandated AES for civilian agencies and the National Security Agency (NSA) has authorized AES for encryption of classified information.
AES is an open standard (see fips-197.pdf) that was selected from an open competition. The winner was the Rijndael algorithm because it combines an extremely high level of security with computational efficiency. The algorithm consists of Exclusive-OR functions combined with matrix operations and is a mathematically 'clean' design which avoids the risk of 'back doors' to unauthorized users.
The elegance and efficiency of the system makes it suitable for either hardware or software systems. Low data rates can be accomplished by software only solutions. Hardware solutions are, of course, much faster and are often specified because implementing the critical security components in hardware isolates them from software threats such as 'viruses'. This avoids the need to carry out a detailed and costly security analysis of all the software components in the system.
To achieve higher data throughput designers can use a SoC (ASIC) or FPGA platform to provide hardware acceleration. This is where another feature of AES comes into play, the scalability of the algorithm. Fig 1 gives a typical trade-off between throughput and equivalent ASIC gate count and exhibits a nearly linear relationship between complexity and data rate.
1. Example area/performance trade-off for AES configurations.
(Click this image to view a larger, more detailed version)
A low gate count design will use a narrow data width (down to 8 bits) and process each 128 bit sample through multiple cycles. A classic cost/performance trade-off is achieved by increasing the hardware resources to give a wider data path, which needs fewer cycles for a given throughput. Implementations using 32 bit data paths often offer an optimal trade-off because of the way the AES algorithm has been defined.
FPGA applications can exploit a similar trade-off. For example, a wireless application requiring 100 Mbps can be realized using a small core with 16 bit data width and a 70 MHz system clock. Optical networking needing 10 Gbps is achieved by increasing the data width to 128 bits, adding pipelining, and winding the clock up to 156 MHz. Trade-offs within this 100:1 range provide intermediate solutions that span the needs of military, broadcast, communications, and storage applications.
2. Simplified block diagram showing area/performance trade-off.
(Click this image to view a larger, more detailed version)
Until recently, the choice between a SoC/ASIC or an FPGA implementation was usually clear-cut. Compared to ASIC solutions, FPGAs carry overheads that have an impact on technical and commercial performance. The programmable interconnect on an FPGA adds RC delays that reduce the performance compared to the custom metal of an ASIC. Additional transistors are needed by an FPGA to provide the programmability, but this raises the cost.
Over recent years, however, FPGAs have reduced these disadvantages and closed the gap significantly to the point where they are routinely used for volume production. A programmable solution is, without question, the fastest way to market and for this reason has become ubiquitous.
Much has been written about the total SoC project costs of mask sets, design time and tool suites for a 65 nm chip, with estimates starting at $5M and going stratospheric. Staggering costs like these eliminate all but a handful of designs, as witnessed by the dwindling number of ASIC starts. Recent reports (see the Soc Schedules EE Times article #206905136, for example) also suggest that close on 9 out of 10 projects overrun their deadlines, making long development schedules even worse.
That said; if the choice is for a masked solution because of unit cost or extreme performance requirements, then there is a wide selection from over 30 IP vendors.
From Fig 1, the cost of the silicon for the AES function will be extremely low, with the largest design occupying only 1 or 2 cents worth of silicon. AES cores can be highly optimized for ASIC. What will be more significant are the engineering costs of integrating the function into the design and especially design verification. Here is where comprehensive test benches which cover all the "corner cases" start to pay off. Some applications may require a FIPS validation of the AES implementation, with the attendant risk of a respin.
When the target implementation technology is an FPGA, the choice of IP appears to be equally as wide. Many vendors offer netlists that can be input into the FPGA design flow. A faint warning bell should be ringing at this point unless the design has been specifically targeted at the FPGA architecture. The reason is that the ASIC world is different from the FPGA realm in subtle ways. An ASIC builds up functions that the designer specifies from a rich cell library and there is no need to consider the impact of implementation. In contrast, FPGAs have fixed resources onto which the design must map efficiently.
Small details such as using asynchronous resets rather than synchronous signals can have disproportionate impacts. The details of the memory design can have a large influence, as these are fixed and finite, and an unsympathetic design could double the resources consumed. This would not be significant if the costs per gate were similar to ASICs, but that is not the case (see the Commercial Considerations topic below). If the design is moderate performance and the production quantity is low, then any inefficiency can probably be accommodated. These architectural differences should have been taken into account if the design has been specifically built for FPGA implementation.
Even within designs built for FPGAs there can be large differences. You would expect vendor to vendor differences, but there can be surprises within vendor portfolios. For example, the premium to implement a data encrypter/decrypter over an encrypter-only design can range from a modest 10% with lots of resource sharing to around twice the size in the worst case.
Another significant variable relates to the key expansion system that is used in the AES process to encrypt the plain text. The algorithm to calculate these keys can be implemented in either the FPGA hardware or in software on an external processor. A software approach may be suitable for low throughput schemes with spare processing cycles, but this will not be the case for high performance and is counter-intuitive if you have decided to use hardware acceleration. However, the key expander can be resource intensive and can double the design size, although it is normally included in the resource estimates.
Vendor to vendor differences can be even more surprising. A comparison of data sheets for a similar configuration can result in significant throughput differences from similar resources. The explanation, (to some degree, at least) can be resolved by the data width used or the features provided.
Another variable that is worth considering is the clock frequency used to achieve the throughput. The relationship is given by:
Throughput (Mbps) = 128 * f<> / number of cycles
... where the number of cycles depends on the key size and the data width and can range from 1 to over 600.
The engineering costs of a high clock speed are higher power consumption and more difficult timing closure. FIFOs and flow control of data may be required if the core runs at a different speed to the rest of the design, adding both cost and complexity.