Web security is too expensive. The network traffic that must be secured is growing fast enough that new levels of equipment efficiency must be reached in order to keep up with demand. The traffic that should be secured will only be secured when the efficiency metrics show the cost to be reasonable. Adding Secure Sockets Layer (SSL) support to Web-based networks represents a significant cost increase over operating unsecured networks. Considerably more network equipment must be used to support the same number of secure connections as unsecured connections. This limited scalability restricts deployment of security services.
Indeed, Web sites with high volumes of secure traffic using even a 1-GHz Pentium III processor web server, capable of over 2000 clear transactions per second, can be reduced to handling less than 100 SSL transactions per second at a per-transaction latency over 20 times the clear transaction latency. This is because the cryptographic transforms that provide privacy and authentication for the SSL protocol do not map well to traditional processors or even to newer Network Processors. Limited in cryptographic performance and high consumers of power (milliwatts per transaction), traditional processors cannot provide scalability in data centers where rack space, power density and cooling requirements are already problems.
Existing cryptographic accelerators typically process RSA decryption operations at rates of 200 to 800 per second. By offloading the mathematically intensive RSA computation from the host CPU, per connection latency is reduced and the number of transactions a CPU can complete is increased.
However, "connection scaling," or more simply the number of connections a device can support, then becomes constrained by the computational overhead of processing the symmetric cryptographic services such as ARC4-MD5. At 2000 connections-per-second the cryptographic load on the server performing encryption on data going to the client (using the maximum SSL record size) is over 280Mbit/second. Using a 1-GHz Pentium III as a reference, the bulk cryptography would consume over 55% of the CPU for ARC4-MD5 and nearly 500% for compute intensive 3DES-SHA1.
While some existing ICs have attempted to offload the cryptographic primitives involved in the bulk cipher suites, the results have been poor. In at least one case they have actually decreased the number of connections that can be processed, due to poor performance by the associated cryptographic cores and inefficient use of the available PCI bus bandwidth.
With such considerations in mind, we set out to design an IC that would provide true scalable SSL acceleration at the system level. The goal was for the entire system to deliver SSL performance rather than simply focusing on individual core level performance at a manageable financial cost and level of power consumption.
Our design team kept in the forefront of the specification both performance metrics and efficiency metrics. For performance metrics we looked at two measures. First, we insisted on fast raw cryptographic acceleration performance that surpassed existing CPUs and IC cores significantly. Performance is measured in Mbit/sec for each of the symmetric algorithms ARC4, MD5, 3DES, and SHA1 and in decryptions-per-second for RSA decrypts. Also, we looked at overall system-level performance. Fast cores are necessary but not sufficient for system-level performance. True system-level performance requires that the chip be designed in such a way that the entire SSL protocol is accelerated, taking advantage of the full capability of all the component parts.
For efficiency metrics, we focused on key ratios for power consumption and financial cost. For power, we set milliwatts-per-RSA decrypts/sec and milliwatts-per-ciphersuite/sec as our benchmarks-our chip consumes only 3.5 watts leading to 0.85 milliwatts-per-RSA decrypt/sec and 1.81 milliwatts-per-ciphersuite/sec. For cost, we set dollars-per-RSA decrypt/sec and dollars-per-ciphersuite/sec as our benchmark a cost of $150 in volume leads to 0.04 cents-per-RSA decrypt-per-second and 0.08-cents-per-ciphersuite/sec.
Early in the design process, we created the very high-speed cryptographic cores that serve as the lowest level building blocks of our chip. Concurrently, our network protocol engineers identified SSL level protocol bottlenecks that would guide the creation of high-level protocol operations that would be carried out as atomic operations on the chip. The result is a chip that does more than simply accelerate the raw cryptographic calculations. It is a true protocol processor for SSL a new generation of security device an intelligent, single chip network security processor.
It includes primitives to optimize the performance of the two phases of an SSL session - the SSL handshake and SSL record-layer processing. The SSL handshake allows the client and server systems to agree on a protocol revision, server identity, and a shared secret that will be used to generate key material for data encryption and authentication.
The SSL record layer processes byte streams into record blocks that are encrypted and authenticated using a pair of encryption and hashing algorithms called an SSL ciphersuite. The SSL session establishment phase requires functions for generating secure random numbers, performing the RSA decryption, and generating the master secret and key material using a series of complex hashes. For the SSL Record protocol, user data must be encrypted using a symmetric cipher and authenticated using a hashing algorithm based on keys provided in the SSL handshake. All of these functions are implemented our chip in hardware as single operations.
Our team also took advantage of the fact that the cryptographic primitives that make up the SSL protocol lend themselves to parallelism. The use of ciphersuite processing allows ARC4 encryption and MD5 hashing cores to run simultaneously on the same data to generate output at twice the rate possible in existing CPU or IC solutions.
Adding an SSL accelerator to process large volumes of transactions naturally means overlapping requests may be processed in parallel rather than serially. Our IC includes multiple high-performance parallel execution units. While a host CPU or older generation IC may only perform either an RSA decryption or 3DES, the multiple execution units of our IC can perform both operations simultaneously. So, not only is "connection scaling" increased because of the SSL offload, but also because of the chip's ability to perform operations in parallel.
Specifically, our chip includes the ability to perform up to four operations in parallel - secure random number generation at rates up to 10,000 SSL transactions/sec, public key operations - such as RSA, and encryption with bulk ciphers and authentication with hash algorithms. The three main sections of the chip (the random number generator, the Public Key Engine, and the Cryptographic Controller) each run independently of each other. Additionally, the four cores that make up the Cryptographic Controller (3DES, ARC4, SHA1, and MD5) also run independently of each other, allowing for parallel execution units.
Driving the parallel execution units requires an architecture that allows the network security processor itself to control the transfer and flow of data and commands across the PCI bus. Clearly, a synchronous design could not effectively exploit the parallel nature of the system, so in our design we used an asynchronous non-blocking I/O strategy.
When the host wants to perform an operation, it places the request in host memory in one of three command queues and informs the chip that new commands are available. The chip processes each command in the queue, transferring other host data from and back to memory if required. Completion notification occurs when either another command is sent to the same queue, a requested interrupt is generated, or a programmable timeout interrupt fires. The parallel-asynchronous queue operation allows the host processor to continue processing incoming requests or network manipulation overhead until the PCI SSL accelerator has transformed the data or computed the requested command result.
See related chart
The parallel nature of the SSL record-processing core, which encrypts and authenticates records simultaneously in one PCI bus transaction, also speeds overall system throughput and efficiency. Existing solutions must transfer the data across the PCI bus multiple times to complete one SSL record operation, effectively cutting the available bus bandwidth by more than half.
The cryptographic state for the public key engine can also be stored. This is done for security reasons rather than a significant performance boost. Private key information is extremely sensitive to compromise, so by storing the information in memory that only the IC can read, there is no way an attacker that has compromised the Web server or load-balancing appliance can obtain the private key information, which could be used to impersonate the compromised site.