Startup Crescendo Networks saw an opportunity in termination of TCP networks with Layer 4/Layer 7 processing capabilities-an approach that could dramatically improve throughput on networks running, for example, Secure Sockets Layer communications into a bank of servers. But the application was a minefield of uncertainties. And that meant the task of architectural partitioning-and in particular, choosing an implementation strategy-would prove critical to this system OEM.
"The experience of vendors who have tried to come up with a hardware solution for Layer 4/7 processing has been very bad," explained Yiftach Shoolman, founder and president of the Dublin, Calif., company. "There is so much going on in the evolution of the protocols at those levels that it is almost impossible to predict the changes. There is a huge need for hardware acceleration, because processing these layers is a bottleneck. But a hardware approach is simply too risky."
Crescendo's architecture team decided to plunge in anyway-to see if there was a way of partitioning the relatively fixed functions into hardware and the constantly changing functions into either configurable hardware or software, and still to end up with an attractive cost and performance point.
"We recognized at once that we would be wise to use off-the-shelf components as much as possible," Shoolman said. That led the architects to focus on an off-the-shelf network processor. But they quickly found it lacked the functionality for the deeper-layer processing that would differentiate Crescendo's product. "That left us with two alternatives," Shoolman said. "We had to add functionality to the NPU. So we could use an NPU with an ASIC, or we could use an NPU with an FPGA."
A conventional analysis found the ASIC would have a strong advantage in unit cost. But it would also have nearly a two-year design cycle, and unless it were based on a software-programmable core, it would be a dead end from the beginning. "There are simply too many changes going on to lock anything into hardware," Shoolman said. The idea of a fast processor and tightly coupled application engines in an ASIC made the second alternative look better.
"The second possibility was an FPGA," Shoolman said, again with two alternatives. Logic for upper-layer processing could be created in RTL and configured into the FPGA, giving the best power and performance. Or the volatile logic could be put in software and executed on a CPU core embedded inside the FPGA.
The project team had generated a lot of RTL for application processing when the architects realized, Shoolman said, that the FPGA would be so close to capacity-about 80 percent utilized-that the sorts of changes the team anticipated would be catastrophic. Yes, the FPGA could be reconfigured, even in the field.
However, working so close to capacity, the problem of achieving a successful route and reaching timing closure would be horrendous, every time a change was made. "We had to move some of the functionality to software," Shoolman said. But that created another problem. "If we dropped an ARM core into the FPGA, we would be at too high a utilization again," Shoolman explained. "We needed a very compact core that was intended for implementation inside an FPGA."
The team found such a core-Altera Corp.'s Nios. By instantiating six to eight Nios cores in the FPGA, the architects found sufficient processing power for the volatile code, and enough logic fabric left for the functions that must be hardwired. The result, according to modeling, would be a 10x to 20x improvement over software-only processing on the servers.
The thinking process was a long way from an a priori decision, Shoolman said. At Crescendo, the chief technical officer does system partitioning and implementation strategy. But CTO Yigal Rappaport worked with the director-level people in both hardware and software engineering. "We are very fortunate to have years of huge experience in hardware-software co-design," Shoolman reflected.
The project began with a full-system simulation in C++, which allowed Rappaport, who is also Crescendo's chief architect, to experiment freely with hardware-software partitioning strategies. But it wasn't possible to simply impose a partitioning from on high, Shoolman said.
"It was a very interactive process. You can't stay at a high level of abstraction and still understand the implications of a partitioning decision."
So, activity broadened out beyond just Rappaport and his team directors. When a deeper look into the actual implementation of a block was necessary, a director would bring his design team in as well. Thus, architectural partitioning and implementation planning became a process of trying an approach, validating it with partial implementation and trying again.
The result was an innovative mixture of off-the-shelf NPU, off-the-shelf FPGA and synthesizable, FPGA-oriented CPUs, each processor with its own applications code. After taking home Net-world+Interop's "best" in the Performance Enhancer Category, the team is now ready to try keeping up with the whirlwind that is Layer 4/7 processing.
See related chart