SAN MATEO, Calif. -- The ranks of configurable CPUs -- processors designed to be augmented by customer-specified execution units-- will expand significantly today with independent announcements by Faraday Technology Corp. and startup Stretch Inc. The two architectures, both derived from CPU designs, bring new dimensions to the market.
The argument behind customer-configurable CPUs, as propounded by their originators, ARC International and Tensilica Inc., can be both simple and compelling. An application is written in C or C++ and profiled. The hot spots, generally tight arithmetic loops, are reduced to small sequences of one or more custom instructions. Then an execution unit is added to the CPU to carry out the new instructions, and a new CPU design and tool suite are generated. The result is that the application remains in software, rather than being committed to immutable ASIC hardware, but the execution performance, die size and energy consumption approach the levels of a hardwired ASIC.
ARC and Tensilica have pushed this approach with moderate, but quiet, success. In both cases the flow involves generating RTL for a proprietary CPU core with a customer-defined additional execution unit and entering this RTL into a standard-cell-based design flow. In both cases, the secret sauce rests in the generation process; the additional tools that are generated with the CPU RTL; and in the proprietary coprocessor interface that permits an execution unit to be developed safely by engineers who have no knowledge about the details of the CPU's design.
The two new entrants are taking similar, but very distinct, directions. Faraday has broken away from the use of proprietary processors, offering a 500-MHz ARM processor with a unique coprocessor port and a number of other novel architectural support blocks, all for use within its ASIC service and United Microelectronics Corp.'s foundry program. Stretch, in contrast, is offering a standard-product microprocessor chip with an SRAM-based, user-configurable execution unit that can be modified during execution.
Faraday proffers an ARM
The Faraday core is actually one component in a library of hard-intellectual-property (IP) offerings from the newly aggressive ASIC company. Rather than following ARM Ltd. down the ARM-9 and ARM-10 trail, Faraday decided to focus on deeply embedded-processing applications " such as edge and access router chips " and to do its own implementation of the version-4 architecture with an eight-stage pipeline. The core will be ARM-certified (now in process at ARM in Cambridge, England) and will include the full set of ARM v4 internal features, including the debug unit. Faraday will offer three versions of the design, each using a different blend of United Microelectronics' low-leakage and high-speed libraries: 166-, 333- and 500-MHz hard cores.
Accompanying the processor core are several important building blocks that Faraday believed were critical to high-data-bandwidth applications, said company president Charlie Cheng. The most obvious is a nonblocking crossbar switch to replace the Amba High-Performance Bus in the canonical version of the ARM architecture. The crossbar takes considerably more die area than a conventional multiplexer array, but in terms of throughput and deterministic performance, it is far better, Cheng said. This block, too, is provided as a hard macro, with an SRAM-style interface and filters tuned to support delivery of sustained bandwidth to streaming data with maximum latency requirements, such as high-definition video.
The third important hard-IP block is a centralized, table-driven data coherency engine. "Ever since the early days of symmetric multiprocessing, architects have used Mesi coherency protocols and distributed controllers as kind of a default," Cheng said. "It's clearly a better approach in that world. But when you only have a few data sources, clearly defined data flows and static task assignment, Mesi is overkill. A centrally administered EI protocol is more than adequate."
But the show stopper in the architecture may be the coprocessor interface. Rather than being isolated on the end of a private bus and treated as a slave, the single coprocessor in the Faraday architecture is a peer to the ARM core. In fact, it could be another ARM core. The coprocessor is coupled to the CPU through ARM's tightly coupled memory, a very fast multiport scratchpad and through interrupts. Both processors share access to the crossbar. This allows customers to design anything from a simple add-on instruction to quite sophisticated engines, tightly couple them to the ARM CPU, but not tie them directly into the ARM pipeline timing. That greatly simplifies timing analysis and verification, Cheng said.
The contrast to Stretch's vision could not be greater. While ARC, Faraday and Tensilica all focus on system-on-chip designers, the engineers at Stretch took up the cause of the design team working at board level. "In the past, when there was a processing task too hard for the available microprocessor, designers had a limited number of choices," said Stretch's chief executive officer, Gary Banta.
"They could add more processors if the task decomposed into a multiprocessing problem and they had the space and power for more CPUs," Banta said. "Or, they could add an FPGA or ASIC to the board to handle the critical inner loops where the hard part was. But all of these solutions had a negative impact on board area, bill-of-materials cost and power."
Stretch aimed to bring the idea of a configurable CPU to the board-level designer. But of course it was not practical to do that by making the design team undertake a cell-based ASIC design. Instead, Stretch created a standard-product CPU with an SRAM-based configurable processing array sitting beside it on the die.
The Stretch flow
The Stretch design flow starts very much like any other. Designers create their application in C/C++, profile it and home in on the inner loops. But then a Stretch tool compiles those critical loops directly into a parallel execution engine mapped onto the configurable fabric. The result is a CPU running a C program, with some loops implemented as single- or multicycle operations on the array fabric.
To reduce their own cost and time-to-market, the Stretch designers started with the Tensilica Xtensa-5 CPU core. They simply replaced the custom execution unit that would have been generated by the Tensilica tools with their own fabric and added a unique register file between the two.
The nature of that register file and fabric are dictated by the kinds of things that happen in the inner loops of applications, said Stretch's chief technical officer, Albert Wang. The operations the Stretch tools extract from these loops are arithmetic in nature, he said, and lend themselves to simple pipelines with unidirectional flow, but are all over the place in terms of operations on word, byte or bit boundaries.
So, Stretch created a file of thirty-two 128-bit-wide registers to sit between the Xtensa core and the fabric. Programmable data paths permit the fabric to extract data of any width on any boundary, including noncontiguous bits, from a register. There are three read ports and two write ports between the register file and the fabric. The write ports permit masked write operations, so the fabric actually can operate on the registers at bit level.
Internally, the execution fabric is an array of bit-level primitives and pipeline registers. These primitives can be linked to form two kinds of basic blocks: arithmetic units and multipliers. So under control of the configuration SRAMs, the fabric is populated with a collection of ALUs, multipliers and registers of varying sizes, interconnected in whatever pattern is re-quired for the data flow of the operation.
The full fabric is divided into two symmetrical halves. Each half can be configured to implement a different pipeline, so that two quite different atomic operations can be implemented simultaneously. Alternatively, the two halves can be used in a double-buffering mode: One half can be executing while the control store for the other half is reloading. The loading process takes 80 to 100 microseconds and is done through a direct-memory-access port. That way, the fabric can be time-multiplexed to implement a large number of atomic operations, so long as they aren't all needed at once.
This reuse capability and some other simplifying factors make the fabric much simpler than an equivalent amount of static FPGA, said Wang. He explained that in the pipelines created by the compiler, flow is unidirectional, eliminating the need for most of the semirandom interconnect segments needed by FPGAs.
Communication between the CPU and the fabric is handled entirely through the wide registers and is predefined by the compiler. When the compiler extracts and creates a custom pipeline, it also works out the latency, so it knows how large a delay slot to insert in the code stream for that operation. Software at the user level is standard C/C++, with both a primitive BIOS and MontaVista Linux available for the Xtensa.
Physically, the Stretch chip will be available in three versions differing only in their I/O configurations. The most expensive will emphasize high-speed communications interfaces, while the least expensive -- around $35 in quantity " will have interfaces more appropriate for embedded signal-processing applications. The chips are to be implemented in a 130-nanometer process by Taiwan Semiconductor Manufacturing Co. Ltd. The CPU will operate at 300 MHz and the configurable fabric at 100 MHz.
Banta allowed that on paper, the Stretch approach should be intermediate in performance and efficiency between an ASIC approach such as Tensilica's and an FPGA-based implementation. But he boasted that on one of the EEMBC benchmarks, the Stretch chip simulation performed better than its Tensilica counterpart. "Perhaps we simply had more insight into the available optimizations in the benchmark," Banta said. "But I think the difference may be that the ability to apply multiple configurations to the fabric at run-time allowed our designers to be more aggressive."
From both the Faraday and Stretch announcements and from the gradual uncloaking of existing chips that have used the ARC or Tensilica cores, it is clear that the configurable processor is being taken seriously by some sophisticated design teams. If the exact boundaries of the domain for this solution are not yet defined and if the competition between extendable CPU cores and reconfigurable arrays or fabrics has not yet reached a conclusion, it is at least certain that this is an important weapon in the design arsenal.