BDTI has released independent benchmark results for Tilera's massively parallel TILE64 processor on the BDTI Communications Benchmark (OFDM). The TILE64 chip incorporates 64 processor cores connected to each other in a mesh configuration. The cores operate at 866 MHz and are fairly simple, three-issue VLIW machines that support limited SIMD operations, such as SIMD adds and subtracts (but not SIMD multiplies). Tilera expects engineers to program the chip using C/C++ along with intrinsics to access the SIMD capabilities.
The TILE64 targets a wide range of applications, including networking, communications, and digital video. According to Tilera, the chip consumes under 20 Watts. Tilera began shipping to initial customers in 2007 and volume production is scheduled for the fourth quarter of this year.
BDTI evaluated the TILE64's performance using the BDTI Communications Benchmark (OFDM). This benchmark is an application-oriented benchmark based on an orthogonal frequency division multiplexing (OFDM) receiver, as shown in the block diagram below. It is representative of the baseband processing found in many current and emerging wired and wireless communications applications. BDTI verified the TILE64 benchmark results on hardware.
BDTI has used the BDTI Communications Benchmark (OFDM) to evaluate a range of processing engines that target communications applications, including traditional, high-performance DSP processors, massively parallel processors, and high-performance, DSP-oriented FPGAs. As noted earlier, the TILE64 is intended for use in a wide range of applications, of which communications is only a subset. BDTI has not benchmarked the chip's capabilities on other types of processing.
For this benchmark, BDTI reports two sets of results: low-cost results, which are optimized to provide the lowest cost per channel; and high-capacity results, which are optimized to accommodate the maximum number of channels per chip. A chip vendor may use two different chips to generate these two results.
Because so far Tilera has only benchmarked one of its chips, the high-capacity and low-cost results are the same. High-capacity benchmark results for the TILE64 and other selected chips are shown in Table 1.
Additional BDTI Communications Benchmark (OFDM) results are available at http://www.bdti.com/bdtimark/ofdm.htm#Scores.
Table 1. BDTI Certified high-capacity results for the BDTI Communications Benchmark (OFDM). Results © 2006-2008 BDTI.
As shown in Table 1, the TILE64 is able to handle 15 channels of BDTI's OFDM benchmark. This is one channel more than picoChip's PC102, though as shown above, the TILE64 is much more expensive. TILE64 implements dramatically more channels than Texas Instruments' TMS320C6410 (a traditional single-core DSP)—but many fewer than the Xilinx Virtex-4 FPGA. (We should note here that picoChip, TI, and Xilinx have all introduced new chips since these benchmark results were issued; BDTI does not yet have results for these newer chips.) The TILE64 is clearly a powerful chip, but its cost-performance—while superior to that of the TI DSP—is not nearly as good as that of the PC102 or FX140.
It's particularly interesting to compare the TILE64 to the PC102, since both are massively parallel devices. The PC102 operates at a much lower clock rate (160 MHz vs. 866 MHz) and uses a larger, heterogeneous array of processors (308 vs. 64) plus 14 specialized co-processors. The TILE64 uses a smaller, homogeneous array of faster processors. The two approaches yield similar benchmark results here (in terms of the number of channels) but make different tradeoffs in terms of cost and programmability.
On the Tilera chip, each core implements the full BDTI Communications Benchmark (ODFM) with a throughput equal to one quarter of that required for real-time operation. A cluster of four cores is therefore able to deliver real-time throughput with four-frame latency. Achieving the 15-channel benchmark result requires 60 cores, plus another four cores to handle I/O and buffering. Tilera implemented the OFDM channel on one core and then replicated this implementation across the chip; and in general, it's likely that TILE64 users will often start with a C implementation of their application running on one core and then add more cores to improve performance. Because the cores do not share global resources the TILE64's performance on the benchmark scaled linearly as more cores were added.
Tilera's simple benchmark implementation approach is different from the approach used by picoChip; picoChip used three different benchmark implementations to make maximum use of the PC102's heterogeneous resources. Some of these implementations use the hardware co-processors, others are coded exclusively in software, while others use a mixture of the two. This is a fairly time-consuming implementation strategy, though it yields excellent chip utilization and low cost-per-channel.
The TILE64 implementation was written in C/C++ with intrinsics, and Tilera says that its engineer completed the implementation process in several weeks. In comparison, we expect that developing a reasonably efficient, assembly-optimized implementation of the benchmark on a high-performance DSP (like the 'C64x) would take roughly 6-8 weeks (depending on the availability of off-the-shelf components, like FFTs), while an FPGA implementation would take even longer since it requires the user to work in VHDL.
For additional analysis of the TILE64 benchmark results and programmability, visit InsideDSP.