Design Article
TMS320C67x vs. ADSP-21160: Which Floating-Point DSP Offers Highest Performance?
Ian Main
8/22/1998 12:00 AM EDT
The Texas Instruments TMS320C67x and the Analog Devices ADSP-21160 SHARC processors are the two highest performance floating-point DSPs on the market today. Which of these two processors provides the highest system performance?
System engineers must select the device that provides the most effective solution to meet the requirements of their DSP application. While the obvious step is to compare the raw processing power of the two processors, this comparison will give little indication of expected system performance, especially in highly demanding multiprocessing applications.
The selection of the better DSP platform from a systems perspective requires an analysis of many aspects of the application. Firstly, the I/O data rates and channel density must be reviewed to determine the bandwidth in and out of the system. The next step involves the mapping of DSP algorithms to DSP devices. This may be complex and requires an understanding of I/O data paths, memory management, inter-processor communication capability and synchronization mechanisms. While the resolution of these issues determines the best technical solution, other factors also require consideration. For example, time-to-market is influenced by the availability of third-party library support and the characteristics of the development tools accompanying each processor.
This paper discusses the factors that the system engineer should consider in selecting one of these processing platforms over the other. The discussion includes an analysis of specific applications to illustrate the system parameters that should be compared in the decision process. The paper also deals with the support for each platform, highlighting tools that assist the developer in achieving the highest-performance in each case.
|
Feature |
Analog Devices
'21160 SHARC (100MHz) |
Texas Instruments
'C6701 |
|
Performance (MFLOPS) |
|
|
|
Peak |
600 |
1 GFLOP |
|
Sustained |
400 |
500-700 |
|
MMACs |
200 |
334 |
|
Bandwidth: External Memory |
534 MB/s |
667MB/s |
|
I/O Bandwidth: Total |
1.134 GB/s
(+ Serial Ports) |
667MB/s
(+ Host port)
(+ Serial Ports) |
|
Interrupt Latency |
4 x 10nS cycles |
11 x 6nS cycles |
|
|
|
|
|
Core Features |
|
|
|
Number of Data Registers |
32
(+32 alternate) |
32 |
|
Extended Floating Point Support |
40-bit extended precision |
64-bit double precision |
|
|
|
|
|
Peripheral Features |
|
|
|
Internal Memory Size |
4 Mbit (2x2Mbit) |
1 Mbit (2x 512kBit) |
|
Program Memory Structure |
Configurable
48-bit instructions |
16k x 32
2K x 256 bit instructions |
|
Data Memory Structure |
Configurable:
16-,32-,48- or 64-bit |
16k x 32
8/16/32/40/64-bit data |
|
Cache |
32 Instructions
(if selected) |
Entire Internal Program Memory (if selected) |
|
DMA Channels |
14 |
4+1(HPI) |
|
|
|
|
|
I/O Capability |
|
|
|
Primary external Data Interface |
64 bits @ 66 MHz |
32 bits @ 167 MHz |
|
Serial Ports |
2 @ 66 Mbit/s |
2 @ 83.5 Mbit/s |
|
Other Ports |
6 x Link Ports @ 100MB/s |
Slave Host Port @ clk/4 (41.5 MB/s) |
|
Inherent Multiprocessing Support |
Cluster and Link Ports |
None |
|
|
|
|
From the above table, it can be seen that the raw processing power of the 'C6701 exceeds that of the '21160 by approximately 30%. In general, this gives the 'C6701 an advantage in single processor low and medium bandwidth configurations. However, the '21160, with more than double the I/O bandwidth and four times the internal memory capacity, makes it a more appropriate solution in high-bandwidth and multiprocessing applications.
Of course, the 'C6701 also has significant I/O bandwidth and with the assistance of external hardware, it may also be used effectively in multiprocessing architectures. The merits of each are investigated in the multiprocessing section.
The following three sections compare the '21160 and 'C67x families, beginning with a review of the implications of different memory support for each. This is followed by an analysis of algorithm distribution and data flow in multiprocessing systems and ends with a study of the system I/O capability of each.
High-Performance Memory Support
There are many instances where the algorithm developer needs
high performance external memory, but in some circumstances, it
is critical to the application. With the advent of synchronous
burst memory support and ever-increasing on-chip memory,
application code is usually executed from internal memory for
highest performance. However, critical variables (such as
filter tap coefficients) must frequently be stored externally
due to a limitation of internal resources. Both the '21160 and
the 'C67x support high-performance external memory.
The '21160 processor may be interfaced to asynchronous (ASRAM), synchronous SRAM (SSRAM) memory, either 32-bit or 64-bits wide. It supports synchronous and sequential burst transfers for the efficient transfer of large blocks of data. '21160's DMA controller automatically packs external data (16-, 32-, 48-, or 64-bit) into the appropriate internal word width, either 64-bits or 48-bits wide.
The 'C67x directly supports 32-bit SSRAM, SDRAM, and ASRAM as its high-performance resource. This memory will likely be available at 167 MHz by the time the DSP is shipping, allowing for single cycle access. The pipeline delay of SSRAM should be taken into account in throughput considerations"this adds three cycles for each first access. The consequence here is that critical sections of code must be run from internal DSP memory as it will require more than 8 clock cycles to load a single 256-bit instruction from any external memory.
In summary, the 167 MHz synchronous interface of the 'C67x will gives it an external memory access advantage of 668 MBytes/s vs. 528 MBytes/s, some 25% advantage. However, this can only be realized for multiple consecutive external accesses where the pipeline delay becomes negligible and is somewhat negated by the fact that '21160 has significantly larger internal memory. In cases where consecutive instructions must be accessed from external memory, the processing performance of both devices drops. The theoretical performance of the 'C67x can be reduced from 1GFLOP to 167 MFLOPS. The '21160s peak performance drops from 600 MFLOPS to 396 MFLOPS due the clock differential between external and internal buses.
High-Density Memory Support
In data driven applications (e.g. imaging and radar), the DSP
requires high-density memory for temporary storage of data.
Usually memory access is sequential due to the correlated
nature of the data.
With the addition of some external logic, the '21160 may be interfaced to low cost bulk DRAM with one or two 15ns wait states. The 'C67x, on the other hand, supports a glueless connection to SDRAM. As with SSRAM, there is a pipeline latency of three cycles, but sequential accesses take two 6nS clock cycles. Paging and refresh delays also need to be considered, as these will result in non-deterministic delays of ten cycles or more. In spite of this, SDRAM clearly has an access advantage over DRAM when making sequential accesses to large sets of data.
If multiprocessing is necessary to meet either the real-time demands of the application or high I/O rates, DSP system performance becomes more relevant than device features. System performance considers algorithm and data distribution in addition to inter-processor communication capability. This section explores the performance that can be expected using multiprocessor configurations of both devices.
Data Storage and Distribution
Here we consider the movement of data between algorithms in a
multiprocessing system.
Whether using a '21160 or 'C67x platform, it is a good practice to decouple the flow of data from the actual processing algorithms. This can be done using DMA co-processors to manage data flow between sub-systems by transferring large blocks between intermediate buffers. This is particularly important for the 'C67x where optimized inner loops running on the DSP cannot be interrupted to service I/O or manage data. By decoupling data structures, these software pipelines will be allowed to run to completion, ensuring peak performance. Of course, if extreme low latency is a requirement, 'C67x loops must be unrolled at the expense of code size. Even then, the memory pipeline of the processor results in a latency when switching tasks (an 11 instruction latency to flush the pipeline and vector to the new address). In contrast, inner loops on the '21160 processor are interruptible, making it easier to balance low-latency I/O performance with optimum CPU performance.
When it comes to distributing data around a multiprocessor system, the '21160 supports this directly through both Link Ports and broadcast capabilities of the multiprocessor cluster architecture. The 'C67x relies on the DSP board architecture to provide a flexible communication system with external DMA facilities to move data between DSP sub-systems.
Algorithm Distribution
If the software developer is used to mapping algorithms
directly to standard nodal topologies as a method of
distributing the algorithm (e.g. mesh or hyper-cube), the
'21160 probably remains the processor of choice as it supports
these physical topologies through Link Port connections.
However, if a 'C67x platform is selected with a DSP RTOS that
supports a virtual network between tasks, the standard
topologies can still be implemented in abstraction from the
hardware layer.
Loading Code from External Memory
If algorithms are run from internal memory, it is easy to
predict the data I/O throughput for both processors. If
algorithms must be loaded from external memory, a more careful
analysis may be required.
The '21160 supports synchronous operation, burst transfers and asynchronous external memory. Exact throughput and access latency depends on the interface used. In general, code is transferred to internal memory under DMA control, unpacked and run from internal memory rather than being executed directly from external memory. The 48-bit wide instructions may be stored in packed format in 64-bit wide memory. This means that four instructions are loaded every three 66 MHz clock cycles (88 Million instructions per second load rate).
If algorithms are run from SBSRAM on the 'C67x, code is burst into internal memory at 667 MB/s (assuming a 167 MHz memory bus), with a three cycle initial latency to fill the pipeline of the external memory. As each fetch packet is comprised of eight 32-bit wide instructions, execute packets are loaded at between 21 and 167 million packets per second, dependent on the number of arithmetic units being targeted. If these code accesses are interleaved with SDRAM data accesses, for example, prediction becomes complex due to paging and refresh cycle latencies and performance is poor. As is the case with '21160, code is generally not executed from external memory due to performance degradation. For large algorithms, it is more efficient to run the processor with the cache enabled, allowing execution from internal memory.
Inter-Processor Messaging
The efficient passing of semaphores and low-latency messages is
integral to any multi-processing system. The '21160 supports
these through a multiprocessor memory space within a cluster
for broadcasting of messages and Link Port connections between
clusters and DSP boards for point-to-point connections. The
'C67x relies on external resources provided on the DSP board.
For example, Spectrum includes DPRAM and QPRAM in the dual and
quad 'C67x implementations of the 'C6x architecture. This
memory connects directly to the external bus of all the
processors and provides a low-latency path between
subsystems.
Of course, interrupts provide the lowest latency mechanism for inter-processor signaling and synchronization. Whether a '21160 or 'C67x is selected, ensure that the DSP carrier board supports inter-processor interrupts.
Single Processor as Target
It is safe to say that if a single DSP is the target of all
input data, system considerations are similar whether you
select a 'C67x or a '21160 processor as the DSP.
Assuming that the application can run from internal memory, a 'C67x may have a performance edge in managing a single high bandwidth stream than the '21160 (668 MBytes/s vs. 528 MBytes/s external port throughput). However, it could be argued that the larger internal memory capacity of '21160 will negate this advantage due to it's capacity to handle larger data blocks.
Assuming the data can be made available on Link Ports, the '21160 is more effective at managing multiple medium-to-high bandwidth channels using its DMA resources.
In applications where I/O data transfers from the external port to local DSP memory are interleaved with processor data accesses (local memory to internal registers), there is a trade-off between data block size and real-time response no matter which DSP is selected.
Multiprocessor Systems as Targets for High-Bandwidth
Data
If the processing requirement of the application is excessive
(due to high bandwidth), no matter which technology is
selected, a multiprocessor solution is required. This section
investigates these high-end applications.
Due to its inherent multiprocessor support, the '21160 is well suited to these applications. The network can easily be scaled to suit the I/O processing requirements. The availability of off-board Link Port connections makes scaling just as easy across multiple DSP boards as it is across DSPs on a single board. Additional features such as the capability to broadcast data throughout a cluster make distribution of the input data easy.
The 'C67x, unlike its floating-point predecessor, the TMS320C40, has no native multiprocessing support. It has been left to the DSP board vendors to innovate effective methods of achieving inter-processor communications. Spectrum's 'C67x architecture is an example of this, using a specialized ASIC to bridge each DSPs to a common PCI backbone. This allows for a distributed memory architecture with each DSP having the ability to pump data to the local memory of any other DSP on the same board. However, it is more difficult to distribute the data across multiple boards. It is considered poor practice to use the system bus (VME, PCI, CompactPCI, VXI etc.) for high bandwidth data and consequently a number of I/O buses (e.g. FPDP and Raceway) support the multiple slave DSP boards networked to an I/O master. The I/O-bus to DSP carrier board connection is often implemented using open standard interconnects e.g. PMC modules. If tighter coupling between I/O and DSPs is required, this may achieved by connecting the I/O directly to the external memory bus via a local mezzanine e.g. Processor Expansion Module (PEM).
Whether a 'C67x or a '21160 is selected, there are numerous DSP network topologies available to support the I/O data flow requirements of most applications. For example, Spectrum's 'C67x architectures support PMC-based I/O streamed to all of the local processing nodes and on the SHARC platforms, I/O can be easily distributed using Link Ports. This allows one to stream incoming data between two or more processors to distribute the processing load. In both cases, the throughput is limited by the performance of the local PCI bus rather than any DSP capabilities.
Multiprocessor Systems as Targets for Low-Bandwidth
Data
There are two instances where applications require multi-DSP
configurations with low data rates. Firstly, there are
applications with computationally intensive algorithms where
the I/O bandwidth exceeds the processing capability of a single
processor. Secondly, in applications with multiple I/O
channels, it is often convenient to distribute the I/O
processing across a network of DSPs.
In the first instance, the 'C67x may offer a better solution as it will reduce the number of DSPs required in the system due to the higher CPU core performance. In the second instance, with limited channel count, either a 'C67x or '21160 may be appropriate. Once the channel count demands a multi-board solution, the '21160 may be preferable due to its inherent support for inter-board communications through Link Ports.
Finally, both the '21160 and the 'C67x support two TDM serial ports. Most DSP systems vendors make these available to the user for direct connection to their I/O circuits.
Multitasking Support
It is easy to conceive mapping multiple tasks to multiple DSPs
in a '21160 network, especially if we consider a single task
per processor. In simple pipelines or array processing
applications, the '21160 may be the processor of choice due to
its support for separate tasks or algorithms at each node of a
multi-dimensional array. However, many 'C67x (and '21160)
applications may require multiple tasks multiplexed onto each
DSP. In such cases, a DSP-based RTOS provides the developer
with a scheduling kernel to simplify development. Some of these
RTOS kernels (e.g. 3L's Diamond) have very low overhead and
provide other features (e.g. inter-task communications
independent of the underlying hardware.) The 'C67x processor
will likely be the target of multi-instance applications (e.g.
modems). Once again, development can be simplified through the
selection of an appropriate RTOS to manage context switches and
multiple data streams.
Diamond and others (e.g. Eonic's Virtuoso) are likely to be supported on both platforms.
Third-Party Library Support
The availability of optimized function libraries (e.g. Imaging,
Math and Signal Processing) allow developers to concentrate on
their own applications rather than time-consuming hand coding
of commonly used building blocks. It usually takes a year
before third party library support for any DSP processor is
available and the 'C67x will probably be no exception. Texas
Instruments maintain an up-to-date web site with free code
examples; this is a useful resource for new 'C67x developers.
The '21160, being code compatible with the ADSP-2106x SHARC
product range, already has library support from companies like
Wideband Computers. Libraries that are optimized to take
advantage of the SIMD instruction set of the '21160 will
probably follow shortly.
Development Environments
Both Texas Instruments and Analog Devices supply a solid suite
of DSP development tools. If there is any difference, it is in
the way that TI's C67x tools focus on code optimization while
the strength of ADI's Visual-DSP lies in its multiprocessor
support.
For example, the natural development methodology using TI Tools is as follows:
- Develop the application in C
- Write inner loops in linear assembly language
- Use the assembly optimizer to take full advantage of the VLIW architecture.
The assembly optimizer is the key to assisting the developer in gaining maximum code performance.
When developing code for the '21160, hand optimization will be required to take full advantage of the SIMD instruction set. The management of multiple tasks on different processor within a clusters is complex. However, Visual-DSP simplifies C code development in this multiprocessing environment through a sophisticated linker that supports shared memory and multiprocessor linking. Additionally, flexible overlay support allows the development of code that can be moved between overlays and non-overlay memory without rework.
It would be easy if we could classify the 'C67x or the SHARC according to application (e.g. DSP X works for Sonar and DSP Y is best for Medical Imaging). Unfortunately this is seldom possible.
Let us take sonar as an example. Within sonar, we may get a simple replica correlation application running on a DSP connected to a single hydrophone and an alarm. A towed-array sonar system, on the other hand may have a few hundred sonar pods feeding into a meshed array of DSPs running multiple beamforming algorithms. In the first instance, the best solution may be one or two 'C67x processors while multiple '21160s may be more appropriate for the sonar array-processing application.
Floating-point DSPs are also selected as development platforms for fixed-point applications. This is due to the ease of coding during the proof-of-concept phase. In this light, both processors may be used as the springboard for any fixed-point application - as with floating-point, it is impossible generalize here.
In conclusion, it is more appropriate to select the DSP platform according to the multiprocessing, I/O and support requirements discussed above than attempt to classify the applications.
This paper has investigated various aspects of single and multiprocessor implementations using both the 'C67x and '21160 processors. While the 'C67x appears to have a performance edge in single processor implementations, the '21160 may gain the upper hand in multiprocessing applications. The most suitable platform depends on data flow, memory requirements, array topology and algorithm characteristics.



