As DSP chip performance reaches monster proportions and board makers pack multiples of these processors onto their products, there's a call for DSP board architectures that can keep the critters suitably fed and effectively synchronized. Recent days have seen stepped-up activity to turn the megaflops on a board into useful work.
"One thing about board-level design is that you're always fighting a battle to add more features in the square inches you've got," said Rob Shadduck, executive vice president of Blue Wave Systems Inc. (Carrollton, Texas). At bottom, he said, "All features will boil down to how well you maximize data flow around the board."
Mercury Computer Systems (Chelmsford, Mass.) made a push late last year to up the throughput for boards compatible with its ANSI-standard Raceway interconnect to make it "a bigger and better topology," said Barry Eisenstein, vice president of advanced technology at Mercury. The second generation of the interconnect, dubbed Race++, represents "a managed evolution," he said, which increases not just performance but also connectivity and scalability.
Race++ boosts the frequency of the interconnect from 40 to 66 MHz and increases the number of ports provided in one crossbar ASIC from six to eight. Point-to-point capabilities rise from 160 to 264 Mbytes/second, while aggregate bandwidth goes from 480 Mbytes/s to 1 Gbyte/s. And while Race can scale to handle about 1,000 processors, Race++ takes on upwards of 4,000.
Among the more finicky details of the architectural enhancements, Race provides the hardware for two nodes to implement adaptive routing for finding the least-congested paths from one point to another. Race++ supports adaptive routing for all nodes.
Further, Eisenstein said, while sources and destinations on a Race fabric may be CPUs, DSPs, specialty processors, local memory, shared memory buffers and various bus address spaces, Race++ "extends the concept to encompass the crossbar ASIC itself." The benefit, he said, "is that it adds handles for users to control adaptive routing and new ways to reconfigure routing in real-time."
Optimizing data flow around and between DSP boards, of course, requires attention to memory and I/O architectures and interprocessor communications. According to a spokesman for Alacron (Nashua, N.H.), the efficiency of a design based on multiple Sharc DSPs from Analog Devices Inc. "is critically dependent upon system architecture, and architectures commonly employed in multiprocessor Sharc designs do not take full advantage of the Sharc's capabilities." To perform efficiently, he said, "a system's architecture must support the parallel-programming model that is best suited to a given application, and provide sufficient memory bandwidth to permit the Sharc processors to function as close to optimally as possible."
The company recently committed to applying its dual-ported local-memory architecture to Analog Devices' latest 200-MHz ADSP-21160 Sharc DSPs. "These new processors will provide developers with up to five times the performance per processor of earlier models," said the spokesman.
Alacron's memory architecture supports single- and multiple-instruction-set programming models and provides all Sharc processors with "full memory bandwidth," said the spokesman, "without contention for global memory." The result is "full scalability for applications, whether the design calls for two or eight Sharcs or 24, without the use of exotic memory architectures."
Historically, multiple Sharcs and C40 DSPs from Texas Instruments Inc. have been sewn together with those chips' native ports: "link" ports and "comm" ports, respectively. Late last year, for example, Analogic Corp. (Wakefield, Mass.) launched a 3U CompactPCI board, the CPCI-DSP-II, with two Sharcs linked to each other and the outside world by 40-Mbyte/s link ports. "Top performance can be achieved by establishing a direct data interface through each board's DSP link port," said a company spokesman.
For high-end applications using many DSPs, however, the native ports are fast becoming inadequate, according to some sources. "Those paths are not nowadays of sufficient bandwidth for high-end applications," said Blue Wave's Shadduck. "It's a rough rule of thumb that if you've got 1 Gflops of processing power, you're probably going to need to move data around on the order of 100 to 200 Mbytes/s, and those links are not up to those rates."
Late last year, however, BittWare Research Systems (Concord, N.H.) launched a board up in the 720-Mflops range that relies on a novel use of Sharc link ports. The Goblin, a 6U CompactPCI board with up to seven ADSP-2106x Sharcs, has an asymmetrical multiprocessing model and a highly scalable hypercube-like architecture. The scheme boasts a 560-Mbyte/s memory bandwidth and starts at $8,495.
The Goblin designates one master DSP and up to six slaves, said Gordon Leeuwrik, vice president of engineering, with each slave DSP linked to the master and to its three "nearest neighbor" slaves. While the slaves gang up on the processing tasks at hand, the master handles the bus interface and assorted housekeeping chores.
The nearest-neighbor configuration allows any DSP to talk to any other DSP on the board "in two hops maximum," Leeuwrik said-or four hops between DSPs in a multiboard configuration.
Elsewhere in the Sharc world, Spectrum Signal Processing (Burnaby, B.C.) provided a ringing endorsement of the PMC (PCI Mezzanine Card) last month, announcing a PMC board called Aida with four Sharc 21060 DSPs in residence. It starts at $6,050 and boasts "up to 332-Mbyte/s off-board communications."
Graeme Harfman, the Sharc product-line manager, said Spectrum is pitching the board in three ways: as a DSP coprocessor for a general-purpose single-board computer; as a vehicle for expanding the link ports on a system DSP board; and as "a low-latency, deterministic data-communications gateway" from an SBC to a DSP subsystem.
"Traditionally, parallel-processing developers relied on backplane communications and were faced with a relatively slow bandwidth between their run-time single-board computer and their multiboard Sharc system," Harfman said.
Over in TI land, the current-generation C6x DSPs "have no comm ports," said Blue Wave's Shadduck, "so you need to provide another means of throwing data at the processors, some high-speed data path that requires minimal processor intervention." The route taken by Blue Wave for a quad-DSP VME board introduced earlier this year is a crossbar switch.
The crossbar on the Model VME/C6420 board links up to four TMS320C6201 fixed-point DSPs or TMS320C6701 floating-point DSPs, as well as the board's VMEbus interface, P2 interface and on-board PMC site. It allows up to five 200-Mbyte/s transfers to occur simultaneously: four on board, one between boards.
The crossbar is dynamically reconfigurable and all of the routing is done in hardware, Shadduck said. The result, he said, is "an immense saving in software overhead." The board uses a Motorola MPC860 as a supervisory processor. The starting price is $14,000.
Ixthos Inc. (Leesburg, Va.), a wholly owned subsidiary of DY 4 Systems Inc. (Kanata, Ontario), launched a new DSP board architecture early this year called Champ, starting out with a four-DSP model. First to roll was the Champ-C6, which has either four 167-MHz TMS320C6701s or four 200-MHz TMS320C6201s, delivering 4 Gflops or 6,400 Mips. Prices start at $15,900.
The Champ-more properly, the Common Heterogeneous Architecture for Multiprocessing-relies on a 66-MHz, 64-bit PCI backbone bus and the company's own IXStar DSP/PCI interface ASIC. "The Champ architecture organizes the processing resource as two clusters of DSP processor elements," said Mark Alexander, product-support manager at Ixthos.
Both processors in a cluster have their own 32-bit PCI path to a common IXStar, and that ASIC gives each cluster dedicated access to a 66-MHz, 64-bit PMC expansion site. Clusters are isolated from each other and from the board's supervisory PowerPC microprocessor by PCI/PCI bridges, which "allow concurrent data movement and make each processing cluster's I/O operations independent of other on-board data-transfer operations," Alexander said. "The internal PCI bus allows up to three independent data paths, each with peak transfer rates of 528 Mbytes/s."
Pentek Inc. (Upper Saddle River, N.J.), on the other hand, eschewed PMC for the I/O on its Model 429x, a quad-C6201 or quad-6701 VMEbus board announced late last year. Instead it tapped a homegrown I/O scheme called VIM (VelociTI Interface Modules), which gives every processor its own dedicated mezzanine site. (VelociTI is TI's VLIW DSP architecture, implemented in the C6x family.)
Pentek already supports three to four different mezzanine buses for I/O, but a new bus was absolutely required to keep up with the DSPs, said company vice president Rodger Hosking. The VIM scheme is "tailored for maximum I/O throughput and fully buffered with synchronous FIFOs to minimize processor overhead," he said. It provides 400 Mbytes/s of I/O bandwidth for each DSP on board, as well as two buffered 50-Mbit/s serial ports and a 32-bit data-address port for status and control. Aggregate I/O bandwidth on the four-DSP board is 1.6 Gbytes/s. A CompactPCI follow-on is expected.
The company has been steadily expanding the slate of VIM modules to fill the expansion sites on the Model 429X, having entered the new year with three choices in its stable: a narrowband digital receiver; an A/D converter; and a C40 comm-port adapter. At the start of this year, Pentek introduced its fourth, fifth and sixth VIM boards, with hopes of having an even dozen available by year's end. These were the $2,000 Model 6220 Raceway interface, the $995 Model 6226 FPDP (front-panel data port) adapter and the Model 6216, a $4,000 multifunction VIM module containing two wideband digital down converters, two amplifiers and two 65-MHz, 12-bit A/D converters.
Both Race and FPDP "provide direct board-to-board connection between processor boards, high-speed peripherals and fast memory boards," Hosking said, but he noted that there are differences between these two ANSI standards.
"Race is a high-speed synchronous backplane fabric that provides multiple, simultaneous 160-Mbyte/s data channels between boards," he said. "For example, in a 12-board system, an aggregate transfer rate of up to 1 Gbyte/s is achieved."
As for FPDP, "it's a front-panel ribbon-cable interface capable of delivering data at sustained rates of 160 Mbytes/s. For example, in a four-board system, an aggregate transfer rate of 1.28 Gbytes/s is achieved." The FPDP is "particularly useful for implementing pipelined, star or mesh data-flow schemes across multiple C6x boards," said Hosking.
The Model 6216 combo VIM board is squarely aimed at software radio applications, serving as "a complete front end" on a DSP board, according to Hosking. It "allows system engineers to perform real-time DSP processing of 25-MHz digital-receiver signals for the first time."
With two of these boards on a DSP baseboard, "four complete software radio receiver channels including the digitizer, digital receiver and DSP functions can all be handled in a single VME slot," he said, "dramatically reducing system space, power and cost."
Elsewhere in TI-based DSP boards, Hunt Engineering (U.K.) Ltd. announced a modular C6x packaging and communications scheme late last year called Heron (Hunt Engineering Resource Node) that it hopes will repeat the success of TI's modular TIM-40 scheme for C4x-era DSPs. In the United States, Traquair Data Systems Inc. (Ithaca, N.Y.) has adopted Heron for its own products.
As with TIM-40, Heron mezzanine modules may contain a DSP and memory, some type of I/O resource or a combination of the two, with the number of modules that can fit on a carrier board depending on the carrier's form factor. Traquair's first launch for the architecture was the HEPC8, a $3,400 PCI carrier board that accommodates four Heron modules. The company also plans to introduce VMEbus and CompactPCI carrier boards this year, as well as several Heron modules, including a digital-camera interface and a multichannel A/D board.
The HEPC8 contains a local implementation of Heron's communications scheme, dubbed Heart, which has the benefits of high bandwidth, low latency, determinism and flexibility, according to Traquair president Steve Bradshaw. The scheme supports multiple simultaneous transfers at 400 Mbytes/s each, with links for point-to-point, multicast and broadcast operations dynamically established by software. A follow-on PCI carrier board with a global Heart implementation, the HEPC9, will be introduced sometime this year.
EETInfo No. 611
EETInfo No. 612
BittWare Research Systems
EETInfo No. 613
Blue Wave Systems Inc.
EETInfo No. 614
EETInfo No. 615
EETInfo No. 616
EETInfo No. 617
Spectrum Signal Processing
EETInfo No. 618
EETInfo No. 619