Design Article
The Give and Take of Designing RISC/DSP Dual-Core SoCs
Jack Shandle
7/31/2002 12:00 AM EDT
As streaming media and gigabit networking applications become commonplace, more SoC design teams are integrating RISC and DSP processor cores on the same ASIC. This is a daunting task to be sure, but much can be learned from a review of pioneering applications such as cell phones and advanced home-entertainment systems.
The leading-edge teams who have migrated designs from 2G to 2.5G to 3G cell phones, for example, have made architectural changes to accommodate new data types such as music, graphics and simple video. They have also added conventional computing power to run what is essentially a PC in a cell phone. At an even higher level of complexity, multiple high-bandwidth streaming media channels for home entertainment center applications are driving profound changes in both architecture and operating systems.
A typicaland very importanthigh-level design decision for a RISC/DSP SoC is partitioning the system tasks between the DSP and the RISC host. The fundamental rule is that the RISC core handles control and the DSP executes specific algorithms. Some exceptions will be mentioned later in this article.
As a starting point, let's look at the components in a dual-core SoC.
From a hardware perspective, parallel execution implies multiple, independent execution units and buses. TI's TMS320C62xx, for example, has eight independent execution units, which allow the processor to issue up to eight instructions per clock cycleall encoded in a single long instruction that describes a single operation.
VLIW DSP cores typically use 32-bit wide instruction wordsdouble the size of conventional DSPs. This allows designers to use larger register sets to enhance performance. The wider word is necessary, however, because information about which unit will execute the instructions must included in the instruction word.
A downside of the long instruction word, however, is high program memory usage, which translates into additional cost for additional RAM or ROM. Power consumption is also high relative to conventional DSPs.
The most common technique used to implement VLIW DSP processing in hardware is called SIMD (single instruction/multiple data), which allows parallel execution of different data using the same instruction. SIMD hardware units vary from vendor to vendor and even within vendors. The design choice mostly involves how the different types of hardware unitsMACs, ALUs, and shiftersare grouped.
VLIW DSP cores impact SoCs in several ways including: high-data-rate, interrupt-sensitive streaming media that is processed in real time; 32-bit instruction words; sophisticated parallel executions hardware architectures; deep instruction and data memories; longer arbitration delays; and difficult applications programming.
Typically under the host control are I/O blocks such as UARTs, USB cores, and Firewire cores as well as functions such as memory management, debugging, expansion-bus cores, and application-specific cores such as rendering engines. A 32-bit core that clocks at 150 MHz or greater and has architectural features such as multiple pipelines and a sizeable cache memories would typically be chosen to share CPU duties with the DSP on most SoCs.
On the software side, compatibility with existing instruction-set architectures is important and so is support for a wide range of real-time operating systems including OSE System's OSECK or Wind River Systems' VxWorks, as well as other operating systems such as Linux and Microsoft CE if the end-user application requires it.
The obvious solution of specifying a real-time operating system is not enough. The parallel-processing architectures of VLIW DSPs also demand that the OS simultaneously execute multiple algorithms as well as scheduling the tasks and communicating with the RISC host CPUall on an event-driven basis.
Inter-process communication requires a multi-threaded RTOS. Communication with the host requires a link handler that can create a logical channel between processes running on different processors and processor types. More important, message exchange between processes must follow the same protocols regardless of whether the communicating processes are located on the same processor, on multiple instances of the same type of processor, or on different types of processors.
For many applications, these requirements add up to an RTOS with a standard API, standardized inter-process communication and debugging environments, and an extensive portfolio of third-party products such as IP stacks. Table 1 shows a detailed list of RTOS enhancements required for dual-core RISC/DSP communication.
|
|
Table 1: List of RTOS enhancements required for dual-core RISC/DSP communication
Two critically important high-level design steps are to assign each system task to the appropriate core and to have a full understanding of the conditions under which the cores communicate. Task partitioning varies from application to application but the basic premise is to assign intense signal processing tasks to the DSP and control and user-interface tasks such as access to external memory, storage media, and the incoming data stream to the RISC host. Encoding the incoming wire-stream data into a format such as PCM that is easily handled by the DSP core is usually best assigned to the host processor.
Communication issues can be considerably more complex but the basic methodology for addressing them is similar for most applications. Among the key considerations are communication bandwidth and each core's minimum and maximum latency. Data bandwidth is provided in the specifications for the cores and buses available for the SoC and is a fairly straightforward calculation. Latency, on the other hand, is not.
Typically, latency is application dependent and far more important than actual data-transfer time. DSP applications that involve voice, video, audio, or even graphics cannot tolerate interruptions in the data stream without annoying the userhence this is a critical issue. There are basically three options for solving latency induced problems: raise the priority of communication task between the RISC and DSP cores, change the buffer transfer sizes, and reduce the priorities of some of the control tasks causing the latency.
The same cannot be said for 3G cell phones. The introduction of multimedia data types results in still higher data rates and even more intensive call processing. But the real difference, says Rayfield, is the introduction of another processor block for application processing.
This "user compute" makes sense because the functionality of a PDA or PC is being integrated into the phone. Figure 1 shows the addition of an application processor that accesses dedicated 64 Mbytes of SDRAM.
Another significant increase in 3G design complexity is the introduction of an operating system for the application processor. This could be any of a variety of PC or PDA OSs including Palm, Symbian, WinCE, Linux and, in the Asian market, iTron. >Figure 1 shows the bus architecture of the additional processor block.
ARM is a strong proponent of adding a user-compute RISC core to solve the 3G challenge but its solution is not universal. Another option is to further offload the existing RISC core by adopting a DSP architecture that does not require the RISC core to issue long series of instructions to the DSPin other words, make the DSP more autonomous.
In the ARM architecture, CPUs connect to peripherals mostly via point-to-point buses. As cell geometries decrease, says Rayfield, wiring has become almost free and overhead is trivial. This undermines most of the reasons for adopting a tristate shared bus.
Architectural innovations to handle streaming media include:
- A unified memory architecture
- Three data buses each servicing specific task domains
- A software architecture to tie it all together.
In the Nexperia platform, the MIPS core handles the RTOS, graphics and the applications software. A TriMedia core handles most of the streaming-media processing and system-wide task scheduling. As opposed to simpler applications mentioned earlier, the two CPUs are part of a single, integrated system. The CPUs share a unified memory that functionally allows them to swap tasks to balance computing loads when necessary, but also provides considerable savings in memory costs.
Each processor core can address virtually any of the peripherals, but every peripheral is assigned to the task domain of one of the cores. This scheme markedly reduces overall system cost and power consumption because resources such as main memory and disk and memory interfaces are shared.
The Nexperia bus architecture consists of three task domains (Figure 2). The backbone is a point-to-point memory bus that connects external SDRAM with the SoC's peripherals for high-throughput, low-latency DMA access.
The two remaining domains are a MIPS PI (Peripheral Interconnect) bus that connects the MIPS core to peripherals in its domain and a TR32 PI bus that performs the same function for the TriMedia core. In addition, the bus architecture includes a crossover bridge joining the MIPS and TriMedia PI buses. This allows memory-mapped I/O access from each processor to control or observe the status of all the peripherals.
While the MIPS processor runs an RTOS, the TriMedia VLIW engine has its own software architecture. The TriMedia Streaming Software Architecture (TSSA) represents one strategy for handling separate datastreaming tasks. TSSA must communicate with the RTOS, of course, but its primary function is to support on-chip hardware with multimedia libraries. These libraries consist of components that perform most of the datastream processing including digitizing, processing, and rendering.
TSSA essentially configures processing components inside the TriMedia core according to instructions received from the host regarding what type of processing the engine requires. This is a departure from the usual process of executing algorithms in software. Here, dedicated signal processing engines are dynamically created, connected, configured, and destroyed depending on the type of data being processed at the moment.
The Nexperia platform is an early entry in the race to the next generation of DSP/RISC SoC in which signal processing is farmed out to multiple DSP cores, each supported by dedicated hardware accelerators that are software-configured by the designer in C.
If this is the next generation architecture for mixed-core SoCs, it will mark a return to the DSP being a slave to the RISC host. Oz Levia, chief technology officer of Improv Systems, a leading advocate and supplier of configurable DSPs, contends that a loosely couple architecture is the most efficient way to handle high-bit-rate streaming media.
According to Levia, "The DSP must autonomously process information coming in at wire speed." Improv envisions a loosely coupled interface between the DSP and the host in which the host has a library of high-level commands for the DSP such as "decode a frame," "execute a DCT," and "dump a buffer." The DSP already has a library of assembly-language commands it uses to respond to high-level commands from the host.
This expertise can only come from the core vendors, he says, because only they know the intricaciesand eccentricitiesof their products. Similarly, test strategies depend heavily on the ability of JTAG circuits, for example, to access both cores. Such a strategy requires intimate knowledge of the cores that is typically found only in the vendor's realm of expertise, Baczewski says.
If there are some organizing principles for understanding the integration issues for RISC/DSP dual-core SoCs, they are:
- The architectures are heavily application-dependent
- Data typesparticularly streaming mediamay require a new paradigm for power- and resource-efficient signal processing
- Bus architectures are evolving away from tristate implementations and toward point-to-point, DMA backbones supported by peripheral buses.
Contributing writer Jack Shandle is a former chief editor of both Electronic Design magazine and ChipCenter.com. He holds a BSEE degree and, over his 15-year career in technical publishing, has written hundreds of articles on all aspects of the electronics OEM industry. Jack is president of eContentWorks, a consultancy that creates high-value content for publishers, EOEM corporations, and industry associations. His email address is jshandle@earthlink.net.


