Jeffrey M. Arnold, chief engineer at Adaptive Silicon Inc. (Los Gatos, Calif.), and Neal Stollon, product manager for LSI Logic Corp.'s ZSP digital signal processor development group (Milpitas, Calif.), cooperated on the Cheshire architecture for use in digital still photography.
A hybrid system-on-chip architecture that combines digital signal processing, reduced instruction-set (RISC) processing and programmable logic can provide a powerful platform for a wide range of embedded applications in image and signal processing, communications and control. Cheshire is such a hybrid SoC platform for digital cameras. This paper describes the requirements of the application and the issues involved in partitioning it across a DSP, a RISC processor and an embedded programmable-logic core.
The processing cores used in Cheshire are the ARM9 RISC processor, the LSI Logic ZSP digital signal processor and the Adaptive Silicon MSA2500 programmable-logic core3,12,8. Programmable logic is an important component in the design tool box since it offers the system architect the flexibility of software while retaining the performance and power advantages of dedicated hardware2. Large speedups over software of critical pieces of the application can be achieved by exploiting the fine-grained parallelism inherent in programmable logic, and by constructing very deep pipelines1,7. In addition to accelerating portions of the application, programmable logic makes it possible to tailor a single SoC to a variety of end products by choosing the configuration to fit the product.
Reconfigurable logic allows us to partition the application in time as well. For example, in a digital camera, compression and decompression are not performed at the same time. Therefore, we can share the same programmable-logic resources to support each function-loading the compression algorithm when the camera is in record mode and the decompression algorithm when it is in playback mode. Indeed, with programmable logic that supports fast, dynamic partial reconfiguration, we can even change the logic between processing steps within the same mode.
One application area that benefits from the hybrid architecture approach is digital still photography. When mapping an application to a hybrid architecture we must match the computational requirements of each algorithm to the capabilities of the computational components. In this application, arithmetic-intensive operations map naturally to the ZSP. Bit-level logic operations and pipelined data manipulation map well to the programmable-logic cores. Both the ZSP and the PLCs are well-suited to operations on data streams. In addition, the programmable logic achieves the highest degree of parallelism, and hence the greatest efficiency, when operating on very deep pipelines. The overall system communication, control and user-interface functions are best suited to the ARM9 processor.
With these principles in mind, we examine several of the key steps in the application to understand their computational requirements and to determine how best to map these functions to the processing elements of our hybrid architecture5.
- Imager: The imager is typically a CCD array, although CMOS imagers are starting to appear on the market. The imager block in this design includes the A/D conversion.
- Initial image processing: The initial image processing includes adjusting the black level (subtracting the
effect of the background current in the imager), compensating for the nonlinearity of the lens, interpolating across any known faulty pixel cells and balancing the whiteness level. Once the black level is known, the adjustment is a constant subtraction from every pixel value. The lens compensation is implemented as a table lookup. Faulty pixel interpolation and whiteness balancing are arithmetic in nature.
- Gamma correction: The gamma-correction step compensates for the nonlinear brightness effect of printers and display devices. Typically, a standard gamma value is used unless the output device is known. Gamma correction is implemented as a table lookup.
- Color space conversion: The RGB values produced by the imager are converted to the YCrCb color space for final processing and compression. Color space conversion is a set of linear arithmetic operations on the RGB values.
- Final image processing: The imaging array effectively acts as a low-pass filter on the image. In some high-end cameras the image may be subjected to further processing, typically including edge detection in the Y channel and color correction at the edges in the Cr and Cb channels.
- Autofocus: Cameras with mechanical autofocus employ a feedback loop from the image to control the
- Compression: Most digital cameras today use the JPEG standard compression algorithm. JPEG compression is a three-step process: A discrete cosine transform (DCT) is applied to 8 x 8 or 16 x 16 blocks of the input image; the resulting coefficients are quantized; the quantized coefficients are encoded using a modified Huffman code.
- Decompression: To allow playback of the stored images, the JPEG algorithm is reversed. The stored coefficients are decoded, subjected to an inverse quantization process and an inverse DCT.
- Image management: Controls the flash image storage, and reads and writes the JPEG header information.
- Viewfinder: The viewfinder block is responsible for scaling the image to fit the LCD screen, and perform-
ing contrast enhancement and gamma correction to compensate for the nonlinear intensity of the liquid-crystal display.
- User interface: The user interface functions include mode selection, on-screen display of menus, editing the storage buffer, battery monitoring and system I/O.
Fig. 1 shows the top level of the Cheshire architecture. The principal processing units are the ARM9 core, the ZSP core and the two PLC cores. The major blocks communicate with one another and with the various memories and peripheral devices over the Amba High-Speed Bus (AHB)3,4,10. AHB bus masters include the ARM9, the ZSP and DMA engines in each of the PLCs. The ZSP and the PLC blocks are also connected through a higher-bandwidth intercore interface-the direct ZSP interface, or DZI-that allows high throughput and direct data sharing between cores.
The ARM9 core is responsible for a variety of sys-
tem-management functions. These include all communication with the user through the various buttons, on-screen menus and the USB; power and battery-life management; configuration management for the programmable-logic cores, including synchronization and dynamic run-time reconfiguration; and image storage and retrieval. The ARM9 coordinates all of the system activities with a real-time operating system.
The imager block consists of the external imaging array and its associated A/D conversion, together with a DMA engine that transfers the data directly into one port of the dual-ported frame buffer. The frame buffer, which is big enough to contain four complete images, is used to hold the latest input image as well as processed intermediate images.
ZSP core subsystem
The DSP subsystem consists of the ZSP400 core and its local memory subsystem. The ZSP400 is a four-way superscalar, 16-bit DSP core developed by LSI Logic. Its architecture is based on a five-stage pipeline11.
The ZSP400 core implements two interface ports for memory and peripherals-an internal port interface for close-coupled, single-cycle program and data memory; and an external port for instruction unit (IU) and data unit (DU) alternative access to external memory and peripherals. The internal port allows closely coupled "local" memory access and is intended for use with synchronous on-chip memory. By using dual-ported memory and a memory interface controller (as seen in Fig. 1) that allows multiplexing and segmentation of memory ports, a low-overhead direct memory access interface to external on-chip logic is implemented.
The external port interfaces the ZSP to external memory and peripherals and provides 16-bit input and 32-bit output data busing to the core IU and DU. The Cheshire architecture uses two Adaptive Silicon MSA2500 programmable-logic cores, each with dedicated memories and high-performance communication circuitry designed to accelerate image-processing operations. Fig. 2 shows the organization of the PLC blocks.
Although the two PLC blocks are organized identically, they perform somewhat different functions. The PLC-1 block is typically used to accelerate operations in the main image pipeline, and works closely with the ZSP. The PLC-2 block processes the viewfinder data-that is, it performs the image scaling, contrast enhancement and gamma correction for the LCD. The image scaling can be handled as either a simple decimation or a linear averaging. Contrast enhancement is done using the histogram projection algorithm. A table lookup operation is used to perform gamma correction.
Cheshire's architecture uses dual approaches for integration between cores. Both ZSP and PLC cores interface to the Amba AHB bus, along with every other significant on-chip logic block. All communication with the peripherals is handled through the AHB by the ARM9 processor.
1. J. Arnold, W. Luk and K. Pocek, eds., Field-Programmable Custom Computing Technology: Architectures, Tools and Applications, Kluwer Academic Publishers, 2000.
2. D. Buell, J. Arnold and W. Kleinfelder, Splash 2 FPGAs in a Custom Computing Machine, IEEE Computer Society Press, 1996.
3. S. Furber, ARM System-on-Chip Architecture, Addison Wesley, 2000.
4. J. Hesketh, "The programmable-logic core: Enabling the configurable system-on-a-chip," Proceedings of DesignCon 2001/PLD Forum, February 2001.
5. A. Jain, Fundamentals of Digital Image Processing, Prentice-Hall, 1989.
6. C. Kozyrakis and D. Patterson, "A new direction for computer architecture research," IEEE Computer, November 1998.
7. C. Rupp et al., "The NAPA adaptive processing architecture," in K. Pocek and J. Arnold, eds., Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, April 1998.
8. C. Rupp, "Features and benefits of ALU-based programmable logic," Electronic Engineering, February 2001.
9. N. Stollon and B. Sihlbom, "BAZIL: A multi-core architecture for flexible broadband processing," Proceedings of the Embedded Processor Conference, 2001.
10. N. Stollon, "Using Amba for signal processor core integration," Portable Design Magazine, July 2001.
11. "ZSP architecture overview white paper," posted at www.zsp.com.