Design Example: A Cross-correlation Image Processor
In this section, we discuss a medical image processing application. Because the design is complex, we will supply only the highlights of the design. A traditional image processor might consist of one or more DSP processors running a set of pipelined assembly and C code. The DSP would be responsible for the algorithm calculation as well as the overall system control functions " an area where DSPs typically perform poorly.
Registration of medical images can be aided by the use of fiducial (reference) markers purposefully placed in the image. One technique that can be used to automatically locate these marks is to run a cross-correlation between the image of the fiducial mark and the image under study. The core algorithm for this processing is a small mask cross-correlation which is processed at each point in the image.
We will design and implement a simple 3x3 cross-correlation image processor. The algorithm multiplies the 9 pixel values in the 3x3 array by the corresponding 9 values stored in a 3x3 reference mask and then sums the 9 products. As a final step, the algorithm rounds-off and then bit-shifts the bytes into the original data size. This circuit will be referred to as a multiplier segment.
The image that we will be processing is 512x512 pixels typical of CT images. The 3x3 cross-correlation will be evaluated at each pixel in the original image. We will evaluate the Texas Instruments 3x3 correlator that uses 8-bit data, but we will devise our own 16-bit correlator that is more suited to the dynamic range of medical images.
In order to increase parallelism in our PLD design, we will process eight rows of the image simultaneously by stepping through the rows from left to right. The value of eight was a trade-off between speed and design complexity for this example. The design is extensible to as many parallel row calculations that the chosen PLD can support. (Note: Altera has PLDs with up to 80,000 LEs.) For simplicity in Figure 6, we show only one-half (4 rows) of the correlation processor that we will utilize as a part of a larger system.
Operation of the system is straightforward (cf. Figure 6). The DMA is constructed as eight parallel sections each serving one of the multiplier segments. The first step is for the NIOS to load the parallel DMA processor with starting addresses for each of the segments. Each segment of the DMA extracts the image data for the 3x3 cross-correlation and loads into its respective segment a single byte per clock cycle. At each load cycle, the circuit calculates an output, but the only valid result is the one in which all 9 values have been loaded. Therefore, DMA only returns values to memory that are the result of a completed calculation.
Figure 6. Four Segments of an Eight Segment 3x3 Correlator
The design consists of four major components:
1. The eight 3x3 multiplier segments designed in DSP Builder
2. A NIOS processor designed in SOPC Builder
3. A parallel DMA module that operates at high speed designed in DSP Builder
4. The C code for the NIOS.
An important point to note as we walk through this example, the Altera software allows the design process to be object-oriented in much the same manner as C++. Once a design element is created and verified, we will encapsulate it and treat it as a black box (in analogy to a class in C++). It will appear as a block in the Quartus II integration, and at that stage, we will no longer concern ourselves about the lower level details of the block. The advantage of this approach is the same as with C++: re-usability and standardized interfaces.
Figure 7(a) shows the Simulink block diagram from DSP Builder used to create the a single multiplier/round-off segment depicted in Figure 6. The appeal of DSP Builder is that standard signal processing blocks can be used to assemble a high-speed signal processing system in a PLD without resorting to HDL programming. In the construction of the multiplier sections, four standard blocks from the DSP Builder Library were used: 1) two 9 tap FIR filter blocks, 2) two Multiply-Add blocks, 3) one Parallel Adder block, and 4) one Product block.
To understand the heart of the algorithm, we will step through the operation of a single multiplier/round-off section. First, the data_reg_select line is held low and the 9, 16-bit bytes of the 3x3 reference mask are loaded into the lower FIR filter. The system is now ready to work its way through the image. Next, at the edge of each clock cycle (in this design f = 300 MHz), the DMA presents a 16-bit byte from the image on the Data_In port (far left-hand side of Figure 8) which is loaded into the lower byte of the upper FIR register. The calculation of the 9x9 product and sum is performed in a single clock cycle. We note however, that the result is presently invalid because we have not finished loading the whole set of 9 values into the upper FIR. However, the system does nothing with the result. At each clock cycle, another of the 9 image bytes is loaded and the results are calculated. The DMA circuitry only transfers the cross-correlated byte to memory after the entire calculation is completed. All other results are ignored.
On the edge of the ninth clock cycle, we have loaded all the registers and the calculation has been concluded. Since the round-off operation must be presented with valid data from the cross-correlation sums, the system requires another clock cycle to perform the round-off and bit-shift. The results appearing at the Conv_data port (far right-hand side of Figure 7(a)) is taken by the DMA circuitry and placed back in the resultant image memory in a single clock cycle. Our total time to calculate a single 3x3 cross-correlation is 11 clock cycles. In the full design, eight of these sections are aligned in parallel. So that 8 cross-correlations are performed in 11 clock cycles.
Figure 7(b) shows the Quartus II black boxes of the correlator and the round-off circuits. (Note: The encapsulation of the circuit in Figure 7(a) into a single black box. This correlator can now be shared among designers and projects.)
Figure 7a. The DSP Builder Block Diagram of a Single Cross-Correlator Segment
Figure 7b. The Quartus II Block Diagram of a Single Cross-Correlator Segment
The NIOS serves as the on-chip (PLD) controller by setting up the DMA transfers, loading static values such as the mask, and communicating to the host system. (Note: SOPC Builder provides a number of standard communication protocols including serial, Ethernet, and PCI bridges. Third-party vendors can supply other protocols.) Construction of a complete NIOS processor consists of selecting the major processor components needed in the application. Within SOPC Builder, the engineer chooses the required components such as Boot ROM, external memory, and peripherals. Figure 8(a) show the NIOS design components for the correlator in the list format of SOPC Builder. Components are selected from the left-hand panel, and are automatically added to the table in the center. Parameters such as memory locations, interrupts, and operational modes are set with the Configuration Wizard of each component. Figure 8(b) shows the entire CPU and all of its peripherals as a "black box" in the Quartus II environment.
Upon generation and compilation of the NIOS, a custom software development kit (SDK) is automatically generated for use with the hardware. The SDK contains explicit instructions for the C compiler/linker such as variable definitions, and memory maps to actual devices. The software for the correlator is then written within the context of this SDK and the low level software details are managed by the SDK itself. In other words, connectivity between the hardware and NIOS software is automatically handled by the SDK.
In order to streamline the NIOS code, we will create two custom peripherals by grouping the hardware for the DMA and the hardware for the multiplier cells into separate custom peripherals. These complex circuits are then reduced to single function calls with in the NIOS software (see below).
After the complete circuit has been simulated, we accomplish this abstraction by collecting the two hardware sections into separate files. By following a few simple design rules, these custom logic blocks are imported into the NIOS design by filling out some simple menu items in the SOPC Builder software. We then instantiate these blocks as Black Box items, meaning that they will now appear as user blocks in the Quartus II final system integration. After this integration, these circuits are accessed and controlled in our NIOS C code as ordinary function calls.
Figure 8a. The NIOS Soft Processor in SOPC Builder
Figure 8b. The NIOS Soft Processor in Quartus II
As we saw in the multiplier/round-off discussion, the most time consuming portion of the algorithm is in data movement. In order to get high throughput in a cross-correlator of this type, it is essential to have a DMA processor that moves the data from external memory to the inputs of the multiplier segments as quickly as possible. To accomplish this, the DMA processor loads the nth byte of the nine image values simultaneously on all eight segments.
The DMA processor is essential just an memory address/fetch engine that generates the address of the desired memory locations, and manipulates the required control lines to force the memory to present its data. In addition, it must do some simple shifting of addresses in order to acquire the data in a 3x3 block with in the image and then index through the pixels of the original image on the clock edge. It also places the resultant cross-correlation into memory in image format " usually just rows placed end-to-end.
The actual structure of the DMA processor is system architecture dependent, and depends explicitly on a number of decisions the designer may be forced to make for reasons other than speed. For example, choosing SDRAM over SRAM because of costs. The precise construction will be sensitive to the following issues:
1. Memory type, e.g., SRAM or SDRAM
2. Memory access speed
3. Bus speed, architecture and arbitration
Utilizing the NIOS custom peripheral capability, program and hardware control can be greatly simplified. The hardware shown in Figure 6 will be encapsulated into the three separate function calls found in Table 1.
Table 1. NIOS Custom Peripherals for the 3x3 Cross-correlator
Once the proper interface discipline has been followed in the setup and definition of our custom peripherals, their use become that of a simple function call. The program that runs the cross-correlator is now abstracted into a few simple lines of C. Since we will calculate 8 rows at a time, we will run the cross-correlation processor 64 times to complete our 512x512 image.
Performance and Cost Comparison between a PLD and a Traditional DSP
Finally, we compare the performance of our 8 segment cross-correlator with that of a 200Mhz TMS320C6201 processor running the same algorithm from TI's IMAGELIB Image Processing Library. Comparison of the two systems is found in Table 2.
Table 2. Stratix EP1S10 vs. TMS320C6201: Clock Cycles per Pixel (1)
The speed increase in the calculation of the results in a PLD is a factor of 21 times greater than the 6201 DSP. The speed gain in the data transfer is much less and will depend upon the memory, and bus configuration. There is some degree of uncertainty in the comparison of data transfer rates because the transfers are performed on two different systems. The message here is that the more calculationally intense an algorithm, the more speed gains are possible with the Altera PLD approach. If data transfer dominates the algorithm, then the speed gain will be less.
Cost of implementation is also an important design metric. Let's compare the cost of a TMS320C6201 implementation versus the same system on the Stratix family of FPGAs. At the time of this writing, the 5,000 unit cost of a TMS320C6201 200 MHz was approximately $120. The same quantity of the Stratix EP1S10 (a 10,000 LE component) was approximately $60 per part. However, the design in this example consumed only 5,500 LEs including the NIOS, DMAs, and the eight row, parallel multiplier units. Software profiling of the TMS320C6201 shows that it is fully engaged in the processing of the correlator. Additional tasks would only increase the time to completion for the image. However, the EP1S10 has 45% of its processing power that is truly idle, and can be applied to other tasks in the system. Thus, the true cost of implementation of the correlator is 55% of the part cost (assuming full utilization of the EP1S10 in the system). See Table 3 for a comparison.
Table 3. Stratix EP1S10 vs. TMS320C6201 Cost of Implementation
Not only does the FPGA co-processor speed up the completion of the image by a factor of six, but it also costs less than one-third to implement in a system.
Like every design decision in a complex system, moving all, or part, of an algorithm into a PLD co-processor depends on a wide range of issues. It really comes down to the analysis of the size and complexity of both the hardware and software components in the design. For example, a simple algorithm presently running on a DSP that is not under enormous deadline pressure is probably not a good candidate for porting to a PLD unless the goal is to reduce PC board real estate or to have a higher degree of hardware integration. However, if your system is high performance, and generally pushes a DSP to its limits, then a PLD co-processor approach may be your only viable alternative. At Glacier Point Research, we have found that switching to PLDs as our primary-processors instead of the C6000 DSPs in our industrial ultrasound systems to be the only way we can get the performance the application demands.
Altera's design tools (Quartus II, DSP Builder, and SOPC Builder) are a major step forward in system development. The wide range of automated tools make the development of PLD systems faster and easier than ever before. We have shown here that the use of PLDs as co-processor in medical and industrial applications provides considerable speed gains over a high-end DSP. In addition, the system architect is given a much higher degree of control than is possible with DSPs.
Special Acknowlegement to:
Dan Grolemund, Ph.D. and Jose Gorostegui
Glacier Point Research, Inc.
San Mateo, CA