Design Article

IMG1

How to use Field-Programmable Object Arrays (FPOAs) in image processing

Sean Riley, MathStar

6/27/2007 5:43 PM EDT

Algorithm-driven design is the norm rather than the exception for the modern system designer, often requiring them to incorporate higher performance integrated circuits (ICs). Unfortunately, the pressure to get products to market – coupled with restricted budgets – rules out application-specific integrated circuits (ASICs). As a result, more and more embedded system designers are turning to programmable logic devices and away from the ASIC approach. To date, the most popular programmable logic choice has been field programmable gate arrays (FPGAs) or digital signal processors (DSPs). Although these devices have enjoyed broad market acceptance, they aren't great at scaling to meet high-performance system requirements.

A new category of very high-performance programmable logic devices has been developed to address the un-met needs of system designers. The MathStar Field Programmable Object Array (FPOA) is an example of this category, offering field re-programmability, 1 GHz performance, a 400-object array, high-speed I/O, and a streamlined design process. The design methodology of an FPOA leverages the use of building blocks called "objects" rather than "gates" used in an FPGA. This object approach allows an FPOA to operate at 1 GHz, up to four times faster than an FPGA, while still offering all the benefits of a programmable logic device. The FPOA has been developed to provide deterministic timing so no timing closure is required. As opposed to an FPGA, a 1 GHz FPOA will always operate at 1 GHz. The result is a much higher performance solution than what is attainable in other reprogrammable solutions.

FPOA application performance in image processing
Because of its high-performance, the FPOA is useful in a wide range of applications, including those in the areas of machine vision, professional video, medical imaging and image processing. These applications are built around extremely fast building blocks, such as flat field error correction, Fast Fourier Transform (FFT), Finite Impulse Response (FIR) filters, Infinite Impulse Response (IIR) filters, Discrete Cosine Transform, 2D convolution filters, and even video codecs such as MPEG2, JPEG2000 and MPEG4/H.264. This article will cover three examples: flat field error correction, FFT and 2D convolution filter.

Flat Field Error Correction: Flat field correction is a very important algorithm in many industries, including professional video, security and surveillance and machine vision. It's used to adjust image sensor output data to ensure that errors and flaws in the optical sensor are not propagated to the rest of the system.

This correction process addresses three types of pixel-based non-uniformities: gain, dark current offset and defective pixels. In order to rectify these non-uniformities, a calibration and correction process must be performed. The calibration process determines the correction factors for pixel gain and offset, and generates a defective pixel map. The correction process takes these factors and calculates an appropriate value for non-uniform pixels.

As sensor resolutions grow to 4K × 4K and beyond, the computation requirements for flat field error correction go up exponentially. Using only 13 to 22 objects, the FPOA architecture is able to sustain performance rates of 500 megapixels per second in continuous flat field error correction. This is more than four times the performance of a large FPGA and supports over 60 frames per second for a 4K × 2K image sensor. Rates higher than 1 gigapixel per second are achievable by implementing several flat field error correction blocks in parallel within a single FPOA.

Fast Fourier Transform: The FFT is an ingenious algorithm that is used for applications that require a discrete signal to be converted from the time domain to the frequency domain. The performance metrics of an FFT include the number of bits used to represent each sample, the number of samples, or points, in the FFT representation, and the rate at which the FFT can handle new inputs, also known as the sample rate.

The FPOA architecture is ideal for FFTs with sample rates at 1 Giga sample per second. This performance level is up to four times what a large FPGA can accomplish. Table 1 shows performance benchmarks for various FFTs implemented on a 400 object FPOA.


Table 1. FFT performance when implemented on an FPOA.

2D Convolution Filter: Convolution kernels are a common and necessary component of image processing systems. The basic structure of the convolution kernel is used in spatial filtering, edge detection and other areas of image processing. The basic idea is to scan an entire image with a mask (also called the kernel), generating a weighted sum for each pixel. Depending on this weighted sum and the contents of the kernel, specific information about the image can be determined. The 2D convolution algorithm consists of arithmetic operations on pixels and memory buffers for localized image storage. As shown in the flat field error correction example, the architecture of the FPOA is well suited to these types of pixel-based applications. The FPOA is able to achieve a much higher pixel processing rate than FPGA architectures. Table 2 shows the object usage for various performance levels of 2D convolution filters.


Table 2. 2D convolution filter performance and resource utilization estimates for FPOA.
Why DSPs and FPGAs don't scale
To date, the most popular programmable logic choices have been DSPs and FPGAs. A typical DSP may have up to four processing engines that run at speeds of 800 MHz or higher. These processing engines are the equivalent of a specialized microprocessor, which is programmed in C level code by a software engineer. This means that a complier must be used to translate the C level code into assembly language that is understood by the DSP device. For highly optimized performance, however, developers generally must hand-edit the DSP assembly code and manually assign tasks to specific processing engines. For this reason, DSPs are ultimately limited in performance by their clock rate and the number of useful operations they can do per clock cycle. DSPs can achieve high clock rates, but cannot implement many functions in parallel.

FPGAs, on the other hand, are able to achieve very high levels of parallelization but cannot achieve high clock rates. For this reason, FPGAs have emerged as a viable alternative for design teams who seek higher performance than can be realized on DSPs. FPGAs provide large arrays of programmable resources, generically called configurable logic blocks (CLBs). CLBs are basically look-up tables, flip-flops, registers and memory banks. More advanced FPGAs also include dedicated resources such as multiply-accumulators (MACs) and embedded central processing unit (CPU) cores. While DSPs are typically programmed in C level or assembly code, FPGAs are programmed with an HDL language-based flow.

With each successive generation, FPGA architectures have been realizing smaller and smaller performance gains. This is because the FPGA architecture is limited by its internal interconnect scheme. A designer must go through a process called "timing closure" in order to finalize a design in an FPGA. It is not uncommon for an FPGA to advertise a maximum clock rate of 500 MHz but only be able to close timing at 200 MHz. This mismatch between advertised clock rate and real-life clock rate forces hardware designers to build in plenty of clock frequency headroom when doing an FPGA design. Adding to the headache, the timing closure process is not deterministic, meaning the design may need to be changed in order to close timing. This makes development time is hard to scope, as any engineering manager who has directed a high performance application on an FPGA knows. An FPGA has a large amount of parallel processing capability but is limited by its relatively low clock rate.

A comparison of parallel processing capability and operating frequency (clock rate) is shown in Fig 1 below.


1. Comparison of performance fundamentals of FPOA, FPGA and DSP architectures.

The FPOA architecture
System designers need a new architecture that enables them to continue developing high-performance embedded systems around programmable logic. The FPOA is a new field-programmable silicon platform that was developed for this express purpose. Unlike FPGAs, which implement most functions at the gate level, FPOAs employ higher-order building blocks called objects. These objects provide a much higher level of abstraction and, therefore, higher performance than the gates of conventional FPGAs. For example, the MathStar Arrix family of FPOAs contains over 400 objects that are able to pass data and signals to each other through a configurable communication framework. The timing of both the objects and the communication framework is deterministic at clock rates up to 1 GHz. This deterministic performance eliminates the tedious timing closure design step commonly associated FPGAs. In addition, the FPOA architecture allows high-level functions, algorithms, equations, and block diagrams to be quickly, directly, and efficiently realized in high-performance silicon.

The current version of the FPOA has three different 1 GHz core objects. The Arithmetic Logic Unit (ALU) executes logical and mathematical functions on 16-bit data and provides general-purpose logic functions for control. The Multiply Accumulator (MAC) performs 16×16 multiply operations with a 40-bit accumulator. The Register File (RF) is a very fast, local memory that can be programmed as RAM, FIFO, or as a sequential read object. These core objects are surrounded by a periphery over internal RAM (IRAM), external DRAM controllers (XRAM) High-Speed I/O, and move data between core objects and off-chip devices.

MathStar's current Arrix FPOA includes 256 ALU objects, 64 MAC objects and 80 RF objects arranged in a 20×20 array as shown in Fig 2.


2. Architecture of MathStar Arrix Field Programmable Object Array.

FPOA Interconnect Framework: The FPOA's 1 GHz interconnect framework is a configurable mesh of connections used to transfer signals and data between objects. There are two types of connections: Nearest Neighbor and Party Line. Nearest Neighbor connections provide for single cycles to communicate with each of their eight adjacent neighbors at 1 GHz. Any object can pull data from its nearest neighbors, operate on it, and provide it as an output in a single clock cycle. Partly Lines provide communications between objects that are not adjacent. At 1 GHz, Partly Lines can connect any object to another object if it is within a three-object radius. Party Lines also operate in a single clock cycle.

Computation Workhorse – ALU: The ALU is a programmable, multi-state core object that provides a general-purpose 16-bit arithmetic logic block for data operations as well as general-purpose logic/truth functions for control bit operations. Each of the 256 ALU objects can conditionally execute more than 30 instructions in a single clock cycle. Each ALU instruction queue is eight instructions deep.

Fast Multiplier – MAC: The MAC object can perform a variety of multiply and accumulate functions at speeds up to 1 GHz. The multiplier function multiplies two 16-bit inputs and generates a 32-bit result plus carry. The accumulator function adds a 32-bit input to an existing or new value and, depending on the configuration, provides a 40-bit output result. The multiplier function requires two clock cycles to complete and the accumulator function requires one clock cycle to complete; however, the MAC is fully pipelined so that at every FPOA clock cycle, the MAC accepts new inputs and generates a new result.

Fast, Temporary Storage – RF Object: The RF object contains 64 memory locations of 20 bits each. The 20 bits are configured as 16 data bits and 4 control bits. The RF object supports three operating modes: random access mode, FIFO mode and Read Sequence mode. The RF object can be also be configured to 40 bit double-width read and write data paths. All modes support simultaneous read and write every clock cycle at 1 GHz.

Memory and I/O: The FPOA has multiple internal SRAM (IRAM) banks, each linked to the core object array via Party Line connections and accessible every other clock. The FPOA also has two independent external memory (XRAM) controllers, each operating up to 2.4 GB/s. Each XRAM controller provides access to external 36-bit Double-Data-Rate (DDR) Reduced Latency DRAM (RLDRAM-II) memory. The FPOA has two high-speed I/O interfaces (RX/TX), each providing 16 Gbps of simultaneous transmit and receive bandwidth. General Purpose I/O (GPIO) banks provide configurable LVCMOS I/O pins for connection to embedded system control. MathStar's FPOA design flow
MathStar's FPOA design flow enables teams of designers to create, verify, program and debug algorithms on FPOA devices. Designs are entered at the level of FPOA objects and are behaviorally simulated in a graphical design. Using the COAST mapping tool, these virtual designs are then mapped into the hardware resources of the FPOA.

The COAST tool generates an object code stream that designers can load onto the array via a PROM or through the FPOA's JTAG interface. Since the FPOA does not require timing closure, designs implemented and simulated for an FPOA are guaranteed to work in actual hardware as long as design rules are followed. The FPOA design process is akin to aligning data on clock cycle boundaries as opposed to closing physical timing in an iterative fashion in an FPGA. The COAST mapping tool is shown in Fig 3.


3. COAST design software allows for object connection and assignment.
(Click this image to view a larger, more detailed version)

The previously discussed 2D Convolution filter example is shown in Fig 4 as seen by the COAST mapping tool. This particular example outlines to 3×3 kernel implementation using nine objects.


4. 2D Convolution Filter 3x3 Kernel Slice Layout in COAST.

Summary
The FPOA represents the next generation of programmable logic, addressing high-performance embedded applications such as machine vision, professional video, medical imaging and image processing. The FPOA combines massively parallel computation found in FPGAs with the 1 GHz clock rates found in DSPs. For applications such as FFTs, flat field error correction and 2D Convolution kernels, the FPOA represents up to four times the performance of large FPGAs in a single chip solution.

For more information on MathStar or the Arrix Family of FPOAs, visit www.mathstar.com or send an email to sean.riley@mathstar.com.

Sean Riley joined MathStar as Vice President of Marketing in April 2005. He is responsible for the planning, definition, positioning and marketing of MathStar's Field Programmable Object Array (FPOA) product line. Sean joined MathStar from Intel Corporation where he spent thirteen years in various marketing, engineering and general management roles. Sean's career has focused on building new businesses within the networking and communications market and he is most proud of his role in making Ethernet a ubiquitous networking standard throughout the world. His combination of technical and marketing knowledge helps him set strategic directions and develop profitable business plans in order to capture the strategic opportunity. Sean can be contacted at sean.riley@mathstar.com.


print

email

rss

Bookmark and Share

Joinpost comment




Please sign in to post comment

Navigate to related information

Product Parts Search

Enter part number or keyword
PartsSearch

FeedbackForm