Common system architectures prioritize different elements of the path from sensor to useful data as follows:
Machine vision: For this application, the quickest path with any software is a high level language. The application focuses on higher frame rates and higher resolution. This remains a moving target, following sensors as they increase in capacity and speed. Typically, the FPGA parallelizing compiler is offloading the CPU code and creating the HDL. Ideally, there is a standard board interface to the specific platform, which may consist of a standalone FPGA or an FPGA that is part of a larger system. Again ideally, this involves a separate layer (e.g., a platform support package) such that the design is not locked into any particular platform.
Typical systems take high resolution images using a COTS camera and eliminate the processing bottleneck by offloading the processing to parallel processes running in an FPGA.
(Click here to see a larger image.)
On the tool side, C-based compilers lend themselves to a stream processing model. Also, OpenCL has a threading plus memory model. In either case, it helps accelerate the project to know about FPGAs and how the tools actually produce results; i.e., what to expect. This is a large variable with regard to many design teams teams. There can be lots of "gotchas" after waiting perhaps four-to-eight hours for the design to propagate (via synthesis and place-and-route) into the physical FPGA. The more this image flow is pipelined, the more the architect should consider multiple processors in the fabric as well as CPUs that are available on some FPGAs.
Real-time input to output: In this type of application, the system is processing frame-by-frame and doing something on every frame. For instance, distributed video (e.g., HD TV) has frame buffers, but it still has to run in real-time. Drop-out is just that -- unrecovered and unconcerned. In some cases, machine vision can compromise in deference to staying real-time or as fast as possible. In applications where loss isn't fatal, the system architect may compromise on the frame rate or the resolution. Buffering compensates for latencies or lags. Frame buffering is typically used for recall (more than once) or as a giant FIFO at the frame level (vs. at the pixel level). External memory is cheaper in this situation than putting it on chip.
Analytics or reconstruction: An example of this type of application is a medical scanner image reconstruction algorithm. According to Professor Scott A. Hauck from the University of Washington:
Computed Tomography (CT) image reconstruction techniques represent a class of algorithms that are ideally suited for co-processor acceleration. The Filtered Back Projection (FBP) algorithm is one such popular CT reconstruction method that is computationally intensive but amenable to extensive parallel execution. [it is amenable to using] an FPGA accelerator for the critical back projection step in FBP using a C-to-FPGA tool flow like Impulse C. The strategies show orders of magnitude speedup over a software implementation of back projection and can achieve nearly the same performance as hand coded HDL while significantly reducing the design effort.
QoR is one of the "dirty little secrets" of most HLL methodologies. HLLs and optimizing compilers will do great with unrolling and parallelizing, but -- without management -- they can add more overhead. Often, the more physical HLLs have less overhead, but -- inversely -- require more knowledge of the hardware. Memory and special FPGA resources are a couple of potential traps. Ideally, algorithms will minimize off-chip memory use and stay within available special memory resources such as DSP blocks. We debug quite a few designs that "fall off of a cliff" when a design iteration requires "just one more DSP block" than is actually available on-chip.
HLLs and tools in this ecosystem include free tools such as OpenCL, which has a reasonable list of libraries and macros. FPGA manufacturers produce excellent tools, but these -- of course -- tend to lock designers into a specific brand. OpenCL is emerging as a possible standard and, although "young" as applies to FPGA, offers a lot of promise for co-processing with GPUs and CPUs using a common language. Third-party tools, such as Impulse C, are well-established, come with more intense factory support (selling design seats, not chips), but they may cost more.
About the authors
Brian Durwood and Ed Trexel are employees of Impulse Accelerated Technologies, one of the established companies making C to FPGA tools and offering Design Services for prototypes and proofs of concept. The Impulse team has assisted on over 600 such designs.