In configuring next generation large scale parallel processing arrays some teams are relying on "heterogeneous processing". Basically a fifty-cent phrase describing a microprocessor with one or more on board co-processors for high-speed on-node processing, most typically GPU, FPGA, Cell, and/or DSP. While the debate continues about the right ratio of microprocessors to co-processors, most teams agree that the basic plumbing of memory management can be the real bottleneck. Today the only real solution is having the microprocessor and co-processors share memory on the node, and interconnecting many nodes with a GigE, Infiniband, or a custom interconnection, configuring the nodes in a distributed memory layout.
Enter the unintended consequence of scaling. Amdahl's law says that as you add more processors, you get bogged down by more overhead. Basically the Nth guy you add to build a brick wall begins to slow things down because all the brick layers are reaching for bricks off the same pile, and get in each other's way. Add another N brick layers and it just gets worse. So the idea is to compliment the original process (the first brick layer) with a co-processor that makes that brick layer more efficient (faster), independent of any other brick layer. Image a machine that hands the brick layer a pre-cemented brick, so all they need to do is place it. Or, there is always the old analogy:
"I know how to make 4 horses pull a cart - I don't know how to make 1024 chickens do it."
Using co-processors dodges Amdahl's law by using more powerful nodes, thus needing fewer of them to reach the same level of performance. While this approach is successful, it puts more burden on the programmer to make a heterogeneous programming model, and successfully implement it on a given node and across multiple nodes. How does the program deploy the algorithm in this new environment? Can it be emulated in one simulation? How does the programmer debug a multi-node program? all using co-processors? This article will discuss these basics within the tool flow and then focus primarily on memory mapping issues at the low end of FPGA enabled coprocessing, and at the high end of the thousand processor arrays.
In our vision of heterogeneous processing, FPGAs are tightly coupled with one or more microprocessors on a mother board, sharing a common memory space. Distributed Global memory cannot be directly accessed (by design) and is accessed instead through a message passing interface such as "MPI", across an interconnect like GigE or Infiniband, or custom high performance interconnects like Cray's SeaStar network. There used to be multiple flavors of passing interfaces but MPI is now the most common. Other parallel languages such as OpenMP and PThreads are alternatives, but require shared memories running on the nodes with the FPGAs. There are also PGAS (Partitioned Global Address Space) languages like SHMEM, UPC, and CoArray FORTRAN. These give the programmer a one-sided messaging model which allows them to have a global address space across the whole machine, but without cache coherency.