@Rodney: Maybe the cost of an ASIC development is not justified. What if Altera turns this into a practical/compromise solution to re-configurable computing? Finally someone broke away from the idea that all we need is still another computer ISA? For years micro-code control has been used for CISC so this may be a step up from ISA to actually running high level language source code(C++). Yes, it can be done!!
@Rodney, Algorithms change rapidly. The FPGA can track that. By the time an ASIC is in production would it still run the algorithms you want? How about for the entire 3 year life of the kit in deployment? True this may teach something about a good compromise special core but there are 3 other stages (per DecafBad). The fascinating thing about FPGA in this scenario is the long term flexibility in what is a fixed plant with a big investment to pay down. We already have plenty of classical compute capacity on fixed architectures. Why rush to freeze the FPGA?
It is also interesting to wonder what other major applications can benefit from this hybrid. A data center runs a lot of different stuff.
Be careful when you suggest RTL sucks. It stands for Register Transfer Logic, which is a language that is a level of abstraction or abstractions lower than what Software Engineers generally code. With RTL, you are not developing code that runs on a processor, you are developing the processor itself. Of course this means the development time will take longer, but what else should one expect. The outcome however if specified correctly will be a design that will provide significicantly higher bandwidth and lower latency due to the massively parallel architecture that FPGAs provide. There is no processor on the planet that would be able to match it.
Is the instruction set proprietary? Is the data flow control micro-code? horizontal?
Quote: So the soft processors offer a compromise. They're not as efficient as full-custom RTL for one particular language, but they also do not require the FPGA to be completely reconfigured whenever languages change. Instead, only the instruction memories need to be updated, and that takes much less time. End quote
Describes a good approach to general FPGA design.
It is also interesting that many languages can be used as source. Why cannot this type of core be generalized as a compromise for general design? The embedded memory blocks can also be used to reduce place and route and the circuit speed reduces the need for optimization to help shorten compile time.
I have a preliminary design that uses 4 memory blocks and a couple of hundred LUTs to run C source in a similar fashion. Sounds like we have similar approaches.
@TanjB is very insightful here. Different ranker models (e.g. English, French, Chinese, Spanish, etc...) each require a different set of Free Form Expressions. (After all, it matters if adjectives come before or after nouns when evaluating the importance of any given search term). So, while it is possible to heavily specialize FFE for one model (e.g. compile the FFE expressions for English directly into a spatial dataflow graph), that won't help much when a search in a different language comes along.
So the soft processors offer a compromise. They're not as efficient as full-custom RTL for one particular language, but they also do not require the FPGA to be completely reconfigured whenever languages change. Instead, only the instruction memories need to be updated, and that takes much less time.
It is important to note that the 60 core freeform expression [FFE] soft processor is only one of four stages in the processing pipeline that we implemented for doing Bing page ranking. There are 3 other stages which were more specialized than the FFE soft processor cores.
Also, saying that the FPGAs "implement" C++ is a generalization of what is really going on. The FFE soft processors implement a custom instruction set aimed specifically at efficiently executing free form expressions. The expressions happen to be coded in C++, but are then compiled via the Phoenix compiler into this customized FFE instruction set. Phoenix can take C++, as well as a wide variety of other languages, and compile them to the custom FFE instruction set. So there is nothing special about C++ here. The code only needs to be a language that Phoenix [or any similar compiler] can take as an input.
@SwanOnChips: Your observation is very clever. You are correct – the picture shows 48 cores rather than the 60 cores described in the ISCA and Hot Chips publications. The reason is far less technical than one might imagine – the Xilinx PlanAhead tool produces more aesthetically-pleasing pictures than Altera's Chip Planner. I used an old picture from the implementation on an early Xilinx prototype because I think it better illustrated the area ratios of each component of FFE.The Altera implementation used in the pilot had 60 cores.
Thanks to everyone for the constructive comments and questions. I appreciate the feedback.
@cd2012: We started this project in 2011, well before either Xilinx or Altera announced any support for OpenCL. If we were to start anew today, we would certainly look deeply into OpenCL, along with a number of other very promising tools/languages/methodologies, some of which have been mentioned in the comments already.
However, @TanjB is right – the vast majority of the effort in this project was focused on building a system that integrated well in the datacenter environment. I encourage you to read the ISCA paper and (when it's available) check out the Hot Chips presentation describing the kinds of challenges that must be addressed in order to integrate FPGAs into the datacenter.
The micrograph pictures 48 adapter cards, not 48 cores within an FPGA. Each of those 48 is attached to a separate server, and the 6 x 8 arrangement is networking the FPGAs in that rack. Functionality flows across multiple FPGA chips if the algorithm is large.