I'm happy to see that FPGAs are finding their ways into datacenters for unexpected uses (I mean compared to their traditional uses), but at the same time it was far from being the first choice "we looked at software, then GPUs and then FPGAs". And why?
Because "The FPGA tools are too slow, there are too many warnings and not great debugging -- but that's not new." Combine that with the fact that RTL is still horrible to write (and whether you use old or new HDLs does not change a lot IMO), and they had to come up with "a middle ground between C++ and RTL where people can program". Interesting, it sounds a lot like Cx.
Of course if Microsoft and Baidu had known about ngDesign, they could have used an Eclipse-based IDE featuring on-the-fly error checking and fast code generation. We have plans for a fast simulator + debugger, Microsof and Baidu feel free to contact us!
Be careful when you suggest RTL sucks. It stands for Register Transfer Logic, which is a language that is a level of abstraction or abstractions lower than what Software Engineers generally code. With RTL, you are not developing code that runs on a processor, you are developing the processor itself. Of course this means the development time will take longer, but what else should one expect. The outcome however if specified correctly will be a design that will provide significicantly higher bandwidth and lower latency due to the massively parallel architecture that FPGAs provide. There is no processor on the planet that would be able to match it.
I am fully aware of what RTL stands for and what it is used for, namely designing hardware circuits at a low level of abstraction. The thing is that it doesn't have to be this way: you can actually write code structured as tasks running in parallel, each one performing computations described as sequential C-like code. In fact, that's what we're doing at Synflow, we've created a new programming language called Cx that actually makes hardware design fun again! And yes, I believe that RTL kind of sucks, which is a good reason why Microsoft ended up designing softcores to program them in C++ on FPGA lol
RTL Register Transfer LEVEL (of abstraction, not language). Hardware Description Language i.e. Verilog VHDL.
HDLs should be the result of logic design, NOT used for design input. HOWEVER Verilog was defined for circuit level simulation. THEN came synthesis to infer logic that would create the same circuits and is limited by the available set of logic functions.
There are no such limits for logic design functions.
Worse yet there are no logic design tools, only simulators that require HDL compilation first which takes unacceptable amount of time. (ModelSim can model HDL, but generally a netlist generated by compile is used.
Programming and Logic Design are totally different. Programming manipulates data sequencially. Hardware gates registers to data flow for data manipulation. There must be sufficient time for the data to stabilize before capturing the result. This time is of no concern to the programmer.
Every hardware register must not change while the contents are being used. This would apply to dataflow programming also.
Hardware consists of data-flow and control logic. Programming is control flow and variables.
We don't need another programming language, rather a tool that accepts text to define registers, memories, and logic functions. The key is to complete the logic design before creating HDL(RTL).
@betajet: A programming language produces a new executable(,exe). The existing tools do not compile a new executable either, so to hopefully clarify:
A program language compiler itself does not change, but generates object code that must run on a computer.
"compile" as applied to an HDL generates a netlist that is then processed by other programs in the tool chain. They do not generate executables -- HLS??
To me HDL represents more the physical circuits and not the logical function. Synthesis is limited by the ability of humans to think of unique logic functions that would generate the particular HDL.
An example of formatted data is a spreadsheet that can manipulate input without a compile step. A program could be written to do the same thing, but it would have to be compiled then run.
Anyway, the existing tool chains based on HDLs start in the middle of the design process and essentially ignore the functional/logical aspects. That makes them poor for debug. They assume a compled design and are nearly useless for debug and design iterations. And generate all kinds of messages about physical rather than functional things.
What I suggested is a program to take input that defines the data-flow and the logic to control the gating of registers and memories. (just like the old days when there was logic design and data-flow design then humans would transcribe the logic into machine readable format for automated wiring). NO, NO, NO I really do not mean do it manually let a program tie things together and do the transcription. Just do what the existing tools fail to do.
@ Matthieu: "they had to come up with "a middle ground between C++ and RTL where people can program". Interesting, it sounds a lot like Cx"
If Cx produces HDL that must be processed by the tool chain and results in configuring the FPGA and they compile C++ to generate memory content, how can they be alike?
They also stated that it takes too long to reconfigure and the tools are too slow, so it must be they found a fast way to execute C++ source. And they they run many threads in parallel so the threads are probably independent since there is no mention of an OS to synchronize the execution.
@KarlS01: what I meant was that both approaches are comparable in being at a higher level than RTL, but lower level than C++. Now you're right that the technologies are not alike, Cx is for hardware design, whereas C++ is still used in that case to write software (even if that software runs on soft cores)
What I had understood by "tools are too slow" was that they were talking about synthesis, place and route tools, etc. Which could be a good reason for using softcores, since you only have to synthesize once, and then you write software that runs on the softcore. My guess is that they used FPGAs to create their own parallel processing platform suited to their needs. I think they could have used a many-core processor (like Adapteva's Epiphany) for that matter, but it did not yet exist when they started the project. And perhaps it did not meet their needs either, after all they looked at GPU first before deciding on FPGA.
@Matthieu: Agreed "tools too slow" as you said. Seems they also think configuration time of FPGA is too long compared to loading memory.
I think Cx approach is more like what logic design used to be. (design first then synthesize) It just seems better to do the design rather than use "inference" to synthesize and live with what you get.
There is still the point of just loading memory being so much faster than synthesis. Micro-code is still beibg used in CPUs and it also can be used for hardware control as a performance compromise.
I will have to think more about whether it can complement Cx.
At this stage more of the engineering effort revolves around system compatibility with the data center environment (power, form factor, networking, etc.), and scheduling use of resources distributed across a network of FPGAs, than the relatively solved problem of programming individual FPGAs.
I would be curious to see the trade study between using FPGAs and using GPUs with either CUDA or OpenCL to accelerate the datacenter. As for the C++ interface, that's really nice for us old-timers, but show me a C++ programmer who is under 40-something and I'll buy you a Starbucks coffee :)
Thanks to everyone for the constructive comments and questions. I appreciate the feedback.
@cd2012: We started this project in 2011, well before either Xilinx or Altera announced any support for OpenCL. If we were to start anew today, we would certainly look deeply into OpenCL, along with a number of other very promising tools/languages/methodologies, some of which have been mentioned in the comments already.
However, @TanjB is right – the vast majority of the effort in this project was focused on building a system that integrated well in the datacenter environment. I encourage you to read the ISCA paper and (when it's available) check out the Hot Chips presentation describing the kinds of challenges that must be addressed in order to integrate FPGAs into the datacenter.
@cd2012: My comment alludes to the answer. It's because they did not implement SW as HW which is what the FPGA suppliers will do with OpenCL. For whatever reason Microsoft chose to implement many-core SOFT processors on the FPGA's, so the FPGA's are still running SW, albeit multi-threaded.
Ranking is a particular kind of processing. It is uniform in that the same functions are used by all rankers. However, there are in practice hundreds of ranker models (think about all the languages, and then all the distinct modes of query - science, news, celebrity, commerce, etc) which use those functional primitives in different subsets and with different combinations. A particular model can project into the fabric a set of standard primitives it has chosen to use and then bind them in a custom data flow and weighting.
@TanjB is very insightful here. Different ranker models (e.g. English, French, Chinese, Spanish, etc...) each require a different set of Free Form Expressions. (After all, it matters if adjectives come before or after nouns when evaluating the importance of any given search term). So, while it is possible to heavily specialize FFE for one model (e.g. compile the FFE expressions for English directly into a spatial dataflow graph), that won't help much when a search in a different language comes along.
So the soft processors offer a compromise. They're not as efficient as full-custom RTL for one particular language, but they also do not require the FPGA to be completely reconfigured whenever languages change. Instead, only the instruction memories need to be updated, and that takes much less time.
Is the instruction set proprietary? Is the data flow control micro-code? horizontal?
Quote: So the soft processors offer a compromise. They're not as efficient as full-custom RTL for one particular language, but they also do not require the FPGA to be completely reconfigured whenever languages change. Instead, only the instruction memories need to be updated, and that takes much less time. End quote
Describes a good approach to general FPGA design.
It is also interesting that many languages can be used as source. Why cannot this type of core be generalized as a compromise for general design? The embedded memory blocks can also be used to reduce place and route and the circuit speed reduces the need for optimization to help shorten compile time.
I have a preliminary design that uses 4 memory blocks and a couple of hundred LUTs to run C source in a similar fashion. Sounds like we have similar approaches.
It is important to note that the 60 core freeform expression [FFE] soft processor is only one of four stages in the processing pipeline that we implemented for doing Bing page ranking. There are 3 other stages which were more specialized than the FFE soft processor cores.
Also, saying that the FPGAs "implement" C++ is a generalization of what is really going on. The FFE soft processors implement a custom instruction set aimed specifically at efficiently executing free form expressions. The expressions happen to be coded in C++, but are then compiled via the Phoenix compiler into this customized FFE instruction set. Phoenix can take C++, as well as a wide variety of other languages, and compile them to the custom FFE instruction set. So there is nothing special about C++ here. The code only needs to be a language that Phoenix [or any similar compiler] can take as an input.
@Swan: "For whatever reason Microsoft chose to implement many-core SOFT processors on the FPGA's, so the FPGA's are still running SW, albeit multi-threaded."(parallel multi-threaded)
They stated that it is faster to just re-load memories than to reconfigure the FPGA.
The soft cores run at lower clock frequency, but still accelerate the application. There are many soft cpu's available, but they designed a soft core -- why? traditional CPUs are memory bound in that there is so much load and store overhead to get the operands into registers and to put the result back into memory. Just streaming the FFE into local memory and storing the result of the algorithm is important. Doing many threads in parallel is even better. Compiling C++ source for a custom soft core without a classical ISA is another plus.
FPGA design is totally different than SW programming. Data is processed as it moves thru the FPGA dataflow. SW selects operands and operators repetitively to do processing which requires a processing step for each operator while multiple operators can be evaluated per step in HW.
Also, OpenCL is mostly matrix manipulation so it may allow matrix algorithms to be programmed in C++ it is most likely a subset and probably not the subset needed for FFE evaluation.
What intrigued me about Microsoft's use of the FPGA's is they implemented ranking of the web search results NOT with HW implementation of SW, but with 60 SOFT processor cores on the FPGA! Since each core handles 4 threads, that's a total of 240 threads. (Though the micrograph in their slide pictures only 48 cores (8 clusters of 6 cores)
@SwanOnChips: Your observation is very clever. You are correct – the picture shows 48 cores rather than the 60 cores described in the ISCA and Hot Chips publications. The reason is far less technical than one might imagine – the Xilinx PlanAhead tool produces more aesthetically-pleasing pictures than Altera's Chip Planner. I used an old picture from the implementation on an early Xilinx prototype because I think it better illustrated the area ratios of each component of FFE.The Altera implementation used in the pilot had 60 cores.
The micrograph pictures 48 adapter cards, not 48 cores within an FPGA. Each of those 48 is attached to a separate server, and the 6 x 8 arrangement is networking the FPGAs in that rack. Functionality flows across multiple FPGA chips if the algorithm is large.
The processor is multi-core shared ALU design as opposed to conventional FPGA design. It is also multi-threaded so the speedup is due to multi-thread execution rather than traditional FPGA pipe-lined computation.
This may mean that a different design approach is needed. Technology evolution may have made the traditional approach of maximizing fmax obsolete.
@Rodney, Algorithms change rapidly. The FPGA can track that. By the time an ASIC is in production would it still run the algorithms you want? How about for the entire 3 year life of the kit in deployment? True this may teach something about a good compromise special core but there are 3 other stages (per DecafBad). The fascinating thing about FPGA in this scenario is the long term flexibility in what is a fixed plant with a big investment to pay down. We already have plenty of classical compute capacity on fixed architectures. Why rush to freeze the FPGA?
It is also interesting to wonder what other major applications can benefit from this hybrid. A data center runs a lot of different stuff.
The algorithms may change, but that doesn't mean that the processor architecture must change.
FPGAs are expensive solutions to mass production systems. An adequate ASP, this is engineering after all, could be substantially less expensive than the FPGAs and also substantially reduce the data center power consumption. They would also be capable of higher clock speeds than an FPGA.
@Rodney: Maybe the cost of an ASIC development is not justified. What if Altera turns this into a practical/compromise solution to re-configurable computing? Finally someone broke away from the idea that all we need is still another computer ISA? For years micro-code control has been used for CISC so this may be a step up from ISA to actually running high level language source code(C++). Yes, it can be done!!