Breaking News
News & Analysis

M'soft Plugs FPGAs in Datacenter

China's Baidu adopts FPGAs, too
8/12/2014 07:27 PM EDT
31 comments
NO RATINGS
Page 1 / 3 Next >
More Related Links
View Comments: Threaded | Newest First | Oldest First
Matthieu Wipliez
User Rank
Manager
Hint: RTL sucks.
Matthieu Wipliez   8/13/2014 6:30:29 AM
I'm happy to see that FPGAs are finding their ways into datacenters for unexpected uses (I mean compared to their traditional uses), but at the same time it was far from being the first choice "we looked at software, then GPUs and then FPGAs". And why?

Because "The FPGA tools are too slow, there are too many warnings and not great debugging -- but that's not new." Combine that with the fact that RTL is still horrible to write (and whether you use old or new HDLs does not change a lot IMO), and they had to come up with "a middle ground between C++ and RTL where people can program". Interesting, it sounds a lot like Cx.

Of course if Microsoft and Baidu had known about ngDesign, they could have used an Eclipse-based IDE featuring on-the-fly error checking and fast code generation. We have plans for a fast simulator + debugger, Microsof and Baidu feel free to contact us!

Glenadush
User Rank
Rookie
Re: Hint: RTL sucks.
Glenadush   8/15/2014 10:37:53 AM
NO RATINGS
Hi Matthieu,

Be careful when you suggest RTL sucks. It stands for Register Transfer Logic, which is a language that is a level of abstraction or abstractions lower than what Software Engineers generally code. With RTL, you are not developing code that runs on a processor, you are developing the processor itself. Of course this means the development time will take longer, but what else should one expect. The outcome however if specified correctly will be a design that will provide significicantly higher bandwidth and lower latency due to the massively parallel architecture that FPGAs provide. There is no processor on the planet that would be able to match it.

 

Matthieu Wipliez
User Rank
Manager
Re: Hint: RTL sucks.
Matthieu Wipliez   8/20/2014 5:13:30 AM
NO RATINGS
Hi Glenadush,

I am fully aware of what RTL stands for and what it is used for, namely designing hardware circuits at a low level of abstraction. The thing is that it doesn't have to be this way: you can actually write code structured as tasks running in parallel, each one performing computations described as sequential C-like code. In fact, that's what we're doing at Synflow, we've created a new programming language called Cx that actually makes hardware design fun again! And yes, I believe that RTL kind of sucks, which is a good reason why Microsoft ended up designing softcores to program them in C++ on FPGA lol

KarlS01
User Rank
Manager
Re: Hint: RTL sucks === HDL tools suck
KarlS01   8/26/2014 2:34:02 PM
NO RATINGS
  1. RTL Register Transfer LEVEL  (of abstraction, not language). Hardware Description Language i.e. Verilog VHDL.
  2. HDLs should be the result of logic design, NOT used for design input.  HOWEVER Verilog was defined for circuit level simulation.  THEN came synthesis to infer logic that would create the same circuits and is limited by the available set of logic functions.
  3. There are no such limits for logic design functions.
  4. Worse yet there are no logic design tools, only simulators that require HDL compilation first which takes unacceptable amount of time. (ModelSim can model HDL, but generally a netlist generated by compile is used.
  5. Programming and Logic Design are totally different.  Programming manipulates data sequencially.  Hardware gates registers to data flow for data manipulation.  There must be sufficient time for the data to stabilize before capturing the result.  This time is of no concern to the programmer.
  6. Every hardware register must not change while the contents are being used.  This would apply to dataflow programming also.
  7. Hardware consists of data-flow and control logic.  Programming is control flow and variables.
  8. We don't need another programming language,  rather a tool that accepts text to define registers, memories, and logic functions.  The key is to complete the logic design before creating HDL(RTL).


betajet
User Rank
CEO
Re: Hint: RTL sucks === HDL tools suck
betajet   8/26/2014 6:14:38 PM
NO RATINGS
Karl wrote: We don't need another programming language, rather a tool that accepts text to define registers, memories, and logic functions.

Sounds like another programming language to me :-)

I like your point that (sequential) programming and logic design are quite different.  I certainly think differently when doing one versus the other, even if the languages used look similar.

KarlS01
User Rank
Manager
Re: Hint: RTL sucks === HDL tools suck
KarlS01   8/27/2014 10:11:44 AM
NO RATINGS
@betajet:  A programming language produces a new executable(,exe).  The existing tools do not compile a new executable either, so to hopefully clarify:
  1. A program language compiler itself does not change, but generates object code that must run on a computer. 
  2. "compile" as applied to an HDL generates a netlist that is then processed by other programs in the tool chain.  They do not generate executables -- HLS??
  3. To me HDL represents more the physical circuits and not the logical function.  Synthesis is limited by the ability of humans to think of unique logic functions that would generate the particular HDL.
  4. An example of formatted data is a spreadsheet that can manipulate input without a compile step.  A program could be written to do the same thing, but it would have to be compiled then run.
  5. Anyway, the existing tool chains based on HDLs start in the middle of the design process and essentially ignore the functional/logical aspects.  That makes them poor for debug. They assume a compled design and are nearly useless for debug and design iterations.  And generate all kinds of messages about physical rather than functional things.
          

What I suggested is a program to take input that defines the data-flow and the logic to control the gating of registers and memories. (just like the old days when there was logic design and data-flow design then humans would transcribe the logic into machine readable format for automated wiring).  NO, NO, NO I really do not mean do it manually let a program tie things together and do the transcription.  Just do what the existing tools fail to do.

Thanks.

KarlS01
User Rank
Manager
Re: Hint: RTL sucks.
KarlS01   8/20/2014 2:46:59 PM
NO RATINGS
@ Matthieu: "they had to come up with "a middle ground between C++ and RTL where people can program". Interesting, it sounds a lot like Cx"

If Cx produces HDL that must be processed by the tool chain and results in configuring the FPGA and they compile C++ to generate memory content, how can they be alike?

They also stated that it takes too long to reconfigure and the tools are too slow, so it must be they found a fast way to execute C++ source.  And they they run many threads in parallel so the threads are probably independent since there is no mention of an OS to synchronize the execution.

Matthieu Wipliez
User Rank
Manager
Re: Hint: RTL sucks.
Matthieu Wipliez   8/20/2014 3:55:26 PM
NO RATINGS
@KarlS01: what I meant was that both approaches are comparable in being at a higher level than RTL, but lower level than C++. Now you're right that the technologies are not alike, Cx is for hardware design, whereas C++ is still used in that case to write software (even if that software runs on soft cores)

What I had understood by "tools are too slow" was that they were talking about synthesis, place and route tools, etc. Which could be a good reason for using softcores, since you only have to synthesize once, and then you write software that runs on the softcore. My guess is that they used FPGAs to create their own parallel processing platform suited to their needs. I think they could have used a many-core processor (like Adapteva's Epiphany) for that matter, but it did not yet exist when they started the project. And perhaps it did not meet their needs either, after all they looked at GPU first before deciding on FPGA.

KarlS01
User Rank
Manager
Re: Hint: RTL sucks.
KarlS01   8/20/2014 5:13:17 PM
NO RATINGS
@Matthieu:  Agreed "tools too slow" as you said.  Seems they also think configuration time of FPGA is too long compared to loading memory.

I think Cx approach is more like what logic design used to be. (design first then synthesize)  It just seems better to do the design rather than use "inference" to synthesize and live with what you get.  

There is still the point of just loading memory being so much faster than synthesis.  Micro-code is still beibg used in CPUs and it also can be used for hardware control as a performance compromise.

I will have to think more about whether it can complement Cx.

 

cd2012
User Rank
Manager
Using C++
cd2012   8/13/2014 11:22:31 AM
NO RATINGS
Both FPGA vendors offer OpenCL which allows programmers to use C++.  I wonder why this team didn't use it.  Was it not mature enough at the time?  Or, did OpenCL not hinder a performance gain?

TanjB
User Rank
Rookie
Re: Using C++
TanjB   8/13/2014 2:15:43 PM
NO RATINGS
At this stage more of the engineering effort revolves around system compatibility with the data center environment (power, form factor, networking, etc.), and scheduling use of resources distributed across a network of FPGAs, than the relatively solved problem of programming individual FPGAs.

AZskibum
User Rank
CEO
Re: Using C++
AZskibum   8/13/2014 8:54:37 PM
NO RATINGS
I would be curious to see the trade study between using FPGAs and using GPUs with either CUDA or OpenCL to accelerate the datacenter. As for the C++ interface, that's really nice for us old-timers, but show me a C++ programmer who is under 40-something and I'll buy you a Starbucks coffee :)

rick merritt
User Rank
Author
Re: Using C++
rick merritt   8/14/2014 1:16:14 PM
NO RATINGS
@cd2012: Great question. I'll ask Andrew to weigh in!

SwanOnChips
User Rank
Freelancer
Re: Using C++
SwanOnChips   8/14/2014 8:23:10 PM
NO RATINGS
@Merritt, now that I have answered the question for @cd2012, the real question I think to ask Microsoft is embedded in my answer. Want to ask them that?

0xDECAFBAD
User Rank
Rookie
Re: Using C++
0xDECAFBAD   8/15/2014 2:03:50 AM
NO RATINGS
Thanks to everyone for the constructive comments and questions. I appreciate the feedback.

@cd2012:  We started this project in 2011, well before either Xilinx or Altera announced any support for OpenCL. If we were to start anew today, we would certainly look deeply into OpenCL, along with a number of other very promising tools/languages/methodologies, some of which have been mentioned in the comments already.

However, @TanjB is right – the vast majority of the effort in this project was focused on building a system that integrated well in the datacenter environment. I encourage you to read the ISCA paper and (when it's available) check out the Hot Chips presentation describing the kinds of challenges that must be addressed in order to integrate FPGAs into the datacenter.

SwanOnChips
User Rank
Freelancer
Re: Using C++
SwanOnChips   8/14/2014 8:20:38 PM
NO RATINGS
@cd2012: My comment alludes to the answer. It's because they did not implement SW as HW which is what the FPGA suppliers will do with OpenCL. For whatever reason Microsoft chose to implement many-core SOFT processors on the FPGA's, so the FPGA's are still running SW, albeit multi-threaded.

TanjB
User Rank
Rookie
using soft cores
TanjB   8/15/2014 1:36:22 AM
NO RATINGS
Ranking is a particular kind of processing.  It is uniform in that the same functions are used by all rankers.  However, there are in practice hundreds of ranker models (think about all the languages, and then all the distinct modes of query - science, news, celebrity, commerce, etc) which use those functional primitives in different subsets and with different combinations.  A particular model can project into the fabric a set of standard primitives it has chosen to use and then bind them in a custom data flow and weighting.

 

0xDECAFBAD
User Rank
Rookie
Re: using soft cores
0xDECAFBAD   8/15/2014 2:40:05 AM
NO RATINGS
@TanjB is very insightful here. Different ranker models (e.g. English, French, Chinese, Spanish, etc...) each require a different set of Free Form Expressions. (After all, it matters if adjectives come before or after nouns when evaluating the importance of any given search term). So, while it is possible to heavily specialize FFE for one model (e.g. compile the FFE expressions for English directly into a spatial dataflow graph), that won't help much when a search in a different language comes along.

So the soft processors offer a compromise. They're not as efficient as full-custom RTL for one particular language, but they also do not require the FPGA to be completely reconfigured whenever languages change. Instead, only the instruction memories need to be updated, and that takes much less time.

KarlS01
User Rank
Manager
Re: using soft cores
KarlS01   8/15/2014 9:47:26 AM
NO RATINGS
@DECAFBAD:

Is the instruction set proprietary?   Is the data flow control micro-code? horizontal?

The following

Quote: So the soft processors offer a compromise. They're not as efficient as full-custom RTL for one particular language, but they also do not require the FPGA to be completely reconfigured whenever languages change. Instead, only the instruction memories need to be updated, and that takes much less time.  End quote

Describes a good approach to general FPGA design.

It is also interesting that many languages can be used as source.  Why cannot this type of core be generalized as a compromise for general design?  The embedded memory blocks can also be used to reduce place and route and the circuit speed reduces the need for optimization to help shorten compile time.

I have a preliminary design that uses 4 memory blocks and a couple of hundred LUTs to run C source in a similar fashion.  Sounds like we have similar approaches.

0xDECAFBAD
User Rank
Rookie
Re: Using C++
0xDECAFBAD   8/15/2014 2:27:07 AM
NO RATINGS
It is important to note that the 60 core freeform expression [FFE] soft processor is only one of four stages in the processing pipeline that we implemented for doing Bing page ranking. There are 3 other stages which were more specialized than the FFE soft processor cores.

Also, saying that the FPGAs "implement" C++ is a generalization of what is really going on. The FFE soft processors implement a custom instruction set aimed specifically at efficiently executing free form expressions. The expressions happen to be coded in C++, but are then compiled via the Phoenix compiler into this customized FFE instruction set. Phoenix can take C++, as well as a wide variety of other languages, and compile them to the custom FFE instruction set. So there is nothing special about C++ here. The code only needs to be a language that Phoenix [or any similar compiler] can take as an input.

KarlS01
User Rank
Manager
Re: Using C++
KarlS01   8/20/2014 1:30:11 PM
NO RATINGS
@Swan: "For whatever reason Microsoft chose to implement many-core SOFT processors on the FPGA's, so the FPGA's are still running SW, albeit multi-threaded."(parallel multi-threaded)
  • They stated that it is faster to just re-load memories than to reconfigure the FPGA.
  • The soft cores run at lower clock frequency, but still accelerate the application.  There are many soft cpu's available, but they designed a soft core -- why?  traditional CPUs are memory bound in that there is so much load and store overhead to get the operands into registers and to put the result back into memory.  Just streaming the FFE into local memory and storing the result of the algorithm is important.  Doing many threads in parallel is even better.  Compiling C++ source for a custom soft core without a classical ISA is another plus.
  • FPGA design is totally different than SW programming.  Data is processed as it moves thru the FPGA dataflow. SW selects operands and operators repetitively to do processing which requires a processing step for each operator while multiple operators can be evaluated per step in HW.
  • Also, OpenCL is mostly matrix manipulation so it may allow matrix algorithms to be programmed in C++ it is most likely a subset and probably not the subset needed for FFE evaluation.

 

SwanOnChips
User Rank
Freelancer
Microsoft did not simply implement SW in HW
SwanOnChips   8/14/2014 2:17:25 AM
What intrigued me about Microsoft's use of the FPGA's is they implemented ranking of the web search results NOT with HW implementation of SW, but with 60 SOFT processor cores on the FPGA! Since each core handles 4 threads, that's a total of 240 threads. (Though the micrograph in their slide pictures only 48 cores (8 clusters of 6 cores)

rick merritt
User Rank
Author
Re: Microsoft did not simply implement SW in HW
rick merritt   8/14/2014 1:15:22 PM
NO RATINGS
@Swan: Thanks for calling out that detail.

It's amazing to me how these MS Research folks started with looking for sw to accelerate Bing and followed a path all the way to developing FPGA cores and tools.

0xDECAFBAD
User Rank
Rookie
Re: Microsoft did not simply implement SW in HW
0xDECAFBAD   8/15/2014 2:09:00 AM
NO RATINGS
@SwanOnChips:  Your observation is very clever. You are correct – the picture shows 48 cores rather than the 60 cores described in the ISCA and Hot Chips publications. The reason is far less technical than one might imagine – the Xilinx PlanAhead tool produces more aesthetically-pleasing pictures than Altera's Chip Planner. I used an old picture from the implementation on an early Xilinx prototype because I think it better illustrated the area ratios of each component of FFE.The Altera implementation used in the pilot had 60 cores.

TanjB
User Rank
Rookie
Re: Microsoft did not simply implement SW in HW
TanjB   8/15/2014 1:49:20 AM
NO RATINGS
The micrograph pictures 48 adapter cards, not 48 cores within an FPGA.  Each of those 48 is attached to a separate server, and the 6 x 8 arrangement is networking the FPGAs in that rack.  Functionality flows across multiple FPGA chips if the algorithm is large.

http://research.microsoft.com/apps/pubs/default.aspx?id=212001

 

KarlS01
User Rank
Manager
"We gave them a middle ground between C++ and RTL where people can program,"
KarlS01   8/14/2014 5:23:54 PM
NO RATINGS
The processor is multi-core shared ALU design as opposed to conventional FPGA design.  It is also multi-threaded so the speedup is due to multi-thread execution rather than traditional FPGA pipe-lined computation. 

This may mean that a different design approach is needed.  Technology evolution may have made the traditional approach of maximizing fmax obsolete.

Rodney.Sinclair
User Rank
Rookie
Application Specific Processor
Rodney.Sinclair   8/15/2014 9:50:08 AM
NO RATINGS
Based on this discussion, it sounds like what the project really did was make an application specific processor.

Now that they know what kind of processor and communication architecture works well, they need to turn it into an ASIC.

This means of course that Altera won't make much money on this project.

TanjB
User Rank
Rookie
Re: Application Specific Processor
TanjB   8/15/2014 11:30:29 AM
NO RATINGS
@Rodney, Algorithms change rapidly.  The FPGA can track that.  By the time an ASIC is in production would it still run the algorithms you want?  How about for the entire 3 year life of the kit in deployment?  True this may teach something about a good compromise special core but there are 3 other stages (per DecafBad).  The fascinating thing about FPGA in this scenario is the long term flexibility in what is a fixed plant with a big investment to pay down.  We already have plenty of classical compute capacity on fixed architectures.  Why rush to freeze the FPGA?

It is also interesting to wonder what other major applications can benefit from this hybrid.  A data center runs a lot of different stuff.

Rodney.Sinclair
User Rank
Rookie
Re: Application Specific Processor
Rodney.Sinclair   8/15/2014 12:01:00 PM
NO RATINGS
The algorithms may change, but that doesn't mean that the processor architecture must change.

FPGAs are expensive solutions to mass production systems.  An adequate ASP, this is engineering after all, could be substantially less expensive than the FPGAs and also substantially reduce the data center power consumption.  They would also be capable of higher clock speeds than an FPGA.

KarlS01
User Rank
Manager
Re: Application Specific Processor
KarlS01   8/15/2014 11:32:02 AM
NO RATINGS
@Rodney: Maybe the cost of an ASIC development is not justified.  What if Altera turns this into a practical/compromise solution to re-configurable computing?  Finally someone broke away from the idea that all we need is still another computer ISA?  For years micro-code control has been used for CISC so this may be a step up from ISA to actually running high level language source code(C++).  Yes, it can be done!!

VictorRBlake
User Rank
Rookie
FPGAs are the right solution for EFT
VictorRBlake   8/16/2014 3:26:46 PM
NO RATINGS
FPGAs may be costly, but they are certainly the right solution for an engineering field trial of a new system. It will give them flexibility to re-program, add functionality, etc.

Top Comments of the Week
August Cartoon Caption Winner!
August Cartoon Caption Winner!
"All the King's horses and all the KIng's men gave up on Humpty, so they handed the problem off to Engineering."
5 comments
Like Us on Facebook

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)
EE Times on Twitter
EE Times Twitter Feed
Radio
LATEST ARCHIVED BROADCAST
David Patterson, known for his pioneering research that led to RAID, clusters and more, is part of a team at UC Berkeley that recently made its RISC-V processor architecture an open source hardware offering. We talk with Patterson and one of his colleagues behind the effort about the opportunities they see, what new kinds of designs they hope to enable and what it means for today’s commercial processor giants such as Intel, ARM and Imagination Technologies.
Flash Poll