The amazing story of how one man single-handedly invented a new computing architecture, designed a multi-million-gate SoC, and went from RTL to GDSII tapeout in just six weeks.
In the course of my travels around the world I have been fortunate enough to meet some truly great engineers. However, it's rare that I am completely blown away by someone on the engineering front. At least, this was true until I was introduced to Andreas Olofsson, president and architect of Adapteva Inc. (www.adapteva.com). As far as I am concerned, Andreas is "an engineer's engineer." This is a man who single-handedly invented a new computer architecture, designed his own System-on-Chip (SoC) from the ground up – including learning how to use all of the EDA tools – then took the device all the way to working silicon and a packaged prototype... and that's when things really started to get interesting!
How it all began
A few years ago, while working on various aspects of digital signal processing, Andreas began to ponder the problem that existing processing solutions – while very versatile – were not inherently efficient in terms of the number of floating-point operations (flops) that could be achieved per watt. Andreas was targeting really complex floating-point problems that require a massive amount of flops. This includes the obvious suspects such as radar, medical imaging, and communications infrastructure tasks like beam forming. But even battery-powered handheld applications increasingly require the ability to perform computationally-intensive tasks while consuming as little power as possible.
As Andreas told me, "It was obvious that the world needed extremely high-performance, low-power computational processing capabilities, but I felt that the existing market simply wasn't doing enough to satisfy these needs." Then Andreas had an epiphany – he realized that it would be possible to achieve his goal by creating an SoC comprised of a matrix of processing elements along with an associated on-chip network, all – as he says – "obsessively fine-tuned for miserly power consumption."
Andreas was well aware that this had been tried before – recent casualties in the market include Ambric with its Massively Parallel Processor Array (MPAA) architecture and structured object programming model, and MathStar with its Field-Programmable Object Arrays (FPOAs) and indescribable programming model. But he believed that, "(a) they came in too early and (b) they had problems with their programming models."
Andreas was a man with a vision. He believed in his heart and soul that he could solve the problem. Thus, on January 23, 2008, he left his existing job and – after a week's vacation with his wife to recharge his batteries – he felt ready to leap into action.
Defining and capturing the architecture
Upon returning from his vacation, Andreas formed his new one-man company, Adapteva, and then disappeared down into his basement. Week after week, he spent sixteen hours a day doing research, reading technical papers, and making notes. As he wryly comments: "I have a very understanding and supportive wife."
I had to question how Andreas was funding all of this effort, and he replied that he was using his pension money – basically, he was betting his entire future and that of his family on the belief that he could succeed on his own where much larger companies had failed. (Andreas is correct – he does have a very understanding and supportive wife!)
On the basis of his research, Andreas determined that the "best-of-the-best" existing solutions were achieving around 0.5 to 1.0 gigaflops per watt (Gfpw). Based on this, Andreas set his target as 50 Gfpw – that's real flops achievable in real-world applications (not marketing flops) – which would be 50 to 100X better than the competition.
Andreas started with his processor. Looking first at offerings from ARM and MIPS, he decided that there was no way he could use these processors to obtain the power-efficiency he required. "They're both great architectures," he notes, "they're just not good enough for what I had in mind – you can’t make something that will be everything to everybody."
So Andreas decided to design his own high-performance, low-power, ANSI-C programmable 32-bit floating-point processor from the ground up. The next step up was to design the rest of the node, which – in addition to the processor – contains 32 KB of local memory, a Direct Memory Access (DMA) engine, and a router. Andreas also designed the rest of the on-chip network, and a number of other elements such as high-speed input/output (I/O) SRAM buffer macros.
The entire architecture is fully scalable from 4 to 1,024 processing nodes. In the case of his first iteration, Andreas determined to create a chip containing a 4 x 4 = 16 matrix of processing nodes as illustrated in Figure 1. This decision was based on the fact that it's obviously simpler to create a 4 x 4 matrix than it would be to create a larger matrix. Also, having a smaller chip increases the market size; a larger chip would be of interest to a smaller market.
Figure 1. The initial device was comprised of sixteen processing nodes.
This works by having each processing node perform read and write operations using address-data pairs. The node doesn't care whether the generated address refers to its local memory or to a remote node. There is no messing around required to setup channels – the combination of the routers and the autonomous multi-hop Network-on-Chip (NoC) ensures that all address-data pairs arrive at their intended destination. The whole chip is designed to run at 1 GHz, which means that address-data pairs can traverse the entire chip with less than 10 ns of latency.
Once he had defined the architecture, Andreas started to capture his design in RTL. The hierarchical design required less than 10K lines of Verilog code and it took around three months to go from the architecture to the final RTL.
Since he didn’t have access to a commercial Verilog software simulator, Andreas used the open-source Verilator Verilog-to-C translator, which he describes as being "screamingly fast."
This allowed him to simulate and verify the design under a range of real-world data-processing scenarios, thereby confirming that he was on the right track. “This means,”
says Andreas, "that from day one there was never any question in my mind as to whether my Verilog would work."
Design tools, tapeout, and building a prototype chip
Filled with confidence that his design would perform as desired, Andreas started to turn his attention to actually building the device. In the summer of 2008, he began to consider both a budget and a formal business plan. Also around this time, Andreas approached some venture capitalists (VCs) who, he says, "laughed at me." Andreas continues, "They said that this was not the right time, not the right product, not the right market, and not the right (one man) team."
I bet they are sorry now (grin)!
Undeterred, Andreas started talking to EDA tool vendors and semiconductor foundries. He quickly discovered that the entire suite of EDA tools required for this sort of design can easily cost around $1 million. Similarly, creating the photomasks for the design at his targeted 65 nm technology node would also cost around $1 million. "The end result was that if I did things the conventional way, this project was going to cost much, much more than was in my bank account,"
says Andreas, "so I decided NOT to do things the conventional way."
In the case of the semiconductor foundry of choice, the solution was to use a multi-project shuttle approach. The idea is that multiple companies come together to share the costs of creating the photomasks. They essentially design one large chip that actually comprises a number of smaller chips. When the wafer is eventually fabricated, diced and sliced, each company receives its own chips. This means that if 20 companies share the cost of the masks, each will pay only around $50K. "This idea has been around for 30 years," says Andreas, "so it's shocking how few startups use it."
Meanwhile, Andreas had started talking to the folks at Magma Design Automation. "Magma was big enough to provide the tools I needed while being small enough to listen,"
he notes. "I was fortunate in that I made contact with a very visionary Magma sales person named Jeff Remmers. Jeff took the time to listen and understand exactly what was going on, and then he told me that our industry would only become stronger if companies like Adapteva could design and build chips."
Andreas first met Jeff on February 1, 2009. Two weeks later they closed a deal so that Adapteva could get started on the design. "Magma worked with me to understand my business goals and constraints and provided a start-up package that enabled me to use world-class software,”
he added. “This was vital to my business in taping-out the important first silicon in record time and within budget."
At this point Andreas’s pension savings were pretty much exhausted, so he had to start looking for more funding. He ended up raising $200k from close family members to complete the development of his silicon prototypes, device packaging, and short-term EDA tool licenses. “This was really hard for me,”
says Andreas. “It's one thing to risk your own funds on a dangerous venture; it's a whole other thing to spend other people's hard-earned money.”
The good news was that Andreas’s Verilog RTL went through clean synthesis and compilation within one day of his receiving and installing the Magma tools. After that he started experimenting with the floorplan and design constraints while also learning how to use the other tools in the suite, such as timing analysis and place-and-route. Almost unbelievably, it took Andreas only six weeks from receiving the Magma tools to take his multi-million-gate SoC, with more than 50 hard macros and hundreds of high-speed I/O SRAM macros, from RTL to GDSII / tapeout.
Packaged chips and programming models
As soon as the GDSII was out of the door, Andreas could turn his attention to other things, including the packaging and the programming model. To be honest, with all he had already achieved, it would not have surprised me to hear that he had once again done everything himself. However, it seems that he knows when the time has come to share the load.
Thus, Andreas outsourced the ball grid array (BGA) package design to a small consulting company who, for only $15K [the majority of which was non-recurring engineering [NRE]), agreed to design and handle the packaging as well as deliver 50 packaged parts.
With regard to the programming model, Andreas notes that part of doing this sort of project with a small budget is that one has to make engineering compromises. It wasn't possible to create a full-up parallelizing compiler in a timely and cost-effective manner (he may look at creating one in the future). So, at the moment, each core has to be programmed individually using ANSI standard C. But this has advantages in terms of understandability and ease-of-use. Also, Andreas says that he "got lucky,"
because the contractors he used to create his compiler "were incredibly cheap and did a fantastic job."
Furthermore, Adapteva has created a library of pre-defined modules for commonly used algorithmic tasks that are very compute-intensive, such as a 1000 point FFT implemented across all 16 cores.
But wait, there's more...
Andreas started to attend conferences "to find out what was going on and to make sure that I was correct in my understanding of what the market needed."
While at an embedded military conference, Andreas met Jeff Milrod, who is the president of Bittware, a company that makes high-performance DSP boards. The two men started talking and Andreas explained what he was doing. Jeff confirmed that there was indeed a market for the type of chip Andreas proposed. On the downside, Jeff gave his opinion that it simply wasn't possible for a single person to design and build such a chip (or any modern chip at all, for that matter). On the upside, Jeff said that if, by some strange quirk of fate Andreas were to succeed, then he should get in touch because Jeff would be interested in creating boards around this device.
Andreas at the testbench with a working prototype.
Andreas received his packaged prototype chips in the fall of 2009 and – with his working prototypes in hand – he contacted Jeff at Bittware, who immediately invested $1.5 million in Adapteva. With these funds, Andreas was now in a position to start building his company, so he brought in two old friends and colleagues, Oleg Raikhman and Roman Trogan. Andreas says that Oleg is a software open source verification guru. "He can take and reverse-engineer a multi-million-line piece of open source code and get it to work. He is a prolific producer – very fast and very good quality – definitely the best verification engineer and software development engineer I know."
Meanwhile, Roman is an expert with regard to understanding and analyzing micro-architectures. "He can analyze the most complex problem in his head and work out where the bug is better than anyone. I haven’t seen a problem he can’t solve in chip design. He is a really extraordinary producer also."
The Adapteva team.
Andreas also notes that there's a big step function between building a prototype and creating a product, and that "It cannot be underestimated how much work goes into creating a real product. I could make a proof-of-concept prototype by myself, but the final product would never have been a success without Roman and Oleg."
Oleg and Roman formally joined Adapteva on January 15, 2010. After spending three months building a full infrastructure from the ground up, the team was ready to tapeout their first real product on April 1, 2010. Andreas says that using Magma's Talus Flow Manager to guide the process, the total time for a RTL to GDSII tapeout for their 40-million-transistor SoC was less than 24 hours. "This was fortunate, because we realized that there was a small but important feature we wished to push in at the last moment, which required an RTL change 24 hours before tapeout. Because we knew we could do it ... we did!"
An Adapteva chip on a Bittware evaluation board.
Andreas and his colleagues received the first of their packaged production chips in August 2010 and the chips have already demonstrated sustained performance efficiency of 25 gigaflops-per-watt. The folks at Bittware are currently designing boards using these chips. And I, for one, cannot wait to see the result!