SAN JOSE, Calif. — The huge data sets collected by web giants such as Amazon, Google, and Facebook are fueling a renaissance of new chips to process them. Two of the latest efforts will be described at an annual conference on computer architecture in late June.
Stanford researchers will describe Plasticine, a reconfigurable processor that sports nearly 100x better performance/watt than an FPGA while being easier to program. Separately, two veteran designers at Nvidia were part of a team that defined an inference processor that delivers more than twice the performance and energy efficiency of exiting devices.
The chips represent tips of an iceberg of work. Intel acquired three machine-learning startups in the past year. Rival Samsung, along with Dell EMC, invested in Graphcore (Bristol, U.K.), one of a half-dozen independent startups in the area.
Meanwhile, Nvidia is racking up rising sales for its GPUs as neural network training engines. Simultaneously, it is morphing its architecture to better handle such jobs.
Google claims that neither its massive clusters of x86 CPUs nor Nvidia’s GPUs are adequate. So it has rolled out two versions of its own accelerator, the TPU.
“This is Compute 2.0; it is absolutely a new world of computing,” said Nigel Toon, chief executive of Graphcore. “Google eventually will use racks and racks of TPUs and almost no CPUs because 98 percent of its revenues come from search,” a good application for machine learning.
Eventually, machine-learning chips will appear in a broad range of embedded systems. With 18 million cars sold a year compared to 10 million servers, “self-driving cars could be a bigger market than the cloud for this technology, and it’s a market that never existed before,” said Toon.
The shared vision is an AI processor that can handle both training and inference for today’s rainbow of neural networks — and maybe even some emerging self-learning techniques. They need to deliver performance through massive parallelism, yet be power-efficient and easy to program.
Even the basic math for the job is a subject of lively debate. Toon believes that a mix of 16-bit floating-point multiplication with 32-bit accumulates delivers optimal precision with minimal errors.
That’s the approach that the new tensor cores in Nvidia’s Volta use, as well as the competing high-end chip that Graphcore will sample to early partners in October. The startup is focused on one big chip using novel memories and interconnects in and out of the chip to link cells and clusters.
Next page: Stanford gets flexible in post-multicore era