SAN JOSE, Calif. — Google’s Tensor Processing Unit beat Intel’s Xeon and Nvidia GPU in machine-learning tests by more than an order of magnitude, the web giant reported. A 17-page paper gives a deep dive into the TPU and benchmarks showing that it is at least 15 times faster and delivers 30 times more performance/watt than the merchant chips.
In May, Google announced the ASIC designed to accelerate inference jobs for a wide range of applications on its data center servers. Now it is providing a first in-depth look at the chip and its performance in a paper to be presented at a computer architecture conference in June.
The paper gives a deep look into both the accelerator and Google’s diverse neural network workloads, suggesting that engineers have much more to learn about the rapidly expanding field.
“We want to hire good engineers, so we want engineers to know that we are doing high-quality work, and we want to let cloud customers know our capabilities,” said Norman P. Jouppi, a distinguished hardware engineer who led a team of more than 70 engineers that contributed to the TPU.
One of the contributors on the project was retired U.C. Berkeley professor and veteran processor architect, David Patterson, who will present the paper today (April 5) at a gathering of engineers in Silicon Valley. Google also posted a blog on the chip written by Jouppi.
The chip is still being used in Google’s data centers today. Jouppi declined to give any details about how broadly it is deployed or any planned enhancements.
The 40-W TPU is a 28-nm chip running at 700 MHz, designed to accelerate Google’s TensorFlow algorithm. Its main logic unit packs 65,536 8-bit multiply-accumulate units and a 24-Mbyte cache, delivering 92 tera-operations/second.
In benchmarks in 2015 using Google’s machine-learning jobs, the TPU ran 15 to 30 times faster and delivered 30 to 80 times better performance per watt than Intel’s Haswell server CPU and Nvidia’s K80 GPU. “The relative incremental-performance/watt — which was our company’s justification for a custom ASIC — is 41 to 83 for the TPU, which lifts the TPU to 25 to 29 times the performance/watt of the GPU,” the paper reports.
The 2015 tests used a 22-nm, 18-core Intel Haswell E5-2699 v3 CPU with a 145-W TDP running at 2.3 GHz. The Nvidia GPU was a 150-W K80 running at up to 875 MHz.
Next step: A look inside the TPU
The TPU (stars) outpaced Intel's Haswell (circles) and K80 (triangles) in a range of neural network inference jobs. (Images: Google)