ANTWERP, Belgium — Google has designed and deployed a second generation of its TensorFlow Processor Unit (TPU) and is giving access to the machine-learning ASIC as a cloud service for commercial customers and researchers. A server with four of the so-called Cloud TPUs delivers 180 TFlops that will be used both for training and inference tasks.
The effort aims to harness rising interest in machine learning to drive use of Google’s cloud services. It also aims to rally more users around its open-source TensorFlow framework, the only software interface that the new chip supports.
The Cloud TPU supports floating-point math, which Google encourages for both training and inference jobs to simplify deployment. The first-gen ASIC used quantized integer math and was focused solely on inference jobs.
Google is packing four of the new chips on a custom accelerator board. It packs at least 64 of them on a two-dimensional torus network in a cluster called a pod that’s capable of up to 11.5 petaflops. The initial chip rode a PCI Express card in an x86 server.
“Many of the same people were involved in the design of the second-generation TPU, which is an entire system compared to the first, which is a smaller-scale thing,” said Jeff Dean, a senior fellow at Google, speaking in a press briefing. “You can run inference on a single chip, but for training, you need to think more holistically.”
The Cloud TPU board seems to lack DRAM but has several components hidden by heat sinks. Click to enlarge. (Images: Google)
Google said that the new ASIC handily beats GPUs on training. The company’s latest large language-translation models take a full day to train on 32 of the current top-end GPUs. The same job runs in six hours on one-eighth of a pod, presumably eight TPUs.
Google started deploying the first-generation TPU in 2015. It is used for a wide variety of the company’s cloud services, including search, translation, and Google Photos.
The TPU was first announced a year ago at the annual Google I/O event. In a paper released last month, Google said that the 40-W TPU is a 28-nm chip running at 700 MHz, designed to accelerate Google’s TensorFlow algorithm. Its main logic unit packs 65,536 8-bit multiply-accumulate units and a 24-Mbyte cache, delivering 92 tera-operations/second.
In benchmarks in 2015 using Google’s machine-learning jobs, the TPU ran 15 to 30 times faster and delivered 30 to 80 times better performance per watt than Intel’s Haswell server CPU and Nvidia’s K80 GPU.
Next page: Inferring what’s inside the Cloud TPU