Chip vendors implement new applications in CPUs. If the application is suitable for GPUs and DSPs, it may move to them next. Over time, companies develop ASICs and ASSPs. Is Deep learning is moving through the same sequence?
In the brief history of deep neural networks (DNNs), users have tried several hardware architectures to increase their performance. General-purpose CPUs are the easiest to program but are the least efficient in performance per watt. GPUs are optimized for parallel floating-point computation and provide several times better performance than CPUs. As GPU vendors discovered a sizable new customer base, they began to enhance their designs to further improve DNN throughput. For example, Nvidia’s new Volta architecture adds dedicated matrix-multiply units, accelerating a common DNN operation.
Even these enhanced GPUs remain burdened by their graphics-specific logic. Furthermore, the recent trend is to use integer math for DNN inference, although most training continues to use floating-point computations. Nvidia also enhanced Volta’s integer performance, but it still recommends using floating point for inference. Chip designers, however, are well aware that integer units are considerably smaller and more power efficient than floating-point units, a benefit that increases when using 8-bit (or smaller) integers instead of 16-bit or 32-bit floating-point values.
Unlike GPUs, DSPs are designed for integer math and are particularly well suited to the convolution functions in convolutional networks (CNNs). Vector DSPs use wide SIMD units to further accelerate inference calculations. For example, Cadence’s C5 DSP core includes four SIMD units that are each 2,048 bits wide; as a result, the core can complete 1,024 8-bit integer multiply-accumulate (MAC) operations per cycle. That works out to more than 1 trillion MACs per second in a 16nm design. MediaTek has licensed a Cadence DSP as a DNN accelerator in its newest smartphone processors.
Opportunities for New Architectures
The most efficient architectures are designed from the ground up for DNNs, eliminating features from other applications and optimizing for the specific calculations that DNNs require. These architectures can be implemented in proprietary ASICs or in chips that sell to system makers (these chips are called application-specific standard products, or ASSPs). The most prominent DNN ASIC is Google’s TPU, which is optimized for inference tasks. It consists mainly of a systolic array of 65,536 MAC units and 28MB of memory to hold the DNN weights and accumulators. The TPU uses a simple four-stage pipeline and executes only a handful of instructions.
Several startups are also developing custom architectures for DNNs. Intel acquired one of them, Nervana, last year and plans to sample its first ASSP by the end of this year; the company hasn’t disclosed any details of its architecture, however. Wave Computing has developed a dataflow processor for DNNs. Other well-funded startups include Cerebras, Graphcore, and Groq. We expect at least some of these companies to deliver production devices in 2018.
Another way to implement an optimized architecture is in an FPGA. Microsoft has widely deployed FPGAs as part of its Catapult and Brainwave programs; Baidu, Facebook, and other cloud server providers (CSPs) also use FPGAs to accelerate DNNs. This approach avoids the multimillion-dollar tapeout fees of ASICs and ASSPs and provides a faster turnaround time; FPGAs can be programmed and reprogrammed in minutes whenever the design changes. But they operate at lower clock speeds and hold far fewer logic blocks than an ASIC can. Figure 1 summarizes our view of the relative efficiency of these solutions.
Depending on the hardware design, the performance per watt of deep-learning accelerators can vary by at least two orders of magnitude. *Using a custom architecture. (Source: The Linley Group)
Some companies are hedging their bets by augmenting an existing design with a more customized accelerator. Nvidia’s Xavier chip, designed for self-driving cars, adds an integer-math block to accelerate DNN inference. Ceva and Synopsys have designed similar units to enhance their SIMD DSP cores. These blocks simply contain a large number of integer MAC units to boost math throughput. Since they don’t replace the underlying GPU or DSP architecture, however, they aren’t as efficient as a from-scratch design.
One challenge for custom designs is that deep-learning algorithms are rapidly evolving. TensorFlow, the most popular DNN development tool, wasn’t available only two years ago, and data scientists continue to evaluate new DNN structures, convolution functions, and data formats. A design customized for today’s workloads may not be optimal, or even functional, for DNNs two years from now. To address this problem, most ASIC and ASSP designs are programmable and flexible, but FPGAs offer the ultimate in flexibility. Microsoft, for example, has defined a proprietary 9-bit floating-point format as part of its Brainwave design.
Moving Through the Options
Throughout its history, the semiconductor industry usually implements new applications first in general-purpose CPUs. If the application is suitable for existing specialized chips such as GPUs and DSPs, it may move to them next. Over time, if the new application becomes a sizable market, companies begin to develop ASICs and ASSPs, although these devices are likely to retain some programmability. Only when an algorithm becomes highly stable (for example, MPEG) does it see implementation in fixed-function logic.
Deep learning is currently moving through this sequence. GPUs and DSPs are clearly applicable, and demand is high enough that ASICs are beginning to appear. Several startups and other companies are developing ASSPs that will ship in 2018 and beyond. FPGAs are typically more popular for low-volume or niche applications; deep learning is already showing enough promise to justify ASIC tapeouts.
The winning DNN architecture is far from clear, however. Although the deep-learning market is growing rapidly, it’s still much smaller than the PC, smartphone, and automotive markets. Thus, the business case for ASICs and ASSPs is marginal. By contrast, companies such as Intel and Nvidia can use their high-performance processors from other markets and enhance them for deep learning, delivering competitive products with extensive software support and frequent updates. We will likely see many different hardware architectures coexist in the deep-learning market for years to come.
-- Linley Gwennap is Principal Analyst at The Linley Group and editor-in-chief of Microprocessor Report. He has recently completed a new report on processors for deep learning.