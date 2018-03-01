AI Silicon Gets Mixed Report Card
SAN JOSE, Calif. — A leading researcher in deep learning praised some of the latest accelerator chips. He also indicated some shortcomings both of the silicon and the software they are supposed to speed up.
The results came, in part, from tests using DeepBench, an open source benchmark for training neural networks using 32-bit floating point math. Baidu, the Google of China, released DeepBench in September 2016 and updated it in June to cover inference jobs and use of 16-bit math.
On some low-level operations such as matrix multiplication, chips with dedicated hardware such as the tensor cores on Nvidia’s Volta GPU can deliver “hundreds of TeraFlops…several factors faster than the previous generation at 5-10 TFlops,” said Greg Daimos, a senior researcher at Baidu’s Silicon Valley AI Lab.
However, some low-level operations “used in real apps don’t have enough [data] locality to get full use of these specialized processors, so we either have to live with moderate speed ups or change the algorithms,” he said.
The Baidu research team is exploring two ways to get more bang for the buck with the new chips. In one effort, researchers are opening controls in their algorithms to take in simultaneous feeds, hoping to boost data parallelism 10-fold.
The other path is to make all models look more like the convolutional neural nets typically used in imaging application. CNNs have more locality than the recurrent neural nets (RNNs) typically used for sequential data such as text or audio apps, Daimos said.
When researchers replaced “a stack of RNN layers with CNN layers” in a Baidu model that generates audio from text “compute intensity increased by 40x, [delivering] good utilization of the new hardware. We have to go through all the apps we build to see if we can use this approach generally, or only use it for speech synthesis,” he said.
It’s unclear when research on either approach will be ready to use in production systems. Meanwhile, the Baidu researcher shared other observations from hardware tests.
