Breaking News
News & Analysis

Ex-Baidu Scientist Blazes AI Shortcut

Native support for 3D tensor operation
8/31/2017 05:31 PM EDT
Page 1 / 4 Next >
More Related Links
View Comments: Newest First | Oldest First | Threaded View
User Rank
dromdrom   9/4/2017 4:30:00 AM
"We can imagine when an autonomous car drives at 65 mph and needs to break at once"

I hope the car needs to brake and not break at once.

User Rank
One core is sometimes better than two
pkorolov   9/3/2017 4:23:50 AM
Clearly an accelerator can improve performance compared to a general-purpose imaging DSP. However, two cores significantly increase overhead since there is now the requirement of two working memories and a shared memory to pass data back and forth. A second core, even if it is very specialized still needs most of the logic of a general CPU architecture such as instruction fetch, decode, data load/store, etc.

Why not just add the CNN capabilities to the imaging DSP? You eliminate all the extra data movement, all the extra memories, and all the extra logic due to having two cores instead of one. 

This is what Cadence has done with their Vision C5 DSP. It is still a general-purpose imaging DSP with specialized CNN instructions added. It can sustain 1024 MACs/cycle along with 1024 bits/cycle local memory transfer, move data to and from system memory, and perform several additional ALU or coefficient decompression operations. All of this in an area that is less than 1mm^2 in TSMC 16nm FFC process.

Of course, Cadence is already working on next generation products that will scale to many thousands of MACs/cycle. Using the Tensilica Instruction Extension language, Cadence engineers, or end-users can add instructions or new architectural features in days instead of months required with regular RTL type of design.

User Rank
Performance effeciencies
Yairs   9/2/2017 1:28:42 PM
Congrats to Novumind on their interesting design and new path.

The claim that DSP is not efficient for tensor calculations seems correct in the general concept but not in case of CEVA's XM4 and XM6 vision DSPs. Using dedicated mechnisms in the architecture such as sliding window and combining several input channels together, the DSP solution can achieve over 90% MAC utilization on any type of filter. Not just 3x3.

To scalie performance to thousands of multipliers indeed there are advantages to more dedicated hardwired architectures to improve effeciency. CEVA CNN block enables support for  any filter size very efficiently while reducing the power consumption/Area significantly compared to use of DSP only. By combining CNN accelerator with DSPs it is possible to offer support for future neural network layers that are not guaranteed to work efficiently on accelerators which have very fixed functionality.

CEVA's vision and deep learning solution is already licensed and proven by dozens of customers and partners and starting production in several markets including Mobile, surveillance and automotive.



User Rank
NovuMind's architecture is a step backward
pkorolov   9/1/2017 9:00:26 PM
NovuMind's architecture claims to achieve 75% to 90% efficiency on 3x3 kernels.

That's great, but Tensilica's Vision C5 DSP, available for licensing today, can achieve the same efficiency for any size kernel. But not only that, Tensilica's DSP can handle all the other processing required in neural networks including non-linear activation functions and on the fly coefficient decompression.

In addition, Tensilca's Vision DSPs supports any network in the Caffe or Tensorflow framework, not just the few that use small kernels. Code is generated automatically for any of these networks and will run efficiently even if data is stored in slow DRAM.

I would suggest that Dr. Wu consider licensing Tensilica's core which solves the problems that he's trying to solve without any of the limitations in his approach. And since Tensilica's cores are expandable, any customer can further improve performance or tune them for specific use cases.

Like Us on Facebook
EE Times on Twitter
EE Times Twitter Feed