The folks at Samplify Systems
say that their new APAX technology “Improves the performance of multi-core processors,”
but I think it’s much more pervasive than that…
Actually, it’s a little bit tricky to know where to start, because this is a bit of a “chicken and egg”
situation. Would it be better to begin with the problem and then describe Samplify’s APAX solution (I ask myself), or is it best to explain APAX technology and then discuss the problems it addresses?
Hmmm. Well, before I do anything else, I really should make the point that APAX technology is of interest to people designing ASIC/SoC devices, and FPGAs; and it’s also of interest to folks working with everything from microcontrollers and microprocessors to honking great big supercomputers and cloud computing scenarios. Phew!
Have I captured your interest yet?
The underlying problem is that Moore’s Law is enabling a quadratic increase in the number of cores (CPUs, GPUs, Codecs, etc.) on a single die. The problem is that the access speed to external (off-chip) memory is increasing only at a linear rate and the width of memory busses is limited by packaging constraints. Meanwhile, off-chip interface bandwidth increases in discrete steps along with the rest of the computing industry; in the case of PCI Express, for example, Gen 1.0a was introduced in 2003 (Gen 1.1 arrived on the scene in 2005), Gen 2.0 was introduced in 2007, and although the base specification for Gen 3.0 was made available in November 3.0 we’re still waiting “for the rush to start”.
The bottom line is that memory and interface bandwidth is not increasing at the same rate as the number of processing cores, with a result that a large proportion of multi-core applications are running at only 10% or less of their potential capability.
Let’s start with a really simple example. Consider an application processor containing a CPU core, a Video Codec, a GPU core, and a Display Controller as shown below. All of these cores communicate via an on-chip system bus and an on-chip memory controller to external DRAM. In this case, the main bottleneck is the “pipe” (bus) connecting the memory controller to the external memory.
In particular, observe the small Samplify Icon that looks a bit like two italic ‘f
’ letters located between the system bus and the memory controller. This icon represents Samplify’s APAX technology, which – in this particular example – would be realized as a hard core (it would be delivered in the form of RTL that would be synthesized along with all of the other cores by whoever was designing and implementing the application processor).
Samplify’s tag line is “…simply the bits that matter”
. I personally like to think about their technology as saving you from carrying a load of “deadweight bits” around with you (“Let’s put those dead bits down and move on with our lives”
What the APAX core does is to compress the data as it moves from the other cores in the application processor to the memory controller, which means that the memory controller has significantly less data to pass to the external memory (this works with all forms of data … integer, fixed-point, floating-point, etc.). Similarly, when data is being retrieved from the external memory, the APAX core decompresses it before it’s handed over to the other functional blocks.
The APAX core, which consumes a tiny portion of the silicon real estate, can be controlled by software on the fly to select between different compression scenarios, including Bypass, Lossless, and Lossy (Fixed-rate or Fixed-quality). The lossless mode typically offers compression between 1.5-to-1 and 2.5-to-1, while the lossy modes offer anywhere between 3-to-1 to 8-to-1 or more, without the effects of lost data being detectable to humans in audible or visual form. The great thing is that Samplify provide tools that allows designers to experiment with different scenarios; in the case of the lossy algorithms, for example, you can gradually “turn the knob” to increase or decrease the amount of compression being applied while observing the effects on the signal-to-noise ratio and so forth.
As another example, let’s take NASA’s Pleiades Supercomputer, which I believe was ranked #7 in Supercomputers in the world as of November 2011. Check out This Technical Report
from the NASA site. I was absolutely amazed to discover that in the case of the ECCO (Estimating Circulation and Climate of Ocean)
application, the sustained percentage performance /throughput of the processing cores themselves was less than 1.4%
(it’s generally less than 8%
for all high-performance computing (HPC) applications):
This means that actually performing the interesting computations occupies only a fraction of the CPU cores’ time; the rest of the time they are performing communications and storage-related tasks or – the majority of the time – idling along waiting while the data is moving around the system.
In the above image, the Compute
times are such thin (less than 1.4%) slices at the bottom of each column that you can’t even really see them on this diagram (the lowest blue areas that you can differentiate are associated with Communications
APAX technology can be delivered in one of two ways – as software to run on a processor core or as RTL that can be synthesized as part of an ASIC/SoC or FPGA design (more on FPGA implementations in just a moment). This means that it can be used all over the place to accelerate computing, consumer electronics, smartphone, and tablet applications as illustrated below:
In the case of the CPU, for example, APAX can accelerate memory-bound applications. In the case of the GPU, it can accelerate 3D graphics rendering (the same technology can be used for integer RGB texture values and floating-point meshes); also to increase DisplayPort throughput. In the case of the memory controller, APAX increases effective memory bandwidth, thereby eliminating the need to add additional memory ports. In the case of the SouthBridge, APAX increases I/O bandwidth between the CPU and the peripherals. And APAX can also be used to increase storage throughput, network throughput, and inter-processor communications (APAX also makes your teeth whiter and gives you a gleaming smile).
But wait, there’s more… the big “buzz” on everyone’s lips at the moment is “Cloud Computing”, which can be made to sound wonderful. What people tend not to tell you is that communications between cloud computing nodes (which may not be in the same rack, or even in the same data center) are significantly less efficient than in purpose-built supercomputing configurations.
How much less efficient? Well, it’s pretty bad; HPC applications can have 40-to-1000% worse performance when running in the cloud. The thing is that APAX technology can dramatically accelerate these HPC applications by reducing the latency of the inter-processor data transfers.
Last, but certainly not least, we come to one of my favorite topics – programmable logic in the form of FPGAs. Today’s FPGAs can support the equivalent of tens of millions of ASIC logic gates, which means they can support multiple soft processor cores, peripheral functions, and hardware accelerators. They also offer hard memory controller cores and so forth. Once again, APAX technology can dramatically increase the effective interface and memory bandwidths when moving data on- and off-chip.
And then there are the red-hot new devices that are best thought of as “FPGAs on Steroids”. These are the ones implemented at the 28nm technology node that incorporate a hard dual-core ARM Cortex-A9 processor subsystem, hard core memory controllers, programmable FPGA fabric, and all sorts of other stuff. I am of course talking about the Arria V and Cyclone V SoC FPGAs from Altera and the Zynq-7000 Extensible Processing Platform from Xilinx. Personally, I think that the combination of devices of this caliber and Samplify’s APAX technology (delivered as RTL and implemented in programmable fabric) is a “marriage made in heaven”… I can’t wait to see how this all turns out…
for more information.Availability
Samplify’s APAX technology is available as software for Intel x86/x64 CPUs on both Linux and Windows 7 for HPC and cloud computing applications. Samplify’s APAX technology will be available as Verilog RTL for SoCs and FPGAs in Q3 2012. For more information, please visit www.samplify.com/apax
If you found this article to be of interest, visit Programmable Logic Designline
where you will find the latest and greatest design, technology, product, and news articles with regard to programmable logic devices of every flavor and size (FPGAs, CPLDs, CSSPs, PSoCs...).
Also, you can obtain a highlights update delivered directly to your inbox by signing up for my weekly newsletter – just Click Here
to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you [grin]).