MANHASSET, N.Y. NeoMagic Corp. is crafting a mobile-system architecture designed to attack two of the most pressing concerns for handheld devices power consumption and multimedia-processing performance. The Associative Processor Array (APA) architecture addresses those issues by performing both bit storage and processing in a single memory cell, slashing the amount of data that must be shuttled around during image and video processing while increasing the amount of processing that can be done per clock cycle.
"Data movement is one of the bigger challenges [for mobile systems] and one that isn't well-addressed by current architectures," said Marc Singer, vice president of corporate marketing at NeoMagic, referring specifically to RISC and digital signal processors. The APA is aimed particularly at the high-end market, where enhanced multimedia such as MPEG-4 video will be implemented. Singer expects to incorporate the architecture into the next member of the company's MiMagic processor family early next year.
"This clearly takes them to a higher level of integration the integrated memory is the key," said Dean McCarron, president of Mercury Research (Cave Creek, Ariz.).
NeoMagic's architectural disclosure coincided with the release this week of the MiMagic 5 applications processor for midrange media phones that need basic imaging capability at low power. Low-end devices, with basic voice and text messaging, are not a target.
With both the APA and MiMagic 5, the Santa Clara, Calif. company is going after an applications processor market that it believes will ship 60 million units a year by 2006 for PDAs alone.
On top of that, NeoMagic said, will be personal entertainment devices, accounting for up to 40 million units a year, and media phones, at up to 50 million units.
But Singer pointed out that the very term "applications processor" is still not well-known. "We see a common system-partitioning scheme across all mobile devices, with the communications processor taking care of the real-time tasks, and the applications processor taking care of the data processing and control functions," said Singer.
Two-way tack
This bifurcated design approach is reflected in Intel Corp.'s Personal Internet Client Architecture and Texas Instruments Inc.'s Open Multimedia Applications Platform, among others. In the PCA, Intel uses its Xscale RISC processor to address the applications segment and the Micro Signal Architecture DSP co-developed with Analog Devices Inc. to handle the communications function. TI's Omap, for its part, incorporates an ARM processor for applications and the company's own vast DSP expertise for communications.
However, NeoMagic has zeroed in entirely on the applications processor specifically, the multimedia-processing function.
An applications processor, as NeoMagic defines it, typically comprises a RISC processor, I/Os and a multimedia engine with some memory.
Such a structure enables "better multimedia at lower power," Singer said.
To tackle multimedia, he said, the company first had to come to grips with its three primary characteristics. "It's an abundant-data type, so you're using a lot of data," Singer said. "It comes in a wide format, And it tends to [involve] a lot of concurrent data" a great deal of information at the same time. Multimedia is also "multidimensional," he noted, referring to the fact that the data is often related to some of the other data that's being processed at the same time.
Those concerns are at the heart of NeoMagic's design for the MiMagic 5. An extension of work done on the MiMagic 3, which introduced the company's view of the applications processor as separate from the communications processor, the MiMagic 5 is built around an ARM922T RISC engine. It takes the MiMagic 3's 12-channel DMA-enabled parallel-bus structure and adds 160 kbytes of on-board memory as well as a dedicated data path for preprocessing in hardware. The on-board buffer memory is said to reduce power consumption by limiting external accesses, while the video preprocessor handles color space conversion, video scaling, overlay and keying.
"The 160 kbytes allows us to buffer a full 320 x 240[-pixel] frame as well as have some working space for some of the multimedia tasks," said Singer.
The device is built in an 0.18-micron, 1.8-volt process and clocks at up to 220 MHz. In idle mode, with the display being refreshed, power consumption is typically 25 milliwatts, said Singer. This jumps to under 50 mW when processing a QCIF image from the camera.
The chip also boasts extensive interface support, in the form of a digital camera interface, four UARTs, two Universal Serial Bus ports, two SPI interfaces, general-purpose I/Os and two secure-digital I/Os.
Sampling now, the MiMagic 5 is but a precursor to the APA, said Singer. The APA goes beyond well-known caching and buffering schemes that attempt to minimize the amount of data being shuttled around. Instead of local caching, it puts the processing elements right alongside the stored bits, allowing the transformations to take place within the cache.
For the core architecture of the APA NeoMagic took the content-addressable memory cell structure the essence of which is the compare function and added transistors for intercell connectivity, to perform the compare-and-write function and to manage the single-cycle access.
Simpler instructions
Part of the breakthrough, said Singer, was simplifying the instruction set to best take advantage of the architecture's capabilities. "It does compare, write or move column, and shift row anything that can be expressed as a truth table can be easily programmed," he said. The dimensions of the array determine how many words can be loaded and processed in parallel.
The first implementation of the APA will be a 160-bit-wide data array by 512 words. The data can be 8, 16 or 32 bits wide. The structure is such that a single-cycle compare of 512 words to an 8-bit pattern, followed by a single-cycle write on that 512-word data, takes just two clock cycles. This rapid compare-and-write is what makes the APA so fit for the repetitive calculations inherent in the motion-estimation function at the heart of MPEG-4 processing, the company said.
Motion estimation works by dividing the screen into 16 x 16 macro blocks, then searching for a particular block in subsequent frames and sending the motion vector for a match to the processor. The processor now has to store only the vector for where that block moved, not the complete block, thereby achieving the desired compression. "If I did this using a sequential engine I'd have to reload each pixel of the macro block multiple times, as I have to look at multiple candidate locations," said Singer.
The APA doesn't have to load the entire location because the locations almost always have some overlap, he went on. It need only load the incremental pixels and then, by doing a column-and-row shift, reorder the pixel data for the "compare" locations and begin processing the next calculation.
"The result," said Singer, "is orders-of-magnitude fewer data loads, with far more processing capability per clock, and the combination gives lower power." The scheme also improves the quality of the motion estimation, he claimed.
This form of processing contrasts sharply with RISC- and DSP-based engines, said Singer. "RISC CPUs are sequential-processing flow machines, optimized for decision-intensive tasks. As a result, they're good for branching, context switching and multitasking," he said. Their arithmetic capabilities are limited, and they're not up to the demands of motion estimation, he said.
DSPs, meanwhile, excel at fast math. "Typically they're constructed based on single-cycle MAC [multiply/accumulate] engines with a caching pipeline for repetitive data loads," Singer said. Though he acknowledged that this makes them better candidates than RISC chips, "it's still not particularly good for motion estimation because the kind of calculation we do in MPEG-4 motion estimation is a sum of absolute differences" not a MAC. "The sum of absolute differences starts with a subtraction, and DSPs aren't particularly good at that," said Singer.
Although the concept behind APA is not entirely new, it's gotten a lift from process technology advances. "This kind of array processing simply wasn't practical at consumer price levels in a 0.35- or 0.25-micron world," said Singer. "We needed to get to 0.18 or 0.13 micron, which we have done."
For programmers to get up to speed on APA, Singer said, will take "a new approach to algorithm development. You can start by porting an existing algorithm, but then you learn to think in parallel functionality, vs. the sequential engine [that's when] you get the most out of it. You have to step out of the von Neumann architecture and into a new domain."