– Part 2 (Programming Model)
– Part 3 (Student Project #1 – FIR Filter)
– Part 4 (Student Project #2 – ZigBee Receiver)
– Part 5 (Student Project #3 – Image Processor)
Generally speaking, there are a limited number of options when it comes to executing compute-intensive data processing applications at consumer electronics price-points:
- ASICs: These little scamps offer high-performance and low power consumption, but their functionality is "frozen in silicon", they have long lead times, and they have extremely high development costs.
- FPGAs: These "off-the-shelf" devices are reprogrammable via hardware design methodologies, but they have relatively slow reconfiguration rates that make them unsuitable for applications requiring dynamic reconfigurability. Also, although improvements are constantly being made, they consume relatively large amounts of power compared to ASICs and SoCs, which tends to make them unsuitable for use in low-power, hand-held, consumer electronics applications
- CPUs/DSPs: Both general-purpose CPUs and special-purpose digital signal processors (DSPs) are highly programmable, but they consume a lot of power and are not capable of addressing extreme-computational or bandwidth-intensive algorithms.
- SOCs: System-on-chip devices are complex beasts that combine ASIC hardware with CPU/DSP functions, hardware accelerators, blocks of memory, peripherals, and so forth. Not surprisingly, these share the pros and cons of ASICs/CPUs/DSPs.
In addition, of course, there is a plethora of specialize architectures, such as arrays of ALUs and/or Programming Elements (PEs) and/or CPUs/DSPs, but these tend to be focused on tasks like robotic vision systems and implementing wireless base-stations, where considerations such as power consumption and cost are less of an issue (feel free to peruse and ponder my Computing Universe paper for more details on these little rascals).
But now it seems that there's a new kid on the block. Having worked in "stealth mode" for the last couple of years, the folks at Element CXI are leaping onto center stage with a family of devices called Elemental Computing Arrays (ECAs). The guys and gals at Element CXI achieved first silicon in June 2007; they first demonstrated these devices in October 2007 at the CEATEC imaging, information, and communications conference in Japan; and first customer shipments are currently scheduled for Q1 2008.
"But just what are ECAs?" you cry! Well, let me tell you what I know, but remember that "a little knowledge is a dangerous thing", so you should regard me as being a highly dangerous individual (because I know so little)!
First and foremost, before we leap into the fray with gusto and abandon, we should note that ECAs are going to be presented as "off-the-shelf" chips that you can incorporate into your own designs. Next, we should note that these little rapscallions are dynamically reconfigurable (we could talk in terms of "reconfiguring the chip hundreds of thousands of times a second", but this doesn't really let us grasp the scope of things. It may be better to visualize this along the lines of "any portion of the chip – up to and including the entire device – can be dynamically reconfigured in a single clock cycle"). Last (for the moment) but not least, we should be aware that these little ragamuffins use an all-software programming model.
It's difficult to know quite where to start, because there are so many aspects to this technology that it makes your head spin. So, in order to keep what little sanity I have left, I'm going to start at the bottom and work my way up.
Conceptually, the lowest-level functional blocks in an ECA are known as Elements. There are currently seven types of Elements, which are divided into three main classes: computation, storage, and signaling as illustrated in Fig 1.
1. There are seven types of fundamental building blocks called Elements.
The Compute-Class Elements are as follows:
- BREO: Bit RE-Orderer. This enables shifting, interleaving, packing, and unpacking operations and can be used (un)packing, (de)interleaving, (de)puncturing, bit extraction, simple conditionals, etc.
- BSHF: Barrel SHiFter. This enables shifting operations and can be used for 16-bit barrel shift, left shift, right shift, logical shift, arithmetic shift, concatenation, etc.
- MULT: 16×16 signed and unsigned MULTiplier with optional 32-bit accumulation stage; double 8×8 multiplies.
- SALU: A Super ALU that performs 16-bit and 32-bit arithmetic and logical functions and can be used for sorts, compares, ANDs, Ors, XORs, ADDs, SUBs, ABS, masking, detecting, leading 0's, leading 0's, etc.
- TALU: A Triple ALU that enables up to three simultaneous logical and arithmetic functions with conditional execution. This little scallywag can be used for sorts, compares, ANDs, ORs, XORs, ADDs, SUBs, ABS, masking, detecting, Viterbi ACS, CORDIC, Motion Estimation, etc.
Storage Class Elements are as follows:
- MEMU: A MEMory Unit providing random-access memory and sophisticated DAG (Data Address Generation) capabilities used for data storage.
Signaling Class Elements are as follows:
- SME: A State Machine Element is used to implement sequential code, operate as a co-processor with other Elements, and operate as a virtual Element for data-flow programs. The SME is a sequential processor, but – unlike traditional processors – it can be augmented by the other Elements in the same Cluster (we'll talk about Clusters in a moment). The SME is also used to implement the real-time operating system, run-time environment, house keeping, test and resilience capabilities, and so forth.
Elements are non-homogeneous data-flow computational engines. All of the Elements have the same form, but different capabilities, thereby allowing each type to be implemented in the most efficient manner. Since all of the Elements have identical interfaces, this will facilitate adding new Elements in the future, and also creating new devices with different mixtures of Elements to target specific classes of problems.
Each Element has four 16-bit inputs and two 16-bit outputs (some Elements have the capability of ganging a pair of inputs or outputs together to perform 32-bit operations). Each input and output of an Element is queued, thereby isolating the Element from interconnect delays, and every Element executes an operation in one clock cycle (one Elemental instruction – the 32-bit multiplication – does require four clock cycles, but each of these is a normal clock cycle).