The Jazz architecture was designed with several goals in mind. Our target market-applications rich in data and computation such as video, audio and image processing-required an architecture with a high capacity for data throughput and a high performance rating. To facilitate faster time-to-market, we also focused on a system solution that would be completely programmable and configurable post-manufacturing. Another design goal was to support the entire life cycle of a product-from early prototype through market entry and mass production with high volume-that required an efficient and modular architecture that would be scalable to allow for cost reduction. Finally, since the architecture is licensed to several partners such as semiconductor vendors and system companies, it had to be easy to implement in silicon.
The Jazz architecture is designed to support such consumer applications as audio, video, image and embedded connectivity. All of the target applications have some common attributes that are critical for an effective solution.
It is clear that the primary performance bottleneck in our target applications-MPEG-2, JPEG, AC-3-is a data-flow problem. Data is consumed, transformed and passed on to a later stage in the algorithm. In such a computational model, a central memory architecture has no advantages. Consequently, the Jazz architecture has a cascade memory architecture that allows compute resources to consume and produce data without bandwidth limitation. That allows for simultaneous access to memory from all computational resources and a higher overall number of operations that can be performed per cycle.
It is clear that while an application contains many different types of algorithms and transforms, each of those can be, and often is, implemented using common operations. Rather than focus on specific implementations of custom blocks, we focused on determining the best combination of general-purpose data-path operators. For example, operations like a single-cycle multiply accumulate with 64-bit accumulate, a zero-cycle (in-operation) round and saturate, etc., provide an effective data path for those algorithms and transforms. In some cases, we added low-level features to the general-purpose operations to increase their applicability to our target markets .
We designed the compute engines with a very large instruction word (VLIW) microarchitecture in order to complete the support for high computational bandwidth. Those engines provide as much as 1.5 billion operations/second (BOPs), including five distinct memory accesses per cycle, three ALU operations, seven address or HW loop index computations and several other operations. Aggressive instruction compression yielded a high code density that complemented the high performance.
To ensure faster time-to-market, we designed the Jazz architecture to be compiler-programmable. That strict design goal was critical to avoid hardware structures that could not be utilized by a compiler and to design the compilation system in tandem with the hardware architecture. We also recognized that in many instances the I/O functionality could delay and impede the completion of a product. We designed the Jazz architecture with programmable I/O sections that can support serial and parallel interfaces such as SDRAM and 1394 with user-definable clocks. Since the interface definition can be programmed and changed after the hardware is done, it does not delay the deployment of the product.
Raw power without efficiency is inappropriate for consumer applications. The Jazz architecture incorporates several design choices that make it more compact and amenable to cost reduction and high-volume optimizations. Since the architecture is memory -dominated, the on-board memory is a logic four-port RAM implemented with high-speed two-port read/write SRAM. That is possible because VLIW does not require pipeline stages to achieve high performance, so the clock rate is kept at a moderate 100 MHz. To ensure high code density, an aggressive compaction of VLIW code is used to achieve over 80 percent utilization of instruction slots in memory.
The Jazz architecture is designed to support an application through the life-cycle phases. For prototype and early production, the application can be mapped to a generic Jazz implementation and get to market quickly. Shifting to high-volume production requires aggressive cost reduction to capture maximum return from shrinking margins.
A lower-cost and lower-power custom Jazz-based architecture can be created to match the needs of an application by removing unused memory, replacing RAM with ROM for instructions, removing unused compute capacity and removing unused I/O pins-yet that will not require a redesign of the underlining application.
Let's look at one design issue in some detail, using it as an example of how architectural design decisions are focused on providing general-purpose capabilities rather than application-specific function.
The goal was to improve the performance and code density of a function used to extract bits from a bit stream. The function, GetBits, was used extensively in our target applications, and was consuming too many instructions and taking too many cycles to execute. Since it was used so often, its performance had a large impact on static code size and application performance.
A solution was proposed that involved the use of dedicated logic to directly perform the GetBits algorithm. Such a move would have resulted in fast performance and dense code, but would have been very inflexible, would not have fit in well with the architecture and would not have provided any benefit to any code except for GetBits itself. We selected key operations from the GetBits code that could be accelerated with new hardware operations. We picked operations that delivered improved performance/density while cleanly fitting in with the rest of the architecture. The added benefit is that some of those operations can be used to improve general code in addition to GetBits.
For the pointers into the bit stream, the operations assume some persistent state between the execution of tasks manipulating the bit stream. The compiler automatically generates code at the start and end of those tasks to load and store the state from registers. Typically, a task will use the GetBits procedures many times, so the overhead of loading and saving the variables at the start and end of the task should be relatively small.
The new operations included:
- Bit Mask (ALU). Mask off the upper bits of a 32-bit word,
- Add Immediate (ALU). Add an immediate value (4-bit) in the ALU,
- Add Modulo 5 (ALU). Five-bit addition in the ALU,
- Modulo Addressing. Supports modulo address arithmetic.
With those capabilities, the GetBits function can be implemented in three cycles. In addition, these three instructions can be packed in with other operations, reducing the overhead even further. The pseudo code looks like this:
- Set up registers for bit pointer, word pointer, two active words from bit stream,
- Instruction 1. Add number of bits to bit pointer, load word pointer, shift active words,
- Instruction 2. Mask number of bits from shift result,
- Instruction 3. Reset word pointer and active words.
The Jazz architecture is licensed to implementation partners that have the right to manufacture and use the architecture in the deployment of products. Essentially, the architecture is synchronous, and while it uses multiple clock domains, the interface is well-defined and well-segregated.
The architecture is also very regular, with repeatable structures that can be synthesized and routed hierarchically. Global connectivity is for the most part cascaded, which makes placement and floor-plan straightforward. While the architecture is memory-rich and requires some 40 or 50 memory blocks, only four or five distinct memory cuts are actually needed. Full-scan design and memory built-in self test provide simple but effective test coverage. A sizeable JazzE6 (six engines) with 256 user pins, 1.44 Mbits of data SRAM (dual port) and 2.88 Mbits of instruction memory in a 0.25-micron process, measures about 11 mm x 11 mm and at 100 MHz draws between 2.5 W and 4 W, depending on the application.