In the SoC design and integration methodology of a MPEG1/2 Audio Layer 3 (MP3) decoder chip, a top-down integration flow, combined with a focus on constraint-driven timing analysis, modular simulation and DFT solutions, led to an implementation cycle of only eight weeks. The chip architecture is centered around Motorola's ColdFire, V2 processor in combination with several application specific modules.
We believe the re-use of complex functional modules in the consumer business such as MP3 decoding is an absolute must to meet today's narrow market windows. The MP3 audio decoder chip re-uses major parts of its functionality such as the 32-bit RISC processor core with its peripherals and memory. In our design, several specific modules were designed in different design centers and were more or less complete when the SoC integration started.
The main charters of the SoC design and integration in this project were:
- Automatically generate a top-level description of the entire chip, combining all functional modules, ensuring I/O assignment, and integrating DFT control.
- Develop and implement a state of the art DFT solution covering stuck-at, transition, and path delay faults.
- Develop and implement a modular verification environment, which ensured the correct integration of each pre-verified module into the entire system.
- Implement the entire system into a small chip size and ensure high speed the central processor would work at 140 MHz guaranteed.
While each of these methods is not completely new in the industry, they are not all always used. In many designs, trade-offs are made in one or more of the areas, flows are not followed in a disciplined manner, or experiments are made with new techniques, even though time does not permit. The result is most likely a delay of the tape out, a lot of stress and possibly panic in the design phase. Sometimes even a re-spin of the silicon is necessary if timing, functionality, or test are not completely met.
In the case of the MCF5249, very aggressive goals were set in terms of design cycle, performance and quality. We believe the following results show that the disciplined, yet aggressive approach paid off.
- Integration and verification of the SoC was completed within eight weeks.
- An initial test program was successfully running within two hours after the first prototypes arrived.
- An application board was playing music two weeks later (MP3 decoded with our chip).
- A 140 MHz minimal speed of the processor was achieved.
- Power consumption was lowered by a factor of three compared to other commercial solutions.
The chip was primarily designed for audio systems supporting MP3 playback. The 32-bit RISC ColdFire, V2 processor comes with both an Enhanced Multiply Accumulate and DSP-like addressing modes for non-sequential addressing to the local memories. The fully synthesizeable core comes packaged with a standard set of peripherals and the ability to add additional customer specific modules, as displayed with the IDE SmartMedia Interface and the audio modules. The audio modules allow an interface to common audio protocols such as EIAJ and IEC958. These features, with the addition of 96 Kbytes of on- chip memory and 8 Kbytes of instruction cache, make this a viable solution for the audio market.
The modules in this design range from very standard peripherals to application specific blocks. All are soft IP-coded in Verilog or VHDL (no custom IP).
Gathering modules from different design centers and vendors and making them work together can prove to be a difficult task. We ended up having many modules in different forms and at different stages of completeness. Since potential specification or design changes were possible, the top-level generation was scripted with an in-house tool to allow fast turn-around with highly reproducible results. Included in this top-level flow is the automatic creation and connection of the Boundary Scan Register and the JTAG logic to meet the IEEE 1149.1 standard.
Once the top-level chip module was in place, the chip implementation flow started. Essentially, the chip implementation flow was highly automated starting from top-level/JTAG creation. This approach avoids, as much as possible, any manual interaction where mistakes can be introduced. The synthesis strategy, using Synopsys' Design Compiler, relied on a distributed/bottom up compile using foundry supplied wireload models including the appropriate margin. The scan insertion impact was taken into account early in the process by using a scan-ready compile and performing a scan design rule check on each block to get feedback of scan design rule problems as soon as possible. The full chip scan insertion was done outside the synthesis tool in Mentor Graphic's Fastscan tool and took care of the numerous clock domains while inferring all the necessary scan connectivity.
Floor planning and chip power routing were performed using Silicon Ensemble from Cadence. These were started as soon as a preliminary scan inserted netlist was available. This gave the back-end designer time to work through any floor planning issues that arose. Analog power to the phased lock loop (PLL) was done manually, however, the power rails of the standard cell rows were used to automatically connect power to the memories. Using this pin method, a very low resistive connection between the memory and the power routing was achieved.
Wire length and congestion driven placement of the standard cells was then performed. Even though timing was critical, this approach worked quite well and the timing closure was performed using the In-Place Optimization (IPO)/ Engineering Change Order (ECO) loop. To avoid the scan chain connectivity's influence on the placement, the scan chains were broken prior to placement and the new order was reconnected during post placement. This paved the way into the clock tree insertion and global and detailed routing.
In an ideal world, this flow would be completed once, all problems would be solved, timing would be met, and tapeout would occur. But we know the world is not perfect. Therefore, the final steps of the implementation flow were fed back into timing analysis and the IPO/ECO loop begins. The layout parasitics were extracted using Cadence's Hyperextract' 2.5D algorithm. This output was then converted using Pearl into Standard Delay Format and used for timing analysis of the chip. The full-chip placement aware timing optimization was performed using worst and best case timing and parasitics data simultaneously, which enabled the fix of setup violations under worst case and hold violations under best case conditions in one run. Since the design changes at this stage of the implementation needed to be kept at a minimum, operations were limited to cell sizing and buffer/inverter insertion or removal. The modified design database, including the placement information for the new cells, was then fed back into the layout tool using the ECO approach. Here, the old routing information was discarded and the extraction/analysis run until timing targets were met and physical, functional, and DFT verification were completed.
Advanced DFT solutions were implemented to run all necessary functional and structural patterns at-speed. This ensures high fault coverage and short test time.
Basically, the implemented DFT solutions comprised of five test modes: JTAG, PLL test, memory BIST, Dynamic High stress mode, and Scan mode. The entire chip can run based on internal PLL clocks or external bypass clocks. Engineering BIST and bypass test clocks are examples of extra sub-modes that also exist to help test engineers further debug the test program. All of the test circuitry of the chip is controlled centrally from the TCU. The TCU generates all necessary control signals for top-level port switching and for every particular sub-module depending on currently selected test mode. It also controls the power down feature of PLL.
The first test mode is dedicated to at-speed memory BIST. It has two modes of operation: production BIST (PBIST) and engineering BIST (EBIST). PBIST is used during production test for simple pass/fail identification of the local memories. EBIST allows the monitoring of the memory data and, thus helps with fault allocation during debugging process. The scan mode is based on advanced at-speed methodology. All scan chains are separately inserted per clock domain and per triggering edge (pos/neg). Each chain has its own independent at-speed scan enable signal and dedicated head and tail register. Scan input and output switching is controlled with a special Bus Scan Enable (BSE) signal. This hardware architecture supports stuck-at, transition, and path delay patterns all running at-speed.
During Automated Test Pattern Generation (ATPG), the faults were covered subsequently in multiple runs. The partial clock domains and clock-crossing domains were addressed one at a time while the faults detected in the previous runs were deleted from the faultlist. With this scan architecture, using the ATPG flow, excellent fault coverage was achieved.
The chip was fabricated in a 0.18 mm 5-metal CMOS process. An application-like evaluation environment proved the correct functionality. The pre-verified test program ran on a tester within two hours after the silicon arrived.