The advent of smaller geometries has made it possible and practical to integrate more and more functionality onto a semiconductor chip. Developers look to incorporate features that will distinguish their products from their competitors', and with these features comes the growing need for embedded memory. Bringing memory onto the ASIC often lowers cost and power consumption, improves performance, and increases the reliability of the system on a chip
(SOC).
Many of today's chips demand more embedded memory than ever before. Large amounts of SRAM, ROM, EPROM, multi-port RAM, and DRAM are finding their way on board. For example, in the case of high-performance microprocessors, 30-50 percent of the premium space and 80 percent of the transistors are allocated to memory alone. These controllers include several levels of cache for data and instructions, multi-port SRAMs for TAGs, TLBs, CAMs, register files, and general purpose SRAMs. As the need for
embedded memories continues to increase, so does the complexity, density, and speed of these memories. This, in turn, creates the need for specialized memory designs that require a high level of expertise and a specialized tool set to which many companies may lack access.
Outsourcing memory
Because of the stand-alone nature of memory blocks (they often begin or consume a pipeline in a clocked system), many chip developers find that outsourcing the design of the memory module is a rational
decision to make both for financial and human resource reasons. Memory blocks can be well defined and separated out from a system much more easily than can other components of a semiconductor chip. The modular nature of memory blocks, the huge demand for embedded memories, as well as the fact that the memory core may utilize new technologies in which the system design team lacks design expertise, have all resulted in the growth of memory compiler and custom memory design houses. To meet overarching system design
schedules, these design houses can provide many of the onboard memories to the system designers in a timely fashion.
While many companies do outsource the design of their embedded memories, many wait too long to make the decision. Seeking outside help early in the schedule can give the system designers the pin locations, footprints (which will establish the memory size), and the HDL models for the memories as soon as possible. These early efforts helps ensure a timely and efficient end product without
compromising area, performance, or quality.
An alternate method of obtaining an embedded memory design is to use a memory compiler, which can provide a physical block in a relatively quick and inexpensive manner. While this method is expedient and quite adequate for standard memory configurations, it poses several down sides as well. Generally, compiled memory designs result in a larger memory block and less efficient overall system performance. In addition, the memory design may be inflexible when the
system design requires additional features.
|
Figure 1 - Custom Memory Blocks
|
|
|
The various memories included RAMs, TLBs, register files, ROMs, multi-port RAMs, and CAMs, as well as general purpose blocks.
|
Conversely, obtaining an embedded memory
design through a custom design house such as oursýPuyallup Integrated Circuit Company (PICCO)ýcan offer multiple advantages. Customized memories can accommodate emerging system needs such as the need to pitch match the logic with the memory core. Instead of placing a standard memory block on the chip and then synthesizing the logic around it to create a desired function, designers can move the logic into the memory block, allowing the physical layout to fit tightly with the memory pitch dimensions. This
approach reduces the overall chip size, allows for a higher memory density, and improves the performance of the chip. The resulting design can be faster, more compact, less power-hungry, and more cleanly routed.
The complexities of current memory design demand a thorough series of procedures. Our design methodology covers the entire spectrum from concept to netlist, including the design, layout, and verification of a memory block. Precise methods help to ensure that the memory block will workýand work
wellýwhen plugged into the SOC.
Memories for RISC
One of our recently completed designs included all of the embedded memories for a 500-MHz 64-bit RISC microprocessor. The onboard memories had to be fast and complex to service the equally fast and complex microprocessor. The various custom memoriesýwhich consumed more than one third of the area of the 200 mm2 CPUýimplement Level 1 and Level 2 caches, two levels of translation look-aside buffers (TLBs) to convert virtual page addresses to physical
addresses, multi-port register files for fixed- and floating-point cores, and other functions such as look-up tables (LUTs) and general purpose memory (GP). The caches contain separate memories for data storage, tag, and least-recently-used (LRU) functions. In addition to the multi-port storage array, the register files also contain ROMs and CAMs for address translation and a renaming logic unit (see Figure 1). In all, we created 20 unique memory designs. Nearly all macros required a single-cycle access.
Often these access times needed to be 1 ns or less since they comprised only a fraction of the function required during the 2-ns pipeline.
The complexity and uniqueness of each memory meant that a memory compiler wasn't a viable option. Each embedded memory required a custom design using novel circuit techniques to meet the high performance, density, low power, and high noise immunity required for the microprocessor.
Such a microprocessor had to use one of the most advanced, state-of-the-art processes:
0.18-ým, 6-layer copper dual-damascene metal CMOS. The small feature sizes and high-performance transistors presented additional design challenges. The narrow wires (whose heights were greater than their horizontal spaces) were especially susceptible to crosstalk and electromigration effects, while the low threshold of the transistors resulted in lower noise tolerances.
Design methodology
To familiarize ourselves with each new process and produce a consistent set of guidelines for each
designer to follow, we first develop a comprehensive set of design standards. These include optimal gate ratios, fanouts, maximum transistor widths, and pre-layout resistance and capacitance rules-of-thumb. Because high-density and high-speed memories require aggressive circuit techniques, crosstalk avoidance techniques and noise margin design standards are critical. Crosstalk standards dictate procedures for routing adjacent signals, while other noise margin standards define rules for static noise margin and
writability for latched circuits.
The design of multiple macros for a chip demands consistent circuit standards. Especially important are standards for clock generators and registers so that input setup-and-hold times are consistent across the entire CPU. To minimize clock skew, the designer needs to tightly control ratios and fanouts, as well as the rise and fall times of all the clock generators.
|
Figure 2 - Test your memories
|
|
|
Our test chip consisted of several embedded memories with scan and BIST, a TAP controller to initialize each test, and a PLL to drive the internal clock grid at frequency.
|
Additionally, we use design-for-test (DFT) features such as scan and full-frequency built-in-self-test (BIST) for each memory. Undoubtedly, BIST is a
more complicated technique than a test scheme that multiplexes the I/Os of the embedded memory to a test bus and routes them to the chip I/O pads for evaluation by a tester. However, BIST offers the advantages of working independently of the tester and operating the memory at full frequency. Depending on the complexity of the BIST, a signature can isolate a failure to just a particular instance or to an actual I/O or memory cell. The latter feature is useful for the implementation of redundancy and for
detailed failure analysis. BIST also provides a useful technique for testing the functionality and determining the maximum operating frequency of the macro or memory, but usually lacks the ability to predict the macro's access time. The DFT features add less than 2 percent area overhead and are invaluable in validating the memories. Using these techniques and custom embedded ATE (automatic test equipment) circuits, we have built several test chips to validate the complex design techniques used in building the
memories (see Figure 2). Since it's currently impractical to drive external I/O pads at 500 MHz, we implemented proprietary embedded ATE circuitry to capture and evaluate the actual access times of the embedded macros. By building the tester on the chip, we ensured that a low-cost digital tester could drive and evaluate the test chip.
Timing and functional verification
Accurate timing models are crucial for any high-performance semiconductor chip. To characterize and simulate critical paths
in the embedded memories, we use Hspice from Avanti. Since it's impractical from a simulation runtime standpoint to simulate the entire macro's LPE netlist, we use a lumping and loading technique (see Figure 3). While this meth-odology is common, it often leaves itself open to inaccurate modeling of the distributed loads and transmission-line effects that are represented by resistor-capacitor (RC) networks. The RC networks include not only resistance and capacitance but also transistors to accurately model
gate and source/drain capacitance. Recognizing the need to guarantee accurate timing, we have written tools to verify that all components of a critical path match the actual macro LPE netlist. We compare wire, gate, source/drain, and coupling capacitance and resistance for nets of interest between the critical path and LPE netlists. When these values don't match, we must update the load models.
Hspice analysis includes simulations for at least six process, temperature, and voltage corners (P-T-V) with
measure statements and plot analysis at each corner. We analyze measure statements and plots and search for incorrect behavior such as poor signal-slew rates, signal glitches caused by crosstalk or charge sharing, unwanted overlapping pulses, poor propagation delays, and poor setup-and-hold margins about clocked circuits.
We typically use a Verilog or VHDL model to model and simulate the entire SOC. To ensure accuracy, each embedded memory has a Verilog model associated with it. Our responsibility is to
ensure that the circuit implementation functionally matches the HDL model. For each memory, we write a comprehensive test bench to test all address combinations, control, and test modes (scan and BIST, in other words). We then apply these vectors and their associated expect data to the full LPE netlist for each macro.
As mentioned above, it proves impractical to have Hspice simulate extremely large netlists and large vector sets (often thousands of vectors). To bridge the gap between Hspice and
Verilog, we use Synopsys' Timemill, which combines logical equivalency testing and circuit electrical verification. It can take, as input, the full memory's LPE spice netlist driven by the same vectors simulated in Verilog. We have found that the tool has good timing accuracy so it can also point out timing weaknesses, in addition to detecting functional differences between the circuit netlist and the Verilog model. The tool isn't a fault simulator, but the vectors should toggle more than 99 percent of all nodes
for good coverage. We run Timemill over the same P-T-V corners as the Hspice simulations. Additional quality assurance runs also check for undriven nodes, low- and maximum-frequency operation, and P-T-V extremes.
Physical verification
We also use Calibre from Mentor Graphics to verify the physical design. Complete LVS and DRC rule decks check for correct circuit connectivity and all spacing, width, overlap, and enclosure violations. Additional quality assurance rule decks check for floating
layers, resistive connections, and unwanted geometries.
For layout parasitic extraction we use Mentor's xCalibre, which generates LPE netlists for use in Hspice-critical path analysis and Timemill simulations. For accurate extractions, the layout hierarchy must match the schematic hierarchy at all levels. Additionally, all feedthroughs must be embedded into each leaf cell so that their parasitic effects will be modeled in the subcircuit LPE netlist.
|
Figure 3 - Timing simulation model
|
|
|
The critical components of this memory array model are the placement of memory cell clusters in the four corners of the array; the periphery that contains the address decoding, clock generation, and drive circuitry; the transmission line nodels that separate the four clusters and long
routes; and the coupling capacitors for crosstalk modeling (not shown).
|
Although LPE netlists are back-annotated into the critical-path simulation, it's imperative that no major surprises crop up between pre-LPE estimates and post-LPE simulation results. Given the high-performance results we're trying to achieve, attention to quality layout practices is crucial to our circuit design techniques. Some of these layout practices include folding or sharing signal source/drains whenever
possible, as well as shielding clock lines and groups of decode lines, folding wide transistors, and using multiple contacts to minimize resistance. (This can be a particularly important detail when driving a large load).
Quality assurance
In addition to the above-mentioned procedures and checks, we also perform extensive quality assurance analysis on each macro before its release to the system designer. Since EDA quality assurance tools are just emerging and may not be fully validated, we have
developed many of our own in-house checks. One level of QA checking can be achieved using in-house software developed specifically for memories in the smaller geometries. We use the tool to ensure that the Hspice critical-path netlist loading exactly matches the full-layout LPE netlist. It also analyzes every net in the entire LPE netlist and checks for excessive driver fanout and skew ratio; it detects multiple drivers on a net and finds the nets that are susceptible to charge sharing (especially dynamic
nets) and crosstalk effects. For the latter, the coupling capacitance, driver's strength, receiver's noise margin, and number of adjacent nets are all taken into account. The designer must either correct or justify any net in violation of any of the above checks.
We perform QA checking on the layout with a special DRC ruleset. This process finds resistive connections (for example, routes through poly, diffusion, or substrate) and checks power grid integrity and excessively wide transistors. Resistive, or
soft, connections that a typical DRC ruleset fails to check may not cause a functional failure in silicon, but can easily contribute to frequency-related or stability failures.
To meet the timing criteria, designers must sometimes make tradeoffs between noise tolerances and speed. Even so, all circuits must pass minimum noise margin rules or the circuit will likely fail when placed in the entire CPU. Circuits such as memory cells, ratioed logic (also known as pseudo-nmos), and dynamic logic gates all
undergo static and dynamic noise margin analyses. We run Monte Carlo Hspice analysis on circuits where device parameter mismatches on the same die can prove critical (differential sense amps, for instance). Finally, all memory cells and latches are tested for writability over all P-V-T corners.
The power distribution and integrity of the power grid have a significant impact on the macro's performance. Voltage IR drops on Vdd and ground bounce on Vss affect noise margins, timing, and possibly
functionality. The problem magnifies with lower supply voltage levels and smaller Vts associated with deep submicron feature sizes. Additionally, the high current densities at the 500-MHz cycle times associated with narrow lines in 0.18-ým technology increase the possibility for electromigration failures. EM failures will usually occur after several months or years of use because of the gradual degradation of interconnects from current flow and Joule heating. If these failures occur too soon in the lifetime of a design,
they can be catastrophic, since they will typically occur in a customer's system in the field.
Using Synopsys' Powermill (Timemill's sister tool) to simulate the entire macro's power, we can create a current map that details each subcircuit's power by placement location. The current map, along with the macro layout's RC-extracted netlist, is input to a tool that analyzes the power buses' IR drops and EM. The tool reports any wire segment or contact/via that fails, allowing designers to improve the
busing. Layout overlays of the errors, as well as contour maps and 3D current and voltage distribution plots are also available to assist the analysis.
These QA procedures aren't limited to the highest speeds and the smallest processes. Even larger processes (0.35-ým and below) and typical slower speeds (100 MHz and above) can exhibit increased susceptibility to noise margin, crosstalk, IR drop, or EM-related failures.
On delivery
When outsourcing the design of embedded memories, a customer
should expect certain deliverables. Early on, memory designers should provide an abstract for floorplanning and placement and routing that establishes the critical boundaries and pin locations for the system designers. The customer should also expect accurate HDL models so that they can eliminate any system bugs. Later, the memory design team should deliver a timing library with delay and race lookup tables or equations that the customer can use in full-chip logic and timing simulations. Current topology
maps help systems engineers to analyze power, IR drops, and EM at the full-chip level. The design team should also include a test bench with test vectors for the memory block along with adequate documentation. The final product is the complete layout database, which will be the memory block that is dropped into place on the system chip. It should come with complete documentation that includes simulation, timing, and verification results, as well as design details, netlists, and schematics.
Embedded
memories are a vital part of today's semiconductor chips, and the level of interoperability they provide to the full chip determines the efficiency, speed, and performance of the overall chip. A solid design methodology can deliver a well-designed memory.
Embedded memories require tighter controls than traditional off-chip memories, since they are subject to externally generated noise. Moreover, the power grid of the memory may need to carry current from the external logic. Designers must learn to predict
and implement accurate gray box models because they usually design the memories in parallel with the entire chip, and the memory integration must occur without a hitch.
The development of quality embedded memories starts with the setting of stringent design standards. This effort, supplemented by quality assurance tools, truly succeeds only when implemented by a design team that not only can design innovative circuits, but also has the discipline to adhere to the strict methodology.
Eric Hall and George Costakis are two of the founders of Puyallup Integrated Circuit Company. Since 1990, PICCO has specialized in full-custom, high-performance, and high-density memory design services and IP.
Send electronic versions of press releases to
news@isdmag.com
For more information about isdmag.com e-mail
webmaster@isdmag.com
Comments on our editorial are welcome.
Copyright © 2000
Integrated System Design
Magazine