Blog
Tell us What You Think
We want to know what you thought about this Discussion. Let us know by adding a comment.
MoSys combines design, process and test to break the 2 billion accesses per second barrier
John Scott-Thomas
4/20/2012 4:59 PM EDT
MoSys has created a new serial memory—the Bandwidth Engine IC—that leverages a highly efficient 10G serial interface and innovative architecture to perform over 2 billion memory accesses per second. This access rate is necessary to support data rates required by 100GE (Gigabit Ethernet) and 100Gb/s aggregate line cards. The Bandwidth Engine IC contains intelligence in ALUs and memory architectures that accelerates networking operations such as statistics and was designed for use in applications where high data speeds, 10-year expected lifetimes, and government mandated power reductions create restrictive specifications. Bandwidth Engine distinguishes itself relative from traditional networking devices by putting the emphasis on fast, intelligent access which works well in the packet classification applications. This required MoSys to use a highly collaborative design approach. To achieve this access rate, a combination of exacting product definition, tightly designed RTL code, a high speed and low-latency SerDes, the core 1T-SRAM technology developed by MoSys, and innovative layout and packaging design were employed. The result is a device which eases SoC packaging and system design challenges by using a high speed serial interface. Overall system performance is increased while power and cost are reduced by the consolidation of banks of traditional memory devices into one Bandwidth Engine.
The Bandwidth Engine uses MoSys' initial technical innovation, the 1T-SRAM which is an embedded DRAM memory that approaches SRAM speeds. This is accomplished by using an eDRAM array architecture with small memory bank sizes which has reduced capacitive and resistive loading on the bitline resulting in lower latency. Additionally the 1T-SRAM interface hides the DRAM refresh and precharge cycles. The memory banks read and write at "SRAM-like" speeds with cycle times of 3.9 nanoseconds. The embedded memory is organized into four independent partitions, each partition is divided into 64 x (32Kb x 72) memory banks; the total memory size is 576 Mb. Each partition has one write port and two read ports which are accessed in a round robin TDM manner. The combination of elements of the array architecture allows up to 12 operations to be in flight in every 3.9-nanosecond period. In this manner it is possible at 10G to issue three commands each 1 nanosecond.
The Bandwidth Engine builds on the core memory array architecture by adding an innovative 90 percent efficient low-latency interface which runs on top of one to sixteen differential serial links that are CEI-11 or XFI compatible. The Gigachip Interface (GCI) has been optimized for high access rate devices, using 80-bit packets that have a 72-bit payload and 8 bits for CRC. The GCI serial interface is designed for chip-to-chip communication rather than a typical network SerDes application and includes an automatic error-recovery mechanism for guaranteeing reliable data transport as required by the intended enterprise and carrier markets. MoSys chose to use a mesochronous interface to minimize latency associated with traditional SerDes. The data is pipelined through a control block that feeds the bit stream to the four 1T-SRAM partitions. On the transmit side, as each partition becomes available, one per 1 nanosecond, up to 2 data words can be read out and returned to the GCI interface for transmission to the host.
Another innovative feature of the Bandwidth Engine design are the on-chip ALUs, one per partition, which can be used to manipulate data using a internal read-modify-write operation. Utilization of the ALU offloads the host processor and frees up the interface for other operations resulting in higher performance and improved energy efficiency. Since the ALU is associated with each partition one instruction can be issued every 1 nanosecond at a 10G interface rate. To ensure data integrity through an ALU operation the ECC syndrome bits of the 72bit word are checked, corrected if necessary and recalculated.
At the level of the silicon, MoSys chose TSMC as the manufacturing foundry. The device has TSMC's embedded DRAM process which is based on capacitor under bitline technology, shown in Figure 1. MoSys has made a successful device by combining TSMC's stable process with their novel chip design. The sense amplifiers connected to the bitline exploit all the metal layers available in the logic-compatible TSMC process. Three stages of multiplexed sense amplifiers are used. The first level bitline runs 20 microns in layer of metal 1 and connects the memory cells to the first level sense amp. A second level sense amp multiplexes the output of two level one sense amps, running 750 microns in metal 4 to a second-level sense amplifier. Finally, eight second level outputs are multiplexed to a third-level sense amplifier using 750 microns of metal 6.

Figure 1: SEM cross-section of the Bandwidth Engine.
The layout of the Bandwidth Engine is also oriented to the primary design goals, high access rate and low latency. Conventional Serializer/Deserializer (SerDes) devices place the I/O at the edge of the chip. The Bandwidth Engine changes this, placing the I/O, Gigachip Interface, and clocking at the center of the die. This achieves two things; a 2-3 nanosecond reduction in latency, and also latency equalization. The placement of the GCI interface and SerDes lanes in the center of the chip also reduces receiver/transceiver (Rx/Tx) crosstalk and allows future generations of the Bandwidth Engine to have the same pinout. The design challenge with this approach was maintaining sufficient noise isolation between the memory, core, and chip SerDes. The die layout is shown in Figure 2. As well as the central location of the SerDes block, two inductors are visible. These are used in two LC oscillators that are employed in the VCOs (Voltage Controlled Oscillators) of two Phase Locked Loops. Two PLLs are required to cover the 6-10 Ghz frequency range used by the chip. LC oscillators were used to create a low jitter PLL.
The package was co-designed with the chip. The package uses 8 metal layers, allowing the designers some flexibility to fine tune the series inductance of the package leads. The package inductance is designed to compensate the parasitic pad capacitance. This results in a cleaner eye diagram, improved return loss, and reduced data error rates.
To reduce test costs, which are significant when testers must connect to the sixteen channels on the Bandwidth Engine, a Design-For Test processor was placed on-chip. The processor can be reprogrammed over the manufacturing lifetime of the chip. This allows product engineers to modify the test algorithm as more is learned about weak bit signatures seen during test; essential to ensuring an enterprise and carrier grade quality and reliability supporting a ten year lifespan goal. In the future it may be possible to reduce or eliminate burn in time for the part.
Ultimately, it is the teamwork of the architectural, design, layout, process, test, and manufacturing groups that allowed the Bandwidth Engine to achieve the 2 Giga-access per second data rate and 10-year lifespan required by enterprise customers. Patents covering the design of the Bandwidth Engine were filed one and a half years ago, and are currently at the patent application stage. MoSys has design wins in progress with tier-one networking partners. The design itself is scalable, and MoSys feels it can be improved and used for 400 GE. The next generation could see a fifty percent performance improvement.

Figure 2: Floor plan of the Bandwidth Engine. Note the placement of support circuitry in the center of the die.
John Scott-Thomas is a product manager at UBM TechInsights, a sister company to EE Times. Arabinda Das, a senior process analyst at UBM TechInsights, also contributed to this article.
The Bandwidth Engine uses MoSys' initial technical innovation, the 1T-SRAM which is an embedded DRAM memory that approaches SRAM speeds. This is accomplished by using an eDRAM array architecture with small memory bank sizes which has reduced capacitive and resistive loading on the bitline resulting in lower latency. Additionally the 1T-SRAM interface hides the DRAM refresh and precharge cycles. The memory banks read and write at "SRAM-like" speeds with cycle times of 3.9 nanoseconds. The embedded memory is organized into four independent partitions, each partition is divided into 64 x (32Kb x 72) memory banks; the total memory size is 576 Mb. Each partition has one write port and two read ports which are accessed in a round robin TDM manner. The combination of elements of the array architecture allows up to 12 operations to be in flight in every 3.9-nanosecond period. In this manner it is possible at 10G to issue three commands each 1 nanosecond.
The Bandwidth Engine builds on the core memory array architecture by adding an innovative 90 percent efficient low-latency interface which runs on top of one to sixteen differential serial links that are CEI-11 or XFI compatible. The Gigachip Interface (GCI) has been optimized for high access rate devices, using 80-bit packets that have a 72-bit payload and 8 bits for CRC. The GCI serial interface is designed for chip-to-chip communication rather than a typical network SerDes application and includes an automatic error-recovery mechanism for guaranteeing reliable data transport as required by the intended enterprise and carrier markets. MoSys chose to use a mesochronous interface to minimize latency associated with traditional SerDes. The data is pipelined through a control block that feeds the bit stream to the four 1T-SRAM partitions. On the transmit side, as each partition becomes available, one per 1 nanosecond, up to 2 data words can be read out and returned to the GCI interface for transmission to the host.
Another innovative feature of the Bandwidth Engine design are the on-chip ALUs, one per partition, which can be used to manipulate data using a internal read-modify-write operation. Utilization of the ALU offloads the host processor and frees up the interface for other operations resulting in higher performance and improved energy efficiency. Since the ALU is associated with each partition one instruction can be issued every 1 nanosecond at a 10G interface rate. To ensure data integrity through an ALU operation the ECC syndrome bits of the 72bit word are checked, corrected if necessary and recalculated.
At the level of the silicon, MoSys chose TSMC as the manufacturing foundry. The device has TSMC's embedded DRAM process which is based on capacitor under bitline technology, shown in Figure 1. MoSys has made a successful device by combining TSMC's stable process with their novel chip design. The sense amplifiers connected to the bitline exploit all the metal layers available in the logic-compatible TSMC process. Three stages of multiplexed sense amplifiers are used. The first level bitline runs 20 microns in layer of metal 1 and connects the memory cells to the first level sense amp. A second level sense amp multiplexes the output of two level one sense amps, running 750 microns in metal 4 to a second-level sense amplifier. Finally, eight second level outputs are multiplexed to a third-level sense amplifier using 750 microns of metal 6.

Figure 1: SEM cross-section of the Bandwidth Engine.
The layout of the Bandwidth Engine is also oriented to the primary design goals, high access rate and low latency. Conventional Serializer/Deserializer (SerDes) devices place the I/O at the edge of the chip. The Bandwidth Engine changes this, placing the I/O, Gigachip Interface, and clocking at the center of the die. This achieves two things; a 2-3 nanosecond reduction in latency, and also latency equalization. The placement of the GCI interface and SerDes lanes in the center of the chip also reduces receiver/transceiver (Rx/Tx) crosstalk and allows future generations of the Bandwidth Engine to have the same pinout. The design challenge with this approach was maintaining sufficient noise isolation between the memory, core, and chip SerDes. The die layout is shown in Figure 2. As well as the central location of the SerDes block, two inductors are visible. These are used in two LC oscillators that are employed in the VCOs (Voltage Controlled Oscillators) of two Phase Locked Loops. Two PLLs are required to cover the 6-10 Ghz frequency range used by the chip. LC oscillators were used to create a low jitter PLL.
The package was co-designed with the chip. The package uses 8 metal layers, allowing the designers some flexibility to fine tune the series inductance of the package leads. The package inductance is designed to compensate the parasitic pad capacitance. This results in a cleaner eye diagram, improved return loss, and reduced data error rates.
To reduce test costs, which are significant when testers must connect to the sixteen channels on the Bandwidth Engine, a Design-For Test processor was placed on-chip. The processor can be reprogrammed over the manufacturing lifetime of the chip. This allows product engineers to modify the test algorithm as more is learned about weak bit signatures seen during test; essential to ensuring an enterprise and carrier grade quality and reliability supporting a ten year lifespan goal. In the future it may be possible to reduce or eliminate burn in time for the part.
Ultimately, it is the teamwork of the architectural, design, layout, process, test, and manufacturing groups that allowed the Bandwidth Engine to achieve the 2 Giga-access per second data rate and 10-year lifespan required by enterprise customers. Patents covering the design of the Bandwidth Engine were filed one and a half years ago, and are currently at the patent application stage. MoSys has design wins in progress with tier-one networking partners. The design itself is scalable, and MoSys feels it can be improved and used for 400 GE. The next generation could see a fifty percent performance improvement.

Figure 2: Floor plan of the Bandwidth Engine. Note the placement of support circuitry in the center of the die.
John Scott-Thomas is a product manager at UBM TechInsights, a sister company to EE Times. Arabinda Das, a senior process analyst at UBM TechInsights, also contributed to this article.
Navigate to related information

