Constantly escalating bandwidth for SoCs are dictating revisions in every aspect of the design process, so it's not surprising to find leading-edge design teams searching for better solutions.
A few teams have gone beyond looking. They're trading conventional shared-bus architectures for new ones that use multiple buses in new topologies. For these teams, latency and insufficient bandwidth are no longer intractable problems. On the other hand, they must wrestle with making the new architectures work for them.
Many factors feed the trend. Multiple peripherals on a bus cause electrical loading that limits attainable clock rates. SoCs that sustain high I/O data rates present another problem. As always, power consumption is a constraint.
The shared bus is still the most common way to move on-chip data. In this scheme, a large multiplexer drives a single interconnect net which selects the source, but sends the signals to all devices on the net.
Typically, outgoing address bits and data are driven through a high fan-out to all peripherals. An address decoder selects the peripheral, which replies to the master's request. The read-data signals are routed through a large multiplexer controlled by the address decoder. The multiplexer selects the data wires the master needs to use.
Unfortunately, multiplexer-oriented topologies are not very scalable electrically because high fan-out and deep multiplexer logic add delays. The physical wires also tend to be long, adding more delay.
Latency becomes an issue. Although the data streams through a common net, information transfer takes place between the master and a single peripheral. While this transaction occupies the bus, all other peripherals are forced idleat least in terms of receiving or sending bits.
When more than one master is presentSoCs often have multiple processors and a memory controllerthe fact that the bus is unavailable for other transfers results in latency and wait states.
Designers have been implementing workarounds for years. Many of LSI Logic's customers, for example, have found that segmented buses with bridges between segments are sufficient. With this technique, designers can finely tune each subsystem to achieve latency and frequency goals. The design style requires a segmentation bridge between higher and lower frequency peripherals, however, which introduces two drawbacks: added latency between segments and a lot of up-front partitioning, says Balraj "Raj" Singh, senior marketing manager for processor cores at LSI.
Customers are opting for hybrid topologies that include segmented buses and point-to-point connections. "What we never see," Singh says, "is a pure switched-bus architecture and that's because of legacy peripherals."
One key difference between the hybridized present and the point-to-point/switch-matrix future is that CPU companies are making building blocks available to implement point-to-point transfers which, in turn, makes for a less custom, more reusable design.
Bandwidth and Latency
Latency is the primary problem addressed by point-to-point buses. Fundamentally, a bus can transfer one bit of data per data line per clock cycle. The bus's maximum bandwidth is the bus width multiplied by the bus clock frequency. As long as the data packet (control, address, and data bits) is less than or equal to bus bandwidth, latency is limited to pipelining effects and is minimal.
Latency becomes a problem when a data packet is larger than the bus width and bits must be transferred over multiple clock cycles, says Tim Mace, AMBA Program Bus Manager of ARM. Typically, the master maintains control of the bus for successive clock cycles to burst data, which creates more latency. As previously mentioned, however, the most significant latency occurs when multiple masters contend for bus access.
Designers have three basic strategies to increase bandwidth and decrease latency:
- Increase bus width. This strategy works only if peripherals can accommodate the wider bus in a single cycle. A 128-bit bus communicating with 32-bit wide peripherals, for example, operates at 25% of its bandwidth capacity or less.
- Speed up the bus. This allows multiple masters to interleave on the bus with no apparent contention. A sophisticated interface is required to allow peripherals to keep pace, however, and power requirements are an important consideration.
- Provide multiple parallel buses. This solution creates point-to-point connections between masters and peripherals, allowing simultaneous transfers without the problem of matching peripheral clocks to bus clocks. ARM's AMBA AHB Multi-Layer Architecture is an example of this strategy. If a bus is provided for each master, a switched-bus architecture results. The design challenge is how to make the point-to-point connections. Two star IP companiesARM and MIPS Technologiesuse interconnect matrices and crossbar switches, respectively.
Over the past few years, ARM's AMBA bus protocol has become a de facto standard. The most recent AMBA upgrades give SoC designers the third option previously mentioned. Designers can implement point-to-point connections with virtually any number of layers of the protocol, each representing a separate bus. ARM employs an interconnect matrix to create point-to-point connections between masters and slave peripherals in a SoC.
In addition to increasing bandwidth, multi-layer buses offer other advantages. System resources do not have to be allocated to particular masters early in the design, for example, because each master can have its own bus andat least theoreticallycan use all the bandwidth.
Similarly, the design of each AMBA AHB (Advanced High-performance Bus) layer can be fairly simple. You usually can avoid arbitration and master-to-slave multiplexing when each master has a different subset of slaves for each architecture. For most of the slaves, only one master accesses them in a particular configuration, so there is no arbitration between master layers.
Although the masters have individual buses, more than one master may want access to the same peripheral. In this case, an arbitration block must be added to handle a point arbitration at each peripheral.
In an example of a system that has three masters and four slaves, the interconnect matrix is configured for three AHB layersone layer for each master (Figure 1). Three types of devices comprise the interconnect matrix. Each layer has a decode stage that determines which slave has been chosen for the transfer. A multiplexer routes the transfer between the correct master/slave combinations. An input stage stores data until it is needed.
Figure 1: This example of the interconnect matrix for ARM's multi-layer bus architecture links three masters to four peripherals. A decode stage determines which slave has been chosen for the transfer. A multiplexer routes the transfer between the correct master/slave combination. An input stage stores data until it is needed.
If two layers want to access the same slave simultaneously, an arbitration is performed at each slave port. Several arbitration schemes are available from ARM, but the designer can choose to implement any arbitration strategy.
Since the lower-priority peripheral cannot access the bus during the transfer, an input stage is included in the interconnect matrix to store a copy of the pipelined address and control information until the slave device is available.
ARM's memory controllers have multiple ports and, therefore, fit easily into a multi-layer architecture. "Our philosophy is to provide the components through the ADK (ARM Developers Kit) to allow our customers to build whatever interconnect architecture they need," says Mace.
Although the multi-layer architecture implemented by AMBA 2.0 is sufficient for present designs, Mace says, ARM's technology roadmap addresses future performance requirements.
Not surprisingly, ARM's primary competitor espouses a different strategy. IBM Microelectronics' roadmap points toward a full-crossbar switched-bus architecture that can sustain all the data gigahertz CPUs can churn out, says Kalpesh Gala, PowerPC Strategic Marketing Manager. But for the near future, IBM will leverage the capabilities of its 128-bit CoreConnect bus architecture to maximize bus utilization.
CoreConnect was developed to perceive bus operations as slave-centricas opposed to the competing master-centric view. As a result, CoreConnect will implement parallelism in what are called "ways", or independent shared-bus slave segments.
Down the line, customers will first see a CoreConnect IP module that implements two shared-bus slave segments, followed by multiple shared-bus segments and finally a full switched point-to-point solution, says Gala. In the meantime, there is plenty of bandwidth left in CoreConnect's three hierarchical buses (the processor local bus, on-chip peripheral bus, and device control-register bus).
Since an effective way to reduce latency is to provide independent read and write buses with address pipelining of multiple outstanding bus-master requests, CoreConnect also allows simultaneous reads and writes. IBM has also implemented 128-bit interfaces pervasively in its CoreConnect product line.
Pipelining is a natural way to reduce latency, says Rick Hofmann, a Senior Engineer at IBMand an inventor of CoreConnect. The address bus, for example, pipelines additional read and write requests while uncompleted read and write transfers are in progress. Another innovation is a performance monitor that attaches to the processor local bus (PLB) so SoC designers can detect events, check performance, measure per-device bus utilization, and even continuously monitor and update target embedded applications to maximize system-bus bandwidth.
In those instances where design teams still want to implement point-to-point connections, says Hofmann, they can design a crossbar circuit with existing CoreConnect IP. While IBM has not productized parallel structures, Hofmann says, an arbiter block and other IP is freely available to CoreConnect Licensees so any competent designer can create a specific crossbar structure using those elements and other CoreConnect IP.
In July, MIPS Technologies introduced IP that gives designers a head start on implementing point-to-point connections. Taking a page from communication designers, MIPS employs the principles of a cross-point switch in SoC-It, says Product Manager Ken Yap. SoC-It is a system-controller IP block that leverages the multi-layer capabilities of ARM's AMBA to allow two simultaneous transfers on the same clock cycle. It offers more plug-and-play usability than ARM, but is presently limited to two bus layers.
Each bus layer has a dedicated SoC-It system-controller kernel that includes a five-by-five crossbar switch fabric, a dual-port memory controller, a bus interface unit to the MIPS CPU, an interrupt controller, an arbiter block, and three dual-port IP interfaces that connect to bridges, customer IP, and peripherals (Figure 2). To move data in and out of the SoC-It kernel, MIPS has interfaces to the AMBA AHB bus bridge, the PCI bus bridge, and a peripheral bus controller.
Figure 2: The MIPS system controller block consists of a five-by-five crossbar switch that connects any combination of the CPU, the memory controller, and three IP interface units. An arbiter block uses a round-robin arbitration scheme as default, but you can modify the block.
The five-by-five crossbar switch provides interconnects between any combination of the CPU, the memory controller and three IP interface units. Both the memory controller and the CPU act as masters. The arbiter block uses a round-robin arbitration scheme as default but can be modified.
The memory controller is tightly coupled with MIPS' CPU cores and PC 100 and PC133 DRAMs, as well as DDR 200 and DDR 266 SDRAMs. SoC-It controllers come in three flavors, one for each MIPS core
IP reuse is a high priority of SoC design teams and nowhere is it more important than at the bus interface. This results in another important trendinterface standardization. ARM's AMBA Compliance Testbench and its AMBA Compliance Program, for example, support IP reuse, says Mace.
The Open Core Protocol (OCP) is a newly launched silicon-core protocol providing a bus-independent, high-performance, configurable interface between on-chip silicon cores and communication systems. The Open-Core Protocol technical paper
provides an OCP overview, along with the protocol's highlights, capabilities, advantages, and key features.
Other reuse initiatives are the VSIA (Virtual Socket Interface Alliance) and OCP-IP (Open Core Protocol International Partnership) standards for interfacing third party IP with on-chip buses. Both should accelerate the trend to new architectures.
VSIA's Virtual Component Interface (VCI) defines a point-to-point electrical connectivity that makes it easier to manage wire and logic delays in physical design, says Anssi Haverinen, chair of VSIA's On-Chip Bus Working Group and a research manager for Nokia. The underlying bus topology has less effect on the module design while using VCI because it is agnostic with the interconnect. "This makes the shift from shared-bus to something else much easier," according to Haverinen.
VCI has won acceptance because it provides some bus independence. On the other hand, it has received limited support from EDA vendors largely because it does not define test signals, a configuration interface, and other key elements in commercial implementation.
Many VCI supporters are adopting the OCP interface because it is VCI compatible and the transition is straightforward. The difference is that OCP defines a complete socket for IP blocks, says Haverinen, who, in addition to this role at VSIA, is an OCP-IP steering group member.
When it comes to putting today's on-chip bus trends into an historical context, the analogy with system-level networks is inescapable. Terms such as crossbar switches and point-to-point connectivity, for example, all come from the networking world.
Buses and networks have quite different technologies, however, and understanding how they differ is important if the analogy is to be extended to SoCs. The most significant difference is that the system- and inter-system-level world relies on standard protocol stacks to communicate between devicesand consequently decouples the interconnect from the devices.
On the other hand, buses tend to be more primitive in the sense that all of the IP blocks are expected to understand the bus's protocol and pipelining styleto cite just two examples. In other words, system requirements are reflected back into the IP.
For starters, this makes IP reuse difficult. Protocols such as OCP have been developed to facilitate reuse. Just as important, however, decoupling IP blocks from the interconnect OCP opens vistas of opportunity for optimizing system capabilities.
"OCP and the standard socket concept are critical pieces of the puzzle," says David Lautzenheizer, marketing vice president of Sonics. Since OCP is core-centric, he adds, each core can have its own data-word width, burst attributes, interrupt schemes, and other critical parameters. "The IP core publishes its requests to the world and it's up to the interconnect to get the right information to it."
Sonics' SiliconBackplane MicroNetwork provides such an intelligent interconnect. Physically adjacent to each IP core is a piece of the network that Sonics calls an agent (Figure 3). The agent's job is to communicate with the core in its particular flavor of OCP. Agents communicate with each other using a unique internal protocol that combines a fully pipelined, non-blocking, programmable-latency bus with an access mechanism that guarantees bandwidth. While topologically equivalent to today's shared bus structures, SiliconBackplane's internal protocols deliver significantly higher bandwidth utilizationas much as 90% according to Lautzenheizer.
Figure 3: Network components known as agents are used in Sonics' Silicon BackPlane architecture to communicate with each IP core in its particular flavor of OCP-IP.
With the Sonics' toolkit, designers can scale interconnect bandwidth independently from the OCP sockets, says Chief Technology Officer Drew Wingard. The interconnect can be wider, or clocked faster, or pipelined differently than would ordinarily be dictated by individual IP blocks in the system, including the memory controller.
To enable maximum floorplan flexibility and rapid timing convergence, the bus connecting the agents combines the functions of multiplexing and repeater insertion, resulting in "short hops" between IP blocks.
With so many solutions emerging it is no wonder that design teams have their work cut out for them just in deciding on a topology. They should expect it to stay that way. "There is no final answer," says Haverinen, "only an endless cycle of optimization and re-optimization."
About the Author
Contributing writer Jack Shandle is a former chief editor of both Electronic Design magazine and ChipCenter.com. He holds a BSEE degree and has written hundreds of articles on all aspects of the electronics OEM industry. Jack is president of eContentWorks, a consultancy that creates high-value content for publishers, eOEM corporations, and industry associations. His email address is email@example.com