The impact of memory architecture is a much researched, but little publicized element to the performance of today's network switches. Many academic papers have been written about the advantages of output queued shared memory vs. combined input-output queued (CIOQ) memory architectures over the years.
Even with all of that research, the industry has standardized on the CIOQ architecture, mainly due to the design challenges with implementing a shared-memory architecture. However, now are there merchant switch chips that are able to take advantage of a shared memory design, making it appropriate to discuss the two architectures and the advantages of adopting the shared memory approach.
For years, the Holy Grail in switch chip design is an output queued shared memory architecture. This has been difficult to achieve in the past due to the high bandwidth required between the switch inputs and the shared memory. Because of this, most vendors implement a combined input-output queued architecture, which uses less core memory bandwidth, but requires extra features to avoid blocking. In most switch chip designs there is a compromise between core bandwidth and ingress complexity, with the result that corner case blocking still remains. These switches also must store packets at both ingress and egress, adding to latency and memory requirements.
By providing a high-bandwidth core memory, the output queued switch architecture can be made simpler than competing CIOQ architectures. This eliminates the complexity of ingress virtual output queues (VoQs) and the extra memory they require. In addition, multicast packets are only stored once, further reducing on-chip memory requirements.
Traditional CIOQ Architectures
Memory access bandwidth has been a thorn in the side of switch chip architects. When using traditional cross bar and memory designs, there is insufficient on-chip bandwidth to allow every input port to write into the same output queue simultaneously. To get around this blocking issue, also known as head of line (HOL) blocking, chip architects include virtual output queues at every switch input. This is combined input-output queued architecture as shown in Figure 1.
Virtual output queues at each ingress port provide a single queue for each switch output (egress) port. If a particular egress queue is temporarily blocked, the matching ingress queue will be flow controlled, but packets destined for other egress ports can bypass this blocked queue and send data to other non-blocked egress ports. But for an N-port switch, this means N*N input queues and associated schedulers which add significant complexity to the design of the switch. This also adds to packet latency since each packet must be queued twice through the switch. Because of the complexity of VOQs and associated schedulers, many switch designs allow some internal blocking in order to reduce the design complexity.
Some switch fabrics are designed using chip-sets, which have separate ingress/egress chips, sometimes called fabric interface chips (FICs), along with central crossbar and scheduler chips. Because of the limited bandwidth provided by off-chip interfaces compared to on-chip interfaces, blocking can occur at the crossbar outputs, requiring a CIOQ architecture. This configuration also requires a complex central scheduler and the associated flow control and grant signaling that must be communicated between devices. All of this adds significantly to cost, power and area.
Output Queued Fabric Architecture
True output queued shared memory switch chip architectures provide full bandwidth access to every output queue from every input port, no blocking occurs within the switch, eliminating the need for complex VOQs. The block diagram of a shared memory architecture is shown in Figure 2.
With this architecture, all packets arriving at any ingress port are immediately queued at full line rate into shared memory. Packets are then scheduled from shared memory to the egress ports. Multicast packets are de-queued multiple times to each egress fan-out port. Each egress port has an independent scheduler design that is much simpler than a central scheduler. In addition, since the packet is queued only once, cut-through latencies of a few hundred nanoseconds can be achieved independent of packet size.
Multicast adds complexity to CIOQ designs due to its blocking nature. With the CIOQ architecture, egress buffers must accept the packet before it can be de-queued from the ingress buffer adding to ingress congestion. Also, the packet must be stored multiple times per switch adding to both the overall memory requirements and to the latency. This also adds to latency jitter due to different physical egress queues.
An output queued architecture stores the packet only once per switch and de-queues it multiple times to each egress port. This reduces on-chip congestion, reduces overall memory requirements and provides low latency transmission. Also, port-to-port skew is minimized which is important in applications such as video distribution.
Memory Size Comparison
To examine the impact on memory size, let's calculate the memory size requirements for a CIOQ architecture to an output queued architecture in a typical implementation. As described above, due to internal blocking and multicast inefficiencies, additional ingress memory will be required. Also, multicast replication requires additional egress memory on a CIOQ switch. There are also inefficiencies due to transferring packets between input queues and output queues, but we won't cover that in this comparison.
Assume the design goal is to provide enough on-chip packet memory to support 1,000 packets up to 2Kbytes in size. The output queued architecture allows storing all of these packets in a single shared memory buffer, and therefore requires 2MB of on-chip memory.
Lets assume a CIOQ switch would normally assign 1MB to the ingress queues and 1MB to the egress queues. Also assume due to the design trade-offs discussed above, that there is a 20% blocking probability at the ingress. This means the CIOQ switch would actually need 1.2MB of ingress memory. Lets assume 20% of the traffic is multicast with an average fan-out of four ports. This requires 60% more egress memory or 1.6MB. So for this example, the CIOQ switch would need 2.8MB of memory to come close to the performance available with 2MB of output queued memory.