The iSCSI protocol will make it possible to design storage arrays, host-bus adapters and storage network switches that can transport Small Computer Systems Interface data over TCP/IP networks. While Fibre Channel will continue to be the primary storage interface for midsize to large storage infrastructures, iSCSI will be useful in smaller environments such as entry-level or departmental storage-area networks. For end users, this holds the promise of reduced costs and management complexity, since the SCSI-over-Internet protocol technology extends the scope of SANs.
System designers, though, need to consider the processing requirements of iSCSI, especially for 10-Gbit Ethernet implementations. Design features that help meet such requirements include a direct path to memory for data and the use of custom hardware such as ASICs to handle TCP and SCSI protocol processing.
To understand the design challenges inherent in 10-Gbit iSCSI, one must first understand the underlying SCSI and TCP/IP protocols.
Since its development in the 1980s, SCSI has become one of the most common means for linking computers to storage systems. SCSI relies on "initiators" and "targets" to originate and carry out I/O commands. Initiators include interface cards such as host-bus adapters in servers that are linked to a SAN. Targets are commonly storage devices. Logical units within a target-such as disk volumes or individual tape drives-are identified by unique logical-unit numbers. SCSI commands define a SCSI task to be executed by a logical unit at the target.
The Internet Protocol (IP) governs the forwarding of data packets across the Internet, while the Transmission Control Protocol (TCP) ensures that messages are properly divided into packets for distribution, reliably delivered and properly reassembled at their destination. Each iSCSI session can encompass multiple TCP/IP connections, and each TCP/IP connection can involve considerable network traffic. To find and identify a target, an iSCSI initiator needs the IP address, TCP port number and iSCSI target name in order to establish an iSCSI session to transmit and receive data.
Thus, iSCSI occupies a layer in the protocol stack between SCSI and TCP. It transports SCSI commands and responses, and it handles data transfer on behalf of the SCSI layer-generally, to and from the buffers provided by the SCSI layer.
A TCP byte stream used by iSCSI is divided into a succession of iSCSI protocol data units (PDUs) that carry commands, responses and data. Each iSCSI PDU includes an iSCSI header (created by the sender) and a payload (such as a SCSI command, response or data). The receiver uses the iSCSI header to identify and extract the iSCSI payload for appropriate SCSI processing. Additionally, the actual network packets contain extra prefixed TCP, IP and Ethernet headers, which are then added and removed by the layers below iSCSI. A PDU may span multiple TCP packets; in turn, a TCP packet may contain portions of one or more PDUs.
The computational demands of packet and PDU processing along with protocol management are significant. In a 10-Gbit iSCSI implementation, the sheer throughput of TCP/IP and iSCSI traffic is beyond the ability of most microprocessors to handle effectively. Therefore, it is desirable to perform iSCSI data transfers directly to and from memory buffers using dedicated hardware to handle iSCSI data-in and data-out PDUs along with related sequencing and flow control.
Also, iSCSI includes optional cyclic redundancy checks (CRCs) to protect its headers and data, but those CRCs differ in two important ways from the ones that are familiar to Ethernet designers.
First, iSCSI uses a different CRC polynomial, CRC32C, to enhance error detection for data that is also covered by an Ethernet CRC. Second, the data covered by a single iSCSI CRC may span several TCP packets that arrive at different times. By contrast, an Ethernet CRC covers data in one Ethernet packet. This can be addressed by buffering packets until they can be reassembled before performing the iSCSI CRC calculation, or implementing the means to calculate and combine partial iSCSI CRCs.
The ability to handle disruptions in the flow of data is critical in designing 10-Gbit iSCSI systems. At each level of the iSCSI protocol stack are events that can disrupt the flow, such as a missing TCP segment that requires retransmission; a missing iSCSI PDU that requires buffering of commands at the target; or a SCSI task-management command, such as TASK ABORT, that affects iSCSI processing.
In developing custom hardware, designers should resist the temptation to overoptimize pipelines with the expectation that there will be a clean, undisturbed flow of data. And because dedicated hardware may be unsuited to handle flow-disrupting events, firmware or (in unusual cases) software executed by the device's operating system may be preferable.
Much of the iSCSI protocol management should ultimately be done using firmware. This includes iSCSI functionality such as connection setup and command sequencing.
Conversely, the handling of initial discovery and in-band authentication may be appropriate for drivers executed by the device's operating system.
Proper implementation of buffering is crucial to maintaining performance. Buffering of out-of-order packets and commands (rather than dropping them and requiring they be retransmitted) also reduces both network traffic and endpoint processing loads. In addition, designers should use an up-to-date version of TCP that supports functionality such as selective acknowledgement and explicit congestion notification, since these functions can reduce the amount of packet retransmission required and, hence, the overall network traffic and endpoint loads.
An iSCSI initiator can transfer data to or from a target in two basic ways. Immediate or unsolicited data is transmitted in the same PDU as the SCSI command (immediate), or in one or more PDUs that follow the command PDU without waiting for permission from the target (unsolicited). Solicited data transfer must wait for permission from the target in the form of an R2T (ready-to-transfer) PDU that signals the target's readiness to receive the data. Since many SCSI writes involve less than 8 kbytes of data, providing target buffers of up to 8 kbytes per command for immediate or unsolicited data can eliminate one or more round trips that would otherwise be needed for solicited data transfer.
Designers need to be aware of the complexity involved in introducing additional TCP/IP stacks beyond those running in the main operating system. Creating an additional complete stack on, for example, an iSCSI host-bus adapter duplicates functions performed by the host TCP/IP stack and adds significant administrative complexity. Furthermore, integrating multiple TCP/IP stacks may require interfaces to hardware and firmware at the TCP level that control which connections use which network interface. A more desirable alternative is to share and reuse appropriate functionality in the main TCP/IP stack.
A different source of complexity arises when iSCSI sessions are allowed to span multiple hardware network interfaces, such as multiple iSCSI host-bus adapters on a single server. This is because load balancing, error recovery and enforcement of iSCSI command ordering become more complicated when they must be coordinated among multiple interfaces. One way to reduce this complexity is to restrict iSCSI sessions to a single hardware interface and use host-based software such as EMC's PowerPath or Veritas' Dynamic Multipathing to assist in failover and load balancing.
Also, any custom hardware or firmware must be able to handle the larger number-possibly hundreds-of simultaneous active connections that may occur in a 10-Gbit iSCSI design, as compared with the smaller number of initiators per target in a typical SCSI environment. An important scenario to consider is a 10-Gbit hardware target communicating with software initiators running at 1 Gbit or slower. Without the proper resources and functionality to handle a sufficient number of active connections, the target hardware could bog down and become a bottleneck to overall system performance.
Designers should be aware that iSCSI requires the implementation of the Encapsulating Security Payload (ESP) and Internet Key Exchange portions of the IPsec protocol, and that there are significant design issues involved in implementing ESP functionality (both encryption and cryptographic integrity) at 10-Gbit speeds.
Designers of 10-Gbit Ethernet iSCSI should provide a direct path between the system interface and memory, dedicate hardware to handle protocol-processing tasks and consider which other tasks should be carried out by firmware vs. system-level software.