Design Article
File Sharing on the WAN: A Matter of Latency
Vinodh Dorairajan, Tacit Networks
12/15/2004 6:00 AM EST
This limit is one of the fundamental problems in using the Internet to exchange data across the wide area network (WAN). The time for a packet of data to go from the sender to receiver and back is called the latency of the network. It is also called the round trip time or network response time.
Since the speed of light is constant, latency is directly proportional to the distance between the two endpoints of communication. In plain language, this means that the longer the distance, the longer the delay.
Conventional wisdom says that increasing bandwidth will lead to improved performance. In fact, the transmission control protocol (TCP) limits the number of concurrent bytes transmitted, regardless of the size of the transmission pipe. TCP restricts the amount of data on a connection to avoid overloading bottlenecks in the network.
The TCP stack will send at most a "windows-worth" of data at once. After that it will wait for the acknowledgement of at least the first packet before moving on to send the next packets of data. So, the higher the latency, the longer the TCP will have to wait to transmit its packets. Thus, adding more bandwidth doesn't help; the available bandwidth will always be smaller than the amount of data to be transferred over a connection's lifetime.
Even when transmitting large amounts of data is not the primary concern, network latency still impacts performance. Many communications protocols, including common file sharing protocols such as network file system (NFS) and common Internet file system (CIFS), were designed to work in a local area network (LAN) environment where latency is extremely low. Even simple actions, such as retrieving the attributes of a file, can require numerous round trips across the network. In a WAN environment, each of those round-trips carries a latency penalty, dramatically slowing operations.
Round Trips and RPCs
The large numbers of round trips that NFS and CIFS protocols force on a file as it is sent across the WAN are really a result of the remote procedure calls (RPCs) that these protocols use. All programmers know about functions; they are self-contained blocks of code that can be called from other sections of the program to do specific work. Functions are compiled as part of the program and reside in that program's binary code. However, when you take this paradigm and involve different machines, RPCs come into play.
RPCs occur when the function and the program are in different binaries or reside on different machines, with the program running on a client machine and the function running on a server machine. When the client calls for the function, it calls a "stub" (or placeholder) function, which then takes all the function parameters and sends them across the network to the server. The function is then executed on the server and the results are sent back to the client, which then forwards them to the program.
Programmers who take the RPC approach often make the assumption that the client and the server will be on the same subnet and therefore close to each other. In this scenario, RPCs are not an issue because of the proximity of the client and server.
As soon as the program runs over a WAN, however, that premise causes "chattiness" stemming from the many, many RPCs it must make to complete each task. When the function and the program reside in the same binary, there can still be multiple function calls that degrade performance, but it is not as noticeable, especially on fast systems with lots of memory. The negative effects of chattiness are a lot more dramatic, however, when each function call has to go over a high latency network connection. Because both NFS and CIFS are dependent on RPCs, both also suffer from chattiness.
A Simulated Environment
To help highlight and confirm the above points, some experiments were conducted. All experiments described here have been carried out with the aid of a WAN simulator that can replicate real WAN conditions of latency, bandwidth, and packet loss over a T1 (1.544 Mbps) bandwidth connection.
During the simulation, both NFS/Linux and CIFS/Windows protocols were tested. For NFS, two Linux computers were used, simulating both ends of the network connection one for the server and another for the NFS client. Similarly for CIFS, two Microsoft Windows computers were used, one running Windows Server 2003 to act as the server, and another running Windows XP to act as the CIFS client desktop. The routes between the boxes were set up to go through the WAN simulator on an isolated network to avoid interference or noise-skewing the results.
Experiment One: The Effects of Latency on NFS
Originally developed by Sun Microsystems, NFS is now an industry standard for accessing files on remote UNIX servers. How does latency affect NFS? To begin with, let's look at the results of an experiment.
An NFS client obtaining attributes for a single file from an NFS server over the WAN with a latency of 120 ms and a bandwidth of 1.5 Mbit/s resulted in three RPCs across the network: "access", "lookup", and "getattr". The access RPC checks whether the user with the given user ID (UID) has permission to access the file. The lookup RPC then establishes a NFS file handle for a given path name and the getattr RPC is the call that actually gets the attributes of the file.
What's wrong with the above picture is that it takes the "function calls" paradigm of a single, integrated binary program and exports that to an distributed RPC model that results in many RPCs across the network. If these RPCs were over a local network, it would hardly matter, but over a 120-ms latency network, these RPCs can destroy the program's performance.
To compute the network time taken to complete a file system call, the following formula is used: t = n * l, where t is the total time taken, n is the number of RPCs issued and l is the latency between the client and the server. Using this formula, the time taken to get the attributes of a single file would be: t = 3 * 120 = 360 ms. If we then define a session as a set of file system operations, the network time taken for the session would be: T =Εi(ni * l).
This means that if a user were trying to get the attributes of ten files in a directory, in a serial fashion, it would take a total of 3.6 seconds. In truth, however, because these are theoretical calculations, the actual latency will probably be higher.
When reading a 1-Mbyte file over the above connection, NFS issued 1308 network packets for a total of 20 RPC calls, taking approximately 11 seconds to complete. It's important to note that a good chunk of this 11 seconds was spent retransmitting the packets. This happens because NFS uses the universal datagram protocol (UDP), which does not provide any guarantee of packet order or confirmation of delivery. When the NFS client does not get a response back from the NFS server within a certain time frame, it assumes that the packet did not reach the server and resends the RPC, wasting bandwidth.
The NFS server used in the above experiment was a fairly high-end machine with almost no load on the system. Even in the absence of load, when high latencies are present in the network, the NFS client will assume that it is causing load on the server and will delay RPCs, increasing the time required for operations to complete. If the load on the NFS server was increased until all NFS server threads were busy, many more packets would be dropped, increasing network retransmissions. All this only further reduces NFS WAN performance, and leads to more frustration.
Experiment Two: The Effects of Latency on CIFS
Now let's look at how latency affects the CIFS protocol. CIFS is the file sharing protocol of choice in Microsoft Windows Networks and support of CIFS is integral to Microsoft operating system technology. It is important to note that CIFS has richer semantics than NFS in terms of authentication and authorization and because of this there is a lot of network packet flow between the server and client machines when transmitting data.
Given the same WAN conditions as simulated above for the NFS experiment, a CIFS client obtaining file attributes from a CIFS server resulted in four RPCs or a latency of 48 ms (12 x 4= 48 ms). Again, to access ten files serially in a directory (and using the formulas in the previous section), the time taken would theoretically be at least 4.8 seconds.
While both NFS and CIFS have mechanisms to get the attributes of many files at the same time, applications are normally interested in files following different file patterns. For instance, while NFS and CIFS are designed to get attributes of multiple files in a single RPC call, Microsoft Word might open three different files when it is editing a document and it gets the attributes of these files one by one and not in a single getattributes call. Because of that, applications end up trying to get the attributes of these files in a serial manner, running into more latency issues.
Reading the same random 1-Mbyte file used in the NFS experiment, CIFS issued over 160 RPCs. The total number of TCP/IP packets that flowed over the wire was approximately 1300. Again, most of those packets were TCP retransmissions. The TCP protocol stack on Windows XP simply timed out waiting for acknowledgement from the TCP server and retransmitted the packets. Because of that, the transmission of the 1-Mbyte file took about 16 seconds.
To complicate matters further, Windows clients can time out and stop transferring files on a slow connection. Even if the underlying TCP connection is still active, the CIFS clients will time out waiting for data and close the connection. The user has no choice but to retry the transfer, which, of course, might fail again.
It is clear that whether NFS or CIFS protocols are used in transmission, the amount of RPCs generated when accessing files is predictable and simply cannot be made to be more efficient. The problem is that while the packets we send over the Internet can travel at the speed of light, they only travel a few at a time and may be further bottlenecked by user contention and limited network bandwidth, resulting in lost productivity and frustrated users. As the next section will illustrate, this only gets more pronounced as WAN latency meets the protocol intensive applications that have become a part of our everyday lives.
Windows on the WAN
Microsoft Office applications are perhaps today's most frequently used applications, and, among these, Microsoft Word files may be the most highly used and transmitted files in the world. While no one would argue the usefulness of Microsoft Word, using the WAN to transmit Microsoft Word files can be a painful and time-consuming experience.
Word uses many temporary files and an entire Word file is read and written multiple times, significantly slowing down WAN file transmissions operations. As one of our experiments showed, just opening the first 8 kbytes (2 memory pages) of a 2-Mbyte Word document for reading took a full 50 seconds (including a TCP retransmission) due to the effects of WAN latency.
After that, as Word attempted to read the remainder of the document, the network latency caused the CIFS client to disconnect. A new file sharing session had to be set up, which caused more delays, and between more disconnections and TCP retransmissions it was a full 180 seconds before Word finally managed to read the whole file.
All this took place before the user could even type a single character into the document. Once a small modification was made to this document, it took well over two and a half minutes (where CIFS RPCs numbered 960 and the total number of TCP packets was over 8000) to save. And this figure does not include the time spent waiting for the user interface to be responsive again. These multiple minutes of delay add up to much lost productivity when you consider how many files users open and close over the course of a day.
This experiment of Windows over the WAN is just one example of how latency can wreak havoc with geographically dispersed systems, but it is not an isolated example. On the contrary, this is pretty typical behavior for most Windows applications whenever they meet the latency of the WAN.
When WAFS Meets the WAN
Since we can't change the speed of light, how do we solve the problem of high CIFS and NFS latency over the WAN? Our first reaction might be to want to change NFS and CIFS protocols.
This, however, is easier said than done. For one thing, the number of existing NFS and CIFS installations numbers in the billions and changing all of them would be a monumental task for the IT industry. And even if IT giants like Microsoft or Linux were able to carry out this task, each new version of an operating system or application would still have to maintain backwards compatibility with all other existing systems.
So how does one solve a problem where neither a law of nature nor a law of business can essentially be changed? One answer lies in a new technology called wide-area file system (WAFS). WAFS have been expressly designed to solve the problem of high latency networks, and when WAFS meets the WAN, the problem of latency can be addressed in a much more effective way.
In a typical WAFS network configuration, one WAFS server appliance is installed at the data center to offer access to shared storage via the WAN. A second appliance is installed at a remote office in front of the user network where it transmits requests for files. WAFS technology then uses specially designed protocols that recognize and transmit just the changes that have been made to a file, drastically cutting the number of RPCs that flow across the network to a bare minimum.
By utilizing a local cache in the remote appliance, WAFS technology also ensures that most file requests are "warm" and the data is instantly available to remote office users, improving the user experience and reducing frustration. In essence, WAFS helps bring the data closer to the end user, resulting in substantially reduced access times.
Advanced WAFS solutions also implement coherency and consistency mechanisms, which prevent different remote sites from overwriting the data by preventing conflicting file save operations. This ensures that updates to a file from one remote site flow to another remote site as quickly as possible. It also ensures that a user at another remote site is not working on a stale version of the file.
Another feature of WAFS solutions that is also critical to a truly successful implementation is "write back caching with logging." Write back caching with logging means that all changes to a file are resident in persistent state storage on local disk until the user has finished modifying the document, transmitting these changes back to the data center only when the user is done writing to the file. This way, users never experience the pain and delays of waiting for a file to save across the WAN.
There are several gains that WAFS technology can achieve in the "fight" against WAN latency. The most advanced WAFS technology nearly eliminates WAN latency, and because it essentially allows data center storage to be virtually indistinguishable from local storage, it also means that distributed enterprises do not have to deploy duplicate storage at each remote office. With both the WAN file sharing and storage consolidation benefits, companies that have implemented WAFS technology report performance improvements of up to 100 times, while obtaining a return on investment in 3 to 9 months through hard dollar savings and user productivity improvements.
About the Author
Vinodh Dorairajan is a senior software engineer at Tacit Networks. Vinodh holds a holds a masters degree in computer applications from Bharathidasan University, India and can be reached at vinny@tacitnetworks.com.



