Whether it is at the core, edge, or access platform, network routers have one basic function: packet forwarding. Edge routers, however, have the need to perform additional processing and deeper examination of incoming packetsin the most efficient manner. To accomplish this, designers need to know how to analyze their design in order to utilize the most appropriate searching technology.
With the emergence of reduced-latency DRAM (RLDRAM) and high-density fast ternary content addressable memory (TCAM)-enabled network search engines, router designers many new options to consider. As designs and applications become more and more complex, it becomes readily apparent that there are many design tradeoffs. There are many factors that must be taken into account, including performance requirements and types of databases, as well as application-specific factors that must be evaluated when applying the chosen technology.
This two-part tutorial will describe in detail the design complexities of selecting a particular search algorithm or technology. In Part 1, we'll investigate the key design considerations, including development issues and attributes that help determine which diverse memory architecture to incorporate into your design. In Part 2, we'll provide examples of applications in which these technologies can be applied.
Router Functions and Searching Databases
The job of the intelligent edge router, also called an ingress router, is to take incoming packets and send them to the network switch fabric. The edge router typically resides in the network service provider's central office, and resides on a line card in a network-processing unit (NPU). These line cards do not simply forward packets; they must act intelligently in deciding when and how to forward them. For instance, the edge router must implement quality of service (QoS) protocols to ensure that customers get the right share of the data channel's bandwidth. Implementing these QoS protocols and other intelligent packet forwarding requires the use of several specialized search databases.
The general structure of the line card's logic flow for packet forwarding appears in Figure 1, which shows the various databases and their connection to the line card's tasks. Three databases are of special interest: the flow cache, the Layer 2 (L2) and Layer 3 (L3) tables, and the access control lists (ACL). Each has to be searched repeatedly as part of the packet forwarding, and each database has different search requirements.
Figure 1: Logic flow and packet forwarding for a line card optimized for edge routers.
The flow cache is used to monitor flows and may be used to speed router operation, taking advantage of packet traffic's "flow" nature. A given application is likely to send a series of packets with the same header characteristics. By keeping a cache of recently processed headers together with their associated forwarding, classification, and policy data, the line card can avoid processing the header each time and simply forward the packet using the flow data generated the first time. A search of the flow cache must use an exact-match operation.
For IPv4/6 packets, the line card examines the L3 network address databases in order to determine the next hop in the packet's route. In this operation, the routing decision may not need to use the full packet header. To speed operation, the search of this database uses a longest-prefix matching (LPM) algorithm.
ACLs are the result of implementing rules for each packet header combination. These rules allow the line card to verify that the services associated with routing the packet match the network status as well as the terms of the user's contract. Users may have restricted access to some services, have contract-related bandwidth limitations, or not have authorization to access some network resources. In order to implement these rules, the line card needs to classify the packet according to its access rights.
Because many users are subject to the same rules, but with different outcomes, the search on the ACL database uses a best-match algorithm with wildcards to mask arbitrary sections of the header from the search. For example, an access rule may say "decline all packets coming from any source IP address destined for Port 80 on any IP destination address" by masking all but the port address in the header.
Line Card Performance
All these database searches must be performed at speeds sufficient to keep pace with the incoming traffic. If these searches are not performed at line rate, packets will be dropped. There are two components to the line card's search performance requirements. The first is the traffic rate. The second is the number of database searches a packet requires. Simple forwarding may only require one or two searches, but access control and billing may require many more. Figure 2 shows the data rates that line cards must handle for a variety of function and speed combinations, as well as the types of equipment that provide those functions.
Figure 2: Search rate requirements for network search engines.
In addition to performing the search function, the line card must maintain the search databases. The frequency of updates on the databases depends on how quickly the network is changing. In relatively static networks using 16-4-4-4-4 trie algorithm for classification, adding or deleting an entry in L2 or flow cache tables requires access to 256 accesses per second (assuming an addition of prefix 8).
The border gateway protocol (BGP) maintains the L3 table and typically requires about 3000 entries per second during extreme route flap conditions. For algorithm-based search tables, this translates to 6 million-memory accesses per second, about 2 percent of the 250 MHz quad data rate (QDR) memory bus used on many line cards.
The memory accesses become much more frequent, however, if the network is updating the flow table or allocating VLAN traffic to switched virtual connections. In this case, table updates can reach more than 100,000 entries per second, or 25 percent of bus capacity, creating the potential for a memory bottleneck.
Unlike the L2/L3 tables, the current ACL tables are fairly static all the time. This may change in next-generation networks. Internet security is becoming more important and networks may need to adjust quickly and adaptively in order to foil cyber attacks, which will place an additional burden on ACL table updates.
In addition to performance, there are other hardware issues that must be kept in mind when choosing the searching implementation for a line card. One is power. The central office where line cards reside is typically small and closed, creating a heat trap. Line cards can easily dissipate as much as 300 W each, so cooling becomes a major problem for central offices. Design of the classification database memory allows a trade-off among speed, latency, board space, and power dissipation. Selecting the right database architecture, then, is an important factor in setting overall system power.
Other Search Engine Choice Factors
Other factors to consider in line card design include the density and availability of memory. Memory density will determine the size of databases that are practical within the board's physical constraints, thus determining its performance. The availability of memory will affect both product cost and production schedule. The wrong memory choice could result in a product that cannot be produced on time, which negatively impacts a designer's ability to meet customer requirements.
Although the search requirements for line cards can be high, there are many search algorithms that can satisfy them. For line card design, however, the algorithm also dictates the architecture. For example, the hashing and Trie algorithms use standard SRAM.
The same algorithms can be used with RLDRAM or other memories by replicating the database over several devices so that searches can run in parallel to attain speeds comparable to SRAM. In one example, four RLDRAM devices with 20-ns random access times can store one database replicated four times to reach up to 200 MHz of memory performance. This solution may be less costly than an SRAM solution when only memory cost is compared, but the system costs go up when the memory bus is replicated four times. Additionally, the replication of databases further complicates the updating of tables. With either architecture, however, the algorithms force a tradeoff between search, update speed and memory size.
Direct Index is Simplest
One of the simplest search methods is to use the search parameter, or key, as the address to an SRAM table that contains an index into the relevant database. This direct indexing provides an exact match, as used in flow caching, using the packet header field as the search key. The approach is fast, delivering the index in a single memory access, but uses memory inefficiently. A search key containing "W" bits needs an index memory of 2W locations. As a result, keys much larger than 25 bits become impractical.
The hashing algorithm provides a fast method of indexing large tables with much wider keys, up to a few hundred bits. Hashing compresses the wide search key to a narrow key of approximately 20 bits, after which the narrow key is used in the direct indexing method. Various methods are available for hashing the search key; one approach is to use an encryption algorithm such as data encryption standard (DES) with a fixed key.
It is possible that hashing a wide key will result in two search keys equating to the same bit sequence in the first 20 bits. The keys thus clash, generating the same index, and the algorithm must resolve the ambiguity. Typically, the resolution requires the algorithm to directly compare the search key to the clashing table entries. This added comparison slows down the hashing algorithm, but the probability of a clash is often quite low. The performance reduction may be significant in worst-case situations.
The hashing algorithm, however, is limited to an exact match. Thus, every bit is significant. While this is suitable to flow caching, the L3 tables need to support longest prefix matching, and ACL needs to use wildcards. These databases need a different algorithm.
The binary search retrieval algorithm (Trie) allows wildcards in the lower-order bits of a search key. The algorithm performs a bit-by-bit examination of an incoming search key, altering left and right pointers with each bit examined, as shown in Figure 3. When the algorithm reaches the first wildcard, or "don't care" condition, the search stops. The pointer's value now provides the table index. This result is a "best match" search. Pointers with shorter prefixes are identified during the search, like P4 in the figure, but the algorithm keeps track of the best match information (in this case P3) at every level.
Figure 3: Diagram of the binary retrieval (tie) algorithm).
These binary search tries are memory efficient, but slow. Each bit in the key triggers a memory access. Because most tables are too large for a CPU's on-board cache, these accesses have the additional drawback of being across an external bus. Parallel examination of multiple search keys can help speed the search, but this requires additional memory interfaces, which adds pin count and design complexity because of the need to interleave memory accesses. As a result, the algorithm is best suited to a small database with wide keys, such as the ACL.
A variation of the binary retrieval algorithm, called the multiple-level Trie, can increase performance and is popular for large databases with narrow keys, such as forwarding. This algorithm examines several bits of the search key (the "stride factor") at each level. The results of the first level point to one of several tables, or nodes, in the second level. The results of the second level point to one of the nodes for the third level, and so on.
The multiple-level trie approach requires a new node for each possible pointer in the previous level. If the stride factor is "K0" bits at the first level, the algorithm wants 2<> nodes at the second level. Each pointer from each node at the second level wants a node at the third level, and so on.
Theoretically, the number of nodes the algorithm calls for quickly becomes unmanageable. In practice, however, the number of nodes actually needed at each level will be no greater than the number of entries (N) in the database, so the memory requirements are significantly less than a pure geometric progression predicts. A calculation showing the memory requirements for a three-level Trie structure with stride values of 16, 8, and 8 and 4096 table entries appears in Table 1. This is a common configuration for IPv4 forwarding.
Table 1: Storage Requirements for a Variable-Stride Trie
Maintaining multi-level Trie nodes depends on the type of entry being made in the database. An IPv4 (32-bit) address masked to the first 12 bits, for instance, requires a change to only 16 locations in the first node and no changes to the other levels for the 16-8-8 structure of the example. Masked to 28 bits, the entry would require one change in the root node, one change in the relevant node at the next level, and 16 changes at the third level. Masking to fewer than 8 bits never occurs.
Other structures are possible for the multi-level Trie. A five-level 16-4-4-4-4 structure would need a total of 10.5 Mbits of memory for 4096 IPv4 entries, as opposed to the 69.2 Mbits needed for 16-8-8. The five-level structure also requires few maintenance entries. The tradeoff for the smaller memory requirement is a longer search path.
Multiple Field Searches
The Trie algorithms become complex when applied to the multiple field searches demanded by access control. In these searches, a wild card may be located at several points in the search key, as opposed to being restricted to the lower-order bits as in the forwarding searches.
The combination of wild cards and multiple search fields means the algorithm must perform a hierarchical search, so it searches for matches in the first field. Where wildcards exist, the search must then jump to start searching in the next field as well as continue searching the next bit in the first field. This branching continues through as many fields as needed. An example of a hierarchical search through a set of rules composed of two fields is shown in Figure 4.
Figure 4: Hierarchical search using multiple fields.
Because a wildcard may exist at any point, if the first field has W1 bits, a second field has W2 bits, and so on. The maximum number of searches needed can be as great as W1xW2xW3... which can easily run into millions of memory access per search. Alternative algorithms are available to reduce search time, but they require proportionally larger memory.
There are more efficient multiple-field search algorithms in the research stage. These heuristic techniques typically reduce the problem by limiting search key configurations, which requires a thorough understanding of the rules being implemented and often use pre-calculated tables. This limits their usefulness to static tables. Next-generation equipment will typically require dynamic table management.
All these direct and Trie search algorithms depend on random-access memory for their implementation. An alternative is to use an architecture that employs ternary content-addressable memory (TCAM), such as a network search engine (NSE). TCAM-based NSEs are specialty memory devices that make a simultaneous comparison to the search key, with wildcards, of all entries in the memory. The NSE returns the address of the first matching entry, which serves as the index into the database. These TCAM-based searches can be used for all three search types used in line cards: exact match, longest prefix (best) match and multiple field match. Let's look at the TCAM approach further.
Single-Cycle Search with NSEs
The TCAM-based NSE is able to search its entire memory array in a single clock cycle by incorporating a comparison function into a standard RAM cell. The bit lines of the RAM cell work normally when writing data into memory. When making a comparison, however, they hold the pattern to be matched.
Additional transistors in the TCAM form an exclusive-NOR with the RAM cell value and the bit lines and the output of this logic ties to a "match" line for that address. Additional transistors allow a wildcard value to force a match on the corresponding bit. All bits at a given address tie to the same match line, so any mismatch discharges the line. Only if the stored value at an address matches the comparison value will the match line remain high.
Each comparison in the NSE requires the pre-charging of the match line, so the devices can exhibit higher power consumption than a conventional SRAM. Advanced TCAM-based NSE designs, however, include segmented power management features, such as a dynamic database management feature. If an NSE holds several databases, the power management feature enables a pre-charge of only those lines associated with the active database.
The NSE devices not only perform a search in a single clock cycle, they can be updated in a single clock cycle. This makes them much faster to update than SRAM or RLDRAM implementing a Trie algorithm. TCAM-based NSEs that can search at 250 million searches per second (MSPS) are available from vendors, making searches and updates at network line rates possible.
If performance were the only issue, TCAM-based devices would be the memory of choice for all search operations. Cost, power, and density are all important design factors, however, and must be traded off against speed. The exact tradeoff depends on the size and performance requirements of the search, which requires a system level comparison of the algorithms and architectures.
In Part 2, we'll upon the recognition of the various design considerations, such as cost and power, and give examples of the types of applications where these searching technologies can be applied.
Editor's Note: To view Part 2, click here.
About the Authors
Michael Miller has been with IDT for more than 15 years and serves as the company's chief technical officer responsible for systems technology. He holds a bachelor of science degree in computer science from the California Polytechnic State University at San Luis Obispo and has been awarded 25 patents to date. Michael can be reached ta firstname.lastname@example.org.
Bertan Tezcan is a systems engineer in the Systems Technology Group at IDT where he is responsible for the algorithmic and architecture definition for the IDT network search engines and content inspection engines. He holds a bachelor of science degree in electrical engineering from Bilkent University in Ankara, Turkey; a masters of science degree in electrical engineering from the University of Southern California; and a masters of business administration degree from Santa Clara University. Bertan can be reached at email@example.com.