XML has become an indispensable part of IT infrastructure. As more web services applications exchange XML messages over the network, we're seeing the emergence of XML-aware network routers and switches. However, to meet the high throughput and low latency requirements, those purpose-built appliances need XML processing technology that goes beyond traditional, software-based XML processing models. In addition, porting XML processing on-chip is becoming a key technology. Situated at the seventh layer of OSI (Open Systems Interconnect), XML processing, when implemented on custom hardware, can be considered the functional extension of network coprocessors.
Although XML is being used in many different ways, one thing remains constant: an XML document must be parsed before applications can do anything with it. Currently, DOM (Document Object Model) and SAX/Pull, the two popular processing models, are used to parse XML.
DOM is a tree-based XML processing API spec that internally creates in-memory data structure, precisely modeling data represented in XML. Allowing random access, DOM is generally considered an easy and natural way of working with XML. SAX (Simple API for XML) and Pull are XML streaming APIs that directly export a low-level tokenizer to calling applications. By not constructing in-memory data structure, SAX and Pull can process XML files larger than available physical memory, but are nonetheless difficult to use, especially when the structural complexity of XML is high.
High-performance XML processing can be important for various application areas. For middleware and transactional database applications, people are often concerned with the application servers' CPU load, as well as the number of transactions performed per time unit. For network switches and routers handling XML messages for security and routing purposes, the typical metrics of interest are throughput and latency. It's critical for those purpose-built devices to achieve the kind of performance that people expect out of network appliances.
Unfortunately, both DOM and SAX/PULL have performance problems that usually make them less than satisfactory for aforementioned applications. Although easy to use, DOM mandates a large number of object allocation operations that are adverse to high performance. Typically, a DOM parser in Java doesn't deliver more than 5 Mbytes/s sustained throughput on a high-end server. On the other hand, because SAX and PULL don't convey structural information of XML, application developers often either have to scan the XML document multiple times, or create custom object models to represent the hierarchical data. Both of these approaches usually render SAX/Pulls raw performance advantages over DOM insignificant.
Looking at the networking industry over the past 20 years reveals a common trend. Whenever there's an emerging performance bottleneck, people resort to custom hardware, typically an ASIC or FPGA, for enhanced processing. At OSI layer two and three, network processors and switching fabrics configured in massive parallels enable backbone routers to handle millions of packets per second. At layer four, TOEs (TCP offload engines) perform the dirty work of terminating a gigabit TCP connection, and thereby free up host processors to focus on application-specific processing. At layer six, VPNs (Virtual Private Networks) based either on SSL (Secure Socket Layer) or IP-SEC makes extensive use of custom hardware to handle data encryption, and signing that would otherwise drain the CPU. Dedicated ASICs performing deep content inspection and pattern matching are one reason why web switches and intrusion detection systems (both at layer 7) can keep up with multi-gigabit networks in enterprise data centers.
Since HTTP carries XML documentation in its body, XML processing is considered a layer-7 problem. Following the footsteps of other layer-7 problems, it's natural to look for ways to use ASICs to boost XML's processing performance. But now it's a layer-7 problem. More than the classic "flexibility vs. efficiency" difference between CPUs and ASICs, be aware that XML on-chip processing is deeper than any other layer-7 problems, as it must intimately and seamlessly interface with application logic. More precisely, ASIC-based XML processing is subject to constraints specific to both XML and custom hardware, and a good design is about formulating the XML processing in a way so that the applications can take maximum performance benefit of custom hardware.
Analyzing the porting of XML on-chip
General-purpose processors, or CPUs, are the most important component of a computer. To program a CPU, application developers usually write code in some kind of programming language, then compile the code into instructions, which in turn get executed by the processor. An important function of compilers is mapping application code, which essentially describes the application behavior, to instructions. A processor has its own bag of performance-enhancing tricks: modern processors can fetch, and reorder multiple independent instructions so they can be executed in parallel by on-chip execution units.
Because the CPUs are flexible, the architecture isn't optimized for any specific tasks. Mapping between an application's behavior and the corresponding assembly code often introduces unnecessary processing overhead. There's a class of computation problems for which the processing overhead due to the general-purpose design becomes significant. The common characteristics of those problems are that they have low data dependency, possess a high degree of architecture parallelism, and that they require a simple set of operations to be performed repetitively over a large amount of data. The best way to improve their efficiency is to optimize at the highest abstraction (architectural) layer. This isn't a CPU's strength because it only sees instructions, and therefore can only optimize the execution at the instruction level.
Compared to CPUs, custom hardware, typically in the form of ASICs or FPGAs, isn't nearly as programmable. Being application specific, custom hardware lacks a CPU's flexibility and takes on a narrow set of functions. But not doing everything allows it to completely bypass the intermediate "instruction mapping," stage and directly use logic gates, such as registers, NANDs, NORs, and XORs, to describe the application behavior. Hence, custom hardware can optimize at the architecture level to maximize processing efficiency. Some of the benefits are smaller die sizes, less heat generation, and higher performance, even at a much slower clock rate than the CPUs.
It's well known that to design a scaleable distributed system, one should minimize state information. Otherwise, the overhead of storing and managing state reduces performance. This is why redirect servers are preferred over the proxy setting for both SIP and DNS. For custom hardware design, being stateless is even more important, as it often has a set of fixed length registers and a limited amount of linear memory buffer not capable of excessive buffering. So the most efficient memory access pattern for custom hardware is forward-only, flow-through to minimize buffering.
C's struct provides a way to bind multiple member variables/fields into one entity. In object-oriented languages, objects are actually a souped-up version of struct with added support for access control and member methods. Both a struct and an object are, at the programming language level, different from primitive data types, such as an integer. However, the physical representations of a struct, an object, and an integer have one thing in common: they're all small memory blocks filled with bits. This is important in ASIC or FPGAs. On an IC, object and structs don't exist. Generally, everything consists of a constant number of bits that can be stored in registers to represent the state. As for the binding, the bits can be split into different groups to represent different variables. For example, a 32-bit register can bind four 8-bit integers, two 16-bit integers, or even 32 single-bit Booleans.
The Data Encryption Standard (DES) is a widely-used symmetrical block-cipher algorithm whose ASIC implementation delivers a sustained throughput of multiple gigabits per second. At the top level, DES decryption converts 64-bit cipher blocks into 64-bit plain-text blocks based on a pre-negotiated 64-bit key value. In many ways, a system-level view of DES on a chip exhibits the constraints and design principles of how to maximize custom hardware. First, a DES chip does nothing but decryption, and does it much faster than a CPU because the algorithm is burned right into silicon. Second, the design is stateless; except for a few internal state registers, every 64-bit input has its own corresponding 64-bit output, and no buffering is needed. Third, the 64-bit data blocks moved across the PCI bus are constant in length and naturally persistent. Hence, application code residing in a separate memory space benefits from the custom hardware's horsepower. Overall, the DES problem provides a good prototype for ASIC-based XML parsing.
DOM and SAX are the two natural candidates for custom hardware implementation because many XML applications are developed using these two processing models. However, there are two kinds of problems that prevent them from being ported on a chip. The first one is technical issues. The second is about whether performance improvement due to custom hardware is significant enough to justify the effort.
It makes sense to port DOM on a chip because building a DOM tree is often an application's most processing-intensive part. However, DOM on a chip doesn't work very well. The first show stopper: the in-memory representation of a DOM tree, consisting of many objects stitched together using pointers, isn't inherently persistent, and therefore can't be transported across the process address space as is.
The second show stopper: If custom hardware produces serialized DOM, the cost of converting the DOM into an in-memory tree becomes prohibitively expensive so that there's little tangible performance benefit to port DOM on a chip. And even if DOM on a chip is possible, the cost of garbage collecting all the objects caps the overall performance gain achievable using custom hardware.
Implementing SAX on a chip also faces a few technical problems. When an application uses SAX to parse XML, the parsing code and application logic are interwoven. This parsing style means the application must initiate many expensive device calls. As an alternative, the chip can produce some form of serialized SAX events. But again, this approach won't provide any substantial performance benefit because of the object creation cost to regenerate the SAX events. However, the biggest problem with SAX on a chip is that SAX parsing itself usually isn't a significant performance bottleneck for an application. So porting SAX on a chip is simply solving the wrong problem.
VTD-XML: a processing model for custom hardware
Traditionally, the first step of text processing is for a parser to split input text documents into many strings, also known as tokens, which are usually null-terminated character arrays. This style of parsing is "extractive" in that the parser physically extracts useful content from the document into those arrays. However, there's another way to tokenize, or record, the starting offsets and lengths of tokens, while leaving the token content unchanged in the source document. In other words, the parse treats the document as a large token bucket, and creates a map detailing the positions of tokens in the bucket.
To understand the difference between the extractive and non-extractive tokenization approaches, consider the following usage scenarios:
String comparison: under the traditional text-processing framework, one uses some flavors of C's "strcmp" function (in ) to compare an extractive token against a known string. In our approach, one simply uses C's "strncmp" function in .
String to numerical data conversion: some frequently used macros, such as "atoi" and "atof," can be revised to work with non-extractive tokens. One possible change is the functions' signatures. For example, atoi takes a character string as the input. To make a non-extractive equivalent, one can create a new-atoi that accepts three variables, the source document (of the type char*), offset (of the type int), and length (of the type int). The difference in implementation mostly deals with the new string/token representation.
Trim: removing the leading and trailing white spaces of a "non-extractive" token only requires changes to the values of offset and length. This is usually simpler than extractive style of tokenization, which often involves creating new objects.
Built on the concept of the "non-extractive" tokenization, VTD (virtual token descriptor) is a binary format specifying how to encode various token parameters into integers. A VTD record is a 64-bit integer that encodes the starting offsets, lengths, types and nesting depth of tokens in XML (see the figure). VTD also requires that the XML document remain intact in memory.
A VTD record is a 64-bit integer that encodes the tokens' starting offsets, lengths, types, and nesting depth in XML.
VTD-XML, the name of both the open source, Java-based, XML processing API and the corresponding XML processing model, internally uses the VTD spec to represent the tokenized form of XML. In addition, VTD-XML introduces the concept of location caches (LCs) to provide a hierarchical view of an XML info-set.
Alluding to previous discussions, VTD is suited for custom hardware implementation. One way to look at this is that VTD turns an XML processing problem into the DES encryption. The fact that VTD records are constant in length allows custom hardware to feature a stateless, flow-through, "process-then-discard" architecture that requires minimal buffering. By burning the VTD algorithm on-chip, the design can optimize at the architecture level to achieve significant performance improvement over general-purpose CPUs. Specifying parameter encoding at the bit layout level ensures platform independence and interoperability among different implementations. Finally, because VTD records are constant length, they're addressable using integers. In other words, VTD records are persistent post-binding, meaning they can be transported across the PCI bus, and bypass the step and the associated cost of allocating a large number of objects, so software applications at the receiving end can take maximize the custom hardware.
Nevertheless, the implications of VTD don't end with custom hardware. In fact, VTD-XML provides DOM-like random access with only 20% to 33% of DOM 's memory usage. The reasons are to avoid per-object overhead and for bulk allocation of storage.
In many VM-based, object-oriented programming languages, per-object allocation incurs a small amount of memory overhead. A VTD record is immune to the overhead because it is not an object. When fixed in length, VTD records can be stored in large memory blocks, which are more efficient to allocate and GC. By allocating a large array for 4096 VTD records, one incurs the per-array overhead (16 bytes in JDK 1.4) only once, thus considerably reducing per-record overhead.
By reducing object creation cost, VTD-XML can achieve high performance, even in software. In fact, VTD-XML often outperforms SAX parser with NULL content handlers.
About the author
Jimmy Zhang is the founder of XimpleWare. He holds MS and BS degrees from the department of EECS at U.C. Berkeley. He can be reached at email@example.com.