REDMOND, Wash. Smart visualization software, designed by the military to represent thousands of online documents in 3-D maps, has surfaced in a commercial version called Themescape. Licensed to Cartia Inc. by the military lab Pacific Northwest National (Richland, Wash.), Themescape clusters documents with similar content onto topographical "information maps," upon which users click with a mouse to zoom in on topic areas and individual source documents.
"We've spent the last two years completely rewriting Themescape from the ground up for end users," said Michael New, vice president of Cartia, based here. "It now runs on Windows platforms instead of high-end workstations. And instead of requiring 3-D hardware, we show its results on 2-D topographical maps."
Themescape combines neural learning with advanced statistical grouping techniques to comb thousands of online documents and glean their "meaning" from the context in which words are used. Instead of looking up words in a dictionary, Themescape compiles the statistical frequency with which one word is found nearby another in a document. After learning these statistical relationships, it clusters documents with similar meanings onto topographical maps, where each "hill" indicates a concept.
No matter how many documents Themescape scans, said New, the presentation method always results in a manageable map showing clusters of documents with related meanings. As an example, the system performed a patent search that organized thousands of documents into a hill labeled "encryption." Nearby slopes labeled "authentication," "transaction" and "databases" indicated clusters of patents related to encryption.
"Themescape can scan any type of textual documents, be they Web sites, e-mail messages, reports, word-processor documents or whatever," said New. "Before presenting the content of the documents in a topographical map, our harvester aggregates it into a master index."
The time it takes the harvester to aggregate content depends on how many documents are involved. For instance, the 200 documents in the Starr Report took 2.5 minutes to aggregate, New said, whereas aggregating 16,500 letters to the editor in a popular magazine took 90 minutes.
Cartia pays a licensing fee to Pacific Northwest National Laboratory, as well as royalties to Battelle, the local agency that runs the lab. Battelle has taken an equity position in Cartia, the dividends from which will be split with the Department of Energy. Cartia has hired the five government researchers who developed Themescape as full-time employees.
Themescape aggregates in-formation based on primitive units of meaning gleaned from the source documents. These are then mapped using familiar elevations, distances and other common topographical symbols. A self-organizing neural network scans and indexes the statistical relationships among the words occurring in harvested documents so that no human intervention is required. The end result is an information
map that translates the meaning found in documents into a visual form that users can delve into interactively. Users can make sense of any set of documents using the same techniques, regardless of how many documents are in the set, because the user interface presents the same degree of visual complexity.
Catalogued by context
The first step Themescape takes involves the spatial and conceptual organization of the set of source documents. Data reduction is accomplished with natural-language filtering algorithms that reduce the origin text into the set of words relevant to the meaning of the documents. After noise words-for example, "the" and "a"-are eliminated, the statistical frequency of the remaining words is catalogued by context.
Each word is then analyzed in the context in which it is found, in order to separate the units of meaning. For instance, a computer "program" is indexed differently from the program for a baseball game by dint of context.
Once the units of meaning are identified, they are associated with the other units of meaning to form a hierarchical set of themes. When the themes are catalogued, their meaning can be "spatialized" by dynamically assigning spatial structures. Thus, each document becomes associated with a multidimensional "meaning" space where the dimensions correspond to the concepts and themes contained in the document set as a whole. This localization of meaning becomes the basic database structure referenced by the proprietary mapping algorithms.
These algorithms take the content representation and translate it into a series of 2-D topographical maps where elevation represents density of content. Interactive tools help users spot trends and locate relevant information nuggets from the overall visual organization. For instance, a temporal slicing tool helps relate information occurring on similar dates, but at various depths in the map.
The technological basis for Themescape, called Spire by the government, is available from Battelle, which is customizing it for use in subscribing government agencies. The new commercial version is being used at British Petroleum, Philips Electronics, Ford and Texaco.