Using symbol pairing, researchers at Rochester Institute of Technology have developed a search engine that makes retrieving of information based on mathematical terms practical.
Any MCU/MPU hardware or software engineer whose work involves the use of any of a number of complex mathematical algorithms knows that traditional search engines are useless when you want to find specific references to a particular expression or equation. This inability to correctly identify and find mathematical expressions is common across almost all commercial online and personal desktop/laptop search engines.
But now, researchers at Rochester Institute of Technology have created an enhancement to traditional search engines that will make searching for mathematical expressions as easy as searching for text. Principal investigator Richard Zanibbi, associate professor of computer science at RIT, told EE Times that eventually this research may find its way into traditional tools such as Google, but a lot of work remains to make it both easier to use and faster.
Called Tangent, the search engine created by graduate students David Stalnaker and Nidhin Pattaniyil allows experts and non-experts to search effectively for formulas by entering them in the LaTeX scientific markup language, or just by drawing the formula.
Currently Tangent is built on top of the Solr open-source enterprise search platform. It has been used to index many collections, including Wikipedia and part of the arXiv collection of science articles. Zanibbi said a demonstration search engine is available on line where some typical results are displayed, but where visitors can enter their own equations and give it a test run.
Nidhin Pattaniyil, a 2014 computer science master’s graduate who helped create Tangent, presents the math search engine poster paper at NTCIR in Tokyo. (Source: Rochester Institute of Technology)
On average the engine can retrieve 28,000 relevant results in 2.2 seconds. “But while that might be suitable for a desktop application, that is too slow for large scale implementations,” Zanibbi told EE Times. “We plan to continue scaling our search engine up and make it faster.” However, he said, when searching for mathematical expressions, the results are better than Google. At a recent information retrieval competition held in Japan, the Tangent search engine produced the best results for formula search and had the highest percentage of “relevant” hits.
What is so different about Tangent from traditional search engines?
According to Zanibbi, virtually all commercial search tools are based on linear left/right representations of text, which do not take positional information (such as superscript/subscript and fractions) into account. Mathematical expressions require the use of a more nuanced tree-structured algorithm to correctly identify and search for such terms. “For many people, visual elements are the anchor for understanding how to organize things, especially with math. We can’t just rely on text-based math, we need an intuitive search engine for visual math.”
Just as in traditional text information retrieval, documents are parsed for words to be inserted into an index to preserve structural information. But with Tangent, pairs of math symbols are also inserted into the index along with information on the relative position of characters in relation to one another. “It is much simpler than most people expect,” said Zanibbi. “You don’t have to encode everything.”
Like existing text-based search engines, the approach developed at RIT uses the well-known inverted index technique. But instead of finding the location of text and storing such information in a database index, the researchers at RTI use an inverted index to map pairs of symbols in a particular mathematical equation and store that data in a symbol layout tree that contains information about position. This includes such things as above/superscript, below/subscript, adjacent and within (for square roots). Fractions are encoded as a FRAC symbol with the numerator above and the denominator below. A square root can have an expression within it and other symbols will be adjacent.
RIT's math search engine retrieves complex equations in documents, using math expressions as the queries.
Tangent was tested against seven other search engines at the 11th NII Testbeds and Community for Information Research (NTCIR) conference held Dec. 9–12 in Japan. The search engine described there beat the competition when searching through Wikipedia articles and a collection of 100,000 scientific documents. Tangent produced the highest-rated top five hits for combined text and math queries, with 92 percent of queries being relevant.
Currently, Zanibbi is working to improve Tangent with Kenny Davila, a Ph.D. student in computing and information sciences, and with collaborators Frank Tompa and Andrew Kane from the University of Waterloo, Canada. Tangent is part of the National Science Foundation-funded project on ”Combining Algorithms for Recognition and Retrieval of Mathematics.”
I've tried out RIT's demonstration mathematical expression search engine on a number of equations I have come across in papers and articles I have edited or read. So far, it has been able to flawlessly pull up relevant articles in which the equations I used were contained. I did not notice how much time it took. I was just happy it found results that did not involve roundabout and time-consuming ways of finding such information using traditional text searches.
— Bernard Cole, MCU and PCB Designline editor, is an embedded microsystems technology analyst who writes about hardware/software design and use across a range of applications. Contacted him at firstname.lastname@example.org. EE Times
Join over 2,000 technical professionals and embedded systems hardware, software, and firmware developers at ESC Boston May 6-7, 2015 and learn about the latest techniques and tips for reducing time, cost, and complexity in the embedded development process.
Passes for the ESC Boston 2015 Technical Conference are available at the conference’s official site with discounted advance pricing until May 1, 2015. The Embedded Systems Conference and EE Times are owned by UBM Canon.