# New measure of success cited for statistical prediction

HANCOCK, N.H. — Renewed scrutiny of a statistical technique used by British intelligence to decode German military communications during World War II has opened new avenues in statistical prediction that researchers say could improve machine-learning software.

Today's spell checkers, data retrieval methods and speech recognition systems use variations of the Good-Turing estimator, named for British mathematicians I.J. Good and Alan M. Turing. The technique, representing an intuitive breakthrough in the modeling of probability distributions behind data streams, was credited with shortening the war by several years. But while Good published a mathematical justification of the method after the war, statisticians remained unsure why the estimator worked so well.

Only within the past few years has an outside group come up with a method for quantifying the success of the Good-Turing algorithm. Now that work, by Alon Orlitsky and his colleagues at the University of California-San Diego's Department of Electrical Engineering, has yielded a statistical estimator that the researchers say is more accurate than Good-Turing over time. The innovation is asymptotic, meaning that it will eventually outperform Good-Turing predictions on given sequences. But the rate of convergence is a practical issue that needs more work, Orlitsky said.

As Orlitsky's team tried to measure the performance of Good-Turing, it discovered a definite limit on how well the method performs. A precise mathematical formulation of that limit led to insights into creating a limitless estimator.

The estimation problem is commonly encountered in applications of probability theory: Given a random data stream, infer the probability distribution that generates it. For example, observations of a long string of binary digits generated by repeated coin tosses would quickly reveal that the occurrence of each digit is equally likely. Even if an observer doesn't know how the string of bits is generated, the frequency patterns will give clues to its generation. As the data stream lengthens, the frequencies will converge to equal outcomes for each digit-and in general, the longer the sequence, the better the estimation technique becomes.

"If the number of possible outcomes is small, the problem is easy. But with a large set of outcomes, the optimal distribution begins to disappear into noise," Orlitsky said. In addition, it is necessary to include all possible outcomes in the estimation of a probability distribution even if they do not occur-the so-called "hidden outcome" problem.

Orlitsky uses the example of a safari drive that yields sightings of three giraffes, one zebra and two elephants. Based on that experience, one might naturally assign probabilities of 1/2 for encountering giraffes, 1/6 for zebras and 1/3 for elephants. But that assumption fails to recognize that the next animal encountered might be a lion. To avoid a nasty surprise, the good estimator would take the entire set of known species in the area of the game drive into account and assign a probability to all species in the set.

That was the problem faced by British cryptologists during the World War II. The German Enigma encryption machine used a huge number of decryption keys, making it almost impossible to crack the code. British intelligence had gained possession of Enigma machines, had determined how they worked and had even obtained a copy of the full book of keys. Some messages had been decrypted and the keys used recorded, so that the code breakers had a small sample from a very large set of keys. But it was unlikely the Germans would continue to use the same keys, so some method of assigning a probability distribution to the keys not yet used was needed.

Instead of quantifying the occurrence of keys, Good and Turing decided to quantify the frequencies at which keys occurred. It seems intuitive that if a key has been seen frequently already, it is less likely to occur in the future, while one that has not yet occurred or has been seen only once will have a higher probability of appearing.

Turing assumed that the frequencies would follow a standard binomial distribution. Taking the frequency of known keys as empirical data, he assigned probabilities to the entire set of keys.

It was a promising approach, but the method suffered from noise problems. Good introduced smoothing algorithms on the data to eliminate the noise. That was the genesis of the Good-Turing estimator.

The data-smoothing aspect is key to getting the method to work effectively, and a lot of work over the years has gone into devising better smoothing techniques. Depending on how data is smoothed, different versions of the Good-Turing estimator have been created for different applications. By coming up with a means of quantifying the success of these different estimators, Orlitsky found that they all have a hard limit on their effectiveness.

Orlitsky was able to discover this limit by quantifying the problem in terms of the positive integers. The nature of the sample set is actually irrelevant to the probabilistic algorithm. What matters is the order in which outcomes appear and how often they appear. So a sample sequence such as giraffe, giraffe, elephant, giraffe, zebra would be encoded in numbers as 1,1,2,1,3. Every time a new item appears, it is assigned the next-highest number, so that this mathematical model, according to its creators, can capture the worst possible problem-one in which there is an infinite number of hidden data items.

Using numbers as the sample set also makes it possible to consider the asymptotic behavior of estimators as the length of sampled data increases indefinitely.

A given method for estimating the probability of a sequence of numbers is compared with the maximum probability assigned to the sequence by all possible probability distributions. Specifically, Orlitsky defines the attenuation of an estimator as the ratio of the maximum value resulting from the application of all possible probability distributions to the value assigned by the estimator. Since an estimator uses some probability distribution, its value will always be lower than the maximum value, so this measure is always greater than 1. Thus, a figure of merit would be how close the attenuation of an estimator approaches 1.

"That type of definition has been used before on problems where the sample set is small and known, but we have extended it to large, even infinite sample sets," Orlitsky said. By looking at how attenuation varies as the length of a data stream increases, Orlitsky and his colleagues were able to discover the ultimate limits of a given estimation technique.

A common technique known as the add-constant estimator, which was first suggested by Pierre-Simon Laplace, the inventor of probability theory, turns out to be very poor. As the data stream lengthens, the estimates actually diverge away from 1; thus add-constant estimators have infinite attenuation. That was a surprising finding, because add-constant estimators have always been a simple and effective approach on small-sample-set problems.

Good-Turing estimators always stay within a factor of 2 of the optimum probability, but Orlitsky discovered that they have a lower limit as well. He was able to show that an estimator with one of the simplest smoothing algorithms could not get below a factor of 1.39 of the best possible estimation for a string of data. Estimators with more complex smoothing algorithms were difficult to analyze, but simulations indicated that they too had a lower limit on their attenuation.

"Theoretically we can show that it is possible to create estimators that attenuate to 1, but the practical aspect is how fast they converge," Orlitsky said. "That is a problem we are working on now-and we have some new results that we are not ready to talk about at this time."