LONDON A new way of using neural networks to perform vision and other machine understanding tasks has emerged from the Gatsby Computational Neuroscience Unit at University College London. By separating the problem into perception and recognition tasks, the new algorithm makes it possible to train the neural network with less outside intervention.
A type of network called a Restricted Boltzmann Machine (RBM) is built upon the idea of learning and recognizing through a "Product of Experts" a concept that is thought to be more biologically valid than its predecessors. With the RBM algorithm, developed by Geoffrey Hinton, a cognitive researcher at the Gatsby unit, it is now also much more efficient.
If the technique is widely adopted, the network models produced may not only aid in building better vision recognition but also help in the understanding of the human visual system.
'A godsend'
"For hardware designers, the binary-unit version of the Product of Experts is a godsend," said professor Alan Murray of the department of electronics and electrical engineering at the University of Edinburgh. "It represents, I believe, the first neural-network architecture that is both sensibly implementable and worth implementing."
In terms of applications, he said, "I have uncovered an ability in the binary Product of Experts to perform online and adaptive novelty detection in both artificial and real data." When fed real heartbeat data, "the Product of Experts is able to model normality and to detect anomalies such as ventricular ectopic beats autonomously, without supervised training."
This, he said, "offers the possibility of a silicon integrated novelty-detector chip with applications in many forms of sensor processing, with direct analog interface to the sensors."
Daniel Lee, a member of the technical staff at Bell Laboratories, called the Product of Experts (PoE) model "an interesting architecture, because it allows the individual experts to combine their very general preferences to obtain specific likelihoods."
In standard mixture-type models, Lee said, "The individual experts have to be very specific themselves. So, for example, if you had one expert that preferred furry animals, whereas another expert preferred domesticated animals and another preferred small animals, their votes combined in a PoE model would light up dogs and cats very nicely." The mixture model, by contrast, "would have one expert prefer dogs and another expert prefer cats, and their sum would be dogs-and-cats."
What Hinton has developed, Lee explained, is a "nice practical algorithm that allows the experts to learn these types of general categories from data."
The new algorithm builds upon some of Hinton's earlier work in Boltzmann machines and RBMs. Lee described these as "similar to PoEs in that individual experts' opinions are combined multiplicatively." However, he said, "they have long suffered from the disadvantage that training a Boltzmann machine involved generating lots of random samples on a computer, which took too much time."
Simpler training
With Hinton's PoE algorithm, it takes far fewer samples to train the system, "and could enable these types of models to fit large, complex data sets," Lee said.
In engineering terms, conventional neural networks are simply complex filters that take incoming signals A, B and C (which could, for instance, be images of people) and give out the right answers X, Y and Z (perhaps names), respectively. They have major advantages in real-world applications like face recognition, because A, B and C cannot be known exactly in advance: Age, lighting, pose, dress and other variables make it extremely unlikely that anyone would ever look the same to a camera on different occasions.
Unlike more conventional algorithmic systems, neural networks where processing and storing information are essentially the same task reconfigure themselves based on experience. They learn that various images map to X and find their similarities, while at the same time determining how they differ from those images that map to Y. This learning is what makes them so powerful.
Structurally, neural networks consist of a number of "neurons," or processing elements, that sum and then perform some function on incoming data from other neurons. In back-propagation, the neurons in the first layer get their information from the outside world: In an image-processing application, this means that each neuron looks at the signal coming in from a single pixel. The next, or hidden, layer can have any number of neurons, each connected to many of the pixels in the first layer. These neurons are then connected to the output.
To train the network, the image signals are allowed to propagate through it. They are processed by the neurons and attenuated or amplified by the strength of the connections between them, thus storing the learned information. Then, the "answer" is compared with the label already assigned to the data. In that way the network learns by changing the interconnection weights to minimize the difference between the answer it gave and the correct one.
How well this process works determines whether one neural-network architecture or configuration performs better than another.
One difficulty in this basic approach is that the learning must be supervised there must be some kind of teacher. The neural network is shown lots of examples of objects that have been labeled in advance, and so eventually begins to associate the input (face) with the label (name). This technique is known as response learning, because it directly links the inputs with the outputs. But from a practical point of view it is inefficient: Not only is lots of training data necessary to fully represent the "fuzziness" of the problem at hand, but all that training data has to be labeled somehow, presumably by a human.
Hinton and his colleagues have chosen to concentrate on another approach: perceptual learning. Instead of focusing on recognizing images, the network's job is simply to learn to be good at perceiving a given type of data without initially assigning any label to it. An example of this kind of learning in humans is the fact that people brought up to speak different languages are better at distinguishing between different sets of sounds, even outside the context of a meaningful word or sentence.
All the perceptual neural net does is to decompose the data into a particular combination of features: The better adapted the feature set, which is stored in the interconnection weights, the more efficient and accurate the network will be for that problem.
By using this perceptual learning as a "front end" to a pattern-recognition system, the second part the response learning becomes a more tractable problem. If the perceptual network has done its job well, a class of objects should now be represented by a relatively small number of feature combinations compared with the number of images that went into defining those features. Essentially, because the fuzziness of the original images has been encapsulated in the hidden units, much less labeled training data should be necessary for the second stage.
One way of determining whether a network is good at perceiving incoming data is to see what kind of data it would generate. So-called generative models, where the neural network is essentially run backwards and the hidden units (features) are stimulated to produce their own "input," have the advantage of showing what a network "believes in."
Unfortunately, the more powerful nonlinear generative models have traditionally been difficult to work with. They fall into two classes, both of which have disadvantages. The first, known as causal models, can be compared to computer graphics. Though it is easy to generate pictures from, say, a 3-D model, it is not easy to reconstruct the 3-D model from this data. In fact, this is why machine vision is so difficult in the first place. It is easy to generate images, yet difficult to extract meaning from those images afterward.
Hinton's Product of Experts was thought to have the opposite problem. Here, to generate an image from scratch, every active hidden unit must agree on every pixel. To achieve this, each hidden unit is either active or not based on its own statistics. If on, it then "votes" on whether each pixel should be on or off based on the relevant interconnect weights. If the vote is not unanimous, the dice are effectively rolled again to select which hidden units should be active until a set is found that can agree on what image should be produced.
Building a set of experts that all agree on an image is not so hard, as long as the knowledge of each of these is sufficiently narrow. In Hinton's system, not all of the hidden units care about every pixel in the image: They specialize in detecting certain features and ignoring others. Therefore, some features will complement others and will be able to coexist with them. On the other hand, it is difficult to find those winning combinations in the first place, which is why such models were thought to be impractical.
Building on the past
What Hinton realized was that there is no need to start from scratch, and he was able to exploit this fact in the learning algorithm for a Product of Experts. Data comes in through the input, which excites various units (features) in the hidden layer. These are then used to generate an image, which is compared with the original data to produce an error signal. That signal, in turn, is used to update the interconnection weights between the input pixels and hidden units.
This way, with each new piece of data, the feature set is refined to more perfectly generate images like that in the training set. Effectively, the neural network is accepting the excited hidden units, caused by the data, as the "answer" to the question, "Which features should I use to generate this image?," and then trying to optimize the features that have been turned on to create a better image next time.
Over time, the system efficiently creates a set of optimized features that can accurately regenerate the training data. That allows easy inference between generated image and feature combination.
"The advantage of this new learning technique is that it allows a network of nonlinear neurons to learn how to decompose images into features," said Hinton. "And after the network has learned, it can extract features from images very rapidly."
In experiments where the technique was applied to both handwriting and face recognition, detectors similar to those found in early vision systems such as those that look for lines or edges at different orientations emerged naturally as hidden units, he said.
The fact that many of the feature detectors look alike despite being specialized is particular to the PoE approach. In causal models, the features or hidden units are independent and compete to affect the image. Like features are therefore less apt to coexist.
In Hinton's networks, because features work cooperatively, it is natural to evolve many that will affect the image in a similar way.