Deep inside the urban canyons of San Diego, the heart of an engineer waxes and wanes, happiness or sadness seesawing on the strength of two punctuation marks a colon paired with a parenthesis, either closed :) or open :(.
But researchers at Javier Movellan's Machine Perception Laboratory at the University of California, San Diego, want to go beyond cute keystrokes in making the computer a more empathetic companion. And there's good reason to do so, said Marian Stewart Bartlett, a postdoctoral researcher at the lab who is exploring ways to make a computer understand facial expressions.
"People need to share their feelings in electronic communications that's why they invented the sad face and smiley face out of a colon and a parenthesis," she said. "Many an e-mail argument would have been averted in person, because our facial expressions would have softened spoken words that sound harsh when read."
Since Movellan founded the Machine Perception Lab three years ago, he has focused its research on automating the recognition of communication modalities that people naturally use, but which are foreign to current computer user interfaces. Cues like lip movements, facial expressions and tone of a voice are all Greek to a PC, but Movellan believes they contain a richer source of communications than a keyboard and display.
Javier Movellan's Machine Perception Lab hopes to create a user interface that registers facial expressions and tone of voice. It now recognizes 12 moods.
"At the Machine Perception Lab, we believe that personal communications involve a great deal of emotions," said Movellan. "These so-called affective communications, most of which involve the face, greatly enrich the interactions between humans. In the lab, our goal is to build computer interfaces that modify their responses to us in recognition of our emotional states."
Movellan and his colleagues hope to go beyond cute punctuation by reading the expressions right off your face, and electronically transmitting them along with your words in futuristic e-mail "talking heads." A tiny camera atop your PC will annotate the emotions you are expressing as you type your e-mail messages, and convey them on screen as a kind of running version of the "happy face." It will show, besides joy and sadness, anger, frustration and eight other emotional states. So far, Bartlett's prototype can reliably recognize 12 emotional states from a video camera focused on a face.
"E-mail is just an example," said Bartlett. "There are hundreds of applications for our facial-expression recognizers. For instance, imagine a Furby [the electronic pet] that responds to your child's sad face by saying 'There, there, don't be sad, why don't we play a game?' "
Of the five senses, the Machine Perception Lab is currently concentrating on sight and sound, devoting the majority of its research projects to recognizing facial expressions and tone of voice. Adding affective recognition of emotions will enable a Furby, in the example above, to react to happiness, sadness, anger or other emotions in a child for instance by whimpering when a child scolds it.
"When I started the lab three years ago to read lips, I thought we could enhance the accuracy of speech recognition by augmenting it with visual information from the lips," said Movellan. "Unfortunately, our hybrid recognizer, which combined acoustic and visual information, actually performed less well than the acoustic recognizer by itself, but the experience convinced me how rich the nonlexical content of communications is."
Movellan cited an interesting aural illusion, called the Mcgurk effect, whereby a speaker says "ga" but with his lips forms "ba." The listener invariably hears something in between, usually "da."
The Mcgurk effect ate into the accuracy of the hybrid speech recognizer. However, the Mcgurk quirk also convinced Movellan that human communications contain much more content than mere syntax. In other words, the meaning of what we say is more expansive than the meaning of the words we use to say it.
"Speech recognizers try to strip away all the emotional content of communication, reducing it to a robotic-sounding drone," said Movellan. "But we do just the opposite we strip away the words being spoken and retain just the emotions, like the way we communicate with animals. Our pets can understand our emotions very accurately, even though they understand very few of the words we are using. This is true of all our modes of communication, so at the lab we work with as many modes as possible."
One experiment running in the lab matches up the speaker with the words being spoken by combining visual and aural data from a videotape. By correlating the movement of pixels with the sound waves, it is possible to put closed captions beneath the current speaker in, say, a panel discussion, rather than just running them on the bottom of the screen and leaving it up to the user to figure out who is saying what.
Correlating those images and sound waves involved "studying the ventriloquist effect, whereby we 'hear' the dummy speaking the words because we see its mouth moving," said Movellan.
In the case of Bartlett's facial recognition of emotions, the trick was to use the same sort of processing in a computer that is used in the visual cortex of the brain. A spatial bandpass filter, called a Gabor filter, has been located in the primary visual cortex, called area V1. The lab found that by running software versions of Gabor filters on a video stream of the face, they could reliably identify the emotions expressed.
According to Bartlett, the technology to perform this emotional-recognition task is finished. All that remains is the nitty-gritty engineering work to integrate the technology into working applications.
"We can already reliably recognize the 12 principal facial expressions with the same accuracy as a human expert, but we need to make the technology work in real environments, where there is noise and different scales to deal with," she said.
In the lab, users' heads are clamped in a virtual vise that is, they are made to keep their heads still, at a specified distance from the camera, and seated against a plain white background. In the real world, however, a person will be moving around in the video frame, and backgrounds will contain coworkers, posters of swimsuit models and who knows what else.
"The trick is how to find where the face is in the frame," Movellan said. "We think we can do that by analyzing the colors it appears that the colors of faces exist in a very small band, regardless of the race of the speaker." But he acknowledged that "a lot of tough engineering work remains to be done to create user-friendly applications."
In the end, the work at the Machine Perception Lab could revolutionize human-computer interactions. For instance, electronic pets could not only respond appropriately to children, but could therapeutically monitor senior citizens. Often, when queried, the elderly reply that they are fine, even when seriously depressed. An electronic pet that can recognize emotions could alert medical personnel to problems.
Education will likewise benefit, according to Movellan. Automated teachers could detect when a student is perplexed or frustrated, when material is understood and when it needs to be repeated or presented in a different way. By learning the manner in which each student learns best, virtual teachers will be able to adjust their strategies in the same way good human teachers do.
"It is important that we stop thinking of computers as a monitor and a keyboard," said Movellan. "In the future, everything will be a computer. All our sensors, like video cameras, will have very powerful microprocessors inside them to perform the same sorts of 'early processing' tasks that our eyes and ears perform independently of the brain. Our desks and even the walls of our homes will be filled with computers and display technology."
The Machine Perception Lab intends to make it possible for such hardware systems to reach their potential by creating software to recognize all the modalities of communications that people recognize in each other. "Even our desks will need to automatically respond to us," said Movellan. "For instance, desks will need to recognize gestures so we can grab virtual objects on the desk and operate them."