We are currently poised on the brink of a dramatic increase in the deployment of voice-controlled "things." Some people call this "embedded speech," while others refer to it as "emerging audio." Whatever you call it, it won’t be long before the world in which we live will be a very different place.
We hear a lot about the Internet of Things (IoT), which some people are starting to call "The Internet of Everything." Now, imagine being able to control all of these "things" by simply talking to them. Let's take this one step further, and imagine these things "talking" to each other.
Last year, I told you about the folks at a company called Sensory, who are doing some incredible things with regard to speech recognition. As part of these discussions, we considered the idea of being able to control your alarm clock by simply saying something like "Clock, please wake me at 6:30 a.m. tomorrow morning, and then wake my wife at 7:30 a.m." (See: Sensory's mega-cool TrulyHandsfree voice control 3.0 and Yes, We CAN Hear You Now! The Rise of Embedded Speech.)
What we are talking about now (pun intended) is taking this to the next level. As I pen these words, we are currently in the middle of a cold snap. Suppose I had the ability to talk to the thermostat in my home and say: "Thermostat, starting at 10:00 p.m., please drop the temperature down to 69 degrees, but make sure the house is warmed up to 72 degrees by the time I get up in the morning."
Now, the thermostat could say "And what time will you be getting up?" However, it would make a lot more sense if it were to first check with my alarm clock to see if I'd already specified a wake-up time, in which case it wouldn’t have to waste my time asking me to repeat myself. Now let's suppose that I subsequently decide to get up a little earlier to walk on the treadmill, for example. So when I retire for the night, I might say to the alarm clock "Clock, I've changed my mind, please wake me at 6:00 a.m." Wouldn’t it be cool (or warm, as the case might be) if the clock automatically communicated this vital piece of new information to the thermostat?
However, there is a bit of a "gotcha" to all of this, which is the fact that the majority of existing voice processing solutions, such as those employed by smartphone applications, are of a type known as "near-field." Basically, the "near-field" moniker refers to the fact that the user's mouth is "up close and personal" with regard to the microphone on the smartphone. When it comes to controlling things remotely, we need solutions that are capable of far-field voice input processing (FFVIP), which involves a whole new set of challenges.
One company at the forefront of FFVIP technology is Conexant Systems. Conexant's technology is deployed between the microphone(s) and the speech recognition technology, such as that provided by Sensory. The point is that, although Sensory's speech recognition technology is phenomenally clever, at the end of the day it's only as good as the signal you feed it -- a noisy signal going in is going to result in less than optimal results coming out. Conexant's technology takes horribly noisy signals from the outside world, performs its magic, and then presents crisp, clean signals to the downstream speech-processing algorithms as illustrated below:
One of the techniques that is used to clean up noisy signals is to take the inputs from multiple microphones and use sophisticated digital signal processing techniques to compare them, to intelligently suppress extraneous noises in the environment, and to place the focus on the dominant voice signal. Until recently, doing this effectively required incredible computational power coupled with four or more microphones. The problem is that the creators of things like tablet computers cannot afford (in terms of both cost and space) to deploy four microphones in their products. Current tablets typically have only one microphone, but next-generation tablets -- like the iPad Air, for example -- will boast at least two.
I was just chatting with Saleel Awsare, vice president and general manager at Conexant, who told me that Conexant's latest AudioSmart Solutions, which were just announced at the Consumer Electronics Show (CES) and which target far-field voice processing for TVs, PCs, smartphones, and tablets, can achieve better results using only two microphones than traditional offerings have been able to achieve with four microphones or more.
In fact, Conexant has just announced a new third-generation voice input processor system-on-chip (SoC) with an embedded, low-power speech recognition engine. This device, the CX2092x, has been designed specifically for deployment in Smart TVs.
The increased clarity in far-field conditions exhibited by the CX2092x allows the user to employ voice control from up to four meters away from the television. Already, LG -- one of the largest TV manufacturers in the world -- has announced that it will be deploying the CX2092x in its 2014 TV lineup, and I expect that other TV vendors will soon follow LG's lead. As I said earlier, we are currently poised on the brink of a dramatic increase in the deployment of voice-controlled "things." Hold onto your hats, because the world is about to change!
— Max Maxfield, Editor of All Things Fun & Interesting