It all comes back to Star Trek. Whenever I have a complaint about the way technology works—or doesn’t—a little voice in the back of my head says, “They don’t worry about this on Star Trek.” One of the things they don’t, apparently, worry about aboard starships a few centuries hence is being understood by computers. Keyboards and mice, we are led to believe, are relics of the distant past, and voice recognition has been perfected. That’s a rosy and probably overly optimistic future, but one small aspect of the Star Trek computer interface is closer than many people realize. Have you ever noticed that when giving spoken commands to the onboard computer, Enterprise crewmembers never worry about where the microphone is located? Somehow, the entire ship manages to listen to everything that’s spoken, and intelligently pick out particular voices—as well as determining what words should count as commands. While we’re not quite there yet, technology has taken a meaningful stride in that direction, thanks to devices called array microphones.
The Inglorious Legacy of Speech Recognition
But first, a story. In late 1993, I was working as a computer graphic artist for a major electrical equipment manufacturer. My friend David was in charge of maintaining the group’s network of Macintosh computers, and one day he invited me into his office to see the company’s latest acquisition: a brand new Quadra 840av computer. David was grinning proudly because he got to play with it before anyone else, and he was eager to see my reaction to the machine’s highly touted voice-recognition capabilities. Apple had included a special digital signal processing (DSP) chip in the computer whose main purpose was to make voice recognition fast enough for day-to-day use. Having read the reviews already, I knew that I should be able to speak any menu command and have the computer execute it without getting anywhere near the keyboard or mouse. So I stood near the microphone and said, “Computer, open Microsoft Word.” It did. Now I was grinning too. “New. Paste. Select All. Text to Table.” I rattled off a bunch of commands and the computer dutifully executed every single one, instantly and perfectly. I was in geek heaven.
The Quadra ended up on the desk of one of my coworkers, a guy named Ken. He, too, was very excited about the new technology and eager to play with it. He named his computer Brancusi after the sculptor, and that was the word he used to trigger new commands. Unfortunately, in the real-world office environment, Ken’s success with voice recognition left a bit to be desired. We’d hear him plead with his computer. “Brancusi, open Photoshop.” A pause. “Brancusi!” Another pause. “Brancusi, what time is it?” Nothing. “Brancusi, WHAT TIME IS IT?” Ken was not amused when we called his attention to the clock on the wall. The point was, the machine refused to understand him, and before long he gave up and turned off the speech recognition feature altogether.
Over the past decade, speech recognition software has certainly improved. Where once you were restricted to a very limited vocabulary of commands—or dictation software that required you to pause after every word—now you can buy applications that will, with a fair degree of accuracy, transcribe normal, continuous speech. But speech recognition is still a very imperfect undertaking for a variety of reasons. Not least is the way microphones work.
The Ears Have It
It’s tempting to think of microphones as being something like human ears, but ears have a couple of very important advantages over microphones. First, they generally come in directionally optimized pairs. And second, they make use of a sophisticated signal processor known as a brain. It’s actually the brain that does the bulk of what we consider listening—sorting out which sounds deserve to be focused on, distinguishing one speaker from another, and following a conversation even when the speaker is moving around. Conventional microphones lack any intelligence and so simply pick up whatever is around them—the sound of computer fans, telephone calls, nearby conversations, music, or traffic. They have no way to discriminate and give you just the particular sounds you want. In terms of speech recognition, this causes some serious problems for your computer, which can’t figure out which sounds are meant to be understood as commands and which should be discarded as noise.
The usual way to “solve” these problems is to wear a headset microphone, which puts a highly directional pickup very close to your mouth. At such a close range the gain (or input volume) doesn’t have to be very high, so extraneous noises are usually avoided. Headset mics do provide pretty good results, but most people don’t really enjoy wearing a contraption on their heads just so they can talk to their computers—even if the headset is wireless.
Creating a Digital Ear
An array microphone does for microphones what your brain does for your ears. It gives them some intelligence. The general idea is that you take a set of microphones—which could be two, or eight, or thousands—and add a DSP chip with some sophisticated logic. The array microphone’s processor continuously figures out where the primary speaker is in the field of audio input it’s receiving, and selectively adjusts the output so that most of it is coming from the microphone that is getting the best signal. This amounts to “focusing” on a certain direction and distance so that the speaker’s voice, and little or no other noise, actually reaches the output. (Some array microphones are actually much more sophisticated than this, doing advanced noise cancellation and other tricks.)
The implication of this is that in principle, with an array microphone on the desk, a speaker can walk from side to side, or move backward or forward—and maintain the same level of accuracy in speech recognition as with a headset mic. But that’s in principle. Just as ordinary microphones vary in quality and sophistication, so do array mics. Some work well only at close distances; some cancel out certain kinds of noise better than others. As with anything, you get what you pay for—there are array mics that cost less than US$50, and those that cost $3,000. But even the most expensive array mics are only designed to pick a single speaker out of a background of noise; the all-important Star Trek tricks they can’t do—at least not yet—are distinguishing one speaker from another and intelligently distinguishing commands from conversation.
An Array of Uses
Array microphones are used in numerous applications besides computer speech recognition. They are sometimes used in recording audio for films and TV shows—where extraneous noises are a no-no—or to amplify the voices of performers in a play. Some hearing aids use an array of miniature microphones to help wearers focus on a single speaker in a noisy environment. You may also find array microphones in cars, where they’re used for hands-free phones. Home automation enthusiasts have been known to use microphone arrays—sometimes separate arrays in each room—so they can speak a command from anywhere in the house (“Turn kitchen lights on,” “Activate force field”) that will be carried out by equipment attached to a central computer. Array microphones have been around for decades, but in the last several years, advances in digital signal processing have made them much more sophisticated. A few centuries more, and we may have the other speech recognition issues ironed out for good. Then we can move on to that whole warp drive problem. —Joe Kissell
For a well-written and noncommercial (but somewhat technical) discussion of array microphones, check out Directional Microphone Array Processing Unit, a paper Daniel Schreck and Sean Nelson presented as a senior design project at Stevens Institute of Technology in Hoboken, New Jersey.
The current favorite among desktop array mics for speech recognition use is the Voice Tracker by Andrea Electronics. A much cheaper alternative (N.B. you get what you pay for) is the Superbeam SoundMAX Array Microphone from Andrea Electronics. At the extreme high end, consider the Audio-Technica DeltaBeam, which retails for about $3,000 and is designed for use in broadcasting and film.
Interested in automating your home to respond to your voice? A good place to start is Smarthome.
And if dictating letters or commands to your computer is your holy grail, you’ll also need some software—such as IBM ViaVoice Simply Dictation for Mac OS X or Dragon Naturally Speaking for Windows. But a disclaimer: every time I decide to play with speech recognition, I end up abandoning it after a day or two. For me, the issue is not one of power or accuracy; I can get the computer to do what I need it to do with the sound of my voice. The main issue is that I can type quietly without disturbing my wife, who’s working in the same room (or the passenger next to me on the plane, or the other people in a cafe), whereas talking to my computer can be very distracting to others. In addition, I sometimes like to listen to music while I work—that would cause no end of problems for my computer, and if I were going to wear headphones, well, that would be the same inconvenience as wearing a microphone headset. There’s also the fact that most of what I do on the computer is either programming (for which speech recognition doesn’t make sense) or writing, which for me involves a lot of visual interaction with the text, not just a blind dictation. All that to say: array microphones are extremely cool for what they do, but even if they work perfectly, they still aren’t enough to make me want to give up my keyboard.