When I started playing with computers, audio output was primitive and there were no means of audio input at all. Voice controlled computers were pure science fiction. When the Sound Blaster gave my computer a voice, it also enabled primitive voice recognition. The mechanics were primitive and the recognition poor, but the promise was there.
Voice recognition capabilities have improved in the years since. Phone-operated systems have enabled voice controlled menus within in the past decade or so. (“For balance and payment information, press or say 4.”) It is now considered easy to recognize words in a specific domain.
Just within the past few years, advances in neural networks (“deep learning”) have enabled tremendous leaps in recognition accuracy. No longer constrained to differentiating numbers, like navigating a voice menu, the voice commands can be general enough to be interesting. And so now we have digital assistants like Apple’s Siri, Google’s Assistant, Amazon’s Alexa, and Microsoft’s Cortana.
But when I tried to communicate with them, I still feel frustrated by the experience. The problems are rarely technological now – the recognition rate is pretty good. But there was something else. I had been phrasing it as “low-bandwidth communication” but I just read an article from Wired that offers a much better explanation: These voice-controlled robots are designed to be chatty.
The problem has moved from one of technical implementation (“how do we recognize words”) to one of user experience (“how do we react to words we recognize”) and I do not appreciate the current state of the art at all. The article lays out reasons why designers choose to do this: To make the audio assistants less intimidating to people new to the technology, make them sound like a polite butler instead of an efficient computer. I understand the reason, but I’m eager for the industry to move past this introductory phase. Or at least start offering a “power user” mode.
After all, when I perform a Google search, I don’t type in the query like I would to a person. I don’t type “Hey I’d like to read about neural networks, could you bring up the Wikipedia article, please?” No, I type in “wikipedia neural network”
Voice interaction with a computer should be just as terse and efficient, but we’re not there yet. Even worse, we’re intentionally not there due to user experience design intent, and that just makes me grind my teeth.
Today, if I wanted a voice-controlled light, I have to say something like “Alexa, turn on the hallway lights.”
I look forward to the day when I can call out: