Though the world is captivated by Siri and her charming ways, old-skoolers know better that she’s not the first virtual assistant to be offering such service. Siri may have brought some personality to the table, alongside her ability to respond to natural language and understand context, albeit in a limited way, but the technology itself is nothing new. In fact, man’s dream to command devices using nothing but voice dates back to the 1950s. Here’s a brief look of how speech recognition technology has come along since the old days and what the future has in store in this fascinating field.
Speech recognition technology was in its infancy in the 1950s. The first system, named Audrey (Automatic Digit Recognizer), was developed by Bell Laboratories in 1952 and could only recognize numbers. The device, although accurate, forced the speaker to pause for 350 milliseconds between words and only understood the numbers 1 to 9.
It wasn’t until 10 years later that IBM showcased the Shoebox device, which had improved speech recognition abilities. Big Blue’s “speech recognizer”, as it was called back then, could understand a whopping 16 words in English – 10 digits and six arithmetical command words.
Speech recognition technology made some big leaps in the 1970s when the US Department of Defense decided to chip in and provide research funding. One of the results of these efforts was Harpy, a speech understanding system developed by the Carnegie Mellon University that could understand about 1000 words.
The IBM Shoebox in action
The 1970s was also the time when speech recognition technology got a boost in the form of the hidden Markov model, or HMM, a statistical method that helps machine better identify words by using complex mathematical pattern-matching algorithms. The HMM would go on to became the basis of most speech recognition software developed by AT&T, IBM, Philips, and Dragon Systems.
In 1985, Kurzweil Applied Intelligence released the first speech-to-text program, which understood 1,000 words. They followed it up by releasing an updated version two years later, which saw the vocab grew to 20,000 words. The technology at a whole, however, was still hampered by the reliance on the discrete utterance system, which made it a necessity to pause between words.
The 1990s finally saw multiple companies releasing speech recognition software for the masses. In 1994, Dragon released its discrete speech recognition software, Dragon Dictate, for a cool $9,000. It competed with IBM Personal Dictation System, Kurzweil Voice for Windows, Listen for Windows, and other numerous offerings. In 1997, Dragon introduced the first continuous speech dictation software from the company, the “Naturally Speaking”. The new version removed the need for users to pause between words.
With the speech recognition industry in a lull state, taking the technology to mobile devices was the logical step to take. Google Voice Search app made its way to the iPhone in 2008. The app relies on Google’s cloud data center to process voice requests, matching them with the huge pool of human-speech samples and search queries collected by the web giant. This method effectively dealt with the issues of data availability and the lack of processing ability that troubled speech recognition software. A personalized recognition feature was added to the Android app in December 2010, allowing it to produce better result by analyzing your voice and learning your unique speech patterns.
Along came Siri
In 2011, Siri came along and took the world by storm. This was part due to Apple’s clever marketing, and part for “her” rather unique way of handling requests. Using the same cloud-based processing like Google Voice Search, Siri is able to provide a contextual reply with a hint of artificial intelligence. This was more than enough to rejuvenate the speech recognition industry, while putting the pressure back on to create better iterations of speech recognition software.
Though Android isn’t short supply of options when it comes to virtual assistants, Google is currently developing a new voice intelligent assistant, which it will name simply the Assistant. Word has it that the app would be pushed big time by Google in Q4 2012. Google also plans to release an API of the app so that it can be integrated into other apps by developers.
Peter Mahoney, chief marketing officer of Nuance Communications, the brain behind the Dragon speech-recognition application, talked to Ars Technica about what lies ahead for future voice assistants. According to Mahoney, aside from understanding requests in conversational tone, we can look forward to the days when voice activated assistants will remember every little request that we’ve ever made, thus enabling them to respond better to casual questions. “I think you’ll see systems that are more conversational, that have the ability to ask more sophisticated follow-up questions and adapt to the individual,” Mahoney said.
Mahoney concluded by saying that “the systems will learn from us about what kinds of things they need to cover, and they’ll get smarter over time.” Eventually, the voice assistants in our smartphones will develop long-term memory to cater users better.
With an improvement in speech recognition algorithms and supporting hardware, the limitations that current voice assistants are facing will be a thing of the past. It is plausible that we will see speech recognition technology embedded in household items and most mobile devices in the near future.
What do you think? Will speech recognition fundamentally change the way we interact with technology? Let us know your opinion in the comments section below!
theres no doubt in my mind that speech recognition will be as common as a speaker on our future devices. I just sent 3 text messages in a row to my wife using my voice and i am planning to really learn how to utilize the technology because it is inherently productive by saving time to communicate
You left out Voice Command for Windows Mobile.
When will Siri and Assistant know if we are happy/sad or relaxed/hurried? Do tones and frequency convey moods? This can be very valuable on several fronts and improve overall responsiveness and value.
Voice-rec and voice to text translation are great solutions. But can it be applied to improving 2-way voice on digital networks? Many conversations (particularly mobile to mobile) are half-intelligible. If words are not immediately clear to a processor, then can the same processor can go about trying to guess at the correct word and insert it using the same speech synthesized from the talker’s voice patterns? Often I find myself guessing what the other person said, or asking them to repeat it. There’s some of this guessing approach in this video of Peter Norvig. http://www.youtube.com/watch?v=yvDCzhbjYWs
Check out a New Android app called SPLISTER:
Use voice recognition to
to create Lists and be reminded of important dates without having to type anything!