If Google is right, then the way we will engage our technology in the future will be conversational. Typing and pecking around for buttons will give way to fluid conversations that we’ll have with our devices on a daily basis. But there’s a serious problem with the way the technology is currently being developed.
Apparently, most of the data used to train speech recognition systems are perilously old and fiendishly narrow. Projects to collect samples have been underway since the ‘80s, and the bulk of this data comes from white college students.
One prolific sample collection initiative, for example, was called Call Home. It was a service that offered free long-distance calling to college students in the early nineties. These calls were recorded, transcribed, and tagged, then sold to scientists and researchers.
“Historically, speech recognition systems have been trained from data collected mostly in universities, and mostly from the student population,” says Gavalda, head of machine intelligence at Yik Yak and speech recognition expert. “The [diversity of voices] reflect the student population 30 years ago.”
Naturally, this creates a problem. Global speech is much more varied than your average pog-playing, Reebok-pumping, fanny-pack-wearing baby of the 80’s. Regional accents make casual vocal interaction with technology problematic, and there’s a concern in the industry about a growing “speech divide” that limits the way these speakers can use devices.
Google is naturally collecting tons of data on the regular from people using their speech recognition software all over the world, but to be truly effective, this data needs to be accurately tagged, annotated, and transcribed. To that end, it appears that Google has conscripted a company called Appen to assist them.
The diversity of voices reflect the student population 30 years ago.
Appen has been posting calls for voice samples in a variety of telling subreddits. The first call was spotted in /r/Edinburgh, which seems like a natural way to gather lots of data to tackle the tricky Scottish accent.
Calls are also appearing in subreddits like /r/slavelabour, /r/beermoney, and /r/workonline, which focus on doing small tasks for payment. The company is offering $35 for 2,000 recorded phrases, each of which takes between 3 and 5 seconds to enunciate. By our math, that’s somewhere in the ballpark of $15 per hour, which isn’t too shabby. If you’re under 17, the deal is actually sweeter: $26 for 500 phrases.
The company is offering $35 for 2,000 recorded phrases.
The Verge reached out to redditors who had taken Appen and Google up on their offer and found that most of them described experiencing difficulty interacting with voice technology like Google Now, Alexa, and Siri due to their accent. Google and Appen seem especially interested in thick regional accents in rural UK and American fly-over states. English-second-language speakers from India and China are also being recruited.
Hopefully this research will make voice technology easier to engage for users all over the world, closing the aforementioned “speech divide.”
What are your thoughts regarding this sample collecting? Has your accent made ‘OK Google’ a hassle in the past? Let us know in the comments below!