Unless you’ve been living under a rock, you’re probably familiar with Google Assistant at this point. Google has made a massive push into artificial intelligence and machine learning. It even states at its events that it has moved from a mobile-first strategy to an AI-first strategy. That means that it wants to train computers to always be delivering relevant and helpful information to you before you even know you need it.
You may have noticed a difference in Google Assistant the last few days. That’s because Google has started using a technology called WaveNet from the DeepMind team. The goal of the new WaveNet technology is to move Assistant from synthesized speech to a more natural speech pattern. Synthesized speech like you’d get from Google Assistant or Apple’s Siri is normally stitched together using small bits of recorded speech. This is called “concatenative text-to-speech” and it’s why some answers can sound a bit off when they’re read back to you.
Since bits of speech are essentially glued together, it’s hard to account for emotion or inflection. To get around that, most voice models are trained with samples that have as little variance as possible. That lack of any variance in the speech pattern is why it can sound a bit robotic, which is where WaveNet comes in. Google and the DeepMind team are trying to get around that with this new technology.
WaveNet is a completely different approach. Instead of recording hours of words, phrases, and fragments and then linking them together, the technology uses real speech to train a neural network. WaveNet learned the underlying structure of speech like which tones followed others and which waveforms were realistic and which weren’t. Using that data, the network was then able to synthesize voice samples one at a time and take into account the voice sample before it. By being aware of the waveform before it, WaveNet was able to create speech patterns that sound more natural.
The advantages of this new system are subtle, but you can definitely hear them. When speaking to another human, you’ll pick up on when they’re coming to the end of a thought because their voice starts to go down at the end of a sentence. If you ever sit and watch the news for a few minutes, you can always tell when a story is about to end because the anchor will start to slow down and the volume or tone of their voice lowers. Part of the reason that concatenative text-to-speech sounds less natural are subtleties like that. That’s a huge part of where the new WaveNet technology improves on the current system.
With this new system, WaveNet can add in subtle sounds to make the voice even more believable. While the sound of your lips smacking together or the sides of your mouth opening might be almost imperceptible, you still do hear those things. Small details like this add to the authenticity of the new waveforms.
The system has come a long way in a short time. Just 12 months ago when it was introduced, it took one second to generate 0.02 seconds of speech. In those 12 months, the team was able to make the process 1,000 times faster. It can now generate 20 seconds of higher quality audio in just one second of processing time. The team has also increased the quality of the audio. The waveform resolution for each sample has also been bumped from 8 bits to 16 bits, the resolution used in CDs (remember those?).
To hear the differences, we suggest you head over to Google’s blog on this topic (linked below). The new technology is rolling out for U.S. English and Japanese voices and Google has provided comparisons for each.
Have you noticed a change in Google Assistant recently? Does a more natural sounding voice make you more likely to use it? Let us know down in the comments.