Unless you’ve been living under a rock, you’re probably familiar with Google Assistant at this point. Google has made a massive push into artificial intelligence and machine learning. It even states at its events that it has moved from a mobile-first strategy to an AI-first strategy. That means that it wants to train computers to always be delivering relevant and helpful information to you before you even know you need it.
Since bits of speech are essentially glued together, it’s hard to account for emotion or inflection. To get around that, most voice models are trained with samples that have as little variance as possible. That lack of any variance in the speech pattern is why it can sound a bit robotic, which is where WaveNet comes in. Google and the DeepMind team are trying to get around that with this new technology.
WaveNet is a completely different approach. Instead of recording hours of words, phrases, and fragments and then linking them together, the technology uses real speech to train a neural network. WaveNet learned the underlying structure of speech like which tones followed others and which waveforms were realistic and which weren’t. Using that data, the network was then able to synthesize voice samples one at a time and take into account the voice sample before it. By being aware of the waveform before it, WaveNet was able to create speech patterns that sound more natural.
With this new system, WaveNet can add in subtle sounds to make the voice even more believable. While the sound of your lips smacking together or the sides of your mouth opening might be almost imperceptible, you still do hear those things. Small details like this add to the authenticity of the new waveforms.
The system has come a long way in a short time. Just 12 months ago when it was introduced, it took one second to generate 0.02 seconds of speech. In those 12 months, the team was able to make the process 1,000 times faster. It can now generate 20 seconds of higher quality audio in just one second of processing time. The team has also increased the quality of the audio. The waveform resolution for each sample has also been bumped from 8 bits to 16 bits, the resolution used in CDs (remember those?).
To hear the differences, we suggest you head over to Google’s blog on this topic (linked below). The new technology is rolling out for U.S. English and Japanese voices and Google has provided comparisons for each.
Have you noticed a change in Google Assistant recently? Does a more natural sounding voice make you more likely to use it? Let us know down in the comments.