Links on Android Authority may earn us a commission. Learn more.
Google can use AI to isolate voices in a crowd - with impressive results
- Google researchers have devised a deep learning model that can isolate individual voices in a video.
- The model aims to replicate the ability of humans to isolate certain sounds.
- The researchers hope that the tech will have a number of uses including improving hearing aids and automatic subtitles.
Google researchers have come up with a way of isolating the voice of a single speaker in a video from other voices and background noise. The method uses a deep learning model that can computationally produce videos in which the speech of specific people is enhanced.
It uses both the audio and visual signals of the speaker, such as the movement of the mouth, to replicate the ability of humans to effectively focus on one sound. This is a phenomenon also known as the cocktail party effect.
In a blog post, Google explains that in order to develop the method, the researchers gathered a collection of 100,000 high-quality videos and talks from Youtube. They then produced around 2,000 hours of video featuring single people talking to the camera without any background interference.
Using this video, Google then created what it calls “synthetic cocktail parties” made up of face videos, their corresponding speech from separate video sources, and non-speech background noise. It then trained the model to be able to split these cocktail parties into separate audio for each speaker in the video.
The post claims that users of the model simply have to select the face of the person in the video that they want to hear.
The results provided through videos on the blog are pretty impressive.
A sports debate that is almost unintelligible due to the participants shouting over each other becomes crystal clear after the voices of each speaker are separated. In another video, the tech is able to isolate the sound of someone talking in the background of a video conference call.
As for potential uses, Google has focused on it being used as a pre-process for automatic video captioning. In a video in the blog post, captions are clearly improved after the tech is used to isolate the sounds of the people in the video.
However, it doesn’t take a wild leap of the imagination to think of other ways that this tech could be used. Adding cameras to smart speakers could seriously improve the way these speakers hear and understand instructions. Meanwhile, adding it to the video camera on your phone could improve the sound quality of your videos. Google also mentions that the tech could be put towards improving hearing aids.
Of course, it would also appear to make it incredibly easy for someone with this tech to indiscriminately spy on any individual within a large crowd.
Best not to think about that, though.