Affiliate links on Android Authority may earn us a commission.Learn more.
Google can use AI to isolate voices in a crowd - with impressive results
August 12, 2025
Google researchers have come up with a way of isolating the voice of a single speaker in a video from other voices and background noise. The method uses adeep learningmodel that can computationally produce videos in which the speech of specific people is enhanced.
It uses both the audio and visual signals of the speaker, such as the movement of the mouth, to replicate the ability of humans to effectively focus on one sound. This is a phenomenon also known as thecocktail party effect.

In ablog post, Google explains that in order to develop the method, the researchers gathered a collection of 100,000 high-quality videos and talks from Youtube. They then produced around 2,000 hours of video featuring single people talking to the camera without any background interference.
Using this video, Google then created what it calls “synthetic cocktail parties” made up of face videos, their corresponding speech from separate video sources, and non-speech background noise. It then trained the model to be able to split these cocktail parties into separate audio for each speaker in the video.
The post claims that users of the model simply have to select the face of the person in the video that they want to hear.
The results provided through videos on the blog are pretty impressive.
A sports debate that is almost unintelligible due to the participants shouting over each other becomes crystal clear after the voices of each speaker are separated. In another video, the tech is able to isolate the sound of someone talking in the background of a videoconference call.
As for potential uses, Google has focused on it being used as a pre-process for automatic video captioning. In a video in the blog post, captions are clearly improved after the tech is used to isolate the sounds of the people in the video.
However, it doesn’t take a wild leap of the imagination to think of other ways that this tech could be used. Adding cameras tosmart speakerscould seriously improve the way these speakers hear and understand instructions. Meanwhile, adding it to thevideo cameraon your phone could improve the sound quality of your videos. Google also mentions that the tech could be put towards improving hearing aids.
Of course, it would also appear to make it incredibly easy for someone with this tech to indiscriminately spy on any individual within a large crowd.
Best not to think about that, though.
Up next:Artificial Intelligence vs Machine Learning: What’s the difference?
Thank you for being part of our community. Read ourComment Policybefore posting.