Google's new AI is able to pick out voices in a crowd

The tech could be used for hearing aids and video conferencing

Google's Research team has developed technology to recognise individual voices in a crowd, just as a human can.

It's based on the cocktail party effect, which refers to how the human brain mutes other sounds and voices when having a conversation with an individual in a busy place. The technology works by separating a sound source into separate sound strings.

Google's demonstration study uses a video with lots of people talking at once. The user can select a particular face and they'll hear the soundtrack of just that person. They are also able to select the context of the conversation and only references to that conversation is played, even if multiple people are discussing the subject matter.

The company explained its technology could be used to improve how hearing aids work, or boost video conferencing tools, enabling them to take place in the middle of an office space, for example, rather than only in a soundproofed meeting room.

Inbar Mosseri and Oran Lang, software engineers working on the project at Google Research explained the sound goes hand in hand with the visual cues, analysing the mouth movements to match the sound with the right person.

"Intuitively, movements of a person's mouth, for example, should correlate with the sounds produced as that person is speaking, which in turn can help identify which parts of the audio correspond to that person," said researchers Inbar Mosseri and Oran Lang, writing in a blog post.

"The visual signal not only improves the speech separation quality significantly in cases of mixed speech (compared to speech separation using audio alone, as we demonstrate in our paper), but, importantly, it also associates the separated, clean speech tracks with the visible speakers in the video."

The data used to develop the technology was collated from 100,000 videos of lectures and training videos on YouTube. Parts of the speeches and lectures with no background sound and just a single person in view were then extracted to generate videos of a cocktail party type environment, with non-speech background noise obtained from AudioSet.

Google Researchers could then use this content to train a multi-stream convolutional neural network-based model to pull out individual conversations.

Google Research's findings are outlined in an in-depth report called Looking to Listen at the Cocktail Party and it explained it would be applying the principles of the technology to products in the future.

Featured Resources

Humility in AI: Building trustworthy and ethical AI systems

How humble AI can help safeguard your business

Download now

Future of video conferencing

Optimising video conferencing features to achieve business goals

Download now

Leadership compass: Privileged Access Management

Securing privileged accounts in a high-risk environment

Download now

Why you need to include the cloud in your disaster recovery plan

Preserving data for business success

Download now

Recommended

Data science fails: Building AI you can trust
Whitepaper

Data science fails: Building AI you can trust

2 Dec 2020
MLOps 101: The foundation for your AI strategy
Whitepaper

MLOps 101: The foundation for your AI strategy

2 Dec 2020
Realising the benefits of automated machine learning
Whitepaper

Realising the benefits of automated machine learning

2 Dec 2020
How to choose an AI vendor
artificial intelligence (AI)

How to choose an AI vendor

2 Dec 2020

Most Popular

350,000 Spotify users hacked in credential stuffing attack
Security

350,000 Spotify users hacked in credential stuffing attack

24 Nov 2020
Samsung Galaxy Note might be discontinued in 2021
Mobile Phones

Samsung Galaxy Note might be discontinued in 2021

1 Dec 2020
IT Pro 20/20: Why tech can't close the diversity gap
Careers & training

IT Pro 20/20: Why tech can't close the diversity gap

1 Dec 2020