Microsoft unveils AI that can simulate your voice from just 3 seconds of audio

VALL-E language model can even imitate the original speaker’s emotional tone using artificial intelligence

Tuesday 10 January 2023 15:17 EST

New AI can accurately imitate a person's voice

Your support helps us to tell the story

From reproductive rights to climate change to Big Tech, The Independent is on the ground when the story is developing. Whether it's investigating the financials of Elon Musk's pro-Trump PAC or producing our latest documentary, 'The A Word', which shines a light on the American women fighting for reproductive rights, we know how important it is to parse out the facts from the messaging.

At such a critical moment in US history, we need reporters on the ground. Your donation allows us to keep sending journalists to speak to both sides of the story.

The Independent is trusted by Americans across the entire political spectrum. And unlike many other quality news outlets, we choose not to lock Americans out of our reporting and analysis with paywalls. We believe quality journalism should be available to everyone, paid for by those who can afford it.

Your support makes all the difference.

Microsoft has unveiled an AI voice simulator capable of accurately immitating a person’s voice after listening to them speak for just three seconds.

The VALL-E language model was trained using 60,000 hours of English speech from 7,000 different speakers in order to synthesize “high-quality personalised speech” from any unseen speaker.

Once the artificial intelligence system has a person’s voice recording, it is able to make it sound like that person is saying anything. It is even able to imitate the original speaker’s emotional tone and acoustic environment.

“Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot text to speech synthesis (TTS) system in terms of speech naturalness and speaker similarity,” a paper describing the system stated.

“In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”

Potential applications include authors reading entire audiobooks from just a sample recording, videos with natural language voiceovers, and filling in speech for a film actor if the original recording was corrupted.

As with other deepfake technology that imitates a person’s visual likeness in videos, there is the potential for misuse.

The VALL-E software used to generate the fake speech is currently not available for public use, with Microsoft citing “potential risks in misuse of the medel, such as spoofing voice identification or impersonating a specific speaker”.

Microsoft said it would also abide by its Responsible AI Principles as it continues to develop VALL-E, as well as consider possible ways to detect synthesized speech in order to mitigate such risks.

Microsoft trained VALL-E using voice recordings in the public domain, mostly from LibriVox audiobooks, while the speakers who were imitated took part in the experiments willingly.

“When the model is generalised to unseen speakers, relevant components should be accompanies by speech editing models, including the protocol to ensure that the speaker agrees to execute the modification and the system to detect the edited speech,” Microsoft researchers said in an ethics statement.

Join our commenting forum

Join thought-provoking conversations, follow other Independent readers and see their replies

Comments

Thank you for registering

Microsoft unveils AI that can simulate your voice from just 3 seconds of audio

VALL-E language model can even imitate the original speaker’s emotional tone using artificial intelligence

Your support helps us to tell the story

Join our commenting forum

Thank you for registering