Adobe's New 'Photoshop For Voice' App Lets You Put Words in People's Mouths : ScienceAlert

Software giant Adobe has teased a powerful new audio editing app that could forever change the way we view the authenticity of recorded speech.

Dubbed Project VoCo, the prototype might best be described as 'Photoshop for voice', enabling anyone to freely edit the spoken content in audio recordings – in much the same way as programs like Photoshop allows you to edit visual data.

Previewing the app at the Adobe Max 2016 software expo last week, researcher Zeyu Jin from Princeton University showed just how easy it will be in the near future to manipulate and transform sound files - and in extreme cases effectively put words that were never actually said into people's mouths.

While audio-editing apps have long enabled people to manually cut, copy, and splice together parts of sound waves, VoCo (voice conversion) operates on a new principle, using an algorithm that breaks down and recompiles human speech.

Adobe hasn't explained how this technology works just yet, but the software seems to identify and log phonemes – the individual speech sounds we put together to make up words and sentences.

With the right amount of sound data on file – which Adobe says is about 20 minutes of one person talking – VoCo will have actually recorded enough of these phonemes to basically impersonate that person, by stitching them together into new word and sentence formations.

In the video below, you can see how VoCo works. Using a snippet of audio recorded from comedian Keegan-Michael Key, Jin first starts to rearrange the words.

In the clip, Key says, "I kissed my dogs and my wife." In the program, a visual representation of the sound wave appears in one window, while another window displays the spoken words in text.

By simply copying and pasting in the text window – with no other editing techniques needed at all – Jin first changes the recording to, "I kissed my wife, and my wife,: then manually types "dogs" back in to the end of the sentence: "I kissed my wife, and my dogs."

So far, this might not be anything extraordinary, since all those words appeared in the original recording. But then Jin types in a new word that wasn't part of the audio, inserting a name to give the sentence a wholly different significance: "I kissed Jordan and my dogs."

To take it further, Jin then edits the audio to make it say "I kissed Jordan three times."

It's worth pointing out that the recording when played back does sound a little glitchy, with the pacing of the speech being a little off, but bear in mind this is only a prototype version.

As Sebastian Anthony at Ars Technica points out, Adobe often previews work-in-progress software at its Max event a year or two before it becomes commercialised – and no doubt, as the technology improves, this mimicry of a real voice's speech could get a lot better.

But unlike Photoshop and its many clones, which enjoy broad appeal – since pretty much everybody likes photos – who would need this kind of audio-editing trickery?

Adobe is pitching VoCo at media, podcasters, filmmakers, and audio industry professionals, arguing that the ability to nip and tuck speech recordings will make their working lives easier.

"When recording voiceovers, dialogue, and narration, people would often like to change or insert a word or a few words due to either a mistake they made or simply because they would like to change part of the narrative," the company says in a press release.

"[With VoCo] you can simply type in the word or words that you would like to change or insert into the voiceover. The algorithm does the rest and makes it sound like the original speaker said those words."

But even though the software is undoubtedly impressive, not everybody is thrilled by the new ease and sophistication of this digital audio forgery.

After all, these kinds of edits could be used to impersonate basically anybody, which could lead to all kinds of problems – just as rampant Photoshopping makes it harder to trust the digitised images we see on the internet every day.

"It seems that Adobe's programmers were swept along with the excitement of creating something as innovative as a voice manipulator, and ignored the ethical dilemmas brought up by its potential misuse," media and technology researcher Eddy Borges Rey from the University of Stirling in the UK told the BBC.

"Inadvertently, in its quest to create software to manipulate digital media, Adobe has [already] drastically changed the way we engage with evidential material such as photographs."

Adobe says it is aware of the potential for misuse with Project VoCo, so is already working on technologies that will make it possible to detect if a recording has been tampered with – such as embedding hidden audio watermarks, which could potentially trigger voice security features used in systems like digital banking.

But while machines might be able to detect the mimics, that doesn't mean we will be too – so in the future, we might need to get used to not trusting our ears so much when we hear recordings of politicians, public figures, or even loved ones.

And until VoCo gets released – Adobe hasn't confirmed a timeframe as yet – we also won't know whether humans are the only things it can fool.

"Biometric companies say their products would not be tricked by this, because the things they are looking for are not the same things that humans look for when identifying people," researcher Steven Murdoch from University College London told the BBC.

"But the only way to find out is to test them, and it will be some time before we know the answer."