You pick up the phone and hear a voice. It sounds exactly like someone you know and love, perhaps your child or grandchild. They’re asking for money to help get them out of a dicey situation. You have no reason to suspect it’s not them—after all, it’s their voice.
But it’s not them. It’s someone using a digital tool that mimics their voice, built using that person’s vocal data—or small bits of recordings of their voice—to create a voice that sounds remarkably like that of someone you know.
That inability to hear “morphed” voices is because our brains are simply not able to keep up with increasingly sophisticated technology, according to recent research presented by the Internet Society. Evolution has not caught up with the internet revolution.
“The main takeaway is that human brains may not be able to distinguish a speaker’s voice from its morphed version, which means that people would be susceptible to voice impersonation attacks at a fundamental biological level,” Nitesh Saxena, lead researcher on the study and the director of the University of Alabama at Birmingham's Security and Privacy in Emerging Computing and Networking Systems (SPIES) Lab, told The Daily Beast.
The researchers used a specialized brain imaging technique utilizing near-infrared light to map and create images of the functions of the brain. The technique, called functional near-infrared spectroscopy, or fNIRS, is similar to an MRI in some ways, but fNIRS allows researchers to study subjects in an environment much closer to real life (participants have to lie down for an MRI, which isn’t how most people go about their days).
Using fNIRS, the researchers examined which parts of the brain were activated, and to what extent, when listening to real versus synthesized voices using a common voice synthesis tool.
Researchers first primed participants to be on the lookout for real and fake entities by exposing them to real and fake websites and forged and legitimate paintings.
The study then unfolded in two parts. First, 20 participants—10 males and 10 females from the University of Alabama at Birmingham—were informed they would hear a real and a morphed voice.
But the fNIRS imaging showed they couldn’t register a difference between the real and synthetic voices, despite being primed to look for a difference with the paintings; even when they were on the lookout, their brains heard the voices the same way.
That could be because of a region within the auditory cortex of our brain, the superior temporal gyrus.
Previous studies have shown that when impaired, this area of the brain drastically affects our ability to recognize voices. In this study, the area was especially confused because of the decision-making it had to do to figure out whether a voice was real or altered.
Saxena believes that the brain builds a model of a speaker’s voice, places it in their memory, and then recalls from that memory to identify the voice.
It’s a struggle many people are beginning to face, as companies have been developing this kind of tech for years.
In the summer of 2018, Google previewed one of its new products, Google Duplex, which enabled Google Assistant to make AI-governed phone calls for users, like set appointments at a salon or reserve a dinner date at a restaurant. Humans on the other end had little to no idea they were talking to a machine, and Google Assistant included nuanced parts of human speech, including placeholder words like “um.”
It wasn’t the first time technology like this had been demoed live. Adobe had previously been working to develop Adobe VoCo, tech that let users edit voices after a certain amount of vocal data had been fed into it. Using a text interface, a user could type in a sentence and Adobe VoCo would synthesize a person’s voice, to make them seem as if they were saying something, when they’d never said it at all.
And as we allow greater numbers of smart speakers such as Amazon Echo and Google Assistant into our homes, they create repositories of our voice data that can be hacked and used to create synthesized versions of our voices.
This study builds on previous research by Saxena and others, which showed that the average rates for rejecting fake voices were between 10 and 20 percent for automated systems. And while humans are better at it compared to automated systems, people only identify a morphed voice on average 50 percent of the time.
One limitation of the study, according to Saxena, is that in real-world voice impersonation attacks, the participants are not explicitly told to identify the real and fake voices, or that there is a fake voice at all.
But that, he points out, could also make the results of the study more troubling, because when people are told to watch out for another voice, they are still bad at identifying an altered voice correctly.
Is there any hope that our brains could get better at determining which voices are altered as they become a more common occurrence?
That’s unlikely, according to Saxena, but there’s hope.
“It may be possible to train the users to explicitly detect morphed voices based on some of the inherent features of these morphed voices,” Saxena said. “Although machine-based voice biometric systems are also susceptible to morphing attacks, it might also be possible to improve such systems to detect morphing attacks which may aid the human users.”