This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.
Amidst the many announcements we saw in last week’s Google I/O and Microsoft Build conferences, one specific demo stole the show: Duplex, an artificial intelligence feature for Google Assistant that can engage in phone conversations on your behalf while imitating your voice with convincing accuracy.
The presentation showed prerecorded conversations in which Duplex called a hair salon and a restaurant and talked to human receptionists, who did not notice they were speaking to an AI agent. Though Duplex has yet to show its worth in live action (if it ever goes live), the demo was pretty impressive, showing the agent continue the conversation through unexpected twists and even throw in some well-timed “ums” and “mm-hmms” from time to time.
The Duplex demo drew enthusiasm among—but not the media. Most of the coverage over the event was focused on the ethical and security implications of AI algorithms becoming very good at imitating humans. Many experts pointed out that technologies such as Google Duplex must reveal their identities to their interlocutors, and others warned against the use of automated voice conversations for fraud and phishing scams.
However, given the hype and quasi-wave of panic that the media has caused, it’s good to know the extent and limits of Duplex and other AI imitation technologies.
How does Duplex’s AI conversation work?
The reason Duplex stirred such shock was the clever combination of existing deep learning techniques rather than the introduction of new AI technology.
Duplex’s magic relies understanding what the human interlocutor is saying and generating the right responses. First, Duplex must turn the human’s voice to text to be able to analyze it. This is something that the search giant is especially good at, thanks to billions of hours’ worth of transcribed YouTube videos it owns and the hundreds of millions of users that have been interacting with its voice search and digital assistant. Google’s deep learning algorithms are effective enough to be able to pick up and decompose human voice under different circumstances and accents.
Next, Google has to understand and respond to what the human interlocutor says. This is the domain of natural language processing and generation (NLP/NLG). NLP is the branch of artificial intelligence that analyzes the meaning and context of human-written text. NLG, the flip side of the coin, generates human-understandable text from data. Again, this is an area where Google is very good at, thanks to its virtually unlimited store of data and its years-long experience at analyzing and translating different texts.
Finally, Duplex has to deliver the response in the user’s voice. Voice synthesis has been around for a while, but earlier technologies relied on recording a large number of samples from a person’s voice and stitching them together to create sentences. The results always sounded artificial because they couldn’t create the little details that make a voice human.
Deep learning has helped propel voice synthesis forward through a different approach. Deep neural networks, the software architecture at the heart of deep learning applications, analyze a subject’s voice samples and create a digital model of their voice. The algorithms can then generate new voice samples instead of stitching previous ones together. With enough training, the voice generated by deep learning algorithms can create very natural-sounding samples and add the intonations and nuances that make for a natural conversation without the need for a lot of prerecorded samples. Google is not the only company to create AI-synthesized voice, though the prerecorded demo it showed at I/O sounded better than most previous samples.
What are the limits of AI conversations?
Though impressive, Duplex suffers from the same limitations that other deep learning applications do, and its capabilities will probably be much more limited than the impression its demo has created. While AI agents are becoming better at carrying conversations in narrow domains such as customer service, banking and general healthcare, they’re still terrible at carrying out in-depth and broad conversations. The latter is something that only human-level AI can accomplish.
Even the most complex AI chatbots need guidance from a human when they become engaged in meaningful conversations. Facebook tried creating a general-purpose AI assistant that would be backed by human operators. The purpose of the assistant was pretty much like Duplex, reaching out on your behalf and making reservations, appointments, purchases, etc. Facebook shut down the project two-and-half years after the beta started, without ever launching it officially.
What this means is that fears that a technology such as Google Duplex can become a tool for carrying out complex fraud and social engineering attacks are probably misplaced. Phishing and spear phishing require meticulous planning and interactions with the target, not something that a technology such as Duplex can accomplish yet.
However, this doesn’t mean that Duplex can’t cause harm. In an interview with Washington Post, Madeline Lamo, a University of Washington graduate student, described a scenario where an automated agent makes dozens of reservations at a restaurant to prevent others from doing so (though I don’t know how they would manage to do it with the same voice and while calling a receptionist who can’t handle the calls in parallel).
Another threat might be the bot itself giving away information about its owner. For instance, a malicious actor might try scam a Duplex agent into giving away information about its owner, such as where they live, what time they go to work, etc. Washington Post’s Molly Roberts describes the threat in this piece. This sounds more realistic than the previous threat, especially since AI agents such as Duplex are trained to engage in simple, straightforward conversations aimed at exchanging information.
All this said, I believe the ethical requirement that companies should be transparent to users when they’re interacting with an AI agent is necessary, for the sake of avoiding confusion and frustration. I can say this from my own experience of interacting with AI-powered chatbots. As soon as I learn that I’m chatting with an AI agent, I change my tone and conversation style to make sure I don’t confuse the chatbot (so that it won’t frustrate me in return).
Drawing a conclusion from all of this, I’d say that Duplex mostly owed its hype to the fact that it sounded human. But the real utility of Duplex is its language processing capabilities, which don’t necessarily require a perfect human voice, especially if we want to be clear to humans that they’re speaking to a robot. So, for instance, I could have Siri call a restaurant and book a table on my behalf instead of impersonating me. Will the receptionists answering the calls loath speaking to robots? Why should they? Robots have been answering our calls for decades. Why shouldn’t be answering theirs?