Voice AI, Telecom, Scams, and Co-evolution

What a time to be alive. I was inspired to write this based on seeing a friend’s work get some new attention (more on that below). And by the way, this is my first post on this new second-order thinking project so I’m diving in even before I had the chance to write a piece on why I’ve started the project itself.

Let’s discuss new opportunities with interactive voice AI.

If you haven’t seen it, here’s the Google Duplex demo from two days ago that has everyone fired up. Google Duplex is a voice AI that, at this point, can make a transactional phone call for you save you time by talking to the human on the other end.

Actually, the team’s writeup about Duplex, complete with recorded examples, is even more interesting. Worth a read.

But the subtle question is, why is everyone so fired up about this? After all, the technology to make a phone reservation like that demoed is not new. My friend Jeff Smith co-founded a startup that did the same thing a year before the Duplex demo. You can watch his demo here:

Jeff’s co-founder Wesley has a great writeup here. He also mentions Duplex’s use of “disfluencies,” noting that adding disfluencies (our uses of “uh” and “um” which make us cringe when we hear recordings of our own voice) is not a great technological improvement, but it sure does make the demo better.

The first industrial revolution was about getting past the limitations of human strength. The computer revolution was about getting past the limitations of human thinking. The internet revolution was about getting past the limitations of human communications.

— Jeff Smith (@jeffksmithjr) May 9, 2018

When it comes to transactions that must take place by phone, Duplex and John Done are good for humans. In the demos, the person trying to acquire the service (book the appointment or buy the flowers) saves time and can accomplish things that they never would in person. The business potentially gets more customers or saves time in the bookings since a benevolent voice AI is only going to ask things related to actual wants.

Second-Order Effects

(Remember, this new blog focuses on second-order thinking….)

There are several second-order effects of voice AI that is this good.

The most direct change is in not having to wait on hold, look up and call multiple numbers, call back if there is no answer and any other thing that can go wrong with a phone call. Related to this, the international call volume rate (both traditional PSTN and VOIP) has remained flat from 2013 – 2015 (the last year available). Even though more people are still getting access to phones, voice is a smaller part of what we do with our phones. Many of the appointments and quick check-ins that had to take place by voice calls in the past have since switched to other forms of communication, often by removing the need for a human on both sides of the communication.

Humans are already used to asking for and receiving information from a machine. What John Done and Duplex have done is the reverse of that.

So back to positive effects from a voice AI. One application is to take a data approach for the customers’ benefit but in a way that a business cannot easily defend against. Here imagine a shy caller who is not able to bully their way into a reservation at a hot restaurant (yes, this is actually a thing). Meanwhile, their AI can. Or, a human being who speaks in a less desirable accent may be told that there is no availability when their AI, which sounds acceptable, gets the reservation. Impact: more diverse restaurant goers, fewer awkward moments.

A negative impact in the short term, depending on how fast things change, is that a little more than 1M people are employed as receptionists, as of 2016. While receptionists aren’t totally obviated by voice AI (answering phones is only part of their job) it means that their job will change. Some will be able to deal with the change and others won’t.

Businesses will need voice AIs to talk to voice AIs… and the occasional human. At some point, it won’t be worthwhile for humans to answer the phone in a restaurant, salon, or florist. I’m curious to learn what percentage of calls in a restaurant deal with table bookings or service hours versus other types of calls that will probably stay human longer — for example, calls from suppliers.

Now let’s take the (minuscule) assumption that if this tech can be used in beneficial ways it can also be used for the opposite. While access to the tech is limited today, it is only a matter of time before it is widely available. The key difference is in how the voice AI tech can allow misuse at scale.

Remember back when Uber employees generated multiple requests for Lyft rides, only to cancel them? The purpose was to waste the time of the drivers and encourage them to drop Lyft for Uber. Imagine doing something like that at scale for a restaurant, cafe, salon, or any type of business that you compete with. It would be possible to tie up phone lines, waste time, or leave restaurants with empty tables.

An older scam that this tech will scale is what’s known as the “Hey, Grandma” scam, where a grandparent gets a call from a “grandchild” in distress. There are different flavors of this. For US grandparents the story is often that the grandchild got into legal trouble and needs money wired. In China and Taiwan, it’s often that the grandchild has been kidnapped and is being beaten up. Again, wire the money.

While the Lyft drivers couldn’t know that their requests were fake, why can’t the grandparents tell that they aren’t really receiving a call from a grandchild? Most of the time they can tell but apparently, the success rate may be as high as 2%! That’s more than enough to make it worthwhile. If you can scale this scam, by automating realistic calls using a realistic voice AI, then that’s a game changer. If you can also emulate the voice of the supposed grandchild, then that success rate will increase. If you can emulate the voice of the grandchild and also know specific facts about them, perhaps gleaned from public social media, then increase that rate even further.

Related to the above, for a great writeup of why it’s so cheap to make phone calls at scale (as taken advantage of by robocallers), see this post on The Broken Economics Of Robocalls.

The Antidote to This Scam (…or the Scamidote)

There is a long-standing low-tech solution to this problem. One is the use of a secret passphrase that only someone in the family would know (I know of multiple instances where this technique is actually used). Another technique is to ask a question that only the caller would know (if they are who they say they are). While it’s difficult to remember to do this when you are caught up in an emotional moment, the best way to check who is really on the other line is to use a preset code or ask a specific question. The solution to this tech problem can’t come from technology. Even today, voice emulation technology is pretty good, as heard for example in these demos of a fake Trump voice. Other copycat voices can be created with only seconds of recorded voice of the actual person.

Out of Step Timing in Co-evolution

Yuval Noah Harari notes in his book Sapiens that history has seen times of co-evolution (lions and gazelle each evolving slowly faster over long stretches of time) and periods when one species (that’s humans) emerged in a new environment and by using intelligence, tools, and cooperation, was able to wipe out stronger species. There was no chance for the woolly mammoths to evolve their way along with humans.

So even if the question antidote described above starts to be widely used, you still just need to fool a few people to make the scam worthwhile. In some cases, the scammers and their targets will co-evolve, each getting better as time goes on. But perhaps not in this case. In this case, the scammers tech advantage happens at scale before their targets are able to realize what’s happening. While the scammers’ tech builds on itself, each potential target must be educated and maintain their rationality when they hear an extreme phone call from a supposed family member in trouble. The humans eventually catch up. So the humans suffer in the short-term from this out of step timing. If anything, a scammer running a malevolent voice AI would want to keep the humans happy so as to survive longer themselves.

The Medium is the Mess

In my old voice startup in 2009-11 I made lots of unintelligent interactive voice responses, but nothing like John Done and Duplex. We researched the differences in communication when people used voice alone, versus text or voice and video. People communicate in incredibly different ways depending on the medium. As above, imagine a shy person who can’t get the words out being able to type a conversation that their voice AI then delivers. Or next generation, have them speak the words directly in order for their voice AI to say them to the listener, but with more conviction.

It is ridiculous to imagine that voice AIs will be required to identify themselves as such. That is, various governments could “mandate” it and various companies could “demand” it, but it’s not going to happen. When there is next to no cost for misuse of voice AI, it is too profitable to fool people, and the calls can be made from international jurisdictions, there is no enforceability.

But a source for good when it comes to making our lives easier in everyday ways? Yes, absolutely. Just with some second-order effects.