I had thought it’d be fun, and uncanny, to clone my voice. I’d sought out the AI start-up ElevenLabs, paid $22 for a “creator” account, and uploaded some recordings of myself. A few hours later, I typed some words into a text box, hit “Enter,” and there I was: all the nasal lilts, hesitations, pauses, and mid-Atlantic-by-way-of-Ohio vowels that make my voice mine.
It was me, only more pompous. My voice clone speaks with the cadence of a pundit, no matter the subject. I type I like to eat pickles, and the voice spits it out as if I’m on Meet the Press. That’s not my voice’s fault; it is trained on just a few hours of me speaking into a microphone for various podcast appearances. The model likes to insert ums and ahs: In the recordings I gave it, I’m thinking through answers in real time and choosing my words carefully. It’s uncanny, yes, but also quite convincing—a part of my essence that’s been stripped, decoded, and reassembled by a little algorithmic model so as to no longer need my pesky brain and body.
Using ElevenLabs, you can clone your voice like I did, or type in some words and hear them spoken by “Freya,” “Giovanni,” “Domi,” or hundreds of other fake voices, each with a different accent or intonation. Or you can dub a clip into any one of 29 languages while preserving the speaker’s voice. In each case, the technology is unnervingly good. The voice bots don’t just sound far more human than voice assistants such as Siri; they also sound better than any other widely available AI audio software right now. What’s different about the best ElevenLabs voices, trained on far more audio than what I fed into the machine, isn’t so much the quality of the voice but the way the software uses context clues to modulate delivery. If you feed it a news report, it speaks in a serious, declarative tone. Paste in a few paragraphs of Hamlet, and an ElevenLabs voice reads it with a dramatic storybook flare. (...)
ElevenLabs knew its model was special when it started spitting out audio that accurately represented the relationships between words, Staniszewski told me—pronunciation that changed based on the context (minute, the unit of time, instead of minute, the description of size) and emotion (an exclamatory phrase spoken with excitement or anger).
Much of what the model produces is unexpected—sometimes delightfully so. Early on, ElevenLabs’ model began randomly inserting applause breaks after pauses in its speech: It had been training on audio clips from people giving presentations in front of live audiences. Quickly, the model began to improve, becoming capable of ums and ahs. “We started seeing some of those human elements being replicated,” Staniszewski said. The big leap was when the model began to laugh like a person. (My voice clone, I should note, struggles to laugh, offering a machine-gun burst of “haha”s that sound jarringly inhuman.)
Compared with OpenAI and other major companies, which are trying to wrap their large language models around the entire world and ultimately build an artificial human intelligence, ElevenLabs has ambitions that are easier to grasp: a future in which ALS patients can still communicate in their voice after they lose their speech. Audiobooks that are ginned up in seconds by self-published authors, video games in which every character is capable of carrying on a dynamic conversation, movies and videos instantly dubbed into any language. A sort of Spotify of voices, where anyone can license clones of their voice for others to use—to the dismay of professional voice actors. The gig-ification of our vocal cords.
What Staniszewski also described when talking about ElevenLabs is a company that wants to eliminate language barriers entirely. The dubbing tool, he argued, is its first step toward that goal. A user can upload a video, and the model will translate the speaker’s voice into a different language. When we spoke, Staniszewski twice referred to the Babel fish from the science-fiction book The Hitchhiker’s Guide to the Galaxy—he described making a tool that immediately translates every sound around a person into a language they can understand. (...)
Elevenlabs’ voice bots launched in beta in late January 2023. It took very little time for people to start abusing them. Trolls on 4chan used the tool to make deepfakes of celebrities saying awful things. They had Emma Watson reading Mein Kampf and the right-wing podcaster Ben Shapiro making racist comments about Representative Alexandria Ocasio-Cortez. In the tool’s first days, there appeared to be virtually no guardrails. “Crazy weekend,” the company tweeted, promising to crack down on misuse.
ElevenLabs added a verification process for cloning; when I uploaded recordings of my voice, I had to complete multiple voice CAPTCHAs, speaking phrases into my computer in a short window of time to confirm that the voice I was duplicating was my own. The company also decided to limit its voice cloning strictly to paid accounts and announced a tool that lets people upload audio to see if it is AI generated. But the safeguards from ElevenLabs were “half-assed,” Hany Farid, a deepfake expert at UC Berkeley, told me—an attempt to retroactively focus on safety only after the harm was done. And they left glaring holes. Over the past year, the deepfakes have not been rampant, but they also haven’t stopped. (...)
For Farid, the UC Berkeley researcher, ElevenLabs’ inability to control how people might abuse its technology is proof that voice cloning causes more harm than good. “They were reckless in the way they deployed the technology,” Farid said, “and I think they could have done it much safer, but I think it would have been less effective for them.”
The core problem of ElevenLabs—and the generative-AI revolution writ large—is that there is no way for this technology to exist and not be misused. Meta and OpenAI have built synthetic voice tools, too, but have so far declined to make them broadly available. Their rationale: They aren’t yet sure how to unleash their products responsibly. As a start-up, though, ElevenLabs doesn’t have the luxury of time. “The time that we have to get ahead of the big players is short,” Staniszewski said, referring to the company’s research efforts. “If we don’t do it in the next two to three years, it’s going to be very hard to compete.” Despite the new safeguards, ElevenLabs’ name is probably going to show up in the news again as the election season wears on. There are simply too many motivated people constantly searching for ways to use these tools in strange, unexpected, even dangerous ways.
In the basement of a Sri Lankan restaurant on a soggy afternoon in London, I pressed Staniszewski about what I’d been obliquely referring to as “the bad stuff.” He didn’t avert his gaze as I rattled off the ways ElevenLabs’ technology could be and has been abused. When it was his time to speak, he did so thoughtfully, not dismissively; he appears to understand the risks of his products and other open-source AI tools. “It’s going to be a cat-and-mouse game,” he said. “We need to be quick.”
The uncomfortable reality is that there aren’t a lot of options to ensure bad actors don’t hijack these tools. “We need to brace the general public that the technology for this exists,” Staniszewski said. He’s right, yet my stomach sinks when I hear him say it. Mentioning media literacy, at a time when trolls on Telegram channels can flood social media with deepfakes, is a bit like showing up to an armed conflict in 2024 with only a musket.
The conversation went on like this for a half hour, followed by another session a few weeks later over the phone. A hard question, a genuine answer, my own palpable feeling of dissatisfaction. I can’t look at ElevenLabs and see beyond the risk: How can you build toward this future? Staniszewski seems unable to see beyond the opportunities: How can’t you build toward this future? I left our conversations with a distinct sense that the people behind ElevenLabs don’t want to watch the world burn. The question is whether, in an industry where everyone is racing to build AI tools with similar potential for harm, intentions matter at all.
by Charlie Warzel, The Atlantic | Read more:
Image: Daniel Stier for The Atlantic
ElevenLabs knew its model was special when it started spitting out audio that accurately represented the relationships between words, Staniszewski told me—pronunciation that changed based on the context (minute, the unit of time, instead of minute, the description of size) and emotion (an exclamatory phrase spoken with excitement or anger).
Much of what the model produces is unexpected—sometimes delightfully so. Early on, ElevenLabs’ model began randomly inserting applause breaks after pauses in its speech: It had been training on audio clips from people giving presentations in front of live audiences. Quickly, the model began to improve, becoming capable of ums and ahs. “We started seeing some of those human elements being replicated,” Staniszewski said. The big leap was when the model began to laugh like a person. (My voice clone, I should note, struggles to laugh, offering a machine-gun burst of “haha”s that sound jarringly inhuman.)
Compared with OpenAI and other major companies, which are trying to wrap their large language models around the entire world and ultimately build an artificial human intelligence, ElevenLabs has ambitions that are easier to grasp: a future in which ALS patients can still communicate in their voice after they lose their speech. Audiobooks that are ginned up in seconds by self-published authors, video games in which every character is capable of carrying on a dynamic conversation, movies and videos instantly dubbed into any language. A sort of Spotify of voices, where anyone can license clones of their voice for others to use—to the dismay of professional voice actors. The gig-ification of our vocal cords.
What Staniszewski also described when talking about ElevenLabs is a company that wants to eliminate language barriers entirely. The dubbing tool, he argued, is its first step toward that goal. A user can upload a video, and the model will translate the speaker’s voice into a different language. When we spoke, Staniszewski twice referred to the Babel fish from the science-fiction book The Hitchhiker’s Guide to the Galaxy—he described making a tool that immediately translates every sound around a person into a language they can understand. (...)
Elevenlabs’ voice bots launched in beta in late January 2023. It took very little time for people to start abusing them. Trolls on 4chan used the tool to make deepfakes of celebrities saying awful things. They had Emma Watson reading Mein Kampf and the right-wing podcaster Ben Shapiro making racist comments about Representative Alexandria Ocasio-Cortez. In the tool’s first days, there appeared to be virtually no guardrails. “Crazy weekend,” the company tweeted, promising to crack down on misuse.
ElevenLabs added a verification process for cloning; when I uploaded recordings of my voice, I had to complete multiple voice CAPTCHAs, speaking phrases into my computer in a short window of time to confirm that the voice I was duplicating was my own. The company also decided to limit its voice cloning strictly to paid accounts and announced a tool that lets people upload audio to see if it is AI generated. But the safeguards from ElevenLabs were “half-assed,” Hany Farid, a deepfake expert at UC Berkeley, told me—an attempt to retroactively focus on safety only after the harm was done. And they left glaring holes. Over the past year, the deepfakes have not been rampant, but they also haven’t stopped. (...)
For Farid, the UC Berkeley researcher, ElevenLabs’ inability to control how people might abuse its technology is proof that voice cloning causes more harm than good. “They were reckless in the way they deployed the technology,” Farid said, “and I think they could have done it much safer, but I think it would have been less effective for them.”
The core problem of ElevenLabs—and the generative-AI revolution writ large—is that there is no way for this technology to exist and not be misused. Meta and OpenAI have built synthetic voice tools, too, but have so far declined to make them broadly available. Their rationale: They aren’t yet sure how to unleash their products responsibly. As a start-up, though, ElevenLabs doesn’t have the luxury of time. “The time that we have to get ahead of the big players is short,” Staniszewski said, referring to the company’s research efforts. “If we don’t do it in the next two to three years, it’s going to be very hard to compete.” Despite the new safeguards, ElevenLabs’ name is probably going to show up in the news again as the election season wears on. There are simply too many motivated people constantly searching for ways to use these tools in strange, unexpected, even dangerous ways.
In the basement of a Sri Lankan restaurant on a soggy afternoon in London, I pressed Staniszewski about what I’d been obliquely referring to as “the bad stuff.” He didn’t avert his gaze as I rattled off the ways ElevenLabs’ technology could be and has been abused. When it was his time to speak, he did so thoughtfully, not dismissively; he appears to understand the risks of his products and other open-source AI tools. “It’s going to be a cat-and-mouse game,” he said. “We need to be quick.”
The uncomfortable reality is that there aren’t a lot of options to ensure bad actors don’t hijack these tools. “We need to brace the general public that the technology for this exists,” Staniszewski said. He’s right, yet my stomach sinks when I hear him say it. Mentioning media literacy, at a time when trolls on Telegram channels can flood social media with deepfakes, is a bit like showing up to an armed conflict in 2024 with only a musket.
The conversation went on like this for a half hour, followed by another session a few weeks later over the phone. A hard question, a genuine answer, my own palpable feeling of dissatisfaction. I can’t look at ElevenLabs and see beyond the risk: How can you build toward this future? Staniszewski seems unable to see beyond the opportunities: How can’t you build toward this future? I left our conversations with a distinct sense that the people behind ElevenLabs don’t want to watch the world burn. The question is whether, in an industry where everyone is racing to build AI tools with similar potential for harm, intentions matter at all.
by Charlie Warzel, The Atlantic | Read more:
Image: Daniel Stier for The Atlantic