You may have already heard an AI voice on YouTube or a podcast — and not even known it
- Adam Bluestein
Seattle’s WellSaid is training its AI to synthesize human speech, and help clients produce it en masse.
Chris Black, the founder and managing director of Energy Industry Academy, an online learning platform for professionals in the energy, oil and gas, and construction industries, talks like a regular guy from somewhere in the South. Which he is.
“My ambiguous twang,” he says, comes from growing up in the DC area, attending law school in Miami and living in Houston for the past ten years. He has produced some 400 hours of audio content — wonky, technical stuff on “nondestructive testing” and safety compliance — which has been enthusiastically received by his target audience. That can be chalked up to Black’s specialized knowledge — but the fact that his voice sounds like “one of them” certainly contributes to his appeal.
What listeners don’t know is that the voice they’re hearing isn’t always Black’s own, but a synthetic version of his voice created with AI technology by Seattle-based WellSaid Labs.
For companies working in the burgeoning business of AI voices, this is the holy grail — producing human-like speech that can’t be distinguished from the real thing. “His voice was somebody that sounded familiar that they could learn from,” says Rhyan Johnson, WellSaid’s senior voice data engineer. “Out of 400 hours of content they’ve generated, 100 hours is using his synthetic voice — and no one has ever noted a difference, which is really exciting.”
Just try to tell the difference between Black’s real and synthetic voice:
As demand grows for lifelike synthetic voices — for use in digital assistants and smart speakers, virtual customer-service agents, e-learning, audiobooks, gaming and more — the market for text-to-speech (TTS) is projected to reach $7 billion in 2028, up from about $2.3 billion in 2020, according to Emergen Research. Along with Big Tech players such as Google, Amazon, Microsoft and others that offer synthetic voices for developers to build applications with, there is a growing contingent of startups, like WellSaid, that focus on making high-quality synthetic voices more accessible and affordable across a variety of niche applications.
WellSaid was spun out of the Allen Institute for Artificial Intelligence’s AI2 Incubator in 2019 and raised $10 million in Series A funding last July, led by Bellevue, Washington-based early-stage venture firm Fuse. It currently offers 49 off-the-shelf voice “avatars,” with four more coming later this month. The company also works with customers to create custom voices. Voices come in narration, promotional, conversational and storytelling styles.
“A lot of our customers are in the e-learning or commercial training space,” says Johnson. “So, finding voices that are engaging and easy to listen to, easy to learn from, that was really our first challenge.” In addition to Energy Industry Academy, WellSaid has worked with the Explanation Company on a custom voice for an interactive voice-based search app for kids. “They can ask pretty much anything, like, ‘when was the Ankylosaurus alive?’,” says Johnson. “It will try to formulate some kind of response and give a little more information — ‘Let’s talk about dinosaurs.’”
The company also works with clients in health and wellness, including XpertPatient, which focuses on delivering resources to cancer patients who are trying to navigate conversations with their doctors. “People engage and remember so much more when they can listen, instead of just giving them a pamphlet or workbooks,” Johnson says. “Finding a voice that is natural and can give feeling to the content that you are trying to provide has been really kind of rewarding to be a part of.”
All of WellSaid’s AI voice “avatars” are based on the voices of real people, who go into a studio and record about two hours of speech, reading from a script designed to capture a range of speech sounds with various intonations. (Originally, they needed 20 hours of audio; Alexa’s celebrity voices — Shaquille O’Neill, Melissa McCarthy, and Samuel Jackson — reportedly required 60 hours.) This data is fed into deep-learning models that have been trained to convert discrete chunks of text, called graphemes, into notations of sound called phonemes, and to create a visualization called a spectrogram that shows the corresponding sound frequencies.
These spectrograms are carefully lined up with words, stitched together, and converted into the actual waveforms of speech by passing them through an adapted WaveNet, a type of “neural” vocoder (vocal synthesizer) originally developed by Google’s DeepMind group. Most so-called neural voice models — which deliver a more natural voice quality than earlier “concatenative” text-to-speech approaches — employ a similar workflow.
Once a voice is ready, it can be summoned up to read pretty much any new script that a customer wants. (Like many TTS companies, WellSaid doesn’t allow its technology to be used to read out various kinds of offensive speech.) As new voices are added, and they gain experience reading new text, the system gets smarter. That means, among other things, that when a new voice is created, the system just needs to identify what makes it sound uniquely like this new person. And if it encounters a script with words it hasn’t trained on, it can guess at pronunciation from other examples.
WellSaid’s AI model has been trained to recognize non-standard words, such as dollar amounts, years, phone numbers, URLs, acronyms and abbreviations. And this month, the company added new markup tools that let customers easily annotate text that needs special treatment, giving them phonetic controls to correct pronunciation of things like names and company-specific jargon, and to better reflect regional accents (“car-muhl” vs. “car-a-mel,” for example). “It’s as if you’re working with a real voice actor and saying, ‘you’re reading this text, but read it this way,’” says Johnson.
WellSaid currently employs about 50 people, some working remotely and some at its Seattle HQ, where the mix of so many cutting-edge companies in AI, and voice in particular, creates a fertile environment for innovation. “Everyone’s very friendly and encouraging and sharing knowledge,” says Johnson. “It’s a great space to be a part of.” The company aims to grow headcount to 60 by year-end, and says Johnson, “There are job openings across all departments.”