John Gregoire never needed a microphone. His voice would project from his 5-foot-11 frame, streaking out of the dugout when he coached his sons’ Little League teams and filling hotel ballrooms packed with his clients. John spent his days as a management consultant and executive coach, flying from his Maine home to New York or Philadelphia or taking the train down to Boston. His voice was his instrument; timely interjection is everything in coaching. The days he didn’t spend on the road were spent on the phone, talking through problems with clients. Then on the drive home, he’d crank the radio, singing along and drumming the rhythm with his hands.
In 2006 john opened his own coaching business and spent the next year constantly on the road. Two weeks of travel every month brought on a bout of pneumonia, which, even after he recovered, left him with overwhelming fatigue. But his wife, Linda, knew something else was off.
By each afternoon, his speech was slow and slurred, she says. For John, it was hard to enunciate; his lips felt awkward. They chalked it up to his work schedule.
One week that June, he headed back out on the road, to a meeting in Chicago. John and his clients sat down for a steak dinner and poured some wine. He began talking, and the two clients exchanged glances. They thought he was drunk; he hadn’t taken a sip.
Small things started to pile up: trouble opening jars, cutting himself while cooking dinner, difficulty showing his son a guitar chord. In New York, he fell over on the subway. One October morning, Linda looked out the window and saw him in the driveway. He looked nice: khakis, a blazer. He placed the garbage in the back of his car and pulled to the street to toss it in the can. As he went to throw it, his momentum bowled him over, and he rolled down the driveway. Linda sprinted out the front door, crying. They made an appointment to see a neurologist.
On December 17, John and Linda took the Amtrak Downeaster train to meet with a specialist at St. Elizabeth’s Medical Center in Boston.
With each stop, the train grew fuller with Christmas shoppers; at one point the whole car started singing carols. When they arrived, they walked 20 minutes through the early winter slush, Linda holding John up the whole way. The doctor was blunt: John had ALS. He gave John 18 months to live and offered the couple a coupon for a discounted wheelchair.
Linda limped John back across the now frozen slush and onto the train. She curled up, put her head on her husband’s shoulder, and cried. He tilted his head back and closed his eyes. They stayed like that for two and a half hours, all the way back to Maine.
John went back to work, and a new normal set in. Clients were understanding; email would suffice instead of phone conversations. Every afternoon at 3:30, John drove to Starbucks for a Venti-size coffee. Placebo or not, the coffee helped his speech. But it couldn’t last forever. Within a year, the man who could command a ballroom with his voice was resigned to typing on an iPad and having it speak for him.
“Bossy Ryan” took over from there. The mechanical, demanding
voice on John’s iPad earned that nickname from Linda. Instead of John sneaking up behind her and whispering in her ear, she had tone-deaf Bossy Ryan. “You hear somebody on the phone, sometimes you can identify who that person is before they even say their name,” Linda says. “You take that away from them and just give them Bossy Ryan? It’s not right.”
Although John had fully switched over to the iPad, he still had a BlackBerry, and on it a voicemail greeting: Hi, this is John Gregoire. I can’t take your call, leave a message. Every day, Linda would call John’s number just to hear that anodyne message, over and over. Hi, this is John … She knew every crack, every pause.
One day she made the call, and it was gone. She checked the number and tried again. Still nothing. She jumped in her car and bolted to the AT&T store. It turned out that the employees who’d helped Linda arrange to have John’s BlackBerry shut off had neglected to tell her she’d lose the message. Linda cried in the store.
“You gotta get it back!” she pleaded.
“It’s in the cloud somewhere, right?”
The employees shook their heads.
John and Linda still have home movies, but they’re too hard to watch. To look back on those things now, Linda says, what’s the point? They’re a painful reminder of what they’ll never get back: the whispers, the kisses, even the fights. There was something about hearing just his voice on the greeting, though, that had been soothing.
Rupal Patel strides onto the TED stage at San Francisco’s SFJAZZ Center. A black slide with white type rises behind her. “In the words of the poet Longfellow,” she says, “the human voice is the organ of the soul.”
A photo of Stephen Hawking slides across the screen, and then one of a little girl, then three more people. All use communication devices to help them speak. And all of them, Rupal says, may be using the same voice. This problem, this lack of individualization, this lack of soul, was driven home for her 11 years earlier.
It was August 2002, and she had just gotten off the stage at the Conference of the International Society of Alternative and Augmentative Communication, in Odense, Denmark. She walked into the crowded technology exhibit hall, where people from all over the world were pitching voice programs, software, and research. There, she stumbled across a conversation between a young girl, no older than 10, and a middle-aged man. When they spoke, they used the same voice.
That voice, known colloquially as “Perfect Paul” (or even more colloquially, the “Stephen Hawking voice”), is the most popular artificial voice on the planet. You may also know it as the longtime voice of the National Weather Service. Developed in the early ’80s by Digital Equipment Corporation for its DECtalk speech synthesizer, it got its name because it was the clearest artificial voice on the market.
As Rupal looked around the exhibit hall, she saw hundreds of people using only a handful of voices. “We wouldn’t dream of fitting a little girl with the prosthetic of a grown man,” she tells the TED audience. “So why then the same prosthetic voice?”
She reached out to Tim Bunnell, a speech synthesis expert who was already building personalized voices for people who lost the ability to speak later in life, people who’d banked their speech knowing it was escaping them. He did this by clipping together a person’s speech samples and reconstructing his or her voice. Rupal had to find a way to reverse engineer the system with whatever vocal ability an individual had. Maybe that was the ability to pronounce one or two vowels, or maybe it was just a noise from deep within their larynx. Whatever it was, Rupal was going to capture it.
“What happens next is best described by my daughter’s analogy
—she’s 6,” Rupal says on the TED stage. “She calls it mixing colors to paint voices.”
To create a voice, Rupal takes those unique source sounds from a speech-impaired person and combines them with full speech from someone roughly the same age and gender (similar regional accents are helpful, too).
She introduces the story of Samantha, a 17-year-old with perisylvian syndrome, a rare disorder that limits her ability to speak. But Samantha can still produce vowel-like sounds. Rupal plays an audio recording of Samantha pronouncing an “Ahhh” sound. She pauses. “Now, Samantha can say this,” Rupal says.
A new recording begins. It’s feminine and youthful and optimistic. “This voice is only for me,” Samantha says. “I can’t wait to use my new voice with my friends.”
Two weeks before the TED talk, on Thanksgiving Day 2013, Rupal had set up a website, VocalID.org, which contained a small section where visitors could sign up to donate their voices. Before walking on stage, Rupal had 10 surrogate voices she could use to make personalized voices. Two hundred, she thought, would be a good start. That alone would be 10 times larger than her biggest study. Within a week, 1,500 people had signed up. Two months later: 8,000 people.
Money started trickling in from the National Institutes of Health and the National Science Foundation. Rupal hired a programmer to build out the website, making it possible for those 8,000 waiting donors to give their voices. But what should they say?
Rupal compiled 3,500 sentences based on their sounds and sound combinations. Some were selected for rhythm, melody, and emotion; 250 constituted our most familiar phrases.
Good to see you.
I love you.
Donors flooded the site. Voice drives surfaced at middle schools and as bar mitzvah projects. Chinese residents used the voice donation as English language practice. More than 10,000 people have now started or completed donating, no small feat considering the 3,500 sentences take roughly six or seven hours to orate.
To take all those donations and actually turn them into usable voices, Rupal needed to build a more complex algorithm that would search through the bank and pull out appropriate matches. In stepped Geoff Meltzner.
Geoff and Rupal had met eight years earlier, when he was working on silent speech recognition. (The work was fascinating: Imagine sensors on your neck and face that can recognize words even if you just mouth them silently.) Geoff built the algorithm Rupal needed and installed a computer in his basement to work through all the permutations. Before he came on, Rupal had to spend 40 to 50 hours manually massaging every voice created by her team at Northeastern University, smoothing out transitions or tweaking certain letter combinations. The computer’s fine-tuning helped drop that number to about 15 hours.
To supplement the federal grants, Rupal launched an Indiegogo crowdfunding campaign with a goal of raising $70,000. Among the incentives, she included a “Trailblazer” option: If you donated $10,000 you could get one of the first voices off the line. (Or, in this case, out of Geoff’s basement.) Rupal didn’t expect anyone to donate that much; four families did. She chose three additional voice beneficiaries and got to work.
The recipients included a 59-year-old man with ALS from Windham, Maine, and a 12-year-old girl with cerebral palsy from Plano, Texas.
“All right, Tess, what do you want to listen to?”
Ann Gregorek adjusts herself in the driver’s seat of her van, looks to Tess in the passenger’s seat, and starts scanning the radio stations. Fall Out Boy and Coldplay speed by before the dial lands on “Chandelier,” by Sia. Tess kicks her foot.
“Really? This one?”
Tess is 12 years old, in seventh grade, and is dressed in a bright red Texas Rangers sweatshirt and Jordan basketball pants. Her pecan hair is pulled into a high ponytail, and her right wrist is adorned with the kind of cloth bracelet championed by 12-year-olds with high ponytails. She has a broad, generous smile, one that makes her eyes look smaller because her cheeks raise so much. On the back of her wheelchair hangs her shimmering purple cooler bag. Her favorite singer is Taylor Swift, but for the moment—I guess—she’ll settle for Sia.
“If she used this all the time, it would take forever, so I just say ‘Kick when I get to the one you want,’” Ann says.
“This” is Tess’ talker. She can explain.
“I push a button on the left side of my head to control my talker,” Tess says. “Please be patient because it takes me a long time to talk back to you.”
The voice is somehow both flat and tinny, like a See ’n Say toy.
It doesn’t sound like a precocious tomboy, and it doesn’t do justice to the fact that when Tess smiles you can practically see it even if you’re standing behind her. It’s straight from the computer, standard-issue. It’s used by thousands around the world. It is, in every way, inhuman.
Ann and her husband, Jason, knew Tess would never be able to speak. Born with cerebral palsy, Tess is limited physically and vocally. Cognitively? One hundred percent there. (During visits with her math tutor, Tess sits in her chair and solves equations in her head, using the same kick method to dispense her answers.)
When she was 3, Tess and Ann sat in Tess’ bedroom, surrounded by her stuffed animals. Tess was the teacher; Ann and the animals, the students. “Let’s rhyme,” Tess said. “What rhymes with man?”
“Tan! Fan!” Ann shouted.
Around and around they went, until lunchtime. Ann got up to make lunch but heard Tess clicking around. Three minutes later, Tess made her announcement: “Remember, girls and boys, no school tomorrow.”
“And I’m like, ‘Remember?’” Ann says. “Where’d you learn that?”
By first grade she’d written her first book. Ice Man was a story about playing in the snow with her sisters, and she followed it up with a slew of poetry collections. As a middle schooler she’s ventured into playing the drums, using her head switch to keep the rhythm. At school, there are other kids like her, which means that there are also other kids with her same voice. For teachers with their backs to the room, this creates a near impossible task: One student responds, but six students have the same voice. Who answered the question?
For Tess there’s also the reality of nearly being a teenager and watching others talk, laugh, or sing as they walk down the hallway.
One night in November 2014, Ann and Jason stumbled across the video of Rupal introducing Samantha to a San Francisco crowd. They quickly sent off an email: “Hello! We are just wiping tears away from our eyes after listening to your TED talk.” Both were stunned when Rupal responded. (“I couldn’t believe I was talking to the Rupal Patel,” Ann says.) They soon became VocalID Trailblazers.
A year later, Rupal walks into their Texas home. She’s just flown in from the Bay Area, where she spent yet another day pitching investors on VocalID. She leans out of her chair and smiles at Tess: “Are you ready for this?”
With Ann beaming by her side, Tess nods. Rupal hits play.
“My name is Tess.”
She plays the second one. It’s quicker than the first, and more youthful. It’s a monumental shift from Tess’ current voice, which, when Rupal plays it soon after, sounds cartoony in comparison.
“They sound like girls,” Ann says. “They sound like Tess.”
Tess puts her head down and smiles. Rupal plays them again. The first is a bit breathy; the second is louder. It’s more confident, clearer. Tess picks the second voice, and Rupal begins the process of uploading it onto Tess’ machine. Half an hour, a machine switch, and a call to Geoff later, the voice is finally ready to go.
“All right, Tess, you have it, you can say anything you want now,” Ann says. “What are your first words going to be?”
“This is one small step … ,” Jason says in his best Neil Armstrong impression.
“Don’t mess with Texas,” Geoff suggests over the phone.
Ann looks at Tess. It’s late afternoon, and she’s tired. “Once you’re organized, of course,” Ann says. “We’ll wait until you’re organized.”
Earlier in the day, Ann was nervous. What if she doesn’t like the voice? Is the whole thing too much pressure? Tess, less so. “She said ‘Whatever’ yesterday and laughed. I think she’s just annoying me on purpose,” Ann said in a text.
Geoff is still on the phone; Rupal, still at Tess’ side. Everyone waits, and Tess lowers her head. Minutes pass. To break the suspense, Rupal gives Tess a T-shirt that features all the Indiegogo donors’ names incorporated into the VocalID logo. It doesn’t work.
“She’s 12,” Jason says with a laugh. “You can’t make her talk, and you can’t make her shut up either.”
Eventually Tess goes into the other room and watches television. Later, after Rupal leaves, she speaks with her sister, alone.
“People will say to me, ‘Why don’t you just make 1,000 voices? You have 10,000 voices in your bank, just flood the market,’” Rupal says that night. “Tess is not one of 1,000 people. The fact that she can’t control so many other things, but she made breath to make sound, and we took that sound to make her voice ... Is it important for her to just pick a voice from a thousand? It may not be important for her right now. But for the 16-year-olds and 17-year-olds we’re making voices for, they say, ‘I don’t want to pick from a library. I’m not a library. I’m a human.’”
Every day, Rupal has the same routine. Wake up at 6:30 a.m., get her kids ready for school, head down the road into the office. Walk past the Italian restaurant on the first floor and up to the third floor, where VocalID’s Belmont, Massachusetts, headquarters shares a kitchen, conference room, and restroom with a marketing team, a sports psychologist, and a forestry consultant. She works all day, tucks her kids in—her husband is a professor at MIT; they’re both on the move all the time—and then begins her second shift. At 9 p.m. she starts her round of West Coast calls, which last until early morning. She falls asleep at 2.
She’s operated on four and a half hours of sleep for the past 18 months.
“This is all I’ve been doing, thinking, breathing,” she says. “It’s not super healthy working until 2 every day and then getting back up at 6:30 or 7 a.m.; it’s driven by adrenaline and driven by passion and driven by drivenness. But there’s a cost to it.”
That cost is missing Christmas cookie exchanges, or discovering more white hairs, or finding nannies to watch her two children when she needs to go to dinners and her husband is out of town. The cost is flying across the country and meeting with investors who just don’t get what we’re trying to do—and then flying back to Boston and wondering if she’s going to land her next grant. It’s exhausting. Her tiny office is nearly unadorned, save for a Tupperware of candy and an innovation award from South by Southwest.
There’s a financial cost, too. Her own net worth is in the company, and her seven employees won’t work below market rate forever. The biggest cost, though, is time. Every day the company doesn’t lock down investors is another day that a competitor could join the market. So she needs to adapt.
The first step: Automate more. After Tess got her voice, Ann called Rupal. The voice, she said, it wasn’t quite Tess. Tess is stronger than that, Ann said, so they sent Rupal new audio samples and started the process again. By the end of 2016 this extra work might not be necessary. Recipients will be able to adjust their voices using an online portal. Want more volume? You got it. More bass? Turn the knob to the right. Part of this is for the customer—automatic, at-will adjustment gives the recipient the exact sound they want—but much of it is a matter of necessity.
VocalID has orders for 76 voices in 2016, up from 2015’s trailblazing seven. The goal for this year is to drop that manual manipulation time from 15 hours to under eight. By 2018, with the addition of the online portal, Rupal hopes to shrink that number to below one. Because even if that time drops to three hours—less than one-tenth of what it took in 2014—the business might not be scalable.
While VocalID started as an altruistic idea, the science and the technology behind it aren’t cheap—the voices cost $1,250, plus an annual upkeep fee. Those voices, though, aren’t the company’s end goal.
Imagine you’re on the subway, but you’re supposed to be on a conference call. What if, instead of being late, you could send a text that was read over the call in your own voice? Or, using that same line of thinking, what if every email you received could be read to you in the sender’s voice? Those applications, Rupal believes, are the future of VocalID. If she can lower the production costs, VocalID can enter those markets, making voices free for the Tesses of the world.
For every money-saving portal, though, comes another pile of questions. For Tess, a 12-year-old entering puberty, that voice may change another two or three times. Does that require a new donor? That probably won’t work; it would sound too different. Would the original donor be willing to redonate? Maybe, but is that an imposition the company is willing to bank on?
And then there are the Johns of the world, who come with their own questions. Can you accurately recreate someone’s voice from audio and video recordings? Can you restore the organ of one’s soul?
On December 14, 2015, three days shy of the eighth anniversary of his ALS diagnosis, John and Linda load into their van and start the 120-mile drive from their Maine home to VocalID’s offices. They don’t talk much. John prepares himself for disappointment.
In the small conference room, John, Linda, Rupal, and Geoff all cram in with a news crew from Maine and prepare to hear three voices. Linda stands behind John, draping her arm over the back of his chair and leaning over his left shoulder. Geoff hits play.
“Can we stop at Starbucks to refuel?”
Rupal smiles and then sees John and Linda’s blank faces. She looks to the floor and fiddles with her fingers. Geoff plays the second voice.
“Can we stop at Starbucks to refuel?”
Linda smiles; John nods politely. Geoff plays the final voice.
“Can we stop at Starbucks to refuel?”
Halfway through the sentence, Linda’s eyes go wide, and John raises his head. They both turn and smile at Rupal. “That’s the one,” Linda says, nodding her head. “That’s really close.”
She rubs John’s shoulders and looks down at her husband. His lips and eyes are sealed shut, and he’s crying. Linda takes off his glasses and wipes his cheeks. Geoff plays one more sample.
“My name is John. I live in Windham, Maine.”
A shriek slips out of John, and Linda leans over to whisper into his ear: “That sounds like you.”
“I can hear him talking sometimes,” Linda says later. “I can hear him talking, but I can’t hear his voice because I’ve been so accustomed to Ryan—Bossy Ryan—that it’s sort of skewed my memory of what his voice sounds like. So when we heard it, it was like having a friend visit.”
Once the voice is inputted onto his machine, John plans out his first sentence with his new, old voice. There was no question what it would be.
“I love you, Linda.”
Soon, they load back into the van. They exclaim together—I can’t believe ... I can’t believe—John in his chair, Linda peeking back in the rearview mirror. The skyline fades behind them, and they ease up Interstate 95, back to Windham. Drained, John falls quickly to sleep.