Teaching a Computer to Hear
Janet MacIver Baker, J69, simply calls it “the debacle.” It’s as good a word as any to describe the cascade of events that led to the sudden demise of Dragon Systems, the company she and her husband Jim had grown for decades, then lost virtually overnight.
Living up to its namesake’s reputation for power and creativity, Dragon cracked one of the most intractable problems in computing—speech recognition technology. Then, just as their successful business had merged with a larger company and was poised to grow worldwide, the Bakers watched as the company went bust, taking all of their technology, employees and money with it.
Worse yet, they lost control of the patents to their own ideas, and were now suddenly barred from developing the technology they’d pioneered. Baker has a word for that, too: “Devastating,” she says. Sitting at her dining table in an ornate Victorian home in the Boston suburb of Newton, Baker clearly didn’t lose everything with the dissolution of her company. But even after more than a decade—and continuing lawsuits against those they deem responsible—she can’t talk about the debacle without emotion.
She plucks artifacts of Dragon Systems from a large plastic bin filled with magazine clippings, articles from the New York Times and the Wall Street Journal and other mementos. Both of the Bakers have since landed on their feet. Jim is a Distinguished Career Professor at Carnegie Mellon, and helped found the Defense Department–funded Human Language Technology Center at Johns Hopkins University. Janet is a visiting scientist at the MIT Media Lab and a lecturer at Harvard Medical School, where she is researching how the brain understands language.
And although their company disappeared, the technology that Baker helped invent lives on. In February 2010, the company that acquired Dragon’s patents and software, Nuance Communications, collaborated with Siri Inc., a spinoff of SRI International (formerly Stanford Research Institute), to unveil a new smartphone app, an intelligent assistant called Siri.
Two months later, Apple acquired Siri Inc. for an undisclosed sum and made the app the most hyped aspect of its iPhone 4S, released last fall (Siri is also a big part of the new iPhone 5). Nuance itself produces a suite of dictation programs based on Dragon NaturallySpeaking, licenses Dragon-brand software to control everything from car navigation systems to coffee makers to TVs and just announced its own competitor to Siri called Nina. None of this would be possible if it weren’t for a promise that Baker and her husband made to each other more than 40 years ago.
A Very Powerful Force
Now in her 60s, Janet Baker flashes a warm smile as she lays out muffins and coffee on a sun-splashed dining table. Her long gray hair is tied back in a no-nonsense ponytail, and she wears dangly earrings, gold-rimmed glasses and a vaguely Asian-looking purple blouse. Dragons are all around the house—with statuettes lining the mantle in the living room and a 200-pound sculpture guarding the front hall. “In the Chinese world, the dragon is the spirit of creativity, but every culture has some powerful, untamed force,” she says. “We thought that speech and language was also a very powerful force.”
The Bakers first vowed to tame that force soon after they married, during Columbus Day weekend of 1971. The two were doctoral students at Rockefeller University, in New York City, later transferring to Carnegie Mellon University—he a mathematician and she a biophysicist. Young and ambitious, they sought an area in which they could pool their expertise to make a significant impact on the world. “We wanted to do something that would be practical and useful and more than a paper on a library shelf,” Baker says. “And we decided the goal had to be satisfied in our lifetime.”
Making some rough estimates of the computing power necessary to solve the problem (and taking into account Moore’s law, which predicts that microchips double in power about every two years), they settled on speech recognition as a worthwhile and attainable goal for their life’s work.
“Our long-term goal was always to achieve a system to transcribe what you are saying in close to real time, with a practical degree of accuracy and a high degree of performance,” Baker says. “We figured it would take us 25 to 40 years to get there.”
At the time they made that choice, speech recognition was already a staple of science fiction adventures like Star Trek, Star Wars and 2001: A Space Odyssey, where robots and computers effortlessly responded to human speech—for good or for bad (“Just what do you think you are doing, Dave?”).
In reality, however, the problem was not so easy to solve. Computers might be good at arithmetic, but they have only a fraction of the computational power of the human brain. So while humans can easily recognize the same word spoken in different tones and accents, computers exhibit a notorious tin ear for language.
“There is huge variability in the way we speak,” explains Baker. “I can say the same word, word, wooord, wooooord?”—she alters the intonation half a dozen times—“and I can drastically change the way it sounds. You recognize it as ‘word’ every time, even when there is a jackhammer or a lawnmower in the background. That’s not easy.”
Studies show that we humans often recognize words even when we don’t hear large parts of them—our brains fill in the gaps automatically. How we do it, though, is an almost total mystery, dependent on a complex interplay of sensory and cognitive brain functions.
Everything is Connected
But Baker had been trained in looking at problems from multiple directions at once. Growing up in Cambridge, Mass., she had already decided by the age of five that she was going to be a doctor. She remembers coming across a hulking edition of Gray’s Anatomy in her hallway bookshelf. Taught to read at an early age, she pulled it out and read the foreword and was immediately discouraged.
“The message was that you can’t understand any system of the body without understanding all of the other systems, because they are all interconnected,” she says. “I didn’t see how anyone could break into that loop.”
The lesson stuck with her, however, as her interests took her from clinical medicine to medical research by high school. A friend was the son of a well-known neuroscientist at MIT, Jerome Lettvin, who let her sit in on classes and even gave her lab space for her own experiments—a heady experience for a 15-year-old.
When it came time for her to pick a college, she chose Tufts, thanks in large part to a new Experimental College program, an intensive math and science curriculum funded by the National Science Foundation. The idea was for students to approach a problem—say, snake locomotion—simultaneously from different disciplines, including math, chemistry, physics and biology.
Baker’s roving mind ate it up. “I’ve always been interested in looking to see what one area of science can give to another,” she says. She applied that multidisciplinary approach as she began focusing on how animals process information, studying the neurophysiology of moth ears at Tufts, the visual receptors of horseshoe crabs at Rockefeller University and the acoustics of speech processing at Carnegie Mellon.
Now turning her attention to the problem of speech recognition, it was almost like opening up Gray’s Anatomy all over again. When humans take in speech, they process an incredible amount of information—not only the unique characteristics of a person’s voice, but also the language being spoken, the topic of conversation, the speaker’s emotional state and the background noise.
Like the systems of the body, it’s hard to make sense of any one of those without understanding the others. At the time the Bakers began their quest, however, most researchers in speech recognition approached the task as a rule-based process, combining one or more of those sources of information and applying certain rules to determine the word that had been spoken.
Computers Learning Language
Jim pioneered a method to combine all of these knowledge sources into a common mathematical model. He collected statistics on the different sources of information, using probabilities to predict the words being spoken. As more speech becomes available, the system learns to make better predictions. “That was radical at the time, and even considered heretical by some,” says Janet Baker. “People were saying you can’t model this, you can’t model that, but it turns out you can model almost anything in this framework.” The approach quickly proved workable, though the computer industry did not adopt it for more than a decade.
After completing their doctorates at Carnegie Mellon, the couple worked on speech recognition at IBM and Verbex (a division of Exxon Enterprises). But after a few years, they grew frustrated with the slow pace of development of the technology for commercial use. They decided to go it alone, opting to form a startup dedicated to advancing state-of-the-art speech recognition in practical, affordable products and applications. With just $35,000 of personal savings—and a big mortgage and two preschoolers to tend to—they founded Dragon Systems in May 1982. They were a close-knit team, sharing the work on both technical and business fronts.
They couldn’t afford large mainframes or other custom hardware. Instead, they worked on small personal computers such as the Apple IIe, streamlining and adapting mathematical equations that were typically run on machines with much greater computing power. By 1984 they had created a prototype of a dictation system that could transcribe words. Providing. The. Speaker. Paused. Between. Each. One. At first, it took 22 seconds for the computer to understand a single word. Over the next six years, however, as computing power increased and they made further advances, the response time shrunk to a fraction of a second.
The company released the world’s first general-purpose dictation system, DragonDictate-30K, in March 1990. While the program was well received by the technical community and the press, it didn’t break through with general consumers, who balked at the idea of pausing between words. Despite that drawback, there were major beneficiaries. People who couldn’t type, especially disabled people, suddenly found that they could get jobs, go to school, produce e-mail and reports and otherwise benefit from computers, all because of the Dragon software.
Despite the modest success of DragonDictate, the Bakers still hadn’t achieved the goal they set for themselves in 1971: to create a system that could transcribe continuous speech. This proved a tough nut to crack. To recognize words without pauses, the computer had to be able to identify the beginning and end of words anywhere in the vocal stream—the difference, say, between I love you and all of you or isle view. “That doesn’t sound like a big change,” says Baker, “but it turns out it takes from a hundred to a thousand times as much computational power.” After a few years, they developed a prototype.
Already, however, they were conscious of other companies—heavy-hitters like IBM and Microsoft—working on their own speech recognition programs. Now within sight of their goal, the Bakers sought an infusion of cash to pull them over the finish line first. Up to that point, the company had been completely self-funded and debt free, growing entirely through revenues from products, software licenses and some U.S. government research contracts. But competition changed the game.
In 1994, Dragon accepted $20 million from the disk drive maker Seagate Technology for a 25 percent stake in the company. The money went to translating research code into marketable form, as well as hiring a phalanx of sales and marketing staff to launch the product with a splash within three years. (An additional investment boosted Seagate’s stake to about 35 percent.)
The result was Dragon NaturallySpeaking, the first general-purpose continuous dictation system. All users had to do was train the program to the sound of their voice, and they could dictate notes and letters into word processing files, as well as control their computer with simple voice commands. On April 2, 1997, the product was released into the world.
Dragon NaturallySpeaking was an instant success. It swept the awards for the software industry, won accolades from both the technology and the business community and made the “editors’ choice” and “best of” lists in the major computer magazines. Doctors, lawyers and average computer users could now just talk to create text and data. Now, 26 years after the Bakers had first predicted success within 25 to 40 years, they had reached their goal.
As with many accomplishments, however, the Bakers realized that this was only the beginning of a new journey. Mainstream newspapers such as USA Today and the New York Times, while enthusiastic about the prospects of the new technology, noted that it was not without its kinks—requiring substantial time to train the computer properly and still making mistakes.
“We had achieved our initial goal, but we didn’t think we had achieved our pinnacle,” acknowledges Baker. Now that they had cracked the nut of how to transcribe speech, the Bakers redoubled their efforts to make the product better. In addition to improving the accuracy rate of the software, they set out to exploit new applications.
In short order, the company grew to 400 employees, with six different lines of research, including multilingual dictation and transcription; telephone systems; embedded technologies for cell phones, mobile devices, and auto navigation; voice-enabled Internet, which was just then becoming widespread; and audio mining for searching recorded speech for specific words, topics and speakers.
Such an ambitious program would require more cash. With revenues of nearly $70 million, Dragon Systems had a head start on all of these technologies, but the Bakers feared they wouldn’t keep it for long with the big boys on their heels. Other companies had periodically offered to buy Dragon Systems—now the Bakers seriously considered two such overtures, one from Visteon, an $18 billion division of Ford Motor Company, and the other from Lernout & Hauspie, a $10 billion Belgian speech software company.
Beginning of the End
Then came “the debacle.”
In addition to hiring top lawyers and accountants, the Bakers paid $5 million to the Wall Street firm of Goldman Sachs to perform due diligence. Before considering the offers, they wanted to make sure the companies were on sound financial footing and would come through on what they promised. Looking back on the decision now, Baker still insists they did the right thing. “We spent millions of dollars trying to do it the best way possible with the best partners we could find,” she says. “We always had a policy of trying to find the best possible people to do what could be done, but—” she continues, her voice lowering, “we lost big time.”
After merger arrangements with Visteon fell through, the Bakers decided to go with L&H, signing a deal for $580 million. Initially, the deal was supposed to be half in cash and half in stock. But just before signing, L&H spent a substantial amount to acquire another company, Dictaphone. The company asked the Bakers for an all-stock deal instead, and they agreed. The deal went through in June 2000. The champagne had hardly been drunk, however, before a Wall Street Journal reporter released a series of scathing articles, starting in August, suggesting that L&H had cooked its books—that it had created fictitious orders to customers in Asia and blatantly fabricated revenues. By the end of November, L&H had gone bankrupt, taking the Bakers’ company, and most of their money, with it.
A dozen years later, Baker still seems incredulous about the sudden reversal of fortune. “We kept on trying to get it back,” she says. “We tried to raise money to buy back a piece of it, and kept trying and trying until it became clear we were not going to get any of it back.”
In the fire sale that followed the bankruptcy, a company called ScanSoft bought most of L&H’s assets—including the invaluable Dragon computer code—on the cheap. Engaging in a speech company shopping spree, ScanSoft acquired Nuance in 2005, and took its name to become Nuance Communications.
It wasn’t until 2010 that the executives from L&H were held accountable for their crimes. Nearly a decade after the company collapsed, its founders, Jo Lernout and Pol Hauspie, were convicted in a Belgian court of accounting fraud and sentenced along with two other executives to three- to five-year terms.
Because of overcrowding in the Belgian prisons, though, their sentences were all commuted and they served little actual time. The Bakers have filed suit against Goldman Sachs in U.S. District Court in Boston for a billion dollars in damages, alleging that the company failed to vet L&H’s finances. Under advice from her lawyers, Baker won’t talk about the lawsuit, but a long feature story in the New York Times this summer detailed a string of missteps by Goldman, in which the company entrusted the deal to four unsupervised junior executives, who failed to adequately check L&H’s customers.
For its part, Goldman Sachs has argued in court papers that checking L&H’s financials was not Goldman’s responsibility, but was properly the role of its accounting firm, Arthur Andersen. Besides, Goldman has argued, since the deal was made by the board of Dragon Systems—a business that no longer exists—the Bakers have no right to sue them. The financial services company has further demanded the Bakers pay all its legal fees.
Meanwhile, the Bakers have been left to pick up the pieces, watching helplessly while Nuance’s stock value has skyrocketed in recent years, now exceeding $7 billion. (The company made an estimated $60 to $120 million on the Apple deal alone.)
“We lost big time,” Baker repeats quietly. “But life goes on.” This year in Kyoto, Jim and Janet Baker were honored “for fundamental contributions to the theory and practice of automatic speech recognition” with what amounts to a lifetime achievement award by the international engineering trade association IEEE, which declared in an award ceremony that the pair has “reshaped the field of automatic speech recognition.”
Apple won’t talk about how much Dragon’s software undergirds the infrastructure of Siri, and Baker won’t talk about Siri, either. It’s public knowledge, though, that during the development of the software, Nuance contributed the portion that transcribes a user’s speech—both for transcribing voice to text messages, notes and emails, and for asking Siri questions like “Where can I find a hamburger joint?” It was Siri Inc. that developed the app’s natural-language-understanding “back end,” the part that figures out appropriate responses to those questions. So Baker probably escapes blame from Siri’s critics, who fault the virtual assistant for her sometimes frustrating and less than helpful responses.
The Final Frontier
Baker is bullish about the current boom in speech recognition applications, even if she isn’t at the heart of its proliferation. She compares it to the excitement over networks a decade ago, when connecting computers was a complicated feat. “Now people don’t think of networks, because they are integrated into everything we are doing,” she says.
She envisions a world, not too far in the future, where speech technology is considered equally mundane. “Speech recognition will be most successful when it’s invisible, just incorporated into things as another way of interacting,” she says, “and you don’t have to jump through hoops or go through a lot of preparation to do it.”
Besides lecturing on entrepreneurship at business schools and consulting on speech technology at the MIT Media Lab, Baker has returned to her roots in biology. As a researcher at Harvard Medical School, she studies the most complex computer there is: the human brain.
Even as Siri and other programs have begun to mimic the brain’s potential, the way humans understand language is still a black box to neuroscience. Using tools like magnetoencephalography that didn’t exist decades ago, Baker has gone back to square one, trying to understand how the brain distinguishes between different words and concepts.
“We started speech recognition with only 10 or 20 words,” says Baker, referring to the early days of Dragon, when calculating how to transcribe even that many words was an accomplishment. Now, with the brain, she says, “we are back to 10-word recognition. We can do pretty well figuring out when and where it’s happening, but exactly how it’s doing it is still a mystery.”
If the dragon is the symbol of power and creativity, then the brain might be the most formidable beast of all.
This article first appeared in the Fall 2012 issue of Tufts Magazine.
Michael Blanding is author of The Coke Machine: The Dirty Truth Behind the World’s Favorite Soft Drink (Avery, 2010) and is at work on his next book, The Map Thief (Gotham, 2013).