BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Behind Apple's Siri Lies Nuance's Speech Recognition

This article is more than 10 years old.

Although Apple won’t confirm it publicly, quite a bit of Siri’s cleverness in understanding spoken language comes from technology supplied by Nuance Communications , Inc.  And such is the sensitivity of Apple’s suppliers to crossing the computing-device giant that even a year after Nuance’s CEO confirmed the relationship, Nuance rank and file are reluctant to talk about it.

Nuance is arguably the most advanced speech recognition company in the world.  It has absorbed nearly every small company working on the problem for the past couple of decades.  Dragon Systems, which possessed advanced speech recognition technology when it ran into financial trouble, has become one of Nuance’s crown jewels.  Other companies working on the problem include Google , IBM , and Microsoft .

Natural input methods have been a holy grail in the personal computing industry since the beginning.  Very quickly after his company was founded, Microsoft CEO Bill Gates began to talk about ways other than keyboard and mouse to get data into computers.  He recognized that there was something essentially strange about the input devices most of us had come to accept.

The first real break in human interface came with the iPhone in 2007, when touch on glass was finally integrated so well that it felt, well, natural.

And Microsoft came up with Kinect in 2010, the best example of gesture input (human movement without direct contact).

But voice, one of the most obvious methods, has been strangely elusive.  Voice can be used for both control (Okay, Google, open Maps!) and transcription (spoken words converted to text).  But people demand extremely high accuracy in speech-to-text conversion, and even 99% isn't enough.  After all, on an average page of 300 words, that’s three errors per page.  The industry was excellent at getting above 90%, but the last few percent continues to be a slog.

I've recently been trying out Nuance’s Dragon Dictate 4 for Mac, which represents the state of the art in recognition.  You have to train the software on your particular speech patterns, which takes a few minutes, but then it’s pretty robust.  It managed supercalifragilistickexpealadocious pretty well, but got hung on antidisestablishmentarianism. Nonetheless, I didn't go easy on it, mumbling in my usual manner rather than articulating clearly and without slang, as I would if speaking English to a non-native speaker.  And it was both fast and accurate most of the time.  The editing commands made correction easier, as I was able to select words and change them without touching the mouse or keyboard.

Essentially, speech recognition takes phonemes (speech sounds) and tries to make them into words.  Early work on speech — most notably championed by Massachusetts Institute of Technology (MIT) Professor Noam Chomsky — tried to put all human language into a single model.  He believed there was a universal grammar.

The problem with his model was that real language — particularly when accounting for all human language, even in obscure pockets of the world — constantly violated the model.  There were so many exceptions that the rules seemed arbitrary.

At a small computational linguistics company called ILA that I worked for in the early 1990s, when Chomsky’s theories were still the guiding light for many of us, we tried to implement language models, only to discover that the “tableware,” the list of exceptions and their handling, required far more work than the models.

According to my good friend Dave Baggett — who cut his eye teeth in MIT’s artificial intelligence lab, worked with me at ILA, went on to co-found ITA Software (which used some of the same linguistic principles to optimize the airline database and was eventually bought by Google for $730 million), and is now, in his latest venture, Inky,“fixing email” — “the empirical data from real language doesn't entirely fit the model.”

His reaction to the idea that Chomskyites are disturbed that higher-level linguistic features can't be described in models: “they should be disturbed — it's evidence we still don't understand what the hell is going on.”

“The lower you go down in the language stack,” Baggett says, “the better linguistics people understand it.  That is, phonology is pretty well understood, morphology a bit less, and syntax a bit less.  Semantics is a fail.”

What has happened since Chomsky reigned all but unchallenged is that computing has become more powerful and language databases (corpuses) have become huge.  With these tools, companies like Nuance that work on speech recognition have turned almost entirely to statistical methods.  Rather than trying to model language the old way, their programs simply look at how often particular word juxtapositions occur, “voting” on their confidence in various answers when more than one candidate presents itself.

In other words, with a huge corpus and faster computing it is possible to determine the likelihood of any particular word following another, regardless of grammatical relationship.  You don’t have to solve the Chomskyan problem of how language and meaning are structured.  You can just do it mathematically.

However, there are mathematics and then there are mathematics.  Brute force — or the examination of every possibility — uses a lot of resources and for some problems would take impossibly long.  So, smart algorithms are always desirable as a way to cut through the thicket.  Most speech recognition now makes use of some variant of the Viterbi algorithm.

As Baggett puts it, “Linguistic theory helps us by generating higher-level features for statistical machine learning processes.  And machine learning is a great tool when you have no clue of the underlying structure.”

An example is simulated annealing for searching a large space.  This algorithm is named after and based on the behavior of cooling metal alloys, which form larger crystals the slower they cool.  Here is a picture of some crystals that took hundreds of thousands of years to cool.  “Tailored search is always better,” says Baggett, “but if you don't know how to tailor, simulated annealing still works better than brute force."

With these statistical methods, speech recognition, particularly speaker-dependent recognition, has largely been solved.  And so, most of the time, Siri at least knows what words you are saying.  The next step — language understanding — is much tougher, lying, as it does, higher up the linguistic abstraction chain.  Again massive computing and vast storage come to the rescue.  Just this past week, IBM announced that its Watson cognitive system will be used to help oncologists at the New York Genome Center do genomic research.  Watson uses statistical methods to “understand” language, allowing researchers to do a smart lit search quickly.  Apple, Google, and Nuance are also working on language understanding.

In time, Siri and her ilk will increasingly get it right — not just what you said, but what you meant.

Meanwhile, Nuance’s Dictate is clearly a boon for people trying to do text input, particularly disabled people, but actually pretty much anyone.  For those with good ideas but poor typing skills, Dictate is a great way to create text quickly.

Twitter: RogerKay