code and the oracular

Report on Automatic Speech Recognition

leave a comment »

This is a report I wrote when I was a consultant to a secretarial and transcription company. I recommended a pilot scheme to test ASR and the verdict after doing that was that it wasn’t mature yet.

ASR Report

The encroachment of Artificial Intelligence techniques into areas that were previously only undertaken by humans continues inexorably (see the work of Ray Kurzweil), financial decisions, legal expert systems, autonomous vehicle control. As price performance of hardware increases according to Moore’s law this trend will become ever more important. One human level task that is increasingly automated is ASR – Automatic Speech Recognition (to be contrasted with voice recognition which is fingerprinting a person’s actual voice). Anyone in the transcription industry should take note of this trend.

To Digress, Moore’s Law is not so much a law as an observed trend that the speed of current microprocessors doubles roughly every 18 months. This isan example of exponential growth although in reality the curve is not smooth but stepwise, which is accounted for by lots of little leaps as new breakthroughs are made in the underlying technologies. There may be a limit to Moore’s law, i.e. this growth may not continue forever. One barrier is that there is a physical limit to how small components on a silicon chip can be made. Below about 20 nanometers they fail to work because of quantum leakage of the electrons used to carry signals. Of course many hope that another new breakthrough will occur to enable the growth to continue, this would have to be a whole new computing paradigm shift, perhaps leaving behind the silicon basis of current computers completely, and using light, so called ‘photonics’ instead.

In the short term though most analysts will continue to bank on Moore’s law certainly until the 20nm limit is approaching.

Hardware improvement is not the only measure of our computing power though, since even an ultra fast machine needs to be programmed, and this is where the real art lies. Some would argue that our intelligence in constructing programs is the real bottleneck, perhaps drawing reference to a “complexity ceiling” which prevents humans from writing programs longer than a few tens of thousands of lines of high level language code. In this sense we may well not even be using the full power of existing computation, since we are unable to exploit it with perfect code. I personally feel that progress does continue as in other sciences , and that even if Moore’s law slowed down , progress would still continue steadily as our scientific knowledge matures and beds down, and new software techniques are discovered.

The upshot of all this is that the issue of hardware performance does imply that processing times for ASR systems are steadily shortening. More powerful machines cause this, as well as the improvement of the actual software itself due to the advance of programming techniques. Thus we have a twofold climb to the performance curve.

The following are some of the major growing applications of ASR:

  • content based audio search
  • voice dialing
  • call routing
  • aircraft control
  • automatic real time voice translation
  • healthcare applications particularly Radiology reporting
  • legal transcription

An additional benefit is the reduction of RSI from typing.

The success of ASR in a medical context is perhaps driven by shortage of funding and huge workloads, as well as the fact that doctors in hospitals do most of their work away from terminals. Perhaps the biggest single application is in radiology reports. However the application of SR to other fields such as legal document transcription, or business meeting trancripts is not so dissimilar in terms of the requirements, a complex medical vocabulary is simply one example of a specialised linguistic domain, other domains should be fit for SR with no fundamental redesign of the tools necessary.

Estimates of the medical transcription market range around $10 billion for North America alone (from Speech in the Warehouse)

also see Nuance Speech Recognition Used to Digitize Radiology Reporting at Royal Free Hospital in London

The development of ASR has not all been a success story though, Ray Kurzweil who was best known in the popular mind as the developer of an early musical synthesiser created an innovative (for those days) system known as Kurzweil Voice Report in 87, the company was acquired by Lernout and Hauspie but later failed due to immaturity of the technology. The history of AI programming techniques has evolved through various AI winters, but I argue these were often a matter of perception due to a short termist optimism which was too easily disappointed. People expected human level performance instantly, and so the bubble burst.

More recently though major breakthroughs have been made and speech transcription can now usually offer an approximately 99% success rate in word accuracy.

The so-called AI effect is when cutting edge technologies in computationally hard problems mature and eventually become seen as merely clever programming, rather than something showing the power of human level problem solving.

This effect recurs in many different areas, and the creep effect as the scientific process expands, and new problems fall to its methods, shows clearly the expanding frontier of the computer revoltuion which subsumes more and more tasks previously seen as only possible for human workers. more and more businesses are either capitalising on this process or falling by the wayside as they fail to innovate successfully.

The Darpa challenge saw its first winner (date) when an autonomous vehicle successfully negotiated the 120 mile course. AI skeptics should take note as the evidence of steady progress in the field builds up.

Effective ASR technologies might become integrated into the conventional toolset of desktop office computing. Will there be a role for outsourcing of an ASR service in the future? A smooth low cost service, perhaps based around a simple process of emailing an mp3 uploaded from a voice recorder device, and receiving back an email with the appropriate text transcript in a matter of minutes might still appeal because of the avoidance of software installation, maintenance and backups, along with the minor irritations and learning curve that increasingly motivates many business people to opt for “software as a service” instead of “DIY” client based applications run in house.

Marketing should thus stress ease of use since this is the main driver for hiring a service rather than running your own SR software in house.

Some of the major providers

  • Phillips
  • Nuance (Dragon)
  • Dictaphone
  • Scansoft
  • Lumenvox
  • IBM (Websphere)

A valid assessment of the feasibility of ASR technologies would have to involve a comparison of the error rates of human tasks and machine performed tasks.

Some eg dictaphone use statitsical NLP techniques such as standardising medical terms with the SNOMED US medical database. this represents a fine tuning of the transcription process that is an add-on to the basic speech to text engine. other additional modules such as grammar and syntax checking are also common.


“It is the intermarriage of systems that has added value to the firm,” says Paul Parsons, senior partner at Greenwoods, a London law firm. “I would not have the speech recognition system if it was not fully integrated into the case management system.” ASR for court stenography is also growing

The Industry as a Whole

standards are being agreed upon in the industry, and code being shared eg voicexml, SALT, CCXML, and MRCP which will accelerate progress further.

During the past few weeks I have attempted a thorough assessment of the Nuance Dragon software which is the most popular ASR product in the UK. I have attempted to integrate the use of transcription into my working habits and to make an assessment of whether time and effort have been saved.

For me the habit of using a keyboard is a very ingrained one, and it took some time before I was able to assess how much work this software has saved me, just as a personal user. The learning curve would indeed be faced by almost all users. Different working habits may need to be developed such as the habit of brainstorming large amounts of speech into text and then reassessing and correcting later. The ease with which words can be spoken does create a pleasant sense of instant creativity, but errors seem to become fiddly and irritating too.

Human secretaries are able to resolve similar sounds into text in context by making guesses about what the speaker “means”. Disambiguation of this kind is beyond current computer systems, since the system would have to “understand” the semantic content of the speech. This is still a failing of all state of the art ASR systems. “Ums” and “errs” in speech confuse the system, and since nearly every speaker does this there are often errors in the transcript. Another major issue is the partial necessity of training the software, which would obviously become difficult with multiple speakers. I will be contacting Nuance to find out more about their enterprise grade system and to ask them to help me obtain an estimate of how effective their software could be with zero training. My version, Dragon Naturally Speaking 10 allows use untrained, and once again this is a critical factor we can only assess with more testing. I will also ask Nuance about how to implement a client server architecture where Dragon could run on a remote server and have mp3 voice files transmitted to it, and for it to return transcripts in some appropriate document form.

A complete assessment of the state of the industry would have to involve a careful analysis of the extent to which the accelerating computing power which we live with from Moore’s Law will improve performance of automatic speech recognition systems. Amazingly powerful machines must still be programmed and in fact modern feature-bloated software sometimes tends to become less time-efficient. What I have found is that there is still an appreciable error rate, and at the moment I am unable to quantify how this would emerge if used in your business. Without a more detailed idea of the spectrum of different subject domains from your clients I feel it would be hard to do a proper job of testing ASR technologies on my own.

I feel the real world is usually captured better by long experimentation than by instant theorising. If a small proportion of incoming speech files in your business were scheduled to undergo both kinds of transcription and then both were human corrected and the finished results (and time taken) carefully compared we would get what in my view would be the best opportunity to assess the relative effectiveness of the two rival transcription methods in situ .

Looking further ahead, if a “silver” ASR based service was offered a slightly less cost than the “gold” human transcription service then an ongoing process could continue to refine the techniques and business processes involved in establishing an ASR based arm to your enterprise. If ASR becomes truly viable you will then be well prepared.

Practice makes perfect and so my recommendation is a careful pilot scheme to test a chosen variety of ASR software for ease of integration and time / labour saved. One way to approach this pilot scheme would be to try and arrive at a comparison of your cost of transcribing one-minute of speech in the existing way, versus the ASR process method, thereby obtaining an efficiency measure in pounds per minute or suitable units. It may also then be possible to make an educated quantitative guess as to how rapidly developments in future ASR systems will tend to reduce this basic cost and thus render ASR more feasible.

We should also try and track the increasing adoption of ASR in your client base which may further influence take-up of your current human transcription service. Perhaps a small web based questionnaire on your site could be used to probe how clients, from the various sectors, view using ASR, and how many of them have considered it as a real alternative. This would be helpful and easy to accomplish.


Written by Luke Dunn

April 10, 2010 at 5:02 pm

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: