Skip to content Skip to navigation

Speech Recognition Maligned?

October 19, 2011
by Joe Marion
| Reprints
Take a look at ASR implementation practices before pointing the finger

I recently came across a posting on regarding a study published in the American Journal of Roentgenology ( AJR, October 2011, Vol. 197:4, pp. 1-5) that describes how “breast imaging reports generated with automated speech recognition software are eight times more likely to contain major errors than those generated with conventional dictation transcription.” The implication would be that automated speech recognition (ASR) is somehow to blame for the errors. I would subscribe that this is indicative of poor ASR implementation practices, and not the ASR application itself.

I base my conclusion on my own experience as an early reseller of the original IBM MedSpeak/Radiology product, dating back to the mid-1990’s. Radiologists are used to dictating and having a transcriptionist edit their dictation. In many situations, the sheer volume of dictated reports results in the radiologist relying on the transcriptionist to catch obvious errors and address questionable ones with the radiologist, so that signoff is a mere formality. Oftentimes, a radiologist will sign the stack of reports with no further review other than a casual glance. Such an approach worked well, but was time consuming and expensive.

Enter Automated Speech Recognizer (ASR) technology. Such applications were originally designed to enable the dictator to self-edit their dictation and make corrections in misrecognition. Later iterations included workflow models that allowed a transcriptionist “editor” to make the corrections to satisfy the criticism of radiology groups that felt self-editing was “beneath them” or time consuming.

To understand the issue with editing, one has to understand the mechanics of ASR applications. Most use multiple methods of converting speech into text. First, the application listens to the sound and makes a “best guess” as to the word dictated based on frequency tables of the sounds, or Phonemes. Next, many applications use different rule sets to improve the context of the transcription. Specialized dictation limits the available word choices depending on the vocabulary, or topic. For example, radiology’s vocabulary is likely different from that of pathology, so words in the vocabulary have a higher probability of selection. Lastly, the applications look at the context of the words, and act to predict the likely next word. For example in the phrase “clinical indication: colon cancer” the punctuation “:” and the word colon sound the same. How does the system know that the first sound is punctuation and the second is anatomy? “Clinical indication” is interpreted as a header phrase, and punctuation usually does not follow punctuation, so the first is interpreted as punctuation and the second as anatomy.

With that explanation as background, how does it relate to a misunderstanding of errors? In the use of ASR applications, the transcribed text is associated with the speech until the report is saved, at which time the application updates its sound and context rules based on the text associated with the speech. If one accepts or saves a report without correcting errors, the system erroneously assumes the text is a correct interpretation of the sound. Therefore, if errors are not corrected, the accuracy of the system declines, as the system is erroneously learning incorrectly.

I would subscribe that this is the major factor in the errors reported in the aforementioned study in AJR. If the radiologist who dictated the report does not correct the errors, the system will continue to make errors. My most shining example of this was a personal experience in demonstrating the technology at the University of Iowa many years ago. To preclude any chance of tricking the system with a canned report, they handed me a previously dictated report and asked me to dictate it verbatim. My initial embarrassment over “Barnhart Catheter” being misinterpreted as “Barn Yard Catheter” was alleviated when I corrected the text and added a pronunciation, using the applications correction utility. My subsequent dictation was accurate to their utter amazement!

The bottom line is that I believe both vendors and support staff sometimes overlook the value of stressing to users the need to review their dictation and correct errors to improve accuracy. If more attention were paid to behavior modification in the implementation of ASR technology, I submit there would be fewer errors in resultant reports. Perhaps the researchers who conducted the study should enforce error corrections, and repeat the study. I would subscribe that the results would be significantly improved! For those considering ASR applications don’t be dissuaded by such research! Instead, make sure that adequate training and follow up is included in the contract.



Joe this is great and timely post! I have been talking to people about how Apple is using their Siri Personal Assistant and the eventual application to healthcare. There are systems with the ability to take a dictated note and pass the information into an EMR template to data mine the clinical elements. Of course most providers are using dragon to populate their note templates. However, the QA falls squarely on the person that has to sign the note. This is no different than accepting a transcription document, and signing and filing it without reading it.
I guess that as more people use technology, the expectation level of accuracy and "error free" increases.
AJR statement that: "breast imaging reports generated with automated speech recognition software are eight times more likely to contain major errors than those generated with conventional dictation transcription." Just means that after a transcriptionist completes the note, it goes to a QA person before it is released to the provider (I worked for a transcription company in a previous life). QA of speech recognition is incumbent on the speaker.
AJR jumped to a conclusion about the technology without fully understanding the two processes.


Right you are! As a speech recognition reseller in a prior life, I used to start every discussion by suggesting that if the users were willing to adapt to the technology, we could have a meaningful discussion, but if the expectation was that the technology would adapt to the user, it would be a short discussion. I place some of the blame on the entities teaching users how to use the technology. If proper use habits are established from the beginning, users should have the discipline to correct errors. If not stressed, "you get what you pay for" and there will be errors!

I'm waiting for the funny misinterpretations of Apple's Siri - just like those for medical dictation using speech recognition!

Terrific post. It certainly generalizes to all forms of supervised machine learning that you're describing. If the system is designed to learn and we don't teach it, we're to blame. Of course, that expectation is rarely set with users.

I just attended AHIMA ( two weeks ago and speech recognition with NLP was discussed and shown by multiple vendors as well as in educational sessions. Not something I can discuss since we have an NLP offering. That said, what's new with speech recognition is performance and paradigms is actual dialogues with physician users. There's more value to providing some feedback to the stack that includes ASR. Will it take another decade before we get the cadence right? I hope not.

One of the big take-homes for me from reading your post is that it's more worthwhile than I realized to spend the time to correct speech recognition with the tools provided.  They cannot learn without patient (as in patience) teaching!