In their paper entitled English Broadcast News Speech Recognition by Humans and Machines, the team proposes to identify techniques that close the gap between automatic speech recognition (ASR) and human performance.
IBM’s initial work in the voice recognition space was done as part of the U.S. government’s Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program, which led to significant advances in speech recognition technology. The EARS program produced about 140 hours of supervised BN training data and around 9,000 hours of very lightly supervised training data from closed captions from television shows. By contrast, EARS produced around 2,000 hours of highly supervised, human-transcribed training data for conversational telephone speech (CTS).
Because so much training data is available for CTS, the team from IBM and Appen endeavored to apply similar speech recognition strategies to BN to see how well those techniques translate across applications. To understand the challenge the team faced, it’s important to call out some important differences between the two speech styles:
Broadcast news (BN)
Conversational telephone speech (CTS)
The team adapted the speech recognition systems that were so successfully used for the EARS CTS research: Multiple long short-term memory (LSTM) and ResNet acoustic models trained on a range of acoustic features, along with word and character LSTMs and convolutional WaveNet-style language models. This strategy had produced results between 5.1% and 9.9% accuracy for CTS in a previous study, specifically the HUB5 2000 English Evaluation conducted by the Linguistic Data Consortium (LDC). The team tested a simplified version of this approach on the BN data set, which wasn’t human-annotated, but rather created using closed captions.
Instead of adding all the available training data, the team carefully selected a reliable subset, then trained LSTM and residual network-based acoustic models with a combination of n-gram and neural network language models on that subset. In addition to automatic speech recognition testing, the team benchmarked the automatic system against an Appen-produced high-quality human transcription. The primary language model training text for all these models consisted of a total of 350 million words from different publicly available sources suitable for broadcast news.
In the first set of experiments the team separately tested the LSTM and ResNet models in conjunction with the n-gram and FF-NNLM before combining scores from the two acoustic models in comparison with the results obtained on the older CTS evaluation. Unlike results observed on original CTS testing, no significant reduction in the word error rate (WER) was achieved after scores from both the LSTM and ResNet models were combined. The LSTM model with an n-gram LM individually performs quite well and its results further improve with the addition of the FF-NNLM.
For the second set of experiments, word lattices were generated after decoding with the LSTM+ResNet+n-gram+FF-NNLM model. The team generated n-best lists from these lattices and rescored them with the LSTM1-LM. LSTM2-LM was also used to rescore word lattices independently. Significant WER gains were observed after using the LSTM LMs. This led the researchers to hypothesize that the secondary fine-tuning with BN-specific data is what allows LSTM2-LM to perform better than LSTM1-LM.
Our ASR results have clearly improved state-of-the-art performance, and significant progress has been made compared to systems developed over the last decade. When compared to the human performance results, the absolute ASR WER is about 3% worse. Although the machine and human error rates are comparable, the ASR system has much higher substitution and deletion error rates.
Looking at the different error types and rates, the research produced interesting takeaways:
The experiments show that speech ASR techniques can be transferred across domains to provide highly accurate transcriptions. For both acoustic and language modeling, the LSTM- and ResNet-based models proved effective and human evaluation experiments kept us honest. That said, while our methods keep improving, there is still a gap to close between human and machine performance, demonstrating a continued need for research on automatic transcription for broadcast news.
Source : https://appen.com/blog/improving-the-accuracy-of-automatic-speech-recognition-models-for-broadcast-news/