TCTS Lab Research Groups
 
 

A short introduction to speech recognition

[ FPMs > TCTS > ASR group > Introduction ]

 

[Introduction]

[People]

[Publications]

[Contact]

 

 

[by Olivier Deroo]

Automatic speech recognition (ASR) is useful as a multimedia browsing tool: it allows us to easily search and index recorded audio and video data. Speech recognition is also useful as a form of input. It is  especially useful when someone's hands or eyes are busy. It allows people working in active environment such as hospitals to use computers. It also allows people with handicaps such as blindness or palsy to use computers. Finally, although everyone knows how to talk, not as many people know how to type. With speech recognition, typing would no longer be a  necessary skill for using a computer. If we ever were successful enough to be able to combine it with natural language understanding, it would make computers accessible to people who don't want to learn the technical details of using them. 

In 1994, IBM was the first company to commercialize a dictation system based on speech recognition. Speech Recognition has since been integrated in many applications :
- Telephony applications,
- Embedded systems (Telephone Voice Dialing system, Car Kit, PDA, ?),
- Multimedia applications, like Language Learning Tools.

Many improvements have been realized since 50 years but computers are still not able to understand every single word pronounced by everyone. Speech Recognition is still a very cumbersome problem. 

There are quite a lot of difficulties. The main one is that two speakers, uttering the same word, will say it very differently from each other. This problem is known as inter-speaker variation (variation between speakers). In addition the same person does not pronounce the same word identically on different occasions. This is known as intra-speaker variation. It means that even consecutive utterances of the same word by the same speaker will be different. Again, a human would not be confused by this, but a computer might. The waveform of a speech signal also depends on the recording conditions (noise, reverberation,...). Noise and channel distortions are very difficult to handle, especially when there is no a priori knowledge of the noise or the distortion.

Recognition modes

A speech recognition system can be used in many different modes (speaker-dependent or independent, isolated / continuous speech, for small medium or large vocabulary). 

Speaker Dependent / Independent system

A speaker-dependent system is a system that must be trained on a specific speaker in order to recognize accurately what has been said. To train a system, the speaker is asked to record predefined words or sentences that will be analyzed and whose analysis results will be stored. This mode is mainly used in dictation systems where a single speaker is using the speech recognition system. On the contrary, speaker-independent systems can be used by any speaker without any training procedure. Those systems are thus used in applications where it is not possible to have a training stage (telephony applications, typically). It is also clear that the accuracy for the speaker-dependent mode is better compared to that of the speaker-independent mode.

Isolated Word Recognition

This is the simplest speech recognition mode and the less greedy in terms of CPU requirement. Each word is surrounded by a silence so that word boundaries are well known. The system does not need to find the beginning and the end of each word in a sentence. The word is compared to a list of words models, and the model with the highest score is retained by the system. This kind of recognition is mainly used in telephony application to replace traditional DTMF methods.

Continuous Speech Recognition

Continuous speech recognition is much more natural and user-friendly. It assumes the computer is able to recognize a sequence of words in a sentence. But this mode requires much more CPU and memory, and the recognition accuracy is really inferior compared with the preceding mode. Why is continuous speech recognition more difficult than isolated word recognition? 

Some possible explanations are :
- speakers pronunciation is less careful 
- speaking rate is less constant 
- word boundaries are not necessarily clear 

there is more variation in stress and intonation (interaction between vocal tract and excitation) 
additional variability is introduced by the unconstrained sentence structure 
coarticulation is increased both within and between words
speech is mixed with hesitations, partial repetitions, etc. 

Keyword Spotting

This mode has been created to cover the gap between continuous and isolated speech recognition. Recognition systems based on keyword spotting are able to identify in a sentence a word or a group of words corresponding to a particular command. For example, in the case of a virtual kiosk providing any customer with the way to a special department in a supermarket, there are many different ways of asking this kind of information. One possibility could be "Hello, can you please give me the way to the television department". The system should be able to extract from the sentence the important word "television" and to give the associated information to the customer. 

Vocabulary Size

The size of the available vocabulary is another key point in speech recognition applications. It is clear that the larger the vocabulary is the more opportunities the system will have to make some errors. A good speech recognition system will therefore make it possible to adapt its vocabulary to the task it is currently assigned to (i.e., possibly enable a dynamic adaptation of its vocabulary). Usually we classify the difficulties level according to table 1 with a score from 1 to 10, where 1 is the simplest system (speaker-dependent, able to recognize isolated words in a small vocabulary (10 words)) and 10 correspond to the most difficult task (speaker-independent continuous speech over a large vocabulary (say, 10,000 words)). State-of-the-art speech recognition systems with acceptable error rates are somewhere in between these two extremes.
 
 
 

 

Isolated Words

Continuous Speech

Speaker Dependent

Small Voc

1

Large Voc

4

Small Voc

5

Large Voc

7

Multi-Speaker

Small Voc

2

Large Voc

4

Small Voc

6

Large Voc

7

Speaker Independent

Small Voc

3

Large Voc

5

Small Voc

8

Large Voc

10

Table 1 : Classification of speech recognition mode difficulties.

The commonly obtained error rates on speaker independent isolated word databases are around 1% for 100 words vocabulary, 3% for 600 words and 10 % for 8000 words [DER98]. For a speaker independent continuous speech recognition database, the error rates are around 15 % with a trigram language model and for a 65000 words vocabulary [YOU97]. 

The Speech Recognition Process

The Speech Recognition process can be divided in many different components illustrated  in figure 4.

Fig. 4  The speech recognition process.

Note that the first block, which consists of the acoustic environment plus the transduction equipment (microphone, preamplifier, filtering, A/D converter) can have a strong effect on the generated speech representations. For instance, additive noise, room reverberation, microphone position and type of microphone can all be associated with this part of the process. 

The second block, the feature extraction subsystem, is intended to deal with these problems, as well as deriving acoustic representations that are both good at separating classes of speech sounds and effective at suppressing irrelevant sources of variation. 

The next two blocks in Figure 4 illustrate the core acoustic pattern matching operations of speech recognition. In nearly all ASR systems, a representation of speech, such as a spectral or cepstral representation, is computed over successive intervals, e.g., 100 times per second. These representations or speech frames are then compared to the spectra or cepstra of frames that were used for training, using some measure of similarity or distance. Each of these comparisons can be viewed as a local match. The global match is a search for the best sequence of words (in the sense of the best match to the data), and is determined by integrating many local matches. The local match does not typically produce a single hard choice of the closest speech class, but rather a group of distances or probabilities corresponding to possible sounds. These are then used as part of a global search or decoding to find an approximation to the closest (or most probable) sequence of speech classes, or ideally to the most likely sequence of words. Another key function of this global decoding block is to compensate for temporal distortions that occur in normal speech. For instance, vowels are typically shortened in rapid speech, while some consonants may remain nearly the same length. 

The recognition process is based on statistical models (Hidden Markov Models, HMMs) [RAB89,RAB93] that are now widely used in speech recognition.  A hidden Markov model (HMM) is typically defined (and represented) as a stochastic finite state automaton (SFSA) which is assumed to be built up from a finite set of possible states, each of those states being associated with a specific probability distribution (or probability density function, in the case of likelihoods).
Ideally, there should be a HMM for every possible utterance. However, this is clearly infeasible. A sentence is thus modeled as a sequence of words. Some recognizers operate at the word level, but if we are dealing with any substantial vocabulary (say over 100 words or so) it is usually necessary to further reduce the number of parameters (and, consequently, the required amount of training material). To avoid the need of a new training phase each time a new word is added to the lexicon, word models are often composed of concatenated sub-word units. Any word can be split into acoustic units. Although there are good linguistic arguments for choosing units such as syllables or demi-syllables, the unit most commonly used are speech sounds (phones) that are acoustic realizations of linguistic units called phonemes. Phonemes are speech sound categories that are meant to differentiate between different words in a language. One or more HMM states are commonly used to model a segment of speech corresponding to a phone. Word models consist of concatenations of phone or phoneme models (constrained by pronunciations from a lexicon), and sentence models consist of concatenations of word models (constrained by a grammar).

Hybrid Systems

Several authors [RIC91,BOU94] have shown that the outputs of artificial neural networks (ANNs) used in classification mode can be interpreted as estimates of posterior probabilities of output classes conditioned on the input. It has thus been proposed to combine ANNs and HMMs into what is now referred to as hybrid HMM/ANN speech recognition systems. 
Since we ultimately derive essentially the same probability with an ANN as we would with a conventional (e.g., Gaussian mixture) estimator, what is the point in using ANNs? There are several potential advantages that we, and others, have observed: enhanced model accuracy, availability of contextual information, and increased discrimination.

Model accuracy

ANN estimation of probabilities does not require detailed assumptions about the form of the statistical distribution to be modeled, resulting in more accurate acoustic models. 

Contextual Information

For the ANN estimator, multiple inputs can be used from a range of speech frames, and the network will learn something about the correlation between the acoustic inputs. This is in contrast with more conventional approaches, which assume that successive acoustic vectors are uncorrelated (while this is clearly wrong).

Discrimination

ANNs can easily accommodate discriminant training, that is : at training time, speech frames which characterize a given acoustic unit will be used to train the corresponding HMM to recognize these frames, and to train the other HMMs to reject them. Of course, as currently done in standard HMM/ANN hybrid discrimination is only local (at the frame level). It remains that this discriminant training option is clearly closer to how we humans recognize speech. 

Current Research In Speech Recognition

During the last decade there has been many research areas to improve speech recognition systems. The most usual one can be classified into the following areas : robustness against noise, improved language models, multilinguality, data fusion and multi-stream processing.

Robustness against noise 

Many research laboratories have shown an increasing interest in the domain of robust speech recognition, where robustness refers to the needs to maintain good recognition accuracy even when the quality of the input speech is degraded. As spoken language technologies are being more and more transferred to real-life applications, the need for greater robustness against noisy environment is becoming increasingly apparent. The performance degradation in noisy real-world environments is probably the most significant factor limiting take up of ASR technology. Noise considerably degrades the performances of speech recognition systems even for quite easy tasks, like recognizing a sequence of digits in car environment. A typical degradation of the performances on this task can be observed in Table 2. 
 
 
 
 

SNR

-5 db

0 DB

10 DB

15 DB

Clean

WER

90.2 %

72.2 %

20.0 %

8.0 %

1.0 %

Table 2 : Word Error Rate on the Aurora 2 database
(Continuous digits in noisy environment for different signal to Noise ratio).

In the case of short-term (frame-based) frequency analysis, even when only a single frequency component is corrupted (e.g., by a selective additive noise), the whole feature vector provided by the feature extraction phase in Fig. 4 is generally corrupted, and typically the performance of the recognizer is severely impaired.
The multi-band speech recognition system [DUP00] is one possible way that is explored by many researchers. Current automatic speech recognition systems treat any incoming signal as one entity. There are, however, several reasons why we might want to view the speech signal as a multi-stream input in which each stream contains specific information and is therefore processed (up to some time range) more or less independently of the others. Multi-band speech recognition is an approach in which the input speech is divided into disjoint frequency bands and treated as separate sources of information. They can then be merged into an automatic speech recognition (ASR) system to determine the most likely spoken words. Hybrid HMM/ANN systems provide a good framework for such problems, where discrimination and the possibility of using temporal context are important features. 

Language Models

Other research tends to ameliorate language models which are also a key point in the speech recognition systems. The language model is the recognition system component which incorporates the syntactic constraints of the language. Most of the state-of-the-art large vocabulary speech recognition systems make use of statistical language models, which are easily integrated with the other system components. Most probabilistic language models are based on the empirical paradigm that a good estimation of the probability of a linguistic event can be obtained by observing this event on a large enough text corpus. The most commonly used models are n-grams, where the probability of a sentence is estimated from the conditional probabilities of each word or word class given the n-1 preceding words or word classes. Such models are particularly interesting since they are both robust and efficient, but they are limited to modeling only the local linguistic structure. Bigram and trigram language models are widely used in speech recognition systems (dictation systems).
One important issues for speech recognition is how to create language models for spontaneous speech. When recognizing spontaneous speech in dialogs, it is necessary to deal with extraneous words, out-of-vocabulary words, ungrammatical sentences, disfluency, partial words, hesitations and repetitions. Those kind of variation can degrade the recognition performance. For example the results obtained on the SwitchBoard database (telephone conversations) show a recognition accuracy for the baseline systems of only 50 % [COH94]. Better language models are presently a major issue and could be obtained by looking beyond N-Grams. This could be achieved by identifying useful linguistic information and integrating more information. Better pronunciation modeling will probably enlarge the population that can get acceptable results on a speech recognition system and therefore strengthen the acceptability of the system.

Data Fusion and multi-stream

Many researchers have shown that by combining multiple speech recognition systems or by combining the data extracted from multiple recognition processes many improvements can be observed. Some sustained incremental improvements based on the use of statistical techniques on ever larger amount of data and different annotated data should be observed in the next years. It may also be interesting to define the speech signal in terms of several information streams, each stream resulting from a particular way of analyzing the speech signal [DUP97]. For example, models aimed at capturing the syllable level temporal structure could then be used in parallel with classical phoneme-based models. Another potential application of this approach could be the dynamic merging of asynchronous temporal sequences (possibly with different frame rate), such as visual and acoustic inputs. 

Multilingual Speech Recognition

Addressing multilinguality is very important in speech recognition. A system able to recognize multiple languages is much easier to put on the market than a system able to address only one language. Language identification consists in detecting the language spoken and enables to select the right acoustical and Language models. Many research laboratories have tried to build systems able to address this problem with some success (both the Center for Spoken Language Understanding, Oregon, and our laboratory are able to recognize the language in a 10 second speech chunk with an accuracy of about 80 %). Another alternative could be to use language-independent acoustic models, but this is still at the research stage. 

References

[BOI00] R. Boite, H. Bourlard, T. Dutoit, J. Hancq, H. Leich, 2000. Traitement de la parole. Presses polytechniques et universitaires romandes, Lausanne, Suisse, ISBN 2-88074-388-5, 488pp.
[BOU94]  Bourlard H. And Morgan N., "Connectionist Speech Recognition - A Hybrid Approach", Kluwer Academic Publishers, 1994.
[COH94]  Cohen J., Gish H., Flanagan J, "SwitchBoard - The second year", Technical report, CAIP Workshop in Speech Recognition : Frontiers in speech processing II, july 1994.
[DER98]  Deroo O. "Modèle dépendant du contexte et fusion de données appliqués à la reconnaissance de la parole par modèle hybride HMM/MLP". PhD Thesis, Faculté Polytechnique de Mons, 1998 (http://tcts.fpms.ac.be/publications/phds/deroo/these.zip).
[DUP00]  Dupont S. « Etude et développement d?architechtures multi-bandes et multi-modales pour la reconnaissance robuste de la parole?,Phd Thesis, Faculté Polytechnique de Mons, 2000.
[DUP97]  S. Dupont, Bourlard H. and Ris C., "Robust Speech Recognition based on Multi-stream Features", Prcoeedings of ESCA/NATO Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-à-Mousson, France,pp 95-98, 1997.
[RAB89]  Rabiner L. R., "A tutorial on Hidden Markov Models and selected applications in speech recognition", Proceedings of the IEEE, vol. 77, no 2, pp 257-285, 1989.
[RAB93]  Rabiner L. R. And Juang B.H., "Fundamentals of Speech Recognition",  PTR Prentice Hall, 1993. 
[RIC91]  Richard D and Lippman R. P., "Neural Network classifiers estimate Bayesian a posteriori probabilities", Neural Computation, no 3, pp 461-483, 1991.
[YOU97]  Young  S., Adda-Dekker M., Aubert X, Dugast C., Gauvain J.L., Kershaw D. J., Lamel L., Leeuwen D. A., Pye D., Robinson A. J., Steeneken H. J. M., Woodland P. C., "Multilingual large vocabulary speech recognition : the European SQALE project", Computer Speech and Language, Vol 11, pp 73-89, 1997

^ Top ^   

Ongoing projects

DiYSE
2009 - 2011
Do-it-Yourself Smart Experiences

COST 2102
2007 - 2011
COST 2102

Edutain
2004 - 2008
Edutain

STRUT
1996 - 2000
Speech Training and Recognition Unified Tool

Former projects

MAGE / pHTS
2010 - 2014
PhD Thesis Maria Astrinaki

MediaTIC
2008 - 2015
MediaTIC

KWS Predict
2007 - 2008
KWS Prediction

IRMA
2005 - 2008
Multimodal Search Interface for Audiovisual content

IC&C
2004 - 2006
Interface Créative & Conception

DOMINI
2004 - 2006
DOMINI

MAIS
2004 - 2007
Mobile Access Information System

MODIVOC
2002 - 2004
Systèmes MObiles et DIstribués à interface VOCale

COST 278
2001 - 2008
Spoken Language Interaction in Telecommunication

ARTHUR
2000 - 2003
ARchitecture de Télécommunication Hospitalière pour les services d'Urgence

DIALOGUE
2000 - 2004
PhD Thesis Olivier Pietquin

CONFIDENCE
2000 - 2004
PhD Thesis Erhan Mengusoglu

RESPITE
1999 - 2002
REcognition of Speech by Partial Information Techniques

DEMOSTHENES
1998 - 1999


THISL
1997 - 2000
THematic Indexing of Spoken Language

SPRACH
1995 - 1998
SPeech Recognition Algorithms for Connectionist Hybrids

COST 250
1995 - 2000
Automatic Speaker Recognition over the Telephone Network

COST 249
1994 - 2000
Continuous Speech Recognition Over the Telephone Network

OOBP
1994 - 2005
Object-Oriented Block Processing

HIMARNNET
1993 - 1995
HIdden MARkov models and Neural NETworks