Work Package Manager: FPMs
Executing Partners: CUED, FPMs, SU, ICSI Thus far, hybrid HMM/ANN systems have been shown to lead to good acoustic modeling (with several advantages over other approaches). Many (relatively simple) speech recognition systems based on this hybrid HMM/ANN approach, have been proved, on controlled tests, to be both effective in terms of accuracy (comparable or better than equivalent state-of-the-art systems) and efficient in terms of CPU and memory run-time requirements (see, e.g., [10,15]). More recently, such a system (ABBOT from Cambridge University, see, e.g., ) has been evaluated under both the North American ARPA program and the European LRE SQALE project (20,000 word vocabulary, speaker independent continuous speech recognition). In the preliminary results of the SQALE evaluation (reported in ) the system was found to perform slightly better than any other leading European system and required an order of magnitude less CPU resources to complete the test. Another striking result is that the acoustic models for this system used several hundred thousand parameters (around 500,000 for ABBOT) while the corresponding models for the competing systems used millions of parameters (around ).
This workpackage is thus devoted to further fundamental research in new conceptual advances in hybrid systems that have been identified by the partners.
WP5 does not have any milestone.
Task Coordinator: FPMs
Executing Partners: FPMs, ICSI
Although hybrid HMM/ANN speech recognition systems appear to be efficient models, we also believe that new developments are necessary to further boost this technology. On the other hand, recent developments in systems trying to model some kind of auditory/perceptual properties  have been relatively successful. We however believe that hybrid HMM/ANN systems are potentially more flexible and better adapted to model some important perceptual properties.
The goal of this Task is thus to investigate, develop and test new recognition systems based on perceptual properties and able to take full advantage of the HMM/ANN framework (i.e., properties that would be difficult to model outside the HMM/ANN framework). As discussed below, two approaches are investigated: SPAM, modelling transitions, and the multi-band approach, working on the basis of independent narrow frequency bands. In both cases, a HMM/ANN approach is required since the local classifier requires some of the ANN properties, mainly (1) looking at longer temporal segments (acoustic context), and (2) discrimination.
The first approach studied in this Task, referred to as SPAM (Stochastic Perceptual Auditory-Event-Based Models), attempts to model speech as a sequence of (non stationary) transitions, disregarding stationary portions of the signal. An initial theory was set up in WERNICKE and preliminary encouraging results were published in . As described below (and in the publications in Deliverable D5.1), this work has been pursued in this project and led to new results basically showing that SPAM:
However, this new approach also led us to start investigating a new speech recognition model based on other properties of human hearing. As recently discussed in , there are many reasons to believe that human speech decoding is done by processing narrow frequency bands quite independently of each other and recombining the partial (over time and frequency) decisions at some segmental level. In the framework of SPRACH, quite a lot of work in this field has thus been done in this direction and led to what has been referred to as ``subband-based speech recognition'', recently further generalized to ``multi-stream speech recognition''. As summarized below (and further discussed in Deliverable D5.1), this approach is very promising from several points of view and is attracting worldwide attention.
SPAM theory and earlier experiments were reported in a number of conference papers as well as in the journal paper . The central notion for these experiments was to focus modeling power on points of phonetic onset decision (called ``avents'') by tying states associated with all other non-onset frames to the same distribution. Given the discriminant training for our hybrid HMM/ANN system, this leads to a system that focuses on distinguishing between classes of transitions.
In the original theory, we estimated acoustic probabilities for avent class of the form
and , and in which represents the current chunk of input data and the previous time index for which an avent had been hypothesized and becomes one of the stochastic variable of the model.
In the earlier experiments, we used a simplified model in which we neglected the dependence on time index and the previous avent class. This year, we extended the work to include the dependence on the time to the previous avent. Modifying both the neural network (to include binary inputs representing 3 ranges of this time) and the stochastic models (expanded out by 3 to incorporate the possible times). Results for an experiment on Bellcore isolated digits (both without and with additive car noise) are shown in the accompanying table.
Figure 5.1: Error rates for isolated digits plus ``oh'', ``no'', and ``yes'', recorded over a public-switched telephone network. The noisy case includes artificially added car noise resulting in a 10dB SNR.
As can be seen, the addition of durational dependence to the SPAM model reduces the error rate in the clean case while maintaining that rate in the noisy case. The clean case error reduction is significant at , assuming a normal approximation to a binomial distribution for the errors. The reduction in the error rate for the combined system, while not as statistically significant, is certainly in the correct direction.
The work of Fletcher and his colleagues  (see the insightful review of his work in ) suggest that the decoding of the linguistic message is based on decisions within narrow frequency bands that are processed quite independently of each other. Recombination of decisions from those frequency bands is done at ``certain levels'' so that the global error rate is equal to the product of ``band-limited'' error rates within the independent frequency channels. This also means that if any of the frequency bands yield zero (or low) error rate, the resulting error rate would be also zero (or relatively low), (almost) independent of the error rates in the remaining bands.
We see at least three engineering reasons for considering this approach:
It is perhaps obvious that a core issue in the design of any sub-band-based system is the choice of the number and position of the constituent sub-bands. Once these are determined, the approach presented here fundamentally consists of the combination of the output of multiple recognizers, one for each band, at some level of representation. Fundamentally, each of these recognizers consists of a probability estimator and a time-warp engine.
Of course, there is less information in a sub-band than in the whole band; the partial decisions may thus be less reliable. To avoid too much flexibility in choosing the time-warping path, it is necessary to re-introduce some constraints at a higher level. This is done by forcing synchrony (in terms of the underlying segmentation) of the different independent frequency band recognizers at some level, as shown Figure 5.2. In other words, the scores of the different sub-band recognizers are recombined at a certain speech unit level (i.e., over a certain time segment) to yield a global score and a global decision. Up to now we have done this at the state, phoneme, syllable or word levels, although we are interested in looking at other units for this purpose. We note here that while this is quite easy at the HMM state level (and at the word level, in the case of isolated word recognition), it is no longer straightforward at any intermediate subword unit level (simply using the standard one-pass dynamic programming approach). Rather, the system can either use an approach based on the 2-level dynamic time warping algorithm, or else an adaptation of HMM decomposition  (initially introduced to decompose the speech signal into a speech part and a noisy part). In the framework of sub-band-based speech recognition, a similar HMM decomposition formalism can be used to do multi-dimensional time warping and recombination of the frequency sub-bands. However, as opposed to standard HMM decomposition, it is not the same input signal that is fed into the different HMM models but different band limited versions of the original speech signal.
Figure 5.2: General form of a K-band recognizer with anchor points between speech units (to force synchrony between frequency bands).
Although Fletcher's recombination criterion  suggests an attractive optimum (since zero error in any band yields zero error overall), we are not aware of any statistical formalism for achieving this. Thus, we decided to define the log-likelihood of a full-band acoustic vector X given a word (sentence) model M as
where represents the j-th segment of X and the model associated with during dynamic time wrapping. Depending on the recombination level, could be a HMM state model, a word model, a phone model or any other subword unit model. For each segment, the statistical recombination of the frequency sub-bands is performed according to
where is the band-limited sequence of acoustic parameters associated with the k-th frequency band, is the model associated with , and 's are the recombination parameters. thus represents the likelihood of a partial (frequency limited and time limited) sequence given model and can be computed with standard HMM or hybrid HMM/ANN systems.
Two different recombination functions have been tested:
where represents a multilayer perceptron (MLP) parameterized in terms of 's and with , at its input.
Three different strategies have been considered for estimating the recombination parameters : (1) normalized phoneme-level recognition rates in each frequency band, (2) normalized S/N ratios in each frequency band, and (3) multilayer perceptron (trained on clean data).
Preliminary experiments have been reported (and compared with a state-of-the-art full band approach) in  where it was shown on a speaker independent task (108 isolated words, telephone speech) that:
We have also compared the performance of the multi-band approach (on both isolated word and continuous speech recognition problems) in terms of:
Three sets of features have been considered: 15 critical band energies (CBE), lpc-cepstral features independently computed for each frequency band and followed by cepstral mean subtraction (CMS) and, finally, lpc-cepstral features independently computed for each band limited critical band energies previously processed with the J-RASTA  (for the case of wideband noise).
We have experimented with several number of bands, ranging from 3 to 6 bands.
Different recombination levels (HMM state, phone or syllable) have also been compared.
Different databases have been tested:
Table 5.1: Error rates on isolated word recognition (13 American English words, telephone speech). Features were either critical band energies (CBE) or lpc-cepstral features (CMS) independently computed for each sub-band and followed by cepstral mean subtraction. ``FB'' refers to regular full-band recognizer. For the 3, 4 and 6 sub-band-based systems, state log-likelihoods recombination was performed by an MLP.
Database 1 has already been used to compare the error rates on clean speech for CBE and CMS, and for different number of sub-bands. The results, reported in Table 1, show that: (1) sub-band modelling yields better recognition performance when compared to a standard full-band approach and (2) all pole modelling of cepstral vectors improve the performance of the full-band system (of course, this was already known!) but also the performance of the sub-band approach.
In the case of noisy speech, as reported in Table 2, the sub-band approach, combined with J-RASTA processing, yields better recognition performance compared to the full-band J-RASTA recognizer.
Table 5.2: Error rates on isolated word recognition (13 American English words) of telephone speech + additive car noise (10dB SNR). Training on clean speech only. ``J-RASTA'' refers to lpc-cepstral features independently computed for each band limited critical band energies previously processed with the J-RASTA noise cancellation technique. Recombination was performed at the state level using an MLP.
The SPAM work reported here has still been limited to the digits corpus. We intend to extend the same techniques to larger databases, and in particular to the OGI Numbers corpus.
Thus far the SPAM and subband efforts have been quite separate. It is also true that each of them will require significant effort individually; however, by the end of this project we hope to experiment with the combination of these methods.
Task Coordinator: FPMs
Executing Partners: FPMs, ICSI
The initial goal of this Task was to investigate, develop and test further the new hybrid HMM/ANN system initiated in WERNICKE. This new technique, referred to as REMAP (Recursive estimation and Maximization of a Posteriori Probabilities), allows full discriminant training of HMM/ANN systems, leading to globally discriminant models [Bourlard et al., 1995]. Compare to other (hybrid as well as non hybrid) alternative approaches such as Maximum Mutual Information or Generalized Probabilistic Descent, this new approach directly maximizes global a posteriori probabilities (without the need of computing probabilities of rival models) to lead to discriminant solutions.
We believe that REMAP could be of major importance for the further enhancement of hybrid HMM/ANN systems, as well as for pattern classification systems in general. It is the goal of this task to continue the work in this innovative area.
The REMAP training algorithm has been implemented and debugged. New MLP targets have been estimated for two databases (Phonebook and Numbers93). In the same time, we are still working on the adaptation of the MLP training for the estimation of conditional transition probabilities. Discriminant HMMs  has been implemented into two STRUT  decoders, a small vocabulary continuous speech decoder using worpdair grammars and a lattice (HTK format) decoder. On the other side, a recent study (see annex 5.1 for a draft description) proposes a Forward-Backward HMM/ANN training which should also lead to global discrimination. The algorithm has been implemented at ICSI and is still to be tested on recognition tasks.
REMAP uses local conditional posterior probabilities of transitions, (estimated by an MLP) to maximize during training (or estimate during recognition) global posterior probabilities of word sequences. Hence minimizing global posterior probabilities of rival word sequences and, in theory, the error rate. Thus, although we still use MLPs to estimate posterior probabilities, these are no longer divided by priors and the network is now trained with targets that are themselves local posterior probabilities that are iteratively re-estimated to guarantee a monotonic increase of the global posterior probabilities of the correct sentences. This kind of algorithm is thus expected to have better global discriminant properties than the previously developed hybrid HMM/ANN systems. On top of better discriminant properties, we believe that the model could also have additional potential benefits in terms of its modeling properties.
As soon as the MLP libraries are modified, we plan to carry out several experiments on both the REMAP (FPMs) and the Forward-Backward (ICSI) trainings. Both algorithms lead to a reestimation of the output class probabilities. The main difference between REMAP and hybrid HMM/ANN forward-backward training is the underlying statistical model. Initially referred to as ``Discriminant HMM'', the REMAP system could also be referred to as Stochastic Finite State Acceptor .
Task Coordinator: CUED
Executing Partners: CUED, SU
The mixture-of-experts paradigm uses a divide-and-conquer approach in which multiple ``experts'' are combined to solve a given task . The approach has previously been used to combine connectionist acoustic models. The work described here extends the mixture-of-experts approach to speaker adaptation, and for use in conjunction with boosting.
Speaker adaptation within the mixture-of-experts framework has been implemented, and results suggest that the method provides improvements in performance, particularly for supervised adaptation . Boosting has been used to provide bootstrap models for a mixture-of-experts acoustic model, and preliminary results on a small isolated digits task show the effectiveness of this scheme .
Speaker adaptation is performed using an architecture in which each expert comprises a recurrent network acoustic model (whose weights are fixed), and a linear input network (LIN). A gating network forces the LINs learn to specialise in particular regions of the input space. The outputs of the LIN-RNN experts are then combined non-linearly to achieve a global transform that is locally linear.
Boosting is a data selection technique which filters the training data to produce models trained on different distributions . In this manner the models specialise in different regions of the input space. These models are used to initialise the experts of a mixture-of-experts acoustic model.
It is planned to extend the combination of boosting and mixture-of-experts to large vocabulary continuous speech recognition.
Task Coordinator: SU
Executing Partners: SU, FPMs, ICSI
The use of acoustic processors based on auditory modeling has been proposed as a route to robust speech recognition in noisy environments. We plan to address some specific issues pertaining to the use of auditory models in speech recognition.
At Sheffield work has commenced on nonlinear dimensionality reduction algorithms.
In this task we have so far concentrated on developing nonlinear dimensionality reduction algorithms, rather than applying the techniques to auditory model data. This is in part due to the SPERT training problems discussed in WP6. We have begun exploration of a latent variables modelling for this application [47,48]. The latent variables approach to dimension reduction involves defining a mapping from a latent space X to the observed space Y. The GTM algorithm  provides a way to do this, by defining a generalized linear mapping (using a layer of fixed radial basis functions) from the latent space to the observed space. The EM algorithm is used for the training, where the E-step is used to estimate the posterior probabilities of points in latent space given the observed data (and parameters) and the M-step is used to optimize the mapping from latent space to observed space. The posterior probabilities of points in latent space may be regarded as ``filling in'' the missing data.
Although the GYM algorithm has performed well in a visualization problem (with a two-dimensional latent space) it is clear that it will not scale. This is due to two main regions:
The projection pursuit/GTM approach will be evaluated on auditory model data, and also electropalatogram (EPG) data and compared with linear methods such as principal components analysis. Recognition experiments will involve training networks on dimension reduced data.
Task Coordinator: CUED
Executing Partners: CUED
Preliminary work with context-dependent phone modelling has yielded significant reductions in word error rate (15% - 20%). HMM systems typically show much larger gains and this task addresses this issue.
Experiments have been carried out with a state-based context-dependent system , but with limited success. Further extensions of current phone-based system have also proven successful for the Broadcast News corpus across all conditions. Further experiments are necessary to increase the number of context-dependent outputs.
The context-dependent system training has been rewritten to run faster and with more flexibility. To train an SI84 size system now takes roughly 4 hours compared to 11 hours.
Currently the system runs with word-internal context-dependent phones. One possible extension would be to use cross-word context-dependent phones. This sort of extension has realised significant gains in HMM systems in the past.
Preliminary encouraging results with SPAM have been published. A new
technique, based on a sub-band model of speech, has been set up and
first promising results have been obtained. A first version of
a REMAP training and recognition system has been implemented (and
results should come soon). The mixture of expert approach has
been applied to speaker adaptation. Work on context dependent modelling