next up previous contents
Next: WP5: NEW ARCHITECTURES Up: No Title Previous: WP3: Language Models


WP4 Overview

Work Package Manager: CUED
Executing Partners: CUED, FPMs, SU, INESC
One of the goals of this project is to have a flexible (hybrid HMM/ANN) speech recognizer that is quite independent of the initial training data and easy to adapt to new languages. It is also possible that hybrid HMM/ANN systems are more robust to task independent training. Consequently, several topics related to these problems are investigated in the WP.

WP4 Milestones and Deliverables

WP4 Milestone: M4.1

As reported in Deliverable D4.1, this milestone was met.

WP4 Deliverable: D4.1

The report below as well Annex D4.1 constitute this deliverable.

Task 4.1: Training Independent Tasks

Task Coordinator: FPMs
Executing Partners: FPMs, CUED

Task 4.1: Task Objective

For task independency we wish our acoustic models to be independent of the acoustic environment and independent of the lexicon. Currently our base system is trained for one acoustic condition, that of noise free read speech with a known microphone.

Task 4.1: Status

The hybrid HMM/MLP system has been tested on a database very well suited for task independent training, PHONEBOOK.

Task 4.1: Status

A Task independent, speaker independent isolated word recognition system has been trained and tested on PHONEBOOK, and was shown to yield around 95% recognition performance on a 600 word lexicon.

Task 4.1: Technical Description

At FPMs, we have trained our system on an isolated words database, PHONEBOOK. The database contains 106 word lists, each composed of 75 or 76 words that have been pronounced by a few (typically around 11) speakers. The speakers and words are different for each word list. We have tested our system on 8 word lists that did not belong to the training set. Since the lexicon is different in each of these 8 word lists, we then have the choice to recognize the 8 word lists as a whole (yielding a lexicon of 600 words) or to recognize each word list independently with a lexicon of about 75 words. In the second case, the recognition rate will be the (unweighted) average over the 8 recognition rates. So far, only a ``small'' training set (5 hours of speech) has been used (although final systems will also be trained on the ``full'' training set). The first dictionary is the one released with PHONEBOOK and contains the phonetic transcriptions of the PHONEBOOK words according to a 42-phoneme inventory. The second dictionary is the 110,000-word CMU 0.4 dictionary using 39 phonemes (a subset of the TIMIT phonemes). Some of the PHONEBOOK words that were not present in CMU 0.4 have been transcribed manually.

Figure 4.1: Phoneme model with 3 tied states.

Figure 4.2: Phoneme model with 3 independent states.

Figure 4.3: Phoneme model with 1 state and 2 transition states and . These two states are tied across several phones and possibly across all the phones.

Three kind of left-to-right phone models have been investigated:

  1. Model 1: 1-state phone models with a minimum duration constraint of 3, as presented in Figure 4.1.
  2. Model 2: 3-state phone models (minimum duration of 3), as shown in Figure 4.2.
  3. Model 3: A new acoustic modeling approach based on tied states for the 1st and the 3rd states of the previous models (point 2). These states are tied across different phones and possibly across all the phones. They represent the inter-phone transitions. Figure 4.3 shows how it looks like. In the experiments reported here, tying was done across the phones according to their broad phonetic class. This lead to 9 transition states towards a phone and 9 transition states from a phone.

Table 4.1:   Error rates on isolated words recognition (75 American English words never seen in the training data) with hybrid HMM/ANN systems and either log-RASTA-PLP feature set or CMS feature set and Phonebook phonetic transcriptions.

Table 4.2:   Error rates on isolated words recognition (75 or 600 American English words never seen in the training data) with hybrid HMM/ANN systems and either log-RASTA-PLP feature set or CMS feature set. CMU phonetic transcriptions

Given the (preliminary) results presented in Tables 4.1, it thus seems that there could indeed be an advantage at using Model 3 for training independent tasks. However, in a recent experiment based on the CMU lexicon (in place of the PHONEBOOK lexicon), a 166,000 parameters HMM/ANN system using log-RASTA-PLP features and Model 1 was yielding a 2.5% error rate. The same system was also yielding a 8.5% error rate on the 600-word lexicon (table 4.2).

Task 4.1: Future Developments

Further tests will be made with the CMU dictionary, and a context dependent system will be trained.

Task 4.2: Unsupervised adaptation and training

Task Coordinator: INESC
Executing Partners: CUED, INESC, SU

Task 4.2: Task Objective

The continuous speech recognition systems are normally speaker-independent due to their training on large speech

databases. Besides their speaker-independence, which arises from the use of a large pool of speakers, the systems are also trained for a global task depending on the database text.

These systems normally presents a large degradation on the system performance when we test on speakers with different characteristics from the ones presented in the training pool. Also their lexicons are not the more suitable for specific tasks.

In this task we will develop techniques for unsupervised speaker and task adaptation in the context of hybrid HMM/ANN systems.

Task 4.2: Status

In spite of the fact that this task is just beginning, we have made some experiments on unsupervised speaker-adaptation using both MLP and RNN structures for phoneme classification.

Unsupervised speaker and environmental adaptation were successfully carried out during the 1995 ARPA Multiple Unknown Microphones task [42]. The mixtures of linear input network (MLIN) architecture has been developed which builds a mixture of LIN adapted RNNs [43].

Task 4.2: Technical Description

The work developed at INESC for this task was based on previous work from WERNICKE project where different techniques were developed for speaker-adaptation applied to a hybrid ANN-HMM system [44]. Among the techniques presented the Linear Input Network (LIN) showed to have a better performance when compared to several other alternatives. These evaluations were made on the Resource Management (RM) corpus in a static supervised speaker-adaptation task. In [45] this technique was extended to an incremental unsupervised mode on the Wall Street Journal (WSJ) on a MLP-HMM system. In [46] we evaluated the LIN technique in both supervised and unsupervised modes in an incremental speaker-adaptation task using the WSJ database.

The unsupervised speaker adaptation carried out at CUED follows the standard block adaptation procedure. Here a forced alignment is made against the utterance hypotheses of the current model, in order to obtain the a frame / phone alignment. This alignment is then used to train the LIN or MLIN in a supervised fashion.

The linear input network code was extended to run within the mixture of experts framework. Here each expert comprises of a recurrent network (whose weights are fixed) and a LIN. The outputs from the experts are combined using a gating network. This results in a non-linear adaptation technique. The MLIN is trained using the EM algorithm.

Task 4.2: Future Development

A comparison of the linear transformation learned by the LIN with an online speaker adaptation technique using speaker cluster information gathered from training data. Also possible investigation of speaker adaptive training using LINs.

The LIN technique will be applied to the baseline system for Portuguese in an unsupervised mode.

WP4: Conclusion

The PHONEBOOK task has allowed the exploration of different configurations of state-tying to achieve task independence. Work on the LIN architecture has continued from WERNICKE and the MLIN architecture has been developed.

next up previous contents
Next: WP5: NEW ARCHITECTURES Up: No Title Previous: WP3: Language Models

Jean-Marc Boite
Tue Jan 7 12:46:31 MET 1997