Work Package Manager: CUED
Executing Partners: CUED, FPMs, SU, INESC One of the goals of this project is to have a flexible (hybrid HMM/ANN) speech recognizer that is quite independent of the initial training data and easy to adapt to new languages. It is also possible that hybrid HMM/ANN systems are more robust to task independent training. Consequently, several topics related to these problems are investigated in the WP.
Task Coordinator: FPMs
Executing Partners: FPMs, CUED
For task independency we wish our acoustic models to be independent of the acoustic environment and independent of the lexicon. Currently our base system is trained for one acoustic condition, that of noise free read speech with a known microphone.
The hybrid HMM/MLP system has been tested on a database very well suited for task independent training, PHONEBOOK.
A Task independent, speaker independent isolated word recognition system has been trained and tested on PHONEBOOK, and was shown to yield around 95% recognition performance on a 600 word lexicon.
At FPMs, we have trained our system on an isolated words database, PHONEBOOK. The database contains 106 word lists, each composed of 75 or 76 words that have been pronounced by a few (typically around 11) speakers. The speakers and words are different for each word list. We have tested our system on 8 word lists that did not belong to the training set. Since the lexicon is different in each of these 8 word lists, we then have the choice to recognize the 8 word lists as a whole (yielding a lexicon of 600 words) or to recognize each word list independently with a lexicon of about 75 words. In the second case, the recognition rate will be the (unweighted) average over the 8 recognition rates. So far, only a ``small'' training set (5 hours of speech) has been used (although final systems will also be trained on the ``full'' training set). The first dictionary is the one released with PHONEBOOK and contains the phonetic transcriptions of the PHONEBOOK words according to a 42-phoneme inventory. The second dictionary is the 110,000-word CMU 0.4 dictionary using 39 phonemes (a subset of the TIMIT phonemes). Some of the PHONEBOOK words that were not present in CMU 0.4 have been transcribed manually.
Figure 4.1: Phoneme model with 3 tied states.
Figure 4.2: Phoneme model with 3 independent states.
Figure 4.3: Phoneme model with 1 state and 2 transition states and . These two states are tied across several phones and possibly across all the phones.
Three kind of left-to-right phone models have been investigated:
Table 4.1: Error rates on isolated words recognition (75 American English words never seen in the training data) with hybrid HMM/ANN systems and either log-RASTA-PLP feature set or CMS feature set and Phonebook phonetic transcriptions.
Table 4.2: Error rates on isolated words recognition (75 or 600 American English words never seen in the training data) with hybrid HMM/ANN systems and either log-RASTA-PLP feature set or CMS feature set. CMU phonetic transcriptions
Given the (preliminary) results presented in Tables 4.1, it thus seems that there could indeed be an advantage at using Model 3 for training independent tasks. However, in a recent experiment based on the CMU lexicon (in place of the PHONEBOOK lexicon), a 166,000 parameters HMM/ANN system using log-RASTA-PLP features and Model 1 was yielding a 2.5% error rate. The same system was also yielding a 8.5% error rate on the 600-word lexicon (table 4.2).
Further tests will be made with the CMU dictionary, and a context dependent system will be trained.
Task Coordinator: INESC
Executing Partners: CUED, INESC, SU
The continuous speech recognition systems are normally speaker-independent due to their training on large speech
databases. Besides their speaker-independence, which arises from the use of a large pool of speakers, the systems are also trained for a global task depending on the database text.
These systems normally presents a large degradation on the system performance when we test on speakers with different characteristics from the ones presented in the training pool. Also their lexicons are not the more suitable for specific tasks.
In this task we will develop techniques for unsupervised speaker and task adaptation in the context of hybrid HMM/ANN systems.
In spite of the fact that this task is just beginning, we have made some experiments on unsupervised speaker-adaptation using both MLP and RNN structures for phoneme classification.
Unsupervised speaker and environmental adaptation were successfully carried out during the 1995 ARPA Multiple Unknown Microphones task . The mixtures of linear input network (MLIN) architecture has been developed which builds a mixture of LIN adapted RNNs .
The work developed at INESC for this task was based on previous work from WERNICKE project where different techniques were developed for speaker-adaptation applied to a hybrid ANN-HMM system . Among the techniques presented the Linear Input Network (LIN) showed to have a better performance when compared to several other alternatives. These evaluations were made on the Resource Management (RM) corpus in a static supervised speaker-adaptation task. In  this technique was extended to an incremental unsupervised mode on the Wall Street Journal (WSJ) on a MLP-HMM system. In  we evaluated the LIN technique in both supervised and unsupervised modes in an incremental speaker-adaptation task using the WSJ database.
The unsupervised speaker adaptation carried out at CUED follows the standard block adaptation procedure. Here a forced alignment is made against the utterance hypotheses of the current model, in order to obtain the a frame / phone alignment. This alignment is then used to train the LIN or MLIN in a supervised fashion.
The linear input network code was extended to run within the mixture of experts framework. Here each expert comprises of a recurrent network (whose weights are fixed) and a LIN. The outputs from the experts are combined using a gating network. This results in a non-linear adaptation technique. The MLIN is trained using the EM algorithm.
A comparison of the linear transformation learned by the LIN with an online speaker adaptation technique using speaker cluster information gathered from training data. Also possible investigation of speaker adaptive training using LINs.
The LIN technique will be applied to the baseline system for Portuguese in an unsupervised mode.
The PHONEBOOK task has allowed the exploration of different configurations of state-tying to achieve task independence. Work on the LIN architecture has continued from WERNICKE and the MLIN architecture has been developed.