One of the main goals of INESC in this project is to develop a speaker independent, large vocabulary continuous speech recognition system. For this purpose an adequate database in both speech and text was collected under task 1.1. The problem is how to obtain a pronunciation dictionary for such a large database. From analysis of PÚBLICO texts we have obtained around 155,000 different words. However, from studies of natural language processing for Portuguese around 1.8 million words have been reported. How will we deal with this large amount of data?
We propose two different approaches to deal with the problem. One is to find some regularities among words, e.g. by dividing the word in three parts: prefix, body and suffix. This will allow us to derive new pronunciations from existing ones through a concatenation process. We are currently studying this approach which is also being used for Language Model Adaptation (task 3.3).
The other approach is to build a pronunciation based on phoneme sequences obtained from the recognizer and smooth those pronunciations with a Text-to-Speech (more accurately a grapheme-to-phonetic output) system and/or with pronunciation rules. For this INESC has being had a close collaboration with ICSI.
In this task, ICSI has undertaken software improvements to produce an easy-to-use lexicon toolkit suitable for installation at other sites. The toolkit incorporates the static pronunciation modeling techniques previously developed at ICSI, and allows the user to iteratively retrain the lexicon and the acoustic model using a single command. The new software has been tested using Portuguese data, although not enough data is yet available to give meaningful recognition results.
A pictorial description of how the toolkit works is shown in Figure 2.1. The lexicon training process can begin with a multiple pronunciation dictionary, or by applying phonological rules to a set of canonical pronunciations. Pronunciation probabilities can be reestimated by simply counting occurrences of each pronunciation variant, by merging states in the HMM pronunciation models, or by estimating the likelihood of each phonological rule's application. One can also combine these methods. Low probability pronunciations can be pruned. Phoneme duration models are also reestimated from the data. Lexicon training passes can be alternated with neural net acoustic model training (provided by the ICSI Quicknet software package), or either the lexicon or the neural net can be trained independently, holding the other constant. Previously, we reported results using BOOGIE, an HMM-merging software package developed at ICSI, which we found had a slight advantage over non-merging techniques on the OGI Numbers corpus. Maximum-likelihood HMM state merging has been integrated into our portable package to eliminate the need for the LISP-based BOOGIE package.
In other work at ICSI, we are continuing development of algorithms for automatically learning pronunciation variants in English for the Switchboard corpus, with an eye towards developing software that can be ported over to Portuguese. Our method uses statistical models such as decision trees to determine how pronunciations can deviate from the canonical, depending on local contextual factors. At the 1996 Johns Hopkins Summer Workshop on Innovative Technologies for Large Vocabulary Conversational Speech Recognition, we (along with others) developed a model which predicted this transformation from canonical form on the phone level, taking into account the stress, syllabic position, and phonological features of surrounding phones.
In order to understand better Switchboard pronunciation phenomena, we have also undertaken an investigation of the phonological patterns in Switchboard, utilizing labeled speech data transcribed by linguists at ICSI. In addition to building phone level mappings, we have been examining pronunciation alternatives on the syllable and word level. Syllable-level models incorporate more context than phone-level models; in addition, using only a few models, one can cover a large portion of the words in a corpus (the top 175 syllables cover 71% of the tokens in Switchboard). Word-level models are yet more specific, but the data for building models is sparse. We hope to be able to integrate information from all of these levels to smooth estimates of pronunciation probability.
We have also been working towards introducing more dynamic factors into the pronunciation model. In conversational speech, for example, speaking rate has a considerable effect on pronunciations: in Switchboard, we found that from very slow to very fast speech, the phoneme deletion rate rises from 9.3% to 13.6%; the phone substitution rate also changes significantly, rising from 16.9% to 24.2%. We are also examining whether word predictability, as represented by unigram and trigram language model scores, has an effect on pronunciation.