next up previous contents
Next: WP3: Language Models Up: No Title Previous: WP1: DATABASES AND

WP2: Lexica and Learning of Lexica

WP2: Overview

Work Package Manager: INESC
Executing Partners: FPMs, INESC, ICSI
The goals of large vocabulary, speaker independent recognition, domain adaptation and task independence require the availability of appropriate pronunciation lexica, encompassing very large number of words and possibly multipronunciations for each word. It has recently been shown that the pronunciation lexica is one of the most important blocks in the development of large vocabulary, speaker independent recognition systems.

The main focus of this workpackage is on the portability of these systems to new languages and the necessary development of new lexicons for those languages. In this respect, the two new languages involved in SPRACH are in different situations. For French, there already exist a significant amount of speech recognition work done, and consequently there already exist relatively large lexica, which may however have to be augmented. For Portuguese, the only work is on small hand-built vocabularies for very limited tasks.

To cover the necessary developments we divide this workpackage in two different tasks. In Task T2.1: Baseline dictionaries for new languages, with duration of just the first year of the project, we developed baseline lexicons for French and Portuguese languages. In Task T2.2: Automatic Learning of New Dictionaries which last the next two years of the project, our intention is to develop techniques to automatically create new pronunciations for existing words in the lexicon and adding new words with their pronunciations to the baseline lexicons created in Task T2.1. Some detailed description of each task is given below.

WP2: Milestones and Deliverables

WP2: Milestones

For the first year we had the following milestones:

As reported in Deliverable D2.2, milestone 2.2 has been met. Milestone 2.1 has not been met.

WP2: Deliverables

For the first year we had the following deliverables:

The text of the report below as well as Annex D2.1 and D2.2 constitute this deliverable.

Task 2.1: Baseline Dictionaries for New Languages

Task Coordinator: INESC
Executing Partners: FPMs, INESC

Task 2.1: Task Objective

This task encompass the work done in the development of baseline dictionaries for French and Portuguese.

For French no good pronunciation dictionary was found, so we decided to create our own. The speech synthesis group at FPMs has developed a tool for grapheme to phoneme translation and we plan to use it for generating our dictionary.

For Portuguese there was no pronunciation dictionary available that could be used in speech recognition. Our goal is to obtain a baseline pronunciation dictionary to be used in the development of the baseline system for Portuguese described in Task T1.2.

Task 2.1: Status

During the first year of the project we developed the baseline pronunciation dictionaries for Portuguese and French. This was the work planned for this task and we consider the task finished with success for Portuguese. No French dictionary is available yet.

Task 2.1: Technical Description

In this task INESC created a baseline pronunciation dictionary for the Portuguese language to be used in the development of the baseline system for Portuguese (see Task T1.2). Since our baseline system was based on the SAM database (see Task T1.1) our baseline dictionary was also based on the SAM database. From SAM (see [24]) there is available the sentences orthographic text and the sentences phonetic transcription hand-made by linguistic specialised researchers.

In our work we began by collecting a Word Frequent List (WFL) of SAM database where we obtained a list of 1314 different words. For each of that words we need their phonetic transcription. In order to obtain these transcriptions we picked the SAM ortographic and phonetic transcriptions and with the help of some specific designed tools we aligned both. Because the sentences transcription take in account co-articulation effects we have some words with different pronunciations. The alignment process was hand verified. In the end we got 1437 different pronunciations. This means that some of the words have more than one pronunciation.

A data file including the baseline dictionary for Portuguese is part of Deliverable D2.2 and can be found in Annex D2.2.

Task 2.2: Automatic Learning of new Dictionaries

Task Coordinator: INESC
Executing Partners: INESC, ICSI

Task 2.2: Task Objective

In this task we will focus on developing techniques to automatically add, to an existing dictionary, new pronunciations of existing words and new words with their pronunciations. The Portuguese language is the main target but we expect that at least some of the techniques developed under this task would be general and able to be applied to other languages.

Task 2.2: Status

This task will begin in the second year of the project extending to the end. At this moment there was made some background work at ICSI and there are some ideas of the work to be done at INESC.

Task 2.2: Technical Description

One of the main goals of INESC in this project is to develop a speaker independent, large vocabulary continuous speech recognition system. As mentioned before an adequate database in both speech and text are being collected under Task T1.1. The problem is how to obtain a pronunciation dictionary for such a large database. The preliminary analysis of PUBLICO texts shows around 50 thousands different words. However we find on the best written Portuguese dictionaries around 100 thousand different words. From studies of Natural Language Processing for Portuguese around 1.8 millions of words were reported. How we will deal with this large amount of data?

We do not expect to have the correct solution but we propose two different approaches to reduce the problem. One is to build a pronunciation based on phoneme sequences obtained from the recognizer and smooth that pronunciations with a Text-to-Speech (more accurately with a grapheme-to-phonetic output) and/or pronunciation rules. For this we expect to have a close collaboration with ICSI. The other approach is to find some regularities between words and divide the word in three parts: prefix, body and suffix. This will permit us to derive new pronunciations from existing ones through a concatenation process. This approach will also be used for Language Models Adaptation (Task T3.3).

In this task, ICSI will be building larger pronunciation dictionaries for Portuguese based on smaller baseline dictionaries using automatic learning techniques. INESC has developed a hand-built baseline lexicon using the SAM database (as reported in Task 2.1), and has provided it to ICSI as part of an initial examination of the data. Building dictionaries for large-vocabulary Portuguese, however, will not begin until the PUBLICO speech database is collected. In the meantime, at ICSI, we have been experimenting with different lexicon generation and pruning methods on a subset of the Numbers95 database from OGI. This continuous-speech database has a vocabulary of 35 words, which makes it easier to examine lexica in detail.

We have been experimenting with lexica constructed from several sources. As a baseline, we used a canonical set of single pronunciations selected from the CMU pronunciation dictionary[26]. We also generated lexica using either a combination of canonical dictionaries[26,27], pronunciations from hand-transcribed data from Numbers93 (an earlier version of the corpus) and TIMIT[25,28], or both the canonical and data-derived pronunciations. In order to generalize and prune these lexica, three different processes were used in different experiments. Two of these, including Bayesian merging of pronunciations and maximum-likelihood merging of pronunciations, used BOOGIE[23], a grammar induction tool developed at ICSI. The third process was a simpler pronunciation graph building tool, which did no generalization or merging. In the following table, we show the results on a 1206 utterance, 4670 word development test set of continuous numbers.

 
Table 2.1: Word error rates for a subset of the OGI Numbers 95 data base. Rows correspond to different sources of pronunciations, and columns correspond to alternative schemes for merging multiple pronunciations.

BOOGIE running in maximum-likelihood mode (which does less merging than the Bayesian mode) performs the best; in particular, it performs slightly significantly better than the baseline when using all pronunciations. The difference with the simple pronunciation graph tool is not significant, however; we are considering adding maximum-likelihood merging capabilities to the simple pronunciation graph tool in order to have a simple lexicon toolkit for sharing with the other partners.

In addition we have been experimenting with various ways of extending pronunciations for the Numbers95 speech recognition task. We have compared results using a lexicon of single, canonical dictionary pronunciations with a multiple-pronunciation lexicon generated by applying phonological rules to canonical pronunciations in the method described by [20,21] to model dialectal and fast-speech effects. In initial tests, we found that using the rule-expanded pronunciations insignificantly improved performance (from the baseline 9.0% WER to 8.5%). We have also experimented with a lexicon generated by selecting the most common pronunciations from hand transcriptions of the Numbers95 training data, which improves the error slightly further to 8.2% WER.

In other work at ICSI, at the 1996 Johns Hopkins Summer Workshop on Innovative Technologies for Large Vocabulary Conversational Speech Recognition, we (along with other participants) developed a model to automatically learn pronunciation variants for words directly from machine transcriptions of acoustic data from the Switchboard corpus. This model computed the probability of a particular pronunciation of a word, given the surrounding context of words, resulting in a 1% improvement in word error rate over the baseline (from 46% to 45%, a reduction similar to that achieved with other techniques explored by researchers working on the Switchboard corpus). We feel that further improvements will be seen when we retrain the acoustic models (which were not retrained for the experiments reported above). In future work, we will be focusing on adding additional contextual factors, such as local measurements of speech rate, to the pronunciation model. We will also use the model to predict new pronunciations for unseen words, which we will use to extend the Portuguese lexicon.

Task 2.2: Future Developments

This task will be begin in the next year and we expect to develop the work described in the previous section.

WP2: Conclusion

This workpackage have two different phases represented by the two tasks. In Task T2.1 we developed a baseline dictionary for Portuguese, which will be used in Task T2.2 as a starting point to automatically create a large dictionary for Portuguese.



next up previous contents
Next: WP3: Language Models Up: No Title Previous: WP1: DATABASES AND



Jean-Marc Boite
Tue Jan 7 12:46:31 MET 1997