next up previous contents
Next: Decomposition Results Up: Portuguese Language Modelling Previous: Portuguese Language Modelling

Background

State-of-the-art continuous speech recognition systems suffer from various problems. First, when trying to recognize unrestricted speech utterances the acoustic dictionary of a system has to be very large, which increase the search space, slow down recognition speed and also result in performance degradations. This is particularly true for highly inflectional languages, like German, Portuguese, etc. Second, even a huge dictionary will not be able to foresee all new words (OOV) occurring in the test text. As a consequence, there will always appear some words unknown to the recognizer that cannot be recognized properly and might lead to some errors within the recognition process. Finally, in spite of large databases, there is still insufficient training material. This applies to the generation of statistical language models which need a lot of data to guarantee robust probability estimations, as the case of n-grams language models [8]. Hence a way has to be found to build robust languages models for these inflectional languages like Portuguese. In this work we propose a decomposition method originally based on morphological analysis of Portuguese words. Instead of use words as the only base units we use morphemes3.5 too. With this process we not only reduces the vocabulary, and therefore the language model perplexity, but also the rate of out-of-vocabulary words and memory requirements.

In every language there are two main mechanisms for creating new words: namely derivation and compounding. For the Portuguese language derivation is the main one, especially the inflectional derivation. Comparing languages like Portuguese and English, it can be easily seen that Portuguese language differs from the English one by an outstanding number of inflections. Consider the verb ``cantar'' (``to sing'' in English). In Portuguese for almost every person in singular and plural there is a different ending:

So instead of 2 different endings as would be the case in English (sing, sings) there are 4 of them in Portuguese. Moreover, considering every verb, every time and every person one get more new words than in English. A complete morphological decomposition would significantly reduce vocabulary size. However, we do not have enough linguistic knowledge to do an exhaustive decomposition and it will be difficult to compose the morphemes at the final stage of speech recognition. In Portuguese language the verb is the most variable class, so our study was only based on morphological decomposition of regular verbs.


next up previous contents
Next: Decomposition Results Up: Portuguese Language Modelling Previous: Portuguese Language Modelling
Christophe Ris
1998-11-10