next up previous contents
Next: Task 3.3: Future Development Up: Portuguese Language Modelling Previous: Background

Decomposition Results

All data used for our experiments was taken from BD-PUBLICO Portuguese database [12]. It consists of about 11 million words in a total of 158 thousand different words, and two test sets, one with 5K and the other with 20K different words, used to evaluate the importance of the process. First, we started by classifying morphologically3.7 all the 158 thousand words. Table 3.5 shows that 35% of the words result from verbal inflections, which yields a reduction of 29% on vocabulary dimension (decrease from 158,156 to 112,286 words). So, we can conclude that a significant vocabulary reduction can be achieved by decomposing verbal inflections. Second, we proceeded with decomposition of these verbal inflections on root/suffix form using some specific designed tools and a hand-made list containing all the possible verbal suffixes resulting from the 55,701 verbal inflections.

Table: Comparing word and morpheme vocabulary.
BD-PUBLICO vocabulary 158,186
Verbal inflections 55,701
Verbal roots 5,696
Verbal suffixes 434
Decomposed vocabulary 112,286

In our experiments we have built two overall language models. One bigram language model based on words (LM_WORDS), the other on their morpheme decompositions (LM_MORPHEMES). The backoff bigram language models were generated trough ``Carnegie Mellon Statistical Language Modeling Toolkit'' [11]. Smoothing was done by absolute discounting [28], and discarding all singletons. Tests have been done for both test sets, 5K and 20K vocabularies. On table 3.6 we can see how word decomposition effects vocabulary size. This reduction gets larger with the increase on vocabulary size.

Table: Vocabulary decreases on both test sets. ``TRAIN'' is the vocabulary that results from the texts used as language models training material.
5K 5,000 4,068 -18.6%
20K 13,070 9,651 -26.2%
TRAIN 139,758 97,386 -30.3%

Table: Reduction of new words (OOV).
5K 20% 15%
20K 10% 7%

Moreover, the number of new words within the test sets has decreased when using morphemes instead of words (table 3.7), as well as the memory requirements for the new proposed model (table 3.8). As to be expected the reduction on vocabulary growth leads to a significant perplexity reduction when comparing morpheme-based language models with word ones. Morpheme bigram perplexity is 63% lower than word bigram (see table 3.9).

Table: Reduction (in Mbytes) on memory requirements.
5K 2,5 1,8 -28%
20K 4,4 3,2 -27%

Table: Word and Morpheme perplexity.
5K 229 85 -63%
20K 257 95 -63%

next up previous contents
Next: Task 3.3: Future Development Up: Portuguese Language Modelling Previous: Background
Christophe Ris