next up previous contents
Next: Automatic segmentation and labeling Up: Task 1.3: Baseline System Previous: Current System

Lexicon

From 188 editions of the PÚBLICO newspaper, corresponding to 24,287 articles and 10,976,009 total words, we computed a Word Frequency List (WFL) with a total of 155,867 different words. After selection of the different sets (training, development test and evaluation test) we ended with a total of 27,833 different words. This list of words was phonetically transcribed by a rule system 1.3. This system has some known difficulties. Due to these problems this lexicon is being hand revised by a specialized linguist. The lexicon will be further perfectioned through the use of smoothing techniques to generate alternative pronunciations based on actual pronunciations and on the likelihood of the acoustic-phonetic models.



Christophe Ris
1998-11-10