The developing database of this baseline system was the Portuguese part of the SAM database EUROM.1. The SAM database consists of read speech with three different sets of speakers and with different material, as described in Task 1.1. This database has no phonetic labelling, no dictionary and no language model. In the database we found the speech files, the separation of prompts for each speaker, the text of the passages and sentences and the SAMPA phonotypical transcription.
The development of the Portuguese baseline system consisted of the following steps Figure 1.1:
In the automatic labelling of the SAM database we trained first the acoustic phonetic models over the TIMIT database (an MLP trained over the TIMIT database to perform phoneme classification). Next we created two conversion tables: one from the TIMIT phonemes to the IPA (International Phonetic Alphabet) symbols  and another one from the SAMPA (Portuguese SAM phonemes) to the IPA symbols. Putting both tables together we created a mapping from TIMIT phonemes to Portuguese phonemes. Obviously not all the phonemes have correspondence. The TIMIT phonemes without correspondence are mapped to the more close Portuguese phonemes. These are just a few phonemes (dh, dx, em, en, eng, ...). To the Portuguese phonemes without correspondence we gave a default threshold and rescored the probability vector to sum one.
The next step was to feed the TIMIT net with the SAM database. The probabilities resulting from the TIMIT net were transformed according to the previously defined mapping table becoming the new set of Portuguese probabilities. This set of probabilities were used in the Y0 1.2 decoder to perform the forced alignment. As input to this forced alignment process we also used the baseline lexicon developed under Task 2.1.
After completing the alignment we trained a new MLP over the SAM database (using labels from the forced alignment) to perform phoneme classification. With this network we made another forced alignment pass generating new labels. The process was iterated three times. The results are presented in Table 1.9.
The results show the improvement made on the classification over the frames. This process of training/alignment proved to be effective in decreasing the classification error.
After the training/alignment phase we evaluated the system. In the training phase we used 179 files from the Many Talker Set. For evaluation we picked the 10 speakers of the Few Talker Set, choosing three passages for each speaker. In this case we also used Y0 for evaluation, including now a word pair language model extracted from SAM texts (for details see Task 3.3). As reported in Task 3.3 we built two different word pair language models. In the first we collected the pairs from the SAM text for the sentences isolated from the paragraphs. Remember from Task 1.1 that the passages contained five thematically connected sentences. In this case we got 250 separate sentences. In the second we considered as our unit the passage by itself and we got 50 passages. Now there is no division between sentences in the same passage. The results are presented in Table 1.10.
These results show the great influence, in such a small task, of the language model. In the first case there were no connections between sentences which is not the case of speech files. When we introduce that connections in the language model there was a great improvement on the system performance.