next up previous contents
Next: Current System Up: Task 1.3: Baseline System Previous: Task 1.3: Status

Task 1.3: Technical Description

The developing database of this baseline system was the Portuguese part of the SAM database EUROM.1. The SAM database consists of read speech with three different sets of speakers and with different material, as described in Task 1.1. This database has no phonetic labelling, no dictionary and no language model. In the database we found the speech files, the separation of prompts for each speaker, the text of the passages and sentences and the SAMPA phonotypical transcription.

The development of the Portuguese baseline system consisted of the following steps Figure 1.1:

Selection of the material to use as training and test sets.
Creation of a text database corresponding to the spoken material.
Creation of a Word Frequency List (WFL).
Creation of a baseline dictionary for the words in the WFL (see Task 2.1).
Automatic labelling and segmentation of the SAM database.
Training the acoustic phonetic models based on the labelling from (5.).
Re-labelling of the SAM database using the acoustic phonetic models from (6.). Steps (5.) and (6.) were iterated several times, to improve the acoustic phonetic models and the labelling.
Creation of a word pair grammar from the text database defined in (2.). For details see Task 3.3.
Evaluation of the system over the test set.

Figure: Automatic training/labeling process

In the automatic labelling of the SAM database we trained first the acoustic phonetic models over the TIMIT database (an MLP trained over the TIMIT database to perform phoneme classification). Next we created two conversion tables: one from the TIMIT phonemes to the IPA (International Phonetic Alphabet) symbols [8] and another one from the SAMPA (Portuguese SAM phonemes) to the IPA symbols. Putting both tables together we created a mapping from TIMIT phonemes to Portuguese phonemes. Obviously not all the phonemes have correspondence. The TIMIT phonemes without correspondence are mapped to the more close Portuguese phonemes. These are just a few phonemes (dh, dx, em, en, eng, ...). To the Portuguese phonemes without correspondence we gave a default threshold and rescored the probability vector to sum one.

The next step was to feed the TIMIT net with the SAM database. The probabilities resulting from the TIMIT net were transformed according to the previously defined mapping table becoming the new set of Portuguese probabilities. This set of probabilities were used in the Y0 1.2 decoder to perform the forced alignment. As input to this forced alignment process we also used the baseline lexicon developed under Task 2.1.

After completing the alignment we trained a new MLP over the SAM database (using labels from the forced alignment) to perform phoneme classification. With this network we made another forced alignment pass generating new labels. The process was iterated three times. The results are presented in Table  1.9.

Table: Percentage of correct frames in the different passages of the alignment. The evaluation was made during the training with the labels from previous alignment.
  % correct frame % correct frame
# Pass. on training set on validation set
1 54.86 53.15
2 63.4 62.26
3 65.41 63.91

The results show the improvement made on the classification over the frames. This process of training/alignment proved to be effective in decreasing the classification error.

After the training/alignment phase we evaluated the system. In the training phase we used 179 files from the Many Talker Set. For evaluation we picked the 10 speakers of the Few Talker Set, choosing three passages for each speaker. In this case we also used Y0 for evaluation, including now a word pair language model extracted from SAM texts (for details see Task 3.3). As reported in Task 3.3 we built two different word pair language models. In the first we collected the pairs from the SAM text for the sentences isolated from the paragraphs. Remember from Task 1.1 that the passages contained five thematically connected sentences. In this case we got 250 separate sentences. In the second we considered as our unit the passage by itself and we got 50 passages. Now there is no division between sentences in the same passage. The results are presented in Table 1.10.

Table: Evaluation results on test set.
Grammar word error %  
wordpair 1 50.1  
wordpair 2 15.6  

These results show the great influence, in such a small task, of the language model. In the first case there were no connections between sentences which is not the case of speech files. When we introduce that connections in the language model there was a great improvement on the system performance.

next up previous contents
Next: Current System Up: Task 1.3: Baseline System Previous: Task 1.3: Status
Christophe Ris