next up previous contents
Next: Phoneme Recognition Experiments Up: Task 1.2: Technical Description Previous: Task 1.2: Technical Description

The labeling and training procedure

As no phonetic labeling is provided with the Bref database, the first task was to generate a first phone alignment. A high quality concatenative speech synthesizer (MBROLA [1]) was used to produce a synthetic reference signal from the phonetic transcription derived from the text. The speech signal coming from the database is then temporally aligned on this reference, in which the segmentation is known. The alignment process is thus reduced to a simple dynamic time warping algorithm. Two different voices (a male and a female) have been used to obtain correct alignment in any case. That first segmentation was used to bootstrap the training of an HMM system based on multi-gaussians. That system provided a new segmentation, which we used to train an MLP, which in turn provided a new segmentation. A few iterations were processed like that. Four sets of acoustic features have been used: the Perceptual Linear Predictive coefficients (PLP), the log-RASTA-PLP coefficients, the LPC-cepstral features with cepstral mean subtraction (CMS) and the Mel-scale frequency cepstral coefficients (MFCC). These parameters were computed every 10 ms on 30 ms analysis windows. The feature set for our hybrid system was based on a 26 dimensional vector composed of the feature parameters, the $\Delta$-feature parameters, the $\Delta$-log-energy and the $\Delta\Delta$-log-energy. Nine frames of contextual information were placed at the input of the network, leading to 234 inputs.
The training and cross-validation scores at the frame level are reported in Table 1.5.


 
Table 1.5: Recognition rates at the frame level using a classical hybrid HMM/MLP trained on different feature sets.
  Nb. of frames CMS log-RASTA PLP MFCC
Train 2,400,000 76.3% 78.6% 82.4% 82.4%
Cross 270,000 74.0% 76.0% 80.2% 79.6%
 


next up previous contents
Next: Phoneme Recognition Experiments Up: Task 1.2: Technical Description Previous: Task 1.2: Technical Description
Christophe Ris
1998-11-10