As no phonetic labeling is provided with the Bref database, the first
task was to generate a first phone alignment. A high quality
concatenative speech synthesizer (MBROLA ) was used to
produce a synthetic reference signal from the phonetic transcription
derived from the text. The speech signal coming from the database is
then temporally aligned on this reference, in which the segmentation is
known. The alignment process is thus reduced to a simple dynamic time
warping algorithm. Two different voices (a male and a female) have
been used to obtain correct alignment in any case.
That first segmentation was used to bootstrap the training of an HMM
system based on multi-gaussians. That system provided a new
segmentation, which we used to train an MLP, which in turn provided a
new segmentation. A few iterations were processed like that.
Four sets of acoustic features have been used: the Perceptual Linear
Predictive coefficients (PLP), the log-RASTA-PLP coefficients, the
LPC-cepstral features with cepstral mean subtraction (CMS) and the
Mel-scale frequency cepstral coefficients (MFCC). These parameters
were computed every 10 ms on 30 ms analysis windows. The feature set
for our hybrid system was based on a 26 dimensional vector composed of
the feature parameters, the -feature parameters, the
-log-energy and the
-log-energy. Nine frames of
contextual information were placed at the input of the network,
leading to 234 inputs.
The training and cross-validation scores at the frame level are reported in Table 1.5.