next up previous contents
Next: Task 7.3: Demonstrations Up: Task 7.2: Technical Description Previous: The 1997 Hub-4E Acoustic

The 1997 CU-CON Evaluation System

PRIMARY TEST SYSTEM DESCRIPTION

The CU-CON1 system for the November 1997 Hub-4E evaluation is a connectionist-HMM system, based on the ABBOT system [4]. The system uses a recurrent neural network (RNN) to produce a posteriori phone class probability estimates. These are mapped to scaled likelihoods and used as observation probabilities within an HMM framework.

Specific features of the system are as follows:

ACOUSTIC TRAINING

Different acoustic model sets have been used for wide-band and telephone bandwidth data. A total of eight different acoustic models have been used, and these are described below.

Wide-band:

Four acoustic models have been used for wide-band data. All the models use PLP acoustic features, and estimate 697 word-internal context-dependent phone probabilities. The four models have the following features:

model 1: Training data was the 104 hours of broadcast news training data released in 1997. The training software was modified to allow the use of 384 state units, resulting in models with approximately 174,000 parameters.
model 2: Training data and architecture the same as model 1. Training data presented in reverse order, which produces a different model because the RNN is time-asymmetric.
model 3: This model used only the data labelled as focus conditions F0 or F1, ie. only the data from the broadcast studio, from the 104 hours of acoustic training data.
model 4: Training data and architecture the same as model 3. Training data presented in reverse order

The outputs of the four models were merged in the log domain to produce the final context-dependent phone probabilities

Telephone bandwidth:

Four acoustic models have been used for telephone bandwidth data. Each of the models estimate 604 word-internal context-dependent phone probabilities. The four models have the following features.

model 5: Training data was the 50 hours of broadcast news training data released in 1996. The model uses PLP features and has 256 state units (84,000 parameters). Adaptation was performed using the the telephone bandwidth segments from the training data.

model 6: Training data and architecture the same as model 5. Training data presented in reverse order. This model was also adapted to the telephone bandwidth segments.

model 7: This model uses the same data as model 5, except that MEL+ acoustic features are used. This model was also adapted to the telephone bandwidth segments.

model 8: This model is the same as model 7, except that the training data is presented in reverse order. This model was also adapted to the telephone bandwidth segments.

The outputs of the four models were merged in the log domain to produce the final context-dependent phone probabilities.

The specification for the 1997 evaluation demands that any required segmentation must be done automatically. The CU-CON system used tools provided by NIST to perform audio segmentation. These tools implement the method of Siegler et al. [4]. Means and variances are estimated for a two second window placed at each point in the audio stream. A Kullback-Leibler distance between successive windows is then computed, and when this reaches a local maximum a new segment boundary is marked. The tools also classify the segments as either full or telephone bandwidth. This is accomplished by building Gaussian mixture models for each of the segments. Maximum likelihood selection of the class given the segment is performed by comparing the segment models to models trained on known bandwidth data. The tools also perform clustering of the segments, but this was not used by the CU-CON system.

GRAMMAR TRAINING 4-gram and trigram language models were built using version 2 of the CMU-Cambridge Statistical Language Modelling Toolkit.

RECOGNITION LEXICON DESCRIPTION

The recognition lexicon used was developed for the 1996 Hub-4 evaluation. The training lexicon was extended to cover the extra acoustic training data available. This required the generation of approximately 5k pronunciations.

NEW CONDITIONS FOR THIS EVALUATION

The 1997 Hub-4E evaluation system used the following new features.


next up previous contents
Next: Task 7.3: Demonstrations Up: Task 7.2: Technical Description Previous: The 1997 Hub-4E Acoustic
Christophe Ris
1998-11-10