next up previous contents
Next: Task 4.2: Future Development Up: Task 4.2: Technical Description Previous: Segmentation of Broadcast Audio

Vocal Track Length Normalisation

Vocal tract length normalisation via warping of the frequency scale during feature extraction has also been used for speaker adaptation. A generic 64 element Gaussian mixture model is trained using a gender balanced subset of the total training data. A model is trained for each speaker at each of 10 warping factors, and the warp factor with the highest likelihood chosen. The RNN acoustic model is then retrained using warped training data. The warping factor used at recognition time is chosen by building Gaussian mixture models of the test data at each warping scale, and using maximum likelihood selection. This method has been shown to reduce word error rate by 8.7% on the WSJCAM0 5k development test set.

Christophe Ris