The broadcast news training data does not include syllable boundary or phonetic alignment information. An automatic procedure for determining syllable boundaries is therefore required. The method used in this work is based on deriving syllable boundaries from phonetic alignments. The first step in determining the syllable boundaries is to produce pronunciations with tagged syllable boundaries. Syllable tagged pronunciations are required for every word in the training data. This was done automatically using the NIST software tsylb25.3. The first phone of each syllable is tagged as an onset phone. Viterbi forced alignment is then used to determine phone alignments for the training data. These can be used in conjunction with the syllable tagged lexicon to derive the syllable onsets.
A single hidden layer, fully connected MLP with 500 hidden units was trained to estimate the probability that a given frame is a syllable onset. The input to this MLP consists if 9 contiguous frames of perceptual linear prediction (PLP) features computed over a 32ms window every 16ms. For the purposes of training, the syllable onsets were represented as a series of four frames, with the initial frame corresponding to the actual onset derived from the phonetic alignments.
A simple numeric threshold applied to the probability estimates generated by the neural network determined the identification of any frame as a syllable onset. This method correctly detected 92% of the onsets derived from the phonetic alignments. However, this method also detected syllable onsets where there were none in 30% of frames outside the four-frame window defined for training. This effect can be seen in Figure 5.7 which shows an example of the neural network output. The width of the onsets detected tends to be much wider than the four-frame window used during training.