next up previous contents
Next: Vocal Track Length Normalisation Up: Task 4.2: Technical Description Previous: Task 4.2: Technical Description

Segmentation of Broadcast Audio

Faced with an unsegmented stream of audio, for example a section of radio broadcast, the first task that must be performed by a speech recognition system is deciding which regions contain speech and which regions do not. Rather than attempting to segment an audio stream into speech and non-speech we have accepted the limitations of the speech recognizer and aimed to separate the audio into regions that are recognizable speech and those which are not. Non-recognizable speech not only includes non-speech audio, such as music, but also speech in which the acoustic conditions are such that the data is not sufficiently well matched to our models to produce a reliable recognition result. If this segmentation can be provided on the basis of a purely acoustic confidence measure, (i.e. a measure that is independent of the language model) it may be computationally inexpensive and computed before speech recognition has been attempted. A segmentation system of this kind would enable us to concentrate the speech recognition effort exclusively on regions where it may be usefully applied.

The acoustic confidence measure employed here is based on the estimates of local posterior probabilities produced by a recurrent network. S(ns,ne) is the entropy of the K posterior phone probability estimates q output by the recurrent network given the data x, averaged over a window of fixed duration [6]:

\begin{displaymath}S(n_s,n_e) = -\frac{1}{n_{e}-n{s}}\sum^{n_e}_{n=n_s}\sum^{K}_{k}F(q_k \vert x_n) log F(q_k \vert x_n)
\end{displaymath} (4.1)

where ns and ne are the frame indices for the beginning and end of the data window. In regions of the signal where the models are well matched the distribution of phone posteriors will typically be dominated by a single phone class and will have low entropy. In non-speech regions, or poorly modeled speech regions, several alternative phone models may have roughly equal posterior probabilities, leading to a higher value of S.

Ideally, there would be a clear distinction between the regions of well modeled speech where the value of S is low, and regions of poorly modeled speech and non-speech where the value is high. However, there are several factors that weaken the power of the measure:

Certain models may be well matched to the data even during periods of non-speech. This is most obviously true for the silence model, but there are other phone models that might be closely matched to non-speech sounds - e.g background hiss can be mistaken for a sibilant such as s. In contrast there are certain weak phones that are often ambiguous even in clean, otherwise well-modeled speech (e.g ix, dx, ux) By excluding frames that have highest posterior probability of belonging to either of these two classes of the power of the confidence measure can be increased.
The per-frame entropy is noisy. Even in clean speech spikes occur at regular intervals corresponding to poorly modeled phone transitions. These spikes can easily obscure the underlying trends (see figure 4.2). However, by applying a median filter with a sufficiently short window (i.e. 50-80ms) many of these spikes can be removed reducing the value of S during the speech regions with respect to that during the non-speech regions.

Figure 4.2: The per-frame entropy for the phrase `America in black and white.' Although this is clean studio speech entropy spikes occur at each of the phone transitions.
\mbox{\epsffig{figure=t4-2:transition.eps,width=0.9\columnwidth, height=0.4\columnwidth} }

Figure 4.3 shows the entropy measure over a 10 minute segment of a typical radio broadcast. The measure was calculated by applying the processing described in the previous section and then averaging over consecutive 40 frame ( $\approx 600$ms) windows. As can be seen even after this 600ms averaging there remain rapid fluctuation. These were filtered out with further median smoothing before segmentation was performed. This final stage of smoothing was performed over a window size equivalent to approximately 10 seconds. The result is illustrated in the lower panel of figure 4.3. Segmentation was performed by locating local maxima or minima in the difference function and identifying these as segmentation points when their absolute value was over some given threshold.

Figure 4.3: The raw entropy measure (top) and the smoothed and segmented entropy measure (bottom) for a 10 minute extract from a broadcast news program.
\mbox{\epsffig{figure=t4-2:entplot.eps,width=\columnwidth} }

Although this procedure was able to detect the segmentation points, the heavy smoothing of the entropy function means the segmentation points are only positioned to within a few seconds of their correct position. Therefore, before they can be used their location has to be `fine tuned.' This was accomplished using a similar technique to Siegler et al. [4]. Means and variances are calculated to describe the PDFs of the data within windows of duration 2s positioned either side of the segmentation point. The distance between these PDFs is measured using a suitable distance metric and the position of the segmentation point is then adjusted, within a small window, so as to maximize this distance.

Following Siegler et al. a KL2 distance metric was employed. This is a symmetric form of an information theoretic measure equal to the additional bit rate accrued by encoding random variable B with a code that was designed for the optimal encoding of random variable A [2]. When both A and B have Gaussian distributions:

...} + (\mu_A+\mu_B)^2(\frac{1}{\sigma^2_A}+\frac{1}{\sigma^2_B})
\end{displaymath} (4.2)

The greater this value the greater the distance between the two PDFs.

Classification of the segments was based on the same acoustic confidence measure as employed for the segment boundary detection: The entropy over the phone posteriors was calculated for each frame of the segment. 5 frame (i.e. $\approx80$ms) median smoothing was applied to reduce the influence of phone transitions. Frames which were due to either silence, or any of the weak phones, were removed and the mean entropy value for the remaining frames was calculated. Low values indicate well modeled speech that is worth decoding, higher values indicate poorly modeled speech and non-speech. A threshold can be set to decide which segments to excise.

The overall segmentation system may be summarized as follows:

Feature extraction (e.g. PLP)
RNN estimation of posterior probabilities of phone classes
Computation of per-frame entropy of phone posteriors (smoothed using 50-80ms median filter)
Peak detection and segment adjustment using KL2 distance measure
Segmental entropy recomputed and segments classified into ``speech'' and ``non-speech''

Experiments to evaluate the system have been conducted using radio broadcasts from the 1996 ARPA Hub 4 Broadcast News evaluation [5]. A 30 minute radio show4.1 was selected from the corpus and segmented into a number of candidate segments. Each of these segments was then decoded using the ABBOT HMM/ANN system. A time aligned reference word sequence was obtained by performing a forced alignment of the reference word transcription. The word sequences hypothesised during the decoding of each segment were then aligned to the reference word sequence enabling recognition word error rates (WERs) to be computed.

As a calibration of the recognition system a standard pre-segmented evaluation was also performed. The context-independent ABBOT system was used for these experiments, using a 64K word vocabulary and a trigram LM. Table 4.5 shows the recognition performance when using the pre-segmented evaluation data. Also shown are the number of words in each condition, and the percentage this forms of the total number of words in the show (including words outside the evaluation data i.e. commercial breaks).

Table 4.5: Results using pre-segmented evaluation data. Word error rates are given for both the context-independent system (CI) and the context-dependent system (CD) using 603 within-word context-dependent phone models.
Condition Words % Total WER (CI) WER (CD)
F0 - prepared 638 12.3 21.2 17.9
F1 - spontaneous 1342 25.9 39.3 33.4
F2 - low fidelity 813 15.7 84.4 84.8
F3 - music 162 3.1 38.9 30.9
F4 - noise 187 3.6 48.6 51.4
FX - mixed 358 6.9 84.2 81.1
All 3500 67.5 51.7 48.5

Figure 4.4: WER for each segment plotted against segment entropy value. Points are weighted by words per segment (top) or decoding time (bottom).
...2:wer_cm_npf.eps,width=0.8\columnwidth} }

Figure 4.4 shows the average WER for each of the 81 segments returned by the automatic segmentation procedure. It can be seen that there is a high degree of correlation between WER and the confidence value for the segments. Also, although many of the `poor' segments contain few words they occupy a large proportion of the total decoding time.

Table 4.6: The simple and weighted correlations between the entropy measure S and segment WER and computational cost. Segments are weighted by the number of words they contain in the weighted measure. Several variations of the confidence measure are shown, illustrating the importance of each stage in the processing of the raw per-frame entropy. `raw', refers to the measure derived from a simple averaging of unprocessed frame entropies. `-transitions,' includes median filtering to remove the effect of phone transitions. `-silence,' shows the effect of excluding the silent frames. `-weak' shows the effect of excluding the set of indistinct phones that generally have high entropies even in clean speech. `all' shows the result of combining all these techniques.
  WER vs. S Cost vs. S
  simple weighted simple weighted
raw 0.684 0.825 0.665 0.845
-transitions 0.695 0.832 0.670 0.850
-silence 0.799 0.915 0.739 0.915
-weak 0.689 0.831 0.643 0.841
all 0.812 0.923 0.742 0.919


The correlations between the segment confidence measure and WER (and decoding time) are detailed in table 4.6. By setting the confidence threshold to an appropriate value it is possible to exclude the segments that contribute most to both the error and the decoding time. By the exclusion of segments of non-speech, which are likely to provoke insertion errors, it is possible to actually reduce the overall error rate. These savings can be seen in figure 4.54.2.

By examining the manner in which the average WER for recovered segments varies as a greater number of segments are accepted for decoding, we can obtain some measure of the systems segmentation and classification performance. Figure 4.6 illustrates that the WER increases steadily as the confidence threshold is decreased. Note that the best classification line passes very close to both the `F0' and `all eval' operating points. If the system had made an inappropriate segmentation of the data, mixing poorly modelled and well modelled speech within individual segments, reaching the F0 operating point would not be possible.

Figure 4.5: By ignoring sufficiently low confidence segments both overall WER and decoding time can be simultaneously reduced.
\mbox{\epsffig{figure=t4-2:opcurve.eps,width=0.8\columnwidth} }

Figure 4.6: WER of included segments versus acceptance threshold, using the CI recognition system. The upper line shows the WER achieved using the acoustic confidence measure. The lower line shows the optimal WER that could be achieved by a system that accepted the segments in order of increasing WER. Also shown are the points corresponding to using the pre-segmented evaluation data and using either just the clean studio speech F0, or the whole evaluation set.
\mbox{\epsffig{figure=t4-2:cum_WER2.eps,width=0.8\columnwidth} }

This technique uses a single acoustic confidence measure both to segment continuous audio and also to predict which segments contain speech that may be regarded recognisable. The technique has two important potential attributes: (1) it is computationally inexpensive allowing for the overall reduction in the computational cost of speech recognition; (2) as the confidence measure is derived directly from our recognition models the segmentation offered is entirely pragmatic, i.e. the data is divided into that which is a good fit to the models and is therefore likely to be recognisable, and that which is not. If different models are used then different segments will be found, but they will be the segments that are most likely to be of practical value.

next up previous contents
Next: Vocal Track Length Normalisation Up: Task 4.2: Technical Description Previous: Task 4.2: Technical Description
Christophe Ris