next up previous contents
Next: Task 5.1: Future Developments Up: A multi-stream approach Previous: Subband-based speech recognition

Combining multiple time-scale features

Another potential advantage of the multi-stream approach which is discussed now concerns the possibility to combine short-term temporal and long-term temporal information. Indeed, current ASR systems mainly use short-term information, typically at the phoneme level, while the longer term information is supposed to be captured via the HMM topology. However, it is often acknowledged that it may be necessary to incorporate larger lexical units than the phoneme to capture all the speech variability and to model long-term dynamics. A plausible candidate is the syllable. Unfortunately, long term temporal dependencies (dynamics), e.g., between syllables, are not explicitly captured by the centisecond-based feature extraction or by the model topology. Consequently, properly handling longer temporal regions (stretching over more than the typical phoneme or HMM-state duration) is still an open issue.

Several studies have attempted to use acoustic context. This was done either by conditioning the posterior probabilities on several acoustic frames, or by using temporal derivative features (see, e.g.,  [5,19]). Typically, an optimum was observed with a context covering 90 ms of speech, corresponding approximately to the mean duration of phonetic units. However, these approaches do not allow for representing higher level temporal processes (such as syllable dynamics for instance) since the underlying HMM model is still phoneme based and implicitly assumes piecewise stationarity (at the HMM state level). In fact, what we should actually (attempt to) do is to process short-term and long-term information with two concurrent HMMs assuming (via different topologies and different features) piecewise stationarity at different temporal scales. In the following, we thus tested the multi-stream approach to combine short-term dependencies and features associated with them (e.g., at the level of 90 ms) with long-term dependencies and their corresponding features (e.g., at the level of 200 ms).

Figure 5.6: Syllable [se] multi-stream model.

As a first attempt in this direction, and as illustrated in Figure 5.6, initial experiments were performed with syllable models described in terms of two parallel models:

A ``regular'' syllable model built up by concatenating context independent (HMM/ANN) phone models and supposed to capture the fine structure of the syllable. This model was processing acoustic vectors as usually used in HMM/ANN systems, typically 9 frames of acoustic context covering about 100 ms. Minimum phone duration was also used.
A second HMM model aimed at capturing the gross syllable temporal structure. In our initial experiments, this model was composed of fewer states (3-states in our case) processing larger temporal context of about 200 ms. It is however clear that both the topology and the features will be subject to optimization in the future.

Preliminary tests were performed on the NUMBERS'93 database.

Full band log-RASTA-PLP parameters were used, with 9 frames (125 ms) of contextual information for the phoneme-based model, and 17 frames (225 ms) for the gross syllable model. Decoding was done with the HMM decomposition/recombination algorithm. We recombined the sub-stream models either linearly (Eq. 5.3), or by using a multilayer perceptron (Eq. 5.4). As an additional reference point, tests were also performed by constraining the search (based on phone HMMs) to match the true syllable segmentation5.2 (obtained from a Viterbi alignment).

Tests were done in the case of clean speech as well as in the case of speech corrupted by additive stationary white noise. Results, reported in Table I and compared to a state-of-the-art phone-based hybrid HMM/ANN system, clearly show a significant performance improvement.

Table 5.1: Word error rates on continuous numbers (Numbers'93 database). Phone refers to regular phone based recognizer. Linear refers to multi-stream system with linear recombination of the two streams. MLP refers to a recombination with an MLP. Cheat refers to constraining the dp search with syllable boundaries. Noise was additive Gaussian white noise, 15dB SNR.
Error Rate Phone Linear MLP Cheat

clean speech

10.7% 10.1% 8.9% 6.8%
speech+noise 17.2% 16.2% 16.2% 13.5%


next up previous contents
Next: Task 5.1: Future Developments Up: A multi-stream approach Previous: Subband-based speech recognition
Christophe Ris