There has been much interest generated internationally in the speech recognition community on multi-band ASR since Jont Allen's cogent retelling of Harvey Fletcher's work on articulation index [6,1]. The main idea of this approach is to divide the signal into separate spectral bands, process each independently (typically generating state probabilities or likelihoods for each), and then merge the information streams, as shown in Figure 5.1. Some of the motivations for this multi-band approach are:
The most common objection to the use of separate statistical models for each band has been that important information in the form of correlation between bands may be lost. Our experience and that of our colleagues has been that recognition performance has not been hurt by this approach. In the work reported here, we examine the estimator performance in a more detailed fashion.
Some multi-band researchers [4,12,9,13] have postulated that transitions in sub-bands occur asynchronously, and that a phone or syllable level merging of multi-band streams is necessary to permit independent alignment for each band within the merged unit. However, this hypothesis has not been analyzed; neither has there been a study of transition boundary shifts in the presence of speech signal variations (such as room reverberation or speaking rate). Without such evidence, we could not justify consideration of longer-term merging units for multi-band ASR. Below, we examine this assumption by analyzing the transition lags in each sub-band to see if sub-band transitions occur asynchronously.