In speech recognition systems, one important overall goal is to accurately estimate the joint distribution of feature vectors given a particular acoustic model. To improve recognition performance, it is probably advantageous to concentrate on those aspects of these joint probability distributions that are, using traditional methods, represented poorly. For example, under the conditional independence assumptions associated with Hidden Markov Models (HMMs), the distribution of time-localized feature-vectors is modeled with a dependence only on a hidden state variable and not directly on the potentially useful acoustic context. The multiband approach described above can be viewed as an attempt to overall better model this joint distribution.
We have also proposed  that by modeling the joint distribution of time-localized feature vectors and statistics relating those time-localized feature vectors to the relevant acoustic context, we can estimate information contained in the feature-vector joint distribution without the typically accompanying theoretical or computational difficulties.
We do this using the modcrossgram (MCG), a computational way of estimating short-time spectro-temporal correlation-based statistics that are informative about the feature-vector joint distribution. Using the standard hybrid ANN/HMM architecture, we compared a MCG-based speech recognition system with a more traditional one on an isolated word speech database. We showed that, in the presence of noise, the MCG-based system achieves a significant reduction in word error rate over the standard system.
Furthermore, we evaluated information preserving reduction strategies for those statistics. We claim that those statistics corresponding to spectro-temporal loci in speech with relatively large mutual information are most useful in estimating the information contained in the feature-vector joint distribution and that a system using such statistics are more likely to generalize . Using an EM algorithm to compute mutual information between pairs of points in the time-frequency grid, we verified these hypotheses using both overlap plots and speech recognition word error results. Finally, we propose a data-derived Bayes-net augmentation to HMMs that explicitly includes a dependence on the relevant acoustic context.
Figure 5.3 shows the information density of a randomly selected 2-hour section of switchboard. The plot shows the mutual information, computed using the EM algorithm, between pairs of points in the time-frequency plane. In other words, we first compute where I(X,Y) is the mutual information between X and Y, Xt,i is channel i at the feature vector at position t (i.e., variation over t defines a sampling ensemble). We then compute the average where d spans over a frequency difference as shown in Figure 5.3 and runs to 425ms into the past. As can be seen, there is significant information spanning at least around 200ms into the past within band, but there seems a more rapid information drop as the frequency difference increases. This can be viewed as an additional motivator for the multi-band approach as it shows that, at least when we compute the unconditional mutual information between pairs of points in the time-frequency spectrum, much of the information is within band anyway. Furthermore, such information might help to determine the bandwidth used for each sub-band recognizer.