next up previous contents
Next: Subband-based speech recognition Up: A multi-stream approach Previous: A multi-stream approach

Multi-Stream Statistical Model

We address here the problem of recombining several sources of information represented by different input streams. This problem can be formulated as follows: assume an observation sequence X composed of K input streams Xk representing the utterance to be recognized, and assume that the hypothesized model Mfor an utterance is composed of J sub-unit models $M_j
~(j=1, \ldots , J)$ associated with the sub-unit level at which we want to perform the recombination of the input streams (e.g., syllables). To process each stream independently of each other up to the defined sub-unit level, each sub-unit model Mj is composed of parallel models Mjk(possibly with different topologies) that are forced to recombine their respective segmental scores at some temporal anchor points. The resulting statistical model is illustrated in Figure 5.4. In this model we note that:

Figure 5.4: General form of a K-stream recognizer with anchor points between speech units (to force synchrony between the different streams). Note that the model topology is not necessarily the same for the different sub-systems.

The recognition problem for a likelihood-based system can then be formulated in terms of finding the model M maximizing5.1:

\begin{displaymath}p(X\vert M) = \prod_{j=1}^{J}
p(X_j\vert M_j) \end{displaymath}

where Xj represents the multiple stream subsequence associated with the sub-unit model Mj. Assuming that we have a different ``expert'' Ek for each input stream Xk (e.g., one ``expert'' for long-term features and one ``expert'' for short-term features) and that those experts are mutually exclusive (i.e., conditionally independent) and collectively exhaustive, we have:

\begin{displaymath}\sum_{k=1}^{K} P(E_k ) = 1

where P(Ek ) represents the probability that expert Ek is better than any other expert. We then have:

p(X\vert M) = \prod_{j=1}^{J}
\sum_{k=1}^{K} p(X_j^k \vert M_j^k ) P(E_k \vert M_j )
\end{displaymath} (5.1)

where P(Ek | Mj ) represents the reliability of expert Ekgiven the considered sub-unit.

Conceptually, the analysis above suggests that, given any hypothesized segmentation, the hypothesis score may be evaluated using multiple experts and some measure of their reliability. Generally, the experts could operate at different time scales, but the formalism requires a resynchronization of the information streams at some recombination point corresponding to the end of some relevant segment (e.g., a syllable).

In the specific case in which the streams are assumed to be statistically independent, we do not need an estimate of the expert reliability, since we can decompose the full likelihood into a product of stream likelihoods for each segment model. For this case we can simply compute:

log~p(X\vert M) = \sum_{j=1}^{J} \sum_{k=1}^{K} log~p(X_j^k \vert M_j^k )
\end{displaymath} (5.2)

Since we do not have any weighting factors, although the reliability of the different input streams may be different, this approach can be generalized to a weighted log-likelihood approach. We then have:

 \begin{displaymath}\log p(X\vert M) = \sum_{j=1}^{J}
\sum_{k=1}^{K} w_j^k \log p(X_j^k \vert M_j^k )
\end{displaymath} (5.3)

where wjk represents the reliability of input stream k. In the multi-band case (see Section 5.3.3), these weighting factors could be computed, e.g., as a function of the normalized SNR in the time (j) and frequency (k) limited segment Xjkand/or of the normalized information available in band k for sub-unit model Mj.

More generally, we may also use a nonlinear system to recombine probabilities or log likelihoods so as to relax the assumption of the independence of the streams:

 \begin{displaymath}\log p(X\vert M) = \sum_{j=1}^{J}
f \left( W, \{ \log p(X_j^k \vert M_j^k ) , ~ \forall k \} \right)
\end{displaymath} (5.4)

where W is a global set of recombination parameters.

During recognition, we will have to find the best sentence model M maximizing p(X|M). Different solutions will be investigated, including:

Recombination at the sub-unit level (where Mj's are sub-unit models composed of parallel sub-models, one for each input stream, as illustrated on Figure 5.4).
Although it does not allow for asynchrony of the different streams, recombination at the HMM state level (where Mj's are HMM states) is also discussed in this paper.

Recombination at the HMM-state level can be done in many ways, including untrained linear way or trained linear or nonlinear way (e.g., by using a recombining neural network). This is pretty simple to implement and amounts to performing a standard Viterbi decoding in which local (log) probabilities are obtained from a linear or nonlinear combination of the local stream probabilities. Of course, this approach does not allow for asynchrony, yet it has been shown to be very promising for the multi-band approach discussed in Section 5.3.3.

On the other hand, recombination of the input streams at the sub-unit level requires a significant adaptation of the recognizer. We are presently using an algorithm referred to as ``HMM recombination''. It is an adaptation of the HMM decomposition algorithm [17]. The HMM-decomposition algorithm is a time-synchronous Viterbi search that allows the decomposition of a single stream (speech signal) into two independent components (typically speech and noise). In the same spirit, a similar algorithm can be used to combine multiple input streams (e.g., short-term features and long-term features) into a single HMM model. The constraint between the parallel sub-models is implemented by forcing these models to have the same begin and end points. The resulting decoding process can be implemented via a particular form of dynamic programming that guarantees the optimal segmentation.

next up previous contents
Next: Subband-based speech recognition Up: A multi-stream approach Previous: A multi-stream approach
Christophe Ris