next up previous contents
Next: Portuguese Language Modelling Up: Task 3.3: Technical Description Previous: Experiments

N-best Cache LMs

The cache-based language models used in this work are constructed simply by linearly interpolating a trigram language model ${\cal T}$with a cache-based component ${\cal C}$. Instead of the conventional backward cache used in [7], we use a forward-backward cache, which operates as follows: given a word string $w_1^n =
w_1,\ldots,w_n$, the probability distribution for a word wat position i in this string is given by

 \begin{displaymath}\hat{P}(w \mid w_1^{i-1}, w_{i+1}^{n}) = (1-\lambda)\hat{P}_{...
...}) + \lambda \hat{P}_{\cal C}(w \mid w_{1}^{i-1}, w_{i+1}^{n})
\end{displaymath} (3.11)

where $\lambda$ is the interpolation parameter.

The cache-based probabilities are calculated such that a word which has been observed in close proximity to the current word will have a high probability estimate, whereas a word which has not been observed will have zero probability.

We conducted perplexity experiments using the cache-based language models. The test set used was the language model test text from the 1996 Hub 4 broadcast news evaluation. The optimal value of the interpolation parameter $\lambda$ (see equation (3.11)) was calculated by the EM algorithm. It was found that the adaptation technique reduced the perplexity by 10% when compared to a baseline trigram model.

We investigated the effect of cache-based language models on word error rate by conducting lattice rescoring experiments. A modified version of the 1996 Abbot evaluation system [13] was used to generate lattices for each of the segments in the six shows of the devtest of the 1996 Hub 4 evaluation. For these lattices, the lattice word error rate was $7.0\%$.

Three different types of cache-based model were investigated:

The best hypothesis from the initial recognition pass is added to the cache.
The N-best list from the initial recognition pass is added to the cache. The probabilities assigned to each of the N-best hypotheses in the initial recognition pass are normalized, and are used to weight the contribution to the cache-based probabilities made by each hypothesis. It has been shown [12] that the proportion of the N-best list which contains a particular word in the appropriate position is a simple but effective measure for the confidence we have in that word being correct. Therefore, by adding the whole N-best list to the cache, we are effectively weighting each word's contribution to the cache by our confidence that it was recognised correctly. In this way we hope to partially overcome the problem of an errorful first recognition pass.
The cache contains the reference transcription, rather than the recognizer output. This puts an upper bound on the performance gain we can expect from techniques which attempt to compensate for the problems of an errorful initial transcription.

As in the perplexity experiments, we use a forward and backward cache. Each of the six shows in the test set is partitioned into topic-homogeneous `stories'. For each segment that we rescore, the cache can be thought of as containing the initial transcription (or N-best initial transcriptions) for every segment in the same story with the exception of the current segment.

Table 3.4 shows the effect of these three different types of cache-based model on word error rate, with different values of the interpolation parameter $\lambda$.

Table 3.4: Total number of errors and % word error rates for cache-based approaches. The test set was the six shows in the 1996 Hub 4 devtest, which contains a total of 22697 words.

$\lambda = 0.05$ $\lambda = 0.1$


8126 (35.8%)


8150 (35.9%) 8175 (36.0%)
100-best 8141 (35.9%) 8163 (36.0%)
200-best 8139 (35.9%) 8160 (36.0%)
Supervised 8106 (35.7%) 8100 (35.7%)


The results of Table 3.4 show that:

It should be noted, however, that none of these three observations are statistically significant3.4. For this reason, it is the third point that seems to be the most important. It clearly points to the conclusion that our initial hypothesis (that `A significant problem with cache-based language model is that the cache will be based on an errorful transcription') was incorrect. It seems that no technique to improve the contents of the cache will result in a significant reduction in word error rate for this task.

next up previous contents
Next: Portuguese Language Modelling Up: Task 3.3: Technical Description Previous: Experiments
Christophe Ris