next up previous contents
Next: WP4: APPLICATION DOMAIN Up: No Title Previous: WP2: Lexica and

WP3: Language Models and LM Adaptation

WP3: Overview

Work Package Manager: SU
Executing Partners: CUED, INESC, SU
In WERNICKE we realized that the efficient representation and use of language models (LMs) was of major importance for flexible recognizers but was unfortunately often disregarded. In consequence we are mounting a substantial investigation of novel language modeling approaches for spoken language.

Although most of the developments here will be done on US and UK English (milestones and deliverables are referring to those languages), outputs will also be exploited, when possible, for other languages.

WP3: Milestones and Deliverables

WP3 Milestone: M3.1

As reported in Deliverable D3.1, this milestone was met.

WP3 Deliverable: D3.1

The report below as well Annex D3.1 constitute this deliverable.

Task T3.1: Markov Model Based Grammars

Task Coordinator: CUED
Executing Partners: CUED, SU

Task T3.1: Objectives

Markov model based grammars, such as trigram grammars, are the most commonly used statistical language model. They are relatively easy to estimate and currently provide the best performance for very large vocabulary recognition. Hence it is both necessary to incorporate this basic tool and prudent to investigate this method with the aim of producing better statistical language models. The topics to be investigated in this task include the representation and use of different grammars by Markov based models, the identification and use of class-based models and the use of variable-order Markov models.

Task T3.1: Status

Since language model (LM) research involves the processing of gigabytes of text data, it is important to have efficient flexible software tools to process source texts. The Carnegie Mellon University Statistical Language Modeling (CMU SLM) Toolkit [52] was released in 1994. This is a publicly-available set of Unix software tools designed to facilitate language modelling work in the research community. The quantity of text now available for training LMs has increased dramatically, and the original toolkit is unable to process such a large quantity of text efficiently. Furthermore, increased amounts of training data have led to greater interest in extending the traditional trigram models to the use of 4- and 5-gram models, which were not supported by the original version of the toolkit.

Carnegie Mellon and Cambridge Universities have collaborated to produce version 2 of the toolkit in order to address the weaknesses of version 1. The CMU-Cambridge Statistical Language Modeling Toolkit is currently undergoing beta testing, and the final version should be publicly available by February 1997. Complementing this, Sheffield University have developed a set of LM software tools that allow a more flexible approach to text processing, and can interface to the CMU-Cambridge toolkit.

Task T3.1: Technical Description

Both sets of software tools offer much more efficient processing compared with the original CMU toolkit. For example, the Sheffield tools are can process the LM-1 corpus (producing a trigram LM) in around 8 hours on a modern workstation. The toolkits are complementary in that the CMU-Cambridge toolkit defines a vocabulary (64K words or fewer) at an early stage, and emphasizes flexibility in the LMs constructed. The Sheffield tools are constructed with unlimited vocabulary LMs in mind, and thus the definition of a vocabulary happens at a late stage.

The key improvements in the version 2 of the CMU-Cambridge toolkit are:

The additional flexibility offered by the Sheffield tools includes:

The toolkits interface at the file format level --- i.e. text processed using the Sheffield tools may be further processed using the CMU-Cambridge tools.

Further documentation on the LM software tools is provided in annex D3.1.

Task T3.1: Future Developments

The LM software tools are a valuable resource which will enable experimental work in language modelling to take place much more efficiently. Moreover, because the tools were developed ``in-house'' it is possible to readily adapt the tools to more precisely fit the specific needs of our group. For example, the CMU-Cambridge toolkit has already been modified locally to be able to write out language models in a format which can be immediately read by the NOWAY decoder.

Work in progress includes:

Task T3.2: Grammatical Inference

Task Coordinator: SU
Executing Partners: SU

Task T3.2: Objectives

The objective of this task is to develop methods for the inference of more powerful language models than n-grams. In particular, we will be exploring finite state language models (beyond n-grams).

Task T3.2: Status

Work on this topic has focused on ``multigram'' models for the bracketing of text. A goal of this approach is to bracket text into multi-word units (using a simple statistical algorithm), and then using these multigrams as the basic symbols in an n-gram model. This may be regarded as a hierarchical approach to increasing language model context, as well a method to find groups of frequently co-occuring words. The software tools for multigram processing have been developed, using some of the tools developed in task T3.1, and multigram models have been created and evaluated using the WSJ corpus.

Task T3.2: Technical Description

The multigram model [53] is a variable length sequence model in which a stream of input symbols is viewed as a succession of variable length independent sequences. Thus if we use a 2-multigram model and have a string of three symbols abc, the multigram estimate of the probability of is:

As a statistical model of text this 2-multigram model is no improvement on a backed-off bigram estimate () since:

Where maybe estimated by a backoff if the bigram does not exist:

(where is the backoff weight for a). Clearly the multigram model makes more independence assumptions than the n-gram model (or, alternatively, assumes unit backoff weights at certain points) and is hence unlikely produce models of lower perplexity.

However, the independence assumptions made by the multigram model are reasonable if the aim is to bracket, or segment, a sequence of symbols (words) into finite length subsequences. We are investigating this approach.

Consider the following example from the CSR LM-1 corpus:

Here, a sequence of three words, ``D. D. T.'', implies a name of chemical which occurs in this order for many documents. Thus, it is probably not very productive if we use a standard trigram model (i.e, ). In this case, we wish to handle ``D. D. T.'' as one word, rather than a sequence of three words so that we can reduce the redundancy of text for the later processing. Using the multigram model, we are able to compute the best possible segmentation of the original text. In fact, one of the multigram models produced the following segmentation;
which probably will agree with many persons' intuition.

We have found that the perplexity of the segmented text was reduced to around 200 given the sufficiently large multigram model, even though we assume the independence between bracketed word set.

There are many extensions possible for the multigram approach, including class-based multigrams, hierarchical multigram models in which a multigram is built using sequences derived on a corpus bracketed by a multigram model and the investigation of multigram bracketing in a tagging or parsing context.

Currently three basic tools (included in the Sheffield tool of Task T3.1) are being used extensively for the multigram modeling experiments:

These procedures treat sequences of one/two/three words as ergodic hidden Markov model (HMM) states and are designed to find the (locally) best segmentation of corpus text.

In order to handle the large size of text for parameter adjustment, it is critical to start the training from reasonably good point in the parameter space, otherwise, it would take weeks till we finally obtain such model parameters. Because the multigram model and the conventional n-gram model share the very similar structure (i.e., both consist of the word sequences), it is natural to derive an initial multigram from its n-gram counterpart. We initialize a multigram model from an n-gram model computed on the same data. Firstly, unigrams and bigrams are derived from trigram counts. The initial multigram probabilities are simply calculated from the corresponding uni/bi/trigram probabilities , i.e.,

This is only an approximate computation of the multigram parameters since the n-gram model does not include the multigram subsequence independence assumption.

Smoothing is a key issue for the inference of multigram model parameters. In our initial experiments, we have restricted the number of vocabulary words to 20,000, with out of vocabulary words being mapped to a common unknown word symbol. However, this still results in millions of bigrams and trigrams, which may not be computationally practical, and many of which will have a very low probability that may be not be reliably estimated. To combat this we have adopted a smoothing technique known as absolute discounting [58]: the initialization procedure removes bigrams and trigrams of the original n-gram model whose frequency is less than some threshold level. After calculating the multigram probabilities, the removed probability masses are redistributed to those not appearing in the multigram model.

As noted above, the initial multigram model is an approximation and requires further adjustment of parameters. This may be performed by using an ergodic HMM which is trained using the standard EM algorithm. In the case of a 3-multigram, a three-state HMM is constructed, with the states corresponding to one-, two- and three-word sequences (see fig. 3.1). Given this HMM structure, the forward-backward training accumulates evidence for ``expected segmentation'' from corpus texts, and the parameters are re-estimated from the accumulation. This HMM training procedure is straightforward and was described in [59], and will not be repeated here. The key computational issue in the training of the multigram HMM is effective hashing of word sequences.

Note that the multigram approach assume the independence between each segmented sequence of word(s). While this is a stronger (and less realistic) assumption compared with n-gram models of language, it may be appropriate in the context of bracketing subsequences of text, which could then be further exploited.

Figure 3.1: Three-state ergodic HMM used to estimate the parameters of a 3-multigram model. Each state produces a sequence of one/two/three words, then makes a transition to the next state.

In addition to a well-developed training procedure, another benefit of using the HMM structure is the Viterbi algorithm. Given a multigram model, a three-state ergodic HMM may be reconstructed and searched for the best possible segmentation (i.e., Viterbi path) of the corpus text. From this test-set perplexities may be calculated.

Tables 3.1 and 3.2 summarize some experiments using the `` Wall Street Journal'' (WSJ) articles from the `` CSR LM-1'' corpus. The initial multigram models with a cutoff frequency of 100, 50, 20, 10 and 5 were generated using trigram counts computed from the WSJ data. The number of vocabulary words was fixed (19,998 words), but the numbers of two/three-word sequences increase as the cutoff frequency decreases.

Table 3.1 shows the result of pilot experiment using the small amount of corpus text. First thing to note is that, although not very surprising, the model perplexity improved as the size of model increased. Model parameter adjustment was effective and reduced the perplexity further. However, it is also observed that the improvement became less significant for models with large number of parameters; too many iterations of the forward-backward algorithm were counter-productive due to over-fitting.

Table 3.1: This is a pilot experiment using 47 files of WSJ'94 for training and 39 files from WSJ'89 for segmentation and perplexity computation. The initial multigram models with a cutoff frequency of 100, 50, 20, 10, and 5 were generated from a trigram file, then the parameters were adjusted using the ergodic HMM training. This table shows the number of one/two/three-word sequences and the perplexities before and after one/ten iterations of training.

The experimental results using the larger amount of corpus text is shown in Table 3.2. Naturally these models worked better than the small data set case in Table 3.1, since the large data set resulted in the larger size parameter. It should be noted that, because of the reasonably good initialization scheme, convergence of the HMM training was achieved after a few iterations over the whole training text. Especially more than a single iteration was not productive, again due to over-fitting to the training corpus.

Table 3.2: The experiments used 881 files between WSJ'87 and WSJ'93 from the `` CSR LM-1'' corpus for training. Also, 47 files from WSJ'94 was used for segmentation and perplexity computation. The initial multigram models with a cutoff frequency of 100, 50, 20, 10, and 5 were generated from a trigram file, then the parameters were adjusted using the ergodic HMM training (one iteration). This table shows the number of one/two/three-word sequences and the perplexities before and after the adjustment. (Complete results expected by the review meeting.)

Task T3.2: Future Developments

Our interest in multigram models stems from their potential to statistically bracket a corpus and form the basis of class-based models, with classes corresponding to word sequences. We plan to investigate this by experimenting with n-gram models using sequence classes.

Task T3.3: Language Model Adaptation

Task Coordinator: CUED
Executing Partners: CUED, SU, INESC

Task T3.3: Objectives

In constructing a language model intended for general text, one is faced with the following problem. One can either generate a model which is trained on material from a specific domain, with the result that the model's performance will be good for text from the same domain, but poor for more general text, or one can construct a model by training it on text from many diverse sources, which will perform better on general text, but will not be especially well suited for any particular domain. Clearly, the ideal would be a general language model whose parameters could be automatically tuned according to the style of text it is attempting to model.

In this task we are investigating dynamic and adaptive language modelling algorithms that can operate in an unsupervised manner.

Task T3.3: Status

At CUED, experiments have been carried out using two adaptation techniques (see Section 3.5.3). These experiments were based on constructing adaptive language models from training data taken from the British National Corpus, and calculating their perplexity using test text from a disjoint set of the same corpus. Both techniques yield a significant reduction in perplexity over the baseline trigram language model given this multi-domain corpus, the mixture-based model giving a reduction and the cache-based system giving a reduction. The two techniques attack the problem of adaptation at different scales, and as a result can be used in parallel to give a total perplexity reduction of .

At Sheffield work has begun investigations on a mixture-based technique using latent semantic analysis. Additionally, some preliminary investigations have been carried out in using textual data from the Internet for adaptive language modelling.

For the work involved in this task INESC begun by collecting a small language model for the Portuguese language from the SAM database which is also to be used as the baseline language model of the Portuguese baseline system (see Task T1.3)

Task T3.3: Technical Description

  At CUED, two different adaptation techniques have been investigated. The first is a mixture-based approach. The training text is partitioned according to the style of text, and a trigram language model is constructed for each component. Each component is assigned a weight according to its performance modelling the observed text, and a final language model is constructed as the weighted sum of each of the mixture components.

The second approach is based on a cache of recent words. Previous work has shown that words that have occurred recently in the text are likely to reoccur. This work investigates the hypothesis that more recent words should be considered more useful within the cache, by implementing a cache in which a word's contribution to its recurrence probability decays exponentially with its distance from the current word.

Work underway at Sheffield has investigated similar techniques, using text obtained from the Internet as training data. A simple information retrieval (IR) system [60] was used in conjunction with software that automatically downloaded documents from the Internet. In a pilot study the test set consisted of text of a known classification - this classification was used as the query term in the IR system, and a language model constructed from the retrieved documents (and blended with a ``general'' language model). Although this resulted in perplexity improvements of up to 20%, online experiments in which the language model cache was used as the IR query were not successful.

We are further developing this work via latent semantic language modelling. Latent semantic indexing [54,55] is a modern information retrieval (IR) technique that is based on the singular values decomposition (SVD) of large sparse term (word) by document matrices. Each column of such a matrix describes a document, with the entries being the unigram relative frequencies of each word in that documentgif. The singular vectors corresponding to the k largest singular values are then used to define k-dimensional document and term spaces, where k is typically of the order of 100. This approach differs from usual vector-based IR methods in that a lower dimensional document subspace is automatically inferred using the SVD. The technique is referred to as ``latent semantic'' since the projection to the lower dimensional subspace has the effect of clustering together semantically similar words and documents, with IR performance data indicating that points in the derived subspace are more reliable indicators of meaning than individual words or terms.

Latent semantic models were recently used for semantic word clustering in speech recognition [57]. We believe that this dimension reduction may also be used for document modelling. A lower dimensional document subspace may be derived using the latent semantic analysis, and a mixture-based language model may be inferred in this subspace. A data point in this document space corresponds to the projection of a document (article/paragraph/sentence). Since documents may now be represented as points in a continuous space, we can construct a gaussian mixture model using the EM algorithm or a Bayesian refinement such as AutoClass [56]. This allows for both ``hard'' clustering in which each document is assigned to a single cluster, or ``soft'' clustering in which documents are assigned probabilistically to all clusters. Separate n-gram models may be constructed for each mixture component (cluster) with the n-gram counts for each document weighted by the posterior probability of the document given the cluster.

The principal computational burden of this approach lies in the singular valued decomposition of the word by document matrix. It is not unreasonable to expect this matrix to have dimensions of 60,000 100,000; however such matrices are sparse and it is possible to perform such computations on a modern workstation. We are using a publicly available package SVDPACKC to perform these computations. We hope to be able to report early results at the review meeting.

Baseline Language Model for Portuguese

The SAM database is a small and limited database (see Task T1.1). There are 50 paragraphs with 5 sentences each in a total of 2842 words (1314 different words). With this small amount of text is difficult to have any significance in the word or pair of words estimations. We found 2782 different pairs of words where 2560 just occur once. As a language model we made a word pair grammar where we defined the words which could begin the sentences, the words that end the sentences and the intermediate pairs.

We first collected the pairs from the SAM text for the sentences isolated from the paragraphs. The evaluation results were of word errors. Such a bad result could be explained from the fact that each of the speech files contain a paragraph and not just a sentence. Next we consider as our unit the paragraph and we generated again the word pair grammar. In this case we achieved a result of word errors. As expected the language model is doing a great job in such a small task.

Task T3.3: Future Developments

These adaptation techniques result is significant reductions in perplexity. It is therefore to be assumed that if implemented in a speech recognition system, they would give a useful reduction in word error rate. Future work will be directed at testing this hypothesis using lattice rescoring experiments.

WP3: Conclusion

Work on a variety of novel language modelling techniques is in progress, with some preliminary results reported. Language modelling software of increased efficiency and and flexibility has been developed, and some initial work has been performed in developing Portuguese language models.

next up previous contents
Next: WP4: APPLICATION DOMAIN Up: No Title Previous: WP2: Lexica and

Jean-Marc Boite
Tue Jan 7 12:46:31 MET 1997