next up previous contents
Next: Mixture LM Up: Task 3.3: Technical Description Previous: Task 3.3: Technical Description

Document Space Modelling

Mixtures of language models (LMs), based on some notion of semantics, have recently been proposed as an approach to dealing domain adaptation [7]. These approaches involve partitioning the corpus, according to the style of text, to produce a set of component LMs, which are then blended together to produce a mixture LM. Although relatively rare, some text corpora include manual tagging of articles by subject (e.g., the British National Corpus). However, hand-labeled style may not necessarily produce the best partitions for use of conventional statistical speech recognition application. Furthermore, it may be quite difficult to track the varying style of texts. As a consequence, it is clearly of interest to develop an automatic method for clustering the corpus texts in an unsupervised manner.

In this work, the problem is approached through the construction of document space model that encapsulates corpus-derived semantic information. Once a consistent and powerful model is constructed, it can be applied for a number of language modeling tasks. In particular, it is straightforward to develop mixture LMs that are tuned to the varying style of the text. To this end, an approach to information retrieval (IR) known as latent semantic analysis (LSA) is used in order to uncover semantic information from the corpus [8,9].

The main focus of this work has been on the British National Corpus (BNC) [10]. It contains examples of both spoken and written British English, manually tagged with the various level of linguistic information. It is a general corpus; it does not specifically restricted to any particular subject field, or genre. The corpus comprises of more than four thousand texts with about one hundred million words, which were hand-labeled into ten domains.



 
next up previous contents
Next: Mixture LM Up: Task 3.3: Technical Description Previous: Task 3.3: Technical Description
Christophe Ris
1998-11-10