next up previous contents
Next: N-best Cache LMs Up: Document Space Modelling Previous: Modeling the Document Space

Experiments

4124 BNC texts were partitioned in random (and independent of hand-labeled domain information) as follows;

generation set: 3309 texts (80 %) for LM generation.
evaluation set: 400 texts (10 %) for LM evaluation.
The rest (419 texts) of the corpus were held out for future use. Also note that the whole corpus contains about 360 thousand independent words, out of which $19\:991$ words were selected as a vocabulary in the unigram frequency order. Out of vocabulary (OOV) words were treated as an ``unknown''. This partition and vocabulary were maintained throughout the course of experiments described later in this section.

Because each text in the BNC contains tens to hundreds of thousands words, they were subdivided mechanically into fixed size window so that varying style of text can be tracked. Using the fixed size of 1000-word window, 3309 texts in generation set were divided into $87\:149$ units. These units are referred to as ``documents''.

In the experiments, $40\:000$ documents were randomly chosen and $19\:991\times40\:000$ term by document matrix was generated using the unigram relative frequencies. It was very sparse; approximately 1.6 % of the elements were not zero. The SVD was applied (using a publicly available package, ``SVDPACKC'' [11]), computing the top 200 singular values and their corresponding singular vectors. Using Equation (3.10), $87\:149$ documents were projected on to 200-dimensional document space. Then, documents were clustered using k-means algorithm with the cosine angle metric. The resulting document clusters are referred to as ``classes'' in order to differentiate from hand-labeled domains.

Although similarity does not necessarily imply the superiority in language modeling task, it is still of interest to compare automatically generated classes against hand-labeled domains. Figure 3.2 shows how many documents from each domain were classified to one of 10 classes. For example, it is observed from the figure that most of documents in spoken domain were identified as class 2, while most of those in imaginative domain have fallen to either class 1 or class 9. On the other hand, many documents in class 4 came from either natural science, applied science, or social science, and those in class 5 from world affairs.


  
Figure 3.2: This figure shows how many documents from each domain were classified to one of 10 classes. Area size corresponds to the number of documents.
\begin{figure}\centerline{\epsffig{figure=circles.eps,width=3.5in}}
\end{figure}

First, a single trigram based LM was derived from complete generation set. This LM is referred to as a ``full LM''. The perplexity was 186.9 for texts in evaluation set as shown in Table 3.3. This gives the baseline for the rest of experiments. Second, following Clarkson et al. [7], 3309 texts in generation set were partitioned into 10 domains using hand-labeled information embedded in each text; 1 domain for whole spoken texts and 9 domains (from imaginative to leisure) for written texts. A trigram based LM was created for each of 10 domains. The perplexity was 178.8 for a mixture LM of 10-domain LMs. Initially, mixing factors cj(0) were set proportional to the total number of trigrams for each component LM. On the other hand, a mixture LM of automatically derived 10 classes achieved the perplexity of 171.2, clearly improved over hand-labeled domain model. Note that domain information from evaluation set was not used when computing the perplexity, as it was assumed to be complete novel data from which no manually tagged information was available.


 
Table 3.3: This table shows perplexities for single and mixture LMs. In mixture models, hand-labeled 10 domain LMs and automatically derived 10 class LMs are compared.
model perplexity
single ``full LM'' 186.9
mixture  
10-domain LMs 178.8
10-class LMs 171.2
 

A LM Adaptation experiment made use of document space information, illustrated in figure 3.3. Each document in evaluation set was first sectioned into fixed size windows (1000 for this experiment), then projected down to the document space using Equation (3.10). When evaluating the trigram probabilities at word wn, the closest class LM to the projection of the most recent window (i.e., previous window to word wn) was selected and blended with the ``full LM''.


  
Figure 3.3: This figure illustrates the LM adaptation procedure using the document class information of evaluation set.
\begin{figure}\centerline{\epsffig{figure=decode.eps,width=3.5in}}
\end{figure}

Figure 3.4 shows perplexities for such mixtures when document space was divided to 10 to 1000 clusters. It achieved about the same perplexity level as a blind mixture of all component LMs; even better when the cluster size was less than 100. This approach took advantage of automatic nature of the LSA modeling. A single ``full LM'' was tuned to the document space with slight increase of computational cost.


  
Figure 3.4: Document space here was divided to 10 to 1000 clusters. When computing the perplexities, the closest class LM to the projection of the most recent window was selected and blended with the single ``full LM''.
\begin{figure}\centerline{\epsffig{figure=perplex.eps,width=3.5in}}
\end{figure}


next up previous contents
Next: N-best Cache LMs Up: Document Space Modelling Previous: Modeling the Document Space
Christophe Ris
1998-11-10