4124 BNC texts were partitioned in random (and independent of hand-labeled domain information) as follows;
Because each text in the BNC contains tens to hundreds of thousands words, they were subdivided mechanically into fixed size window so that varying style of text can be tracked. Using the fixed size of 1000-word window, 3309 texts in generation set were divided into units. These units are referred to as ``documents''.
In the experiments, documents were randomly chosen and term by document matrix was generated using the unigram relative frequencies. It was very sparse; approximately 1.6 % of the elements were not zero. The SVD was applied (using a publicly available package, ``SVDPACKC'' ), computing the top 200 singular values and their corresponding singular vectors. Using Equation (3.10), documents were projected on to 200-dimensional document space. Then, documents were clustered using k-means algorithm with the cosine angle metric. The resulting document clusters are referred to as ``classes'' in order to differentiate from hand-labeled domains.
Although similarity does not necessarily imply the superiority in language modeling task, it is still of interest to compare automatically generated classes against hand-labeled domains. Figure 3.2 shows how many documents from each domain were classified to one of 10 classes. For example, it is observed from the figure that most of documents in spoken domain were identified as class 2, while most of those in imaginative domain have fallen to either class 1 or class 9. On the other hand, many documents in class 4 came from either natural science, applied science, or social science, and those in class 5 from world affairs.
First, a single trigram based LM was derived from complete generation set. This LM is referred to as a ``full LM''. The perplexity was 186.9 for texts in evaluation set as shown in Table 3.3. This gives the baseline for the rest of experiments. Second, following Clarkson et al. , 3309 texts in generation set were partitioned into 10 domains using hand-labeled information embedded in each text; 1 domain for whole spoken texts and 9 domains (from imaginative to leisure) for written texts. A trigram based LM was created for each of 10 domains. The perplexity was 178.8 for a mixture LM of 10-domain LMs. Initially, mixing factors cj(0) were set proportional to the total number of trigrams for each component LM. On the other hand, a mixture LM of automatically derived 10 classes achieved the perplexity of 171.2, clearly improved over hand-labeled domain model. Note that domain information from evaluation set was not used when computing the perplexity, as it was assumed to be complete novel data from which no manually tagged information was available.
A LM Adaptation experiment made use of document space information, illustrated in figure 3.3. Each document in evaluation set was first sectioned into fixed size windows (1000 for this experiment), then projected down to the document space using Equation (3.10). When evaluating the trigram probabilities at word wn, the closest class LM to the projection of the most recent window (i.e., previous window to word wn) was selected and blended with the ``full LM''.
Figure 3.4 shows perplexities for such mixtures when document space was divided to 10 to 1000 clusters. It achieved about the same perplexity level as a blind mixture of all component LMs; even better when the cluster size was less than 100. This approach took advantage of automatic nature of the LSA modeling. A single ``full LM'' was tuned to the document space with slight increase of computational cost.