next up previous contents
Next: Experiments Up: Document Space Modelling Previous: Mixture LM

Modeling the Document Space

Latent semantic analysis (LSA) is a modern IR technique that is based on the singular value decomposition (SVD) of very large sparse term (word) by document matrix [8,9]. Each column of such a matrix describes a document, with the entries being some measure corresponding to each vocabulary word in that document. The eigenvectors corresponding to the k largest eigenvalues are then used to define k-dimensional term and document spaces, where k is typically of the order of 100. Put simply, the approach effectively models the co-occurrence of vocabulary words or documents provided by the very large matrix. The technique is referred to as ``latent semantic'' because the projection to the lower dimensional subspace has the effect of clustering together semantically similar words and documents. IR performance data indicates that points in the derived subspace are more reliable indicators of meaning than individual words or terms [8]. Furthermore, assuming that a document is a (linear) combination of words, it is possible to project any document (with tens of thousands vocabulary words) down to a few hundred dimensional vector, regardless of whether it is included in the original matrix. In this work, this reduced dimensional space is referred to as a document space. One major advantage of this approach is that a lower dimensional document subspace is automatically inferred using the SVD.

A method to generate the term by document matrix is one focal point of the LSA approach because it affects the notion of semantics expressed in the space. For example, the unigram relative frequencies might be used for the column (i.e., document vector) entries of such matrix. As the total word counts often vary in orders of magnitude between documents, the unigram probabilities can be used instead if one wants to avoid the possible effect of the document sizes.

When characterizing each document by the occurrence of each word, it would be useful if uniqueness of the word in the whole corpus could be considered. Such measure often used in IR area is the ``inverse document factor''. It calculates $\displaystyle \frac{p_j(w)}{p(w)}$ where pj(w) and p(w)are the unigram probabilities of word w in document j and in whole corpus, respectively. This measure enhances the unigram probabilities of the document which are not very common in the whole document set. In IR work, this matrix is weighted by terms designed to improve the retrieval performance [9]. This may be an area for further investigation for language modeling work.

The principal computational burden of this approach lies in the SVD of the term by document matrix. It is not unreasonable to expect this matrix to have dimensions of at least $20\:000\times20\:000$; however such matrices are sparse (1-2% of the elements are non-zero) and it is possible to perform such computations on a modern workstation [11]. First, a $m\times n$ matrix A (whose rank is r) can be decomposed as

$\displaystyle A = U\Sigma V^T$     (3.9)

where ``T'' implies a transpose. $\Sigma$ is an $r\times r$ matrix whose diagonal elements correspond to singular values, or the non-negative square roots of r non-zero eigenvalues for AAT. Also U and V are $m\times r$ and $n\times r$ matrices whose columns are often referred to as term and document singular vectors. They define the orthonormal eigenvectors associated with the r eigenvalues of AAT and AT A, respectively [8,9]. The schematic is given in Figure 3.1.

Figure 3.1: This is a schematic for the singular value decomposition (SVD) of the term by document matrix A.

The singular vectors corresponding to the k ($k\leq r$) largest singular values are then used to define k-dimensional document space. Using these vectors, $m\times k$ and $n\times k$ matrices Uk and Vk may be redefined along with $k\times k$ singular value matrix $\Sigma_k$. It is then known that $A_k = U_k\Sigma_k V_k^T$ is the closest matrix (in a least square sense) of rank k to the original matrix A [9]. As a consequence, given an m-dimensional vector q for a document, it is warranted that k-dimensional projection $\hat{q}_k$ computed by

$\displaystyle \hat{q}_k = q^T U_k\Sigma_k^{-1}$     (3.10)

lies in the closest k-dimensional document space with respect to the original m-dimensional space. In the experiments, $m = 19\:991$ and k = 200 were used, effectively achieving an order of 100 reduction in the work space.

The k-dimensional projection $\hat{q}_k$ represents principal components that characterize ``semantic'' information of the document. Thus, corpus documents can be classified according to their projections using, say, k-means clustering algorithm together with some metric. Experiments here used the cosine angle metric defined as $\displaystyle \cos\phi = \frac{\mbox{\boldmath$a$ }\cdot\mbox{\boldmath$b$ }}{\...
...ert\mbox{\boldmath$a$ }\vert\vert\cdot\vert\vert\mbox{\boldmath$b$ }\vert\vert}$between two vectors $\mbox{\boldmath$a$ }$ and $\mbox{\boldmath$b$ }$.

next up previous contents
Next: Experiments Up: Document Space Modelling Previous: Mixture LM
Christophe Ris