A method to generate the term by document matrix is one focal point of the
LSA approach because it affects the notion of semantics expressed in the space.
For example, the unigram relative frequencies might be used for the column
(*i.e.*, document vector) entries of such matrix.
As the total word counts often vary in orders of magnitude between documents,
the unigram probabilities can be used instead if one wants to avoid the
possible effect of the document sizes.

When characterizing each document by the occurrence of each word, it would be
useful if uniqueness of the word in the whole corpus could be considered.
Such measure often used in IR area is the ``inverse document factor''.
It calculates
where *p*_{j}(*w*) and *p*(*w*)are the unigram probabilities of word *w* in document *j* and in whole corpus,
respectively.
This measure enhances the unigram probabilities of the document which are not
very common in the whole document set.
In IR work, this matrix is weighted by terms designed to improve the retrieval
performance [9].
This may be an area for further investigation for language modeling work.

The principal computational burden of this approach lies in the SVD of the
term by document matrix.
It is not unreasonable to expect this matrix to have dimensions of at least
;
however such matrices are sparse (1-2% of the
elements are non-zero) and it is possible to perform such computations on
a modern workstation [11].
First, a
matrix *A* (whose rank is *r*) can be decomposed as

where ``

The singular vectors corresponding to the *k* ()
largest singular
values are then used to define *k*-dimensional document space.
Using these vectors,
and
matrices *U*_{k} and *V*_{k} may
be redefined along with
singular value matrix .
It is then known that
is the closest matrix (in
a least square sense) of rank *k* to the original matrix
*A* [9].
As a consequence, given an *m*-dimensional vector *q* for a document, it is
warranted that *k*-dimensional projection
computed by

lies in the closest

The *k*-dimensional projection
represents principal components
that characterize ``semantic'' information of the document.
Thus, corpus documents can be classified according to their projections using,
say, *k-means* clustering algorithm together with some metric.
Experiments here used the cosine angle metric defined as
between two vectors
and
.