next up previous contents
Next: Experiments Up: Named Entity Tagged LMs Previous: Named Entity Tagged LMs

Tagged Language Modeling

A tagged LM is an extension to the conventional n-gram models. First, define a set of vocabulary words by ${\cal V}= \{ w^{[1]}, \cdots, w^{[M]} \}$ with size M and let $<\!\! w_1, \cdots, w_i, \cdots \!\!>$ denote a sequence of words. Suppose there exist L different tagged classes, ${\cal T}= \{ t^{[1]}, \cdots, t^{[L]} \}$. It is assumed that each word wi in the document is either classified into one of tagged classes $t_i\in{\cal T}$ or not classified to any class. As a convention here, a unique identifier ei for wi is defined as

 
$\displaystyle e_i = \left\{
\begin{array}{ll}
t_i & \mbox{if $w_i$\space belong...
...f $w_i$\space does not belong to any class in ${\cal T}$ }.
\end{array} \right.$     (3.1)

A tagged LM computes a score for each word wi given a sequence of identifiers $e_1^{i-1} = <\!\! e_1, \cdots, e_{i-1} \!\!>$ by
 
$\displaystyle f(w_i \vert e_1^{i-1}) = \sum_{e_i\in (w_i\cup{\cal T})} f(w_i,e_...
...})
\sim \sum_{e_i\in (w_i\cup{\cal T})} f(w_i \vert e_i) f(e_i \vert e_1^{i-1})$     (3.2)

In Equation (3.2), f(ei | e1i-1) is a standard type n-gram model with a vocabulary set, ${\cal V}\cup{\cal T}$ where $\cup$ implies a union, and
 
$\displaystyle f(w_i \vert e_i) = \left\{
\begin{array}{ll}
f(w_i \vert t_i) & e_i = t_i\in{\cal T}, \\
1 & e_i = w_i\in{\cal V},
\end{array} \right.$     (3.3)

where f(wi | ti) is a unigram probability of word wi in tagged class $t_i\in{\cal T}$.

Equations (3.2) and (3.3) provide an approach to a tagged LM processing. An LM generation procedure may be as follows;

1.
Initialize identifiers as ei = wi for all words in the document.

2.
Decide a vocabulary set ${\cal V}$ from most frequent words3.1.

3.
Mark the document by a set of tags ${\cal T}$. For each word $w_i\mathop{\in\!\!\!\!\!/}{\cal V}$, replace the identifier by ei = ti if $t_i\in{\cal T}$ is found for wi. Otherwise, wi is OOV (mapped to the unknown (UNK) word symbol).

4.
Count the n-gram statistics, f(ei | e1i-1), based on a vocabulary set, ${\cal V}\cup{\cal T}$.

5.
Count the unigram statistics f(w) for words $w\mathop{\in\!\!\!\!\!/}{\cal V}$ but appeared in the tagged class ${\cal T}$.


next up previous contents
Next: Experiments Up: Named Entity Tagged LMs Previous: Named Entity Tagged LMs
Christophe Ris
1998-11-10