next up previous contents
Next: Task 3.1: Future Development Up: Named Entity Tagged LMs Previous: Tagged Language Modeling


The WSJ corpus from the ARPA CSR project contains articles of the Wall Street Journal newspaper from 1987 to 94. It was partitioned for LM generation (1987 to 93) and evaluation (1994) subsets. Nearly 110 million words were in a generation set, out of which $19\:952$words were chosen for a vocabulary ${\cal V}$ in frequency order.

The documents were then processed by NED; each word was either marked up with one of tags described earlier or not marked at all. In cases of unresolvable type ambiguity, the version of NED used here adds a plain ``NAME'' tag. Out of which, tags on in-vocabulary words were removed before counting the statistics. For simplicity, NE tags (``ORGANISATION'', ``PERSON'', ``LOCATION'') and an ambiguity case tag (``NAME'') were used (i.e., total 4 tags in ${\cal T}$). Temporal and number expression tags were not used although they might also provide some useful information on the context. An identifier ei was set for each word wi according to Definition (3.1) and three sets of LMs were generated;

conventional LM
A standard type trigram derived from a sequence of words. All OOVs were mapped to the UNK symbol.

tagged LM with UNK extension
A standard type trigram and a unigram extension for words mapped to the UNK symbol.

tagged LM with NE extension
An NE trigram derived from a sequence of identifiers. OOVs (i.e., those not in ${\cal V}$, nor tagged) were mapped to UNK. Unigram extensions for words in tagged classes ${\cal T}$ and for those mapped to UNK.
After discounting and smoothing, the resulting trigrams contained approximately the same number of tokens (over 8 million trigrams). The unigram size for each tagged class are shown in Table 3.1.

Equations (3.2) and (3.3) may be used to estimate the language model probabilities when decoding. Alternatively, (3.2) may be approximated by a maximization:

$\displaystyle f(w_i \vert e_1^{i-1}) \sim
\max_{w\in e,\; e\in ({\cal V}\cup{\cal T})} f(w \vert e) f(e \vert e_1^{i-1})\; .$     (3.4)

This allows a decoder to recover a sequence of word and identifiers;
$\displaystyle (w_i, e_i) \sim
\mathop{\rm argmax}_{w\in e,\; e\in ({\cal V}\cup{\cal T})} f(w \vert e) f(e \vert e_1^{i-1})\; .$     (3.5)

Note that this approach essentially performs an NE tagging operation based on unigram statistics. It is not possible to use NED when decoding since, aside from the additional computational burden that would be imposed, NED requires punctuated, marked up text, not the raw word sequence hypotheses generated by a speech recognizer.

Speech recognition experiments were carried out on the DARPA North American Business News task, using the context-independent ABBOT system. The OOV rate of the test set was evaluated by comparing the transcription with the vocabularies of three LM sets described earlier. There were a total of 6059 words in the transcription; 5809 (95.9%) were included in the trigram vocabulary ${\cal V}$. About 70% of the 250 OOV words were included in the tagged unigram word set, reducing the effective OOV rate from 4.1% to 1.3%. The distribution across tags is shown in table 3.1.

Table: This table shows the unigram size and the number of hits for each tagged LM. 250 words (4.1%) in the transcription were not found in the trigram vocabulary ${\cal V}$, out of which 174 and 173 words were included ( i.e., ``hit'') in the UNK and the NE extensions, respectively. Because some words were members of more than one tagged subset, the accumulated number was greater than 173 in the latter case.
tagged LMs unigram size #``hits''
UNK extension    
``UNK'' $126\:002$ 174
NE extension    
``ORGANISATION'' $21\:038$ 20
``PERSON'' $41\:176$ 45
``LOCATION'' $3\:457$ 6
``NAME'' $31\:823$ 38
``UNK'' $49\:256$ 121

Decoding was performed by the single pass NOWAY decoder [6]3.2. Note that, when using the tagged LMs, the decoding was performed with a vocabulary of $123\:848$ words ($130\:964$ pronunciations)3.3, represented as pronunciation tree containing around $300\:000$ nodes (phone models). In addition to the three LM sets, a variation to the tagged LM with UNK extension was also tested. This variation used a flat estimate of P(w|e) (set to 10-5, since there were about 105 in ${\cal T}$).

Table 3.2 shows the word error rate (WER) for each LM set on this task. The tagged LMs, with an effective vocabulary of 126 thousand words, reduced the WER by about 14% compared with the conventional LM. Although there was no significant difference in WER between tagging words with NEs in ${\cal T}$ versus tagging all with UNK, there was an improvement in performance by using an estimated unigram model for P(w|e) over a flat estimate. This table also gives an indication of the number of OOV words relative to ${\cal V}$ (but in ${\cal T}$) that were detected. 2.8% of the words in the test set fall into this category, and these results indicate that the tagged LM approach detected 70% of them.

Table: This table shows the word error rate (WER) for the conventional and the tagged LMs. The ``OOV hit'' rate is the percentage of recognised words that were in ${\cal T}$ but OOV relative to ${\cal V}$ (2.8% of the words in this set fall into this category).
  WER (%) ``OOV hits'' (%)
conventional LM 20.5 0.0
tagged LMs    
(flat estimate) 18.2 1.6
UNK extension 17.7 1.8
NE extension 17.7 2.0

next up previous contents
Next: Task 3.1: Future Development Up: Named Entity Tagged LMs Previous: Tagged Language Modeling
Christophe Ris