Simply increasing the vocabulary size is not always an acceptable option, since it requires recomputing the n-gram model, and may substantially increase the n-gram size. Additionally there may simply not be any data to estimate much more than unigram statistics for many words that would otherwise be OOV. An approach investigated in this work defines semantic classes covering a subset of the possible vocabulary. An n-gram model is constructed using these classes and augmented by a unigram model, conditioned on these classes, for less frequent words (i.e., those which the conventional n-gram alone would classify as OOV). This approach is adopted with a view to reducing the OOV rate, restricting the size of the n-gram model to manageable level and effectively pooling the n-gram statistics of semantically related words.
Recent research in the DARPA Message Understanding Conference (MUC)  has shown that the recognition and classification of proper names in business newswire text can now be done on a large scale and with high accuracy; the success rates of the best systems now approach 96% combined precision and recall. The software we have used here for identifying proper names is part of the LaSIE system developed at the University of Sheffield . LaSIE has been designed as a general purpose information extraction research system whose overall purpose is to extract important facts from business newswire texts. The stand alone version of the LaSIE NE identifier used here (known as NED) has achieved roughly 92% combined precision and recall scores against blind test set of 30 newswire texts. NED recognizes and classifies those classes of naming expressions, specified in the MUC-6 NE task definition including named entities (``ORGANISATION'', ``PERSON'', ``LOCATION''), temporal (``DATE'', ``TIME''), and number expressions (``MONEY'', ``PERCENTAGE'') . Although they are not the only classes of proper names, they do account for the bulk of proper name occurrences in business newswire text.