The generic NgramTagger module resolves part-of-speech ambiguities whith a probabilistic language model using ngrams. On the GrammarUnit layer, a Viterbi decoder is uses to find the best path through the possible 'Name' fields of each GrammarUnit. The standard ARPA file format is used ( take a look at CMU SLMT ToolKit by Doug Paul). NgramTagger will be made compatible with the ngram files format of Festival (A. Black, P. Taylor, R. Cayley, Univ. Edinburgh) in the near future.
This module needs one database file containing the n-gram probabilities. The structure of this file is as follows:
Beginning of data mark: \data\
ngram 1=nr # number of 1-grams
ngram 2=nr # number of 2-grams
ngram 3=nr # number of 3-grams
ngram N=nr # number of N-grams
\1-grams: proba1 word1 backoff1
\2-grams: proba2 word1 word2 backoff2
\3-grams: proba3 word1 word2 word3 backoff3
\N-grams: probaN word1 word2 word3 ... wordN
end of data mark: \end\
The name of this file has to be defined in
the EULER initialization file, using -arpa -dico flags
grammar = NgramTagger.dll -arpa -dico grammar_arpa_file_name.
The flag -noBackOff can be used, it suppresses the contribution of backoff probabilities in the computation of the probabilities of paths.
No specific flag is needed.
N-gram tagging has been used for European and not European languages, with success.
In order to adapt this module to another language, one simply needs to run the ngram computation tool available in the CMU SLMT ToolKit (or, in the near future, the n-gram tool in Festival), on tagged corpora. This will automatically produce files compatible with NgramTagger.
This module is distributed under the EULER license.