A generic rule-based lemmatization module, to retrieve the possible part-of-speech tags of isolated words.


This module needs 5 lexicon files:

Each lexicon is a simple text file, in which each line contains, respectively:

The five lexicon file names have to be specified with the -dico flag, in the following order : inflection, part-of-speech, derivation, lemme and locution files names.


lemmat = RuleLemmatizer.dll -dico InflexionLex POSLex DerivLex LemmeLex LocuLex


At run-time, only Words who have they 'POS' field equal to 'WORD' are treated.

If a sequence of Word items matches a LOCUTION key, then each WORD_POS in the corresponding list replaces the 'POS' field of the corresponding Word item, and GRAMMAR_UNIT replaces the 'Name' feature of the GrammarUnit linked whith the first Word and all the GrammarUnit items linked whith the other Word items are erased.

Else, the set of accepted part-of-speech values for a Word item is the set of pairs (POS_VALUE, INFLEXION_VALUE) which respect the following conditions:

No flag is used for specifying this run-time behaviour.



Sources not available.

Adaptability to other languages

This lemmatizer can be used to perform (inflectional) morphological analysis of languages for which inflection modifies the last characters of a lemma (i.e. at least most European languages). One simply needs to create ad-hoc lexicons (which may be rather time-consuming, depending on the language and on the information one has at hand when starting this work).

Notice a complex lemmatizer is not needed in all cases. When a complete lexicon is available for a language (including all inflected forms), this module can even be bypassed.


This module is distributed under the EULER license.

Copyright © 1999 TCTS LAB, Faculté Polytechnique de Mons, Belgium