RulePreProsessorFr parses a text, extracts its first sentence, and fills the MLC (more precisely, its Token, Word, and GrammarUnit layers) with its constituents. Abreviations, numbers, and proper names are detected and expanded if necessary.
The only feature of RulePreProcessorFr which we found easy to put in an external database (instead of hard-coding it into the source code of the module) is its capabilty to automatically expand abreviations. The abreviation disctionnary is a simple text file. Each line of it is composed of: a KEY enclosed in double quotes, a SEPARATOR (spaces and/or tabulations), and a VALUE enclosed in double quotes.
If VALUE has more than one word, they have to be separated by spaces. Notice that in case a single quote is a word boundary, it must be left attached to the word it belongs to (typically the word on its left) and a space separator is necessary after it.
KEYs may not be duplicated.
The "#" character is a comment marker: the rest of the line is ignored.
# ABREVIATION FILE
# abréviations de c'est-à-dire
"c.-a-d." "c' est à dire"
"c.-à-d." "c' est à dire"
The abreviation file must be declared in the Euler initialisation file, with the -abrev flag.
On the opposite of abreviations, number expansion is hard-coded (but may be quite easily adapted). It is possible to specify French, Belgian or Swiss prononciation rules, by defining the -French ( default ), -Belgium or -Swiss flag.
Example (in the initialization file):
preprocFr = RulePreProcessorFr.dll -abrev ..\..\..\DataBases\abrv.dba -French
RulePreProsessorFr is able to parse text from three types of input streams :
Any such flag, which should be stored in the CmdLine object, will be removed after processing its content. This makes it possible to process a sequence of inputs with only one CmdLine.
IMPORTANT NOTE : This module is not the owner of its input data. This means that if the input is discarded (file moved or deleted, buffer emptied or freed) while the module is running, the application may crash !
RulePreProsessorFr creates three layers in the MLC: the Token, Word and GrammarUnit layers, containing items of type Token, Word and GrammarUnit, respectively. The Token layer simply contains all the elementary textual units of a sentence (including punctuation and separators), as they were found in the text. The Word layer contains words after token analysis ( i.e after number or abreviation expansion, or any other specific text transformation). The GrammarUnit layer contains units seen as a single entity for morpho-syntactic analysis purposes. Each token extracted is linked to the first word and to the first grammarUnit it correponds to.
When a complete sentence has been correctly parsed, parsing RulePreProcessorFr returns true, which lets the EULER kernel give hand to other modules. If the end of the input stream is reached, RulePreProcessorFr returns false, which stops the kernel.
GNU C++ Souces: Euler\modules\ RulePreProcessorFr\*.*
Preprocessing is known to be quite language-dependent. Instead of trying to produce a single preprocessing module, supposed to be widely usable for several languages (which would quickly lead to a very complex module), we have developed a French preprocessing GNU module (hence its name), and suggest adapting its source code to create new preprocesing modules for other languages. Notice that preprocessing need not be as complex as in RulePreProcessorFr.
(Preprocessing is also known to be very much application-dependent. Reading emails, for instance, is not the same as reading stock market prices, because the format of the input text changes a lot. This is another good reason for not trying to developing a general preprocessing.)
This module and its sources are distributed under the GNU license.