This task had two different goals. One is to follow the developments of US English state of the art databases and another is the development of multilingual databases (mainly the Portuguese database).
To compare the performance of different recognition systems, or to access the advantages of some modifications to our basic systems, the use of a common and widely used speech database is of the greatest importance. In that sense we had been using the ARPA CSR corpora of North American Business News (NABN) to develop and assess neural network technologies on different continuous speech recognition tasks.
However, when we make modifications to our basic systems is important to have a small turn around time in their evaluation. In that case it's important to use small, well defined but also difficult databases. This is the case of the OGI Numbers 93 database and the PhoneBook database which have been in use in this project. Also the WSJ0 and WSJCAM0 speech databases are being used for that purpose.
However, the bulk of this task concerns the use of multilingual databases to perform research on new languages. From the three European languages targeted in this project, Portuguese presented the most difficult situation since there were virtually no Portuguese databases with adequate size for training the kind of recognizers that are addressed in this project. For that reason a large Portuguese database has been collected within this task. This database is based on texts from the PÚBLICO newspaper. However, a small Portuguese database from the SAM project was used to develop the baseline system for Portuguese (Task 1.3). The two other languages are addressed through the use of WSJCAM0 for UK English and BREF for French.
There are also some databases for language modelling as is the case of CSR-LM1 and the British National Corpus. Also, the text part of the PÚBLICO database will be used for the language modelling work in the Portuguese language.
Some of these databases were described in last year report. In the present report we include an updated description of some of these databases.