Work Package Manager: INESC
Executing Partners: CUED, FPMs, INESC, SU
As stated initially this project has two essential objectives which are (1) to stay in the forefront of research in large vocabulary continuous speech recognition, and (2) to address the multilingual aspects of recognition with hybrid HMM-ANN systems.
As a starting point for this project, and as a result of previous WERNICKE project, all the partners have developed their own systems for the Wall Street Journal (WSJ) database, which were the standard international database used for development and evaluation of large vocabulary, speaker independent, continuous speech recognizers.
This workpackage was planed to cover the ground necessary to achieve our essential objectives. In that sense we intended with this workpackage the following:
To cover all this aspects we divided the workpackage in three different tasks. The first task Task T1.1: Databases covers the above points 1. and 2. The other two tasks Task T1.2: Baseline system for French and Task T1.3: Baseline System for Portuguese covers the above point 3.
For the first year we had the following milestone:
No Deliverables were scheduled for this year.
Task Coordinator: INESC
Executing Partners: CUED, FPMs, INESC, SU, ICSI
To compare the performance of different recognition systems, or to access the advantages of some modifications to our basic systems the use of a common and widely used speech database is of the greatest importance. In that sense we will be using the ARPA CSR corpora of North American Business News (NABN) to develop and assess neural network technologies on different continuous speech recognition tasks.
However when we made modifications to our basic systems is important to have a small turn around time in their evaluation. In that case is important to use small, well defined but also difficult databases. This is the case of OGI Numbers 93 database and PhoneBook database which are in use in this project. Also the WSJ0 and WSJCAM0 speech databases are being used to that purpose.
On top of these databases we will be using multilingual databases to perform research on new languages. From the three European languages targeted in this project the Portuguese presents the most difficult situation since there are virtually no Portuguese database of a size adequate for the training of the kind of recognizers that are addressed in this project. In that sense a large Portuguese database based on PUBLICO newspaper will have to be collected within this task. However a small Portuguese database from SAM project will be used to develop the baseline system for Portuguese (Task T1.3). The other two languages are fulfill through the use of the WSJCAM0 for English and BREF for French.
There are also some databases for language modelling as is the case of CSR-LM1 and the British National Corpus. Also the text part of the PUBLICO database will be used to language modelling work on the Portuguese language.
Some of these databases will be described below.
Some of the databases in use in this project are available through Linguistic Data Consortium (LDC) and are available to all partners which are associates of LDC. Other databases as the EUROM1, where is included the SAM Portuguese Database, and BREF are distributed through the European Language Resources Association (ELRA). Although BREF is marked as available in the ELRA catalog, it is not distributed yet. The SAM Portuguese Database is available at INESC. A large Portuguese database based on PUBLICO newspaper is being collected at INESC.
Following we will describe the different databases in use in this project.
PhoneBook is a phonetically-rich isolated word telephone-speech database. It consists of more than 92,000 utterances and almost 8,000 different words, with an average of 11 talkers for each word (totaling 23 hours of speech). Each speaker of a demographically-representative set of over 1,300 native speakers of American English made a single telephone call and read 75 words. The database contains 106 word lists, each composed of 75 or 76 words that have been pronounced by a few (typically around 11) speakers. The speakers and words are different for each word list. The word lists are labeled as with or e and or z except if =e, in which case is then equal to a or b only. There are thus 106 word lists. The database being very large (totaling 23 hours of speech), we defined two training sets, one cross-validation set and one test set as follows:
Numbers93 is a continuous-speech database collected by the CSLU at the Oregon Graduate Institute. It consists of numbers spoken naturally over telephone lines on the public-switched network. The Numbers'93 database consists of 2,167 speech files of spoken numbers produced by 1,132 callers.
From ELRA html description of BREF: The BREF corpus was designed to provide enough read speech data for the development and evaluation of continuous speech recognition systems (both speaker-dependent and speaker-independent), and to provide a large corpus of continuous speech for the acquisition of acoustic-phonetic knowledge of spoken French. All the recorded texts were selected from extracts of the French newspaper Le Monde so as to provide a large vocabulary (over 20,000 words) and a wide range of phonetic environments. The entire BREF corpus contains over 100 hours of speech material from 120 speakers. The BREF-80 sub-corpus consists of 2 ISO9660 CDROMs, BREF80-1 and BREF80-2, containing speaker-independent training data from 80 speakers. Together these 2 CDs contain 5330 sentences, an average of 67 sentences per speaker. While this data represents only a small portion of the entire BREF corpus, the sentences have been selected to cover most of the BREF training prompts, in order to conserve a wide range of phonetic contexts with a minimum amount of speech data. Thus, the BREF80 sub-corpus produced on these CDs was especially selected to train speaker-independent, vocabulary-independent speech recognizers.
The situation of the Portuguese language is different from those of the English and French languages. In the beginning of the project just were available the Portuguese version of SAM and during this last period of the year the SPEECHDAT. The last one is a telephone speech corpus, based on isolated words and spellings, and just a few sentences per speaker. The SAM database is read speech recorded in an anechoic room by a relative large number of speakers. Both these speech databases have not the adequate size for training the kind of recognizers that are addressed in this project.
One of our goals is to build a speaker independent, large vocabulary continuous speech recognition system for the Portuguese language. To gain this goal we have two basic problems. One is the need of an appropriate Portuguese database in terms of size and contents (in both speech and text). The other problem is based on the fact that there is no database segmented and lapelled for the Portuguese language. To solve this problem we planned to develop a basic system for the Portuguese language (Task T1.3) where we should test some techniques for automatic segmentation and labeling in parallel with the development of basic lexicon (Task T2.2) and language model for Portuguese.
In that sense we commit ourselves to define, develop and acquire an appropriate database for Portuguese through the SPRACH project. Meanwhile we will use the SAM database to develop the basic system for Portuguese.
Following we will describe the SAM database as well the work done to collect the new database.
The Portuguese SAM EUROM.1 database is a result of ESPRIT Project 6819 SAM-A . The corpus consists of 4 components:
Different parts and differing amounts of this material were recorded by three sets of native Portuguese speakers:
The majority of the speakers was selected among the staff of INESC and CLUL. The age group with the largest representation is 20-29 (about 42%). Next come the age groups 30-39 (33%), 40-49 (12%), below 20 (7%) and above 50 (7%).
This database was recorded in an anechoic room and following all the settings of the others SAM databases. In  we can find a detailed description of the database. In that description we can find the separation of prompts for each speaker, the text of the passages and sentences and the SAMPA phonetic transcription. The speaker's description is also included.
In the development of the basic system for Portuguese (Task T1.3) we will be using the passages of the Many Talker Set. This gives us a total of 180 speech files.
The SAM database was very useful to develop a basic system for Portuguese (as shown in Task T1.3), but is not satisfactory in terms of size and contents. In that sense we commit ourselves to define, develop and acquire an appropriate database for Portuguese through the SPRACH project.
With this new database we pretend to create a corpus equivalent in size to the WSJ0 database. We would like to choose also a newspaper but with a broad coverage of matters and different writing styles. Our choice was the PUBLICO newspaper.
We chose the PUBLICO newspaper for several reasons. PUBLICO is one of the best daily newspaper for the Portuguese language, with a broad coverage of subjects through its different sections and produced by a excellent group of journalists and collaborators. The newspaper has its editions on the WEB through the initiative PUBLICO ON-LINE (http://www.publico.pt/). This WEB version has all the text of the daily edition printed in paper. Since INESC is also involved in this initiative we had some facilities to obtain the texts in html format directly from their ftp site.
Next we describe the different steps towards our present state:
After all these steps we consider the texts ready to use as selection material for the sentences prompts and as a basis for language model training.
Under this task we have to define the different train and test sets and choose the sentences to record. In that sense we plan to do the following steps:
After all these steps we are ready to record the speech files. There will be a sentence for each file. The recordings will take place at INESC in sound proof room. We plan to use a HMD 25-1 Sennheiser microphone with a preamplifier FP11 from Shure.
The speakers will be selected from the Instituto Superior Tcnico from the University of Lisbon. This is an engineering school with undergraduate and graduate students. The ages will be on the range 18-26. Between the speakers we can find a large range of regional accents.
Since we commit ourselves with the goal of continually evaluating our recognizers through internationally accepted databases we are open to new databases that could appear to that end.
On other side we will continue our work on the acquisition of the Portuguese database based on PUBLICO newspaper.
Task Coordinator: FPMs
Executing Partners: FPMs
Starting from the baseline US-English large vocabulary continuous speech recognizer developed in the framework of WERNICKE, the early task of FPMs will be to develop an equivalent system in French. This will involve accessing and processing the BREF database available via ELRA.
FPMs became member of ELRA, and paid the fee to access the BREF database. Unfortunately, as of this writing, we had not been able to get access to the data. Consequently, most work in this task was basically impossible. The corresponding effort was spent on software development (see STRUT description in section 7.3.3)
Task Coordinator: INESC
Executing Partners: INESC
In terms of Portuguese language the main goal in this project is to develop a speaker independent, large vocabulary continuous speech recognition system. To obtain this goal we need an adequate database in both speech and text. Also we need an appropriate phoneme segmentation of that database or some tools that will enable us to do it.
Unfortunately that database do not exist yet and are being collected under Task T1.1. To overcome this problem we began developing a baseline system for Portuguese based on a much smaller database and which is available in useful time. This baseline system will permit us to develop the basic structures necessary to build a large system.
In this first year of the project we developed a speaker-independent continuous speech recognition system for the Portuguese language. Based on a relative small database, with a medium vocabulary of 1314 words, we trained the acoustic phonetic models, we created a baseline lexicon and build a word-pair grammar for a small and limited task associated with EUROM.1 SAM Portuguese database.
The developing database of the baseline system for Portuguese was the Portuguese part of the SAM database EUROM.1. The SAM database consist of read speech with three different set of speakers and with different material, as described in Task T1.1. This database have no phonetic labelling, no dictionary and no grammar. In the database we found the speech files, the separation of prompts for each speaker, the text of the passages and sentences and the SAMPA phonotypical transcription.
Therefore, the development of the Portuguese baseline system consisted of the following steps:
In the automatic labelling of the SAM database we trained first the acoustic phonetic models over the TIMIT database (a MLP trained over the TIMIT database to make phoneme classification). Next we created two conversion tables. One from the TIMIT phonemes to the IPA symbols  and other from the SAMPA (Portuguese SAM phonemes) to the IPA symbols. Putting both tables together we created a mapping table from TIMIT phonemes to Portuguese phonemes. Obviously not all the phonemes have correspondence.
The next step was to feed the TIMIT net with the SAM database. The probabilities resulting from the TIMIT net are transformed according to the mapping table previous defined becoming the new set of Portuguese probabilities. This set of probabilities are used in the decoder Y0 to make the forced alignment. As input to this forced alignment process we use also the baseline lexicon developed under Task T2.1.
After completed the alignment we trained a new MLP over the SAM database (with the labels from the forced alignment) to make phoneme classification. With this network we made another forced alignment pass generating new labels. The process was iterated three times. The results are presented in Table 1.1.
Table 1.1: Percentage of correct frames in the different passages of the alignment.
The results show the improvement made on the classification over the frames. This process of training/alignment proved to be effective decreasing the classification error.
After the training/alignment phase we evaluated the system. In the training phase we are using 179 files from the Many Talker Set. For evaluation we picked the 10 speakers of the Few Talker Set choosing 3 passages for each speaker. In this case we use also the Y0 for evaluation including now a word pair language model extracted from SAM texts (for details see Task T3.3). As reported in Task T3.3 we build 2 different word pair grammars. In the first we collected the pairs from the SAM text for the sentences isolated from the paragraphs. Remember from Task T1.1 that the passages contained 5 thematically connected sentences. In this case we got 250 separated sentences. In the second we consider as our unit the passage by itself and we got 50 passages. Now there is no division between sentences in the same passage. The results are presented in Table 1.2.
Table 1.2: Evaluation results
These results show the great influence in such a small task of the language model.
The work done under this task will continue next year through the improvement of the training/alignment procedure and from the inputs from Tasks T2.2 and T3.3. Also the availability of a large speech and text database for Portuguese will have beneficial effects in the baseline system for Portuguese.
In this workpackage we want to follow the constant evolution of the internationally accepted databases for evaluation of continuous speech recognizers with very large vocabularies. Also providing the databases for French and Portuguese needed to develop the multilingual aspects of the project as is the development of baseline recognizers for the French and Portuguese is one of the main targets for this workpackage. These basic principles have been achieved.