The situation of the Portuguese language was different from those of the English and French languages. In the beginning of the project the only available database was the Portuguese version of SAM. The Portuguese SAM EUROM.1 database is a result of ESPRIT Project 6819 SAM-A . The SAM database consists of read speech recorded in an anechoic room by a relatively large number of speakers. This database was described on the first year report. However none of the available Portuguese speech databases have the adequate size for training the kind of recognizers that are addressed in this project.
One of our goals in the SPRACH project is to build a speaker independent, large vocabulary continuous speech recognition system for the Portuguese language. To reach this goal we had two basic problems. One was the need for a Portuguese database, appropriate in terms of size and contents (both in speech and text). In that sense we committed ourselves to define, develop and acquire an appropriate database for Portuguese within the SPRACH project. The other problem was the lack of a segmented and labelled database for the Portuguese language. To solve this problem we planned to develop a basic system for the Portuguese language (Task 1.3) where we tested some techniques for automatic segmentation and labelling, in parallel with the development of a basic lexicon (Task 2.2) and language model for Portuguese. We used the SAM database to develop the baseline system for Portuguese.
Bellow we will describe the work done to collect the new Portuguese database.
The SAM database was very useful to develop a baseline system for Portuguese (as we shown in Task 1.3 in first year's report), but was not satisfactory in terms of size and contents, for the development of a complete speaker-independent, large vocabulary, continuous speech recognizer. For that reason we chose to define, develop and acquire an appropriate database for Portuguese within the SPRACH project.
With this new database our aim was to create a corpus equivalent in size to the WSJ0 database . We also chose as corpus a newspaper's text, the PÚBLICO newspaper. We chose the PÚBLICO newspaper for several reasons. PÚBLICO is one of the best daily newspapers in the Portuguese language, with a broad coverage of subjects and writing styles through their different sections written by an excellent set of journalists and collaborators. Additionally, the newspaper has its' editions on the WEB, through their PÚBLICO ON-LINE initiative (http://www.publico.pt/). This WEB version has all the text of their daily paper edition. Since INESC is also involved in that initiative we had some ease in obtaining the texts.
As recording population we selected the students from the Instituto Superior Técnico (IST), a large engineering school from the Technical University of Lisbon, with undergraduate and graduate students. This population presents, in terms of this task, some limitations and some advantages. The main limitation comes from the age range which is only between 19 and 28. However there are several advantages. Due to being one of the best and larger engineering schools of the country, we can find many students from different regions (from south to north, from the larger cities on the coastline to the small ones in the interior) and from different social levels, which gives us a large variability of speakers with different accents.
The recordings took place in a sound proof room at INESC (Lisbon), which is located in the neighborhood of IST. The recordings started in mid April 1997 and finished on November 1997.
The first phase of our work consisted on the collection of the text of the first 6 months of the newspaper that were available through WWW (from the beginning, on September 22, 1995 till March 31, 1996). Next these data were converted from html format to text format. Afterwards the files were cleaned up (removing html headers and some duplicated information) and a single file was created for each day. Each edition was labeled with a unique id made from the date, and each article also with a unique id based on the edition and on the original filename. This makes it very easy to locate any part of the text at any stage of the processing.
In the end we obtained 188 files corresponding to the same number of editions of the newspaper. This represents approximately 220 Mb of text. In the next step we analyzed the texts in order to correct misspellings, and to convert numbers into orthographies. The process of selection and conversion was done automatically, and then manually verified and corrected. This step was very consuming in terms of time and manpower, involving 10 persons for approximately 2 months.
After all these steps we considered the texts ready for use both as selection material for the sentence prompts and as a basis for language model development.
This work was done during the first year of the project and described on last report.
During this year we begun with the definition of the different training and test sets and the selection of the sentences to record. We started by computing the overall totals of these texts (Table 1.1).
Next we performed a statistical analysis of the texts to help us decide which should be the parameters to use in the selection of sentences. Those statistics led us to decide that our spoken paragraphs should have 2 to 4 sentences each, and each sentence should have between 6 and 39 words. We rejected paragraphs with just one sentence because we want to maintain coherent paragraph blocks of text which ``provide semantically meaningful material, thereby facilitating the production of realistic speech prosodics'' and longer paragraphs (more than 4 sentences) which occur very infrequently and normally are harder to read. These limiting parameters and the restriction that the words should be among those that occur more than twice defined the set of paragraphs and sentences that were available for selection.
In the next phase we divided the texts into three parts: training, development and evaluation. We used 80% of the text for training, 10% for development and 10% for evaluation. This selection was made in a random fashion having the paragraph as unit.
From the training part of the text we randomly selected paragraphs with a total of 10,000 sentences to be used as recording material. In this training set we have a total of 21,025 different words.
For both the development and evaluation test sets we decided to have two vocabularies: a small one with no more than 5K words and a larger one with no more than 20K words. As in WSJ0 we would like to have 2,000 sentences for each of the 5K sets and 4,000 sentences for each of the 20K sets.
For the 20K development test set we picked sentences at random obeying to the restrictions defined in the previous section and an additional one of using no more than 20K different words. We selected 4,000 sentences with a final vocabulary of 13,070 words.
The 5K words development test set was harder to select. We followed the same procedure as for the 20K set and we obtained only 809 sentences. At that point we decided to allow the vocabulary size to increase until we got 2,000 sentences. The results are presented in Table 1.2.
All these sets are available but we used only the first one of 5K words with only 809 sentences. Since we want to select a total of 400 sentences (with repetition) for recording, the number of sentences that we got is still sufficient. It is important for us to maintain the total vocabulary words within 5K due to computational limitations. The same process was applied for both development and evaluation test sets.
Additionally, 15 speaker-adaptation and three calibration sentences were selected. These sentences were chosen to be phonetically rich. They were originally from the PÚBLICO texts but were modified by hand.
The overall selected sentences were individually examined, to eliminate those that were hard to read. Then they were converted into prompts to be used in the recording phase, and into standard SGML format to be used in the recognizer score.
The next step was to define the various recording sets. We decided to have a large training set of 8,000 sentences from 100 speakers, and development and evaluation test sets of 400 sentences from 10 speakers each, for both 5K and 20K vocabularies. The numbers of speakers and sentences are the following for each set:
The three calibration sentences were the same for all the speakers and the 15 speaker-adaptation sentences were the same for the development and evaluation test sets speakers.
The allocation of the sentences to the speakers was random, with sentence replacement between speakers.
After the allocation of the sentences to the speakers we compiled the resulting vocabulary for each set (Table1.3).
In the end we compiled a list of the different words for all the sets (for the total sentences). That list has a total of 27,833 words. It was from this list that a pronunciation dictionary was developed, as described in Task 2.2. From this global dictionary we extracted the pronunciation dictionaries associated with each of the above sets.
From the training part of the texts, four bigram backoff closed language models were computed (5K/20K words x development/evaluation test sets) using the CMU-Cambridge SLM Toolkit1.1. The perplexity results obtained for each set are presented in Table 1.4.
As we can observe the perplexity values associated to these tasks are large.
For the time being, we chose to record only the training set and the 5K development and evaluation test sets. We expect to record the 20K development and evaluation test sets in a later phase. It will be simple to create new sets from now on, because the recording conditions and the students will still be available.
The recordings took place at INESC, in a sound proof room. A desk mounted microphone is being used for the collection of the signal.
The action of database collection was advertised through out the IST campus and the students offer to participate in the project gracefully. As a compensation for their collaboration they received a T-shirt with the logo of the project. The recordings started in mid April 1997 and finished on November 1997. The database will be packed in 4 C-D's that we expect to release in a near future.
The database that we presented here is the result of a large and careful planning work. We expect that this database of both speech and text, and the supporting material, will be useful to the speech recognition research community to create and develop continuous speech recognition systems for European Portuguese.