New MBROLA databases

One of the biggest interests of the MBROLA project (and definitely its most original aspect) lies in its ability to provide an ever growing set of languages/voices to users.

To achieve this goal, the MBROLA project has itself been organized so as to incite other research labs or companies to share their diphone databases.

The terms of this sharing policy can be summarized as follows :

1. You may ask for our LINUX-based "mrbrolization tools" to adapt your own diphone database to the Mbrola format. However, since we currently have no manpower behind the mbrola project, we will not ensure support for using these tools (other than by providing a how-to file).

2. The resulting Mbrola diphone database will be copyrighted by Faculte Polytechnique de Mons - T.DUTOIT. Non-commercial use of the database in the framework of the MBROLA project will be automatically granted to Internet users. In return, we shall send you a license agreement which will transfer all our commercial rights on the newly created database to you, provided the database is used with and only with the MBROLA program.

3. All these details will be fixed by some official agreement before we send you the tools anything.

If you want to create a database from scratch

First, you should be aware that recording a diphone database is not a trivial operation. If it is not performed carefully, the result can be deceiving. FR1, for instance, required about one month of work, yet with the help of some efficient laboratory tools for signal recording and editing. What is more, some phonetic knowledge of the targeted language is necessary to create the initial corpus.

So if you just think of designing a new diphone database as a game, forget it.

If, on the contrary, you are willing to spend some time to provide the MBROLA community with a new language or voice, or if you already have a diphone database and wish to share it in Mbrola format (and receive in return the rights for any commercial exploitation of the Mbrola diphone database we will create for you), welcome here.

If you still want to create a database from scratch

Creating a database is typically achieved in four steps :

  • Creating a text corpus
  • Recording the corpus
  • Segmenting the speech corpus

Creating a text corpus

Diphones are speech units that begin in the middle of the stable state of a phone and end in the middle of the following one. Their main interest in synthesis is that they minimize concatenation problems, since they involve most of the transitions and co-articulations between phones, while requiring an affordable amount of memory, as their number remains relatively small (as opposed to other synthesis units such as half-syllables or triphones).

Hence, the first step to build a diphone database consists of fixing a list of all the phones of a language. Notice that phones are acoustic instances of phonemes. Phonemes are themselves defined on a functional, linguistic level.

Obtaining a list of phones from a list of phonemes requires to number allophones, i.e. acoustic versions of some phonemes that significantly differ from the standard one, mostly due to co-articulation constraints. Although it is not necessary to account for all allophonic variations to build an intelligible synthesizer, the naturalness of synthetic speech may be affected if too few allophones are considered. In FR1, for example, we did not consider allophones at all. As a result, some allophonic phenomena, such as devoicing of /R/ when followed or preceded by unvoiced plosives, is only partially accounted for.

When a complete list of phones has emerged, including allophones if possible, a corresponding list of diphones is immediately obtained, and a list of words is carefully completed, in such a way that each diphones appears at least once (twice is better, for security). Unfavorable positions, like inside stressed syllables or in strongly reduced (i.e. over-co-articulated) contexts, should be excluded. One typically uses carrier sentences in which the word with the diphone considered is inserted. Notice that many diphones only appear in the association of words (i.e. not in single words). A number of diphones even never appear at all. Hence, the task of creating a text corpus which contains all existing ones is not trivial.

Recording the corpus

The corpus is then read, by a professional speaker if possible, digitally recorded, and stored in digital format.(Recommended format: Fs=16Khz, 16-bit, Mono)

IMPORTANT : In order for the Mbrola resynthesis operation to achieve best results, the corpus should be read with the most monotonic intonation possible (just like when reading a long and boring enumeration). Even the end of words should maintain their fundamental frequency constant. Since this is a totally unnatural way of reading a text, the speaker should train before starting the recording session.
NOTA BENE : If you already have a diphone database which you want to make available in Mbrola format, contact the author, even if it has not been recorded with constant pitch. It is very likely that your database can be used anyway.

It is best to use high quality audio devices (microphone, pre-amp, A/D converter). The sound recording tools provided with many low-price commercial boards, for example, should be avoided, as they produce undesired recording noise. To roughly test the quality of your recording system, just plug the microphone in, adjust the recording level, hold your breath, and record. Or, if you can, short circuit the microphone entry of your system, and record. See the recording noise. In the case of FR1, the noise level only corrupted the last three bits of our data, leaving thirteen significant bits.

Another important type of noise to avoid is ambient noise and reverberation. In particular, the recording should be free of low frequency noises, due to trucks passing in the neighborhood for instance. Most of the time you won't hear them, but your microphone will hardly fail to detect them, especially if it is a high quality one. The best way to avoid them is to install your recording system inside a professional soundproof room. For FR1, this is what we did.

Segmenting the corpus

Once the corpus has been recorded, all diphones must be spotted, either manually with the help of signal visualization tools, or automatically thanks to segmentation algorithms, the decisions of which are checked and corrected interactively. One of our partners, Arthur Dirksen, gently provides his Diphone Studio software to alleviate the painstaking task of databases segmentation (Thanks, Arthur!). A diphone database is finally created, which centralizes the results, in the form of : the name of diphones, the related waveforms, their duration, and internal sub-splittings. As a matter of fact, the position of the border between phones should be stored, so as to be able to modify the duration of one half-phone without affecting the length of the other one.

NOTA BENE : For optimal results with Mbrola, it is best to keep diphones in context. The MBROLA resynthesis operation, indeed, includes some pitch analysis, which itself achieves more accurate results when, say, 50 ms of speech are kept at the left and right of each diphone.

What to do now?

If you want to build a new diphone database (and assuming you are now aware that this is not as straightforward as it may seem), please do the following :

  • Contact the author and announce your intention to proceed to the creation of a new database.
  • Download the Database License Agreement. Print 2 copies of it. Read them, fill them, and sign both copies on all pages. Send both copies back to us by email.
  • Download the instructions for creating a diphone database. Follow these instructions as best as you can. The final quality of your database depends on it (plus, we won't process database which do not match our requirements)/LI>

Thank you for your cooperation !

Last updated Oct, 28, 2014