The MBROLIGN Project

Towards a Large Repository of Aligned Text-to-Speech

Lesoir97, Emotions, Eurom1, Arabic

The aim of the MBROLIGN project, initiated by the TCTS Lab of the Faculté Polytechnique de Mons (Belgium), is to create a repository of phonetic/prosodic (PHO) files for as many languages as possible, and provide them free for non-commercial applications. The phonetic/prosodic files collected are .pho files compatible with the MBROLA synthesizer.

The ultimate goal of this project is to boost academic research on corpus-based prosodic modeling, known as one of the biggest challenges taken up by Text-To-Speech synthesizers for the years to come.

Central to the MBROLIGN project is MBROLIGN v1.1, a text-to-speech alignment tool based on the freely available MBROLA synthesizer. Mbrolign takes a list of phonemes as input, together with a wav file, and produces a phonetic/prosodic file as output. Text-to-Speech alignement is based on dynamic time warping between the input wav file and an utterance synthesized from the input phonemes, using the MBROLA synthesizer.

Text+speech databases are needed as input to the MBROLIGN aligner. In order to ensure a large distribution of the outputs of this program, we have established a very simple sahring policy, so as to incite other research labs or companies to share their phonetic/prosodic files. The terms of this sharing policy can be summarized as follows :

After some official agreement between the author of MBROLIGN and the owner of a text+speech, the database is processed by the MBROLIGN team and the phonetic/prosodic files are returned to the owner of the database. These files are also made available on the MBROLIGN web site for non-commercial, non-military use as part of the MBROLIGN project. Commercial rights on the phonetic/prosodic files remain with the database provider.

The phonetic/prosodic files collected follow this very simple format :
_ 51 25 114 
b 62 
o~ 127 48 170 
Z 110 53 116 
u 211 
R 150 50 91 
_ 91

This is also format of the input data required by the MBROLA synthesizer. Each line contains a phoneme name, a duration (in ms), and a series (possibly none) of pitch pattern points composed of two integer numbers each : the position of the pitch pattern point within the phoneme (in % of its total duration), and the pitch value (in Hz) at this position.

Hence, the first line of the above example:

_ 51 25 114

corresponds to a silence of 51 ms, and a pitch pattern point of 114 Hz at 25% of 51 ms. Pitch pattern points define a piecewise linear pitch curve.


Here is a demo of some phonetic/prosodic files obtained in the context of this project, and resynthesized with the MBROLA synthesizer:
  • LeSoir97, 70 minutes of Belgian read news
  • Emotions, 30 minutes of emotive French speech
  • EUROM1, a text-to-speech alignment of the EUROM1 (MULTEXT PROSODIC Database part) ENGLISH CD
  • Arabic, 40 minutes of arabic speech (moroccan)

