The MBROLIGN Project
Towards a Large Repository of Aligned Text-to-Speech
Lesoir97, Emotions, Eurom1, Arabic
The aim of the MBROLIGN project, initiated by the TCTS Lab of the Faculté Polytechnique de Mons (Belgium), is to create a repository of phonetic/prosodic (PHO) files for as many languages as possible, and provide them free for non-commercial applications. The phonetic/prosodic files collected are .pho files compatible with the MBROLA synthesizer.
The ultimate goal of this project is to boost academic research on corpus-based prosodic modeling, known as one of the biggest challenges taken up by Text-To-Speech synthesizers for the years to come.
Central to the MBROLIGN project is MBROLIGN v1.1, a text-to-speech alignment tool based on the freely available MBROLA synthesizer. Mbrolign takes a list of phonemes as input, together with a wav file, and produces a phonetic/prosodic file as output. Text-to-Speech alignement is based on dynamic time warping between the input wav file and an utterance synthesized from the input phonemes, using the MBROLA synthesizer.
Text+speech databases are needed as input to the MBROLIGN aligner. In order to ensure a large distribution of the outputs of this program, we have established a very simple sahring policy, so as to incite other research labs or companies to share their phonetic/prosodic files. The terms of this sharing policy can be summarized as follows :
The phonetic/prosodic files collected follow this very simple format :
After some official agreement between the author of MBROLIGN and the owner of a text+speech, the database is processed by the MBROLIGN team and the phonetic/prosodic files are returned to the owner of the database. These files are also made available on the MBROLIGN web site for non-commercial, non-military use as part of the MBROLIGN project. Commercial rights on the phonetic/prosodic files remain with the database provider.
_ 51 25 114
o~ 127 48 170
Z 110 53 116
R 150 50 91
This is also format of the input data required by the MBROLA synthesizer. Each
line contains a phoneme name, a duration (in ms), and a series
(possibly none) of pitch pattern points composed of two integer
numbers each : the position of the pitch pattern point within
the phoneme (in % of its total duration), and the pitch value
(in Hz) at this position.
Hence, the first line of the above example:
_ 51 25 114
corresponds to a silence of 51 ms, and a
pitch pattern point of 114 Hz at 25% of 51 ms. Pitch pattern
points define a piecewise linear pitch curve.
Here is a demo of some phonetic/prosodic files obtained in the context of this project, and resynthesized with the MBROLA synthesizer:
- LeSoir97, 70 minutes of Belgian read news
- Emotions, 30 minutes of emotive French speech
- EUROM1, a text-to-speech alignment of the EUROM1 (MULTEXT PROSODIC Database part) ENGLISH CD
- Arabic, 40 minutes of arabic speech (moroccan)
Last updated December 17, 1999, send comments to firstname.lastname@example.org