The Technologies We Use

The TCTS lab of the Faculté Polytechnique de Mons has been active in the field of Text-To-Speech synthesis since 1983. A recent PhD. thesis has been completed on the design of LIPSSTALK, a high quality, diphone-based, text-to-speech (TTS) system for the French language, which incorporates innovative implementations for both the NLP and DSP parts in the form of the time-domain Multi-Band Re-synthesis OverLap-Add (MBROLA) synthesis algorithm and of the LInguistic Processing for Speech Synthesis (LIPSS) text analysis system. LIPSSTALK has recently served as a basis for (EULER), a multilingual TTS system based on top quality software engineering principles.

PArt of our work is devoted to the study of concatenation-based synthesis techniques. On the basis of a comparative study of several TTS models and algorithms, we have opted for a synthesis technique known as MBROLA(A demo is available here and here), which takes advantage of the flexibility of parametric speech models while keeping the computational simplicity of time-domain synthesizers. This model compares favorably with others with respect to the aforementioned quality criteria for the following reasons :

  • Its computational complexity has been kept as low as with the TD-PSOLA approach (4 operations/sample on the average), while enabling the synthesizer to apply spectral smoothing in the time-domain between neighboring segments, a feature that is not available with TD-PSOLA.
  • As a result, the fluidity of MBROLA-based speech is enhanced, so that even diphones produce high quality synthetic speech. Yet, we have not even had to optimize our diphone database through the classical and painstaking trial and error operation which consists of rejecting and re-recording bad segments (i.e. segments which introduce important discontinuities when concatenated to others).
  • It is possible to code the MBROLA diphone database for French with bit rates lower than 30 kbits/s without degrading speech quality (to be compared with the 256 kbits/s required to store speech at 16 kHz with a 16 bits resolution) and with a sensible increase of computational load (8 operation/sample) .

In the context of High Quality Text-to-Speech (HQ-TTS) synthesis by concatenation, we have also investigated the use of three other leading speech models : the classical Auto-Regressive (LPC; A demo is available here) one (with order 18), the hybrid Harmonic/Stochastic (H/S) model (also known as the MBE model; A demo is available here) , the 'null' model, as implemented by the Time-Domain Pitch-Synchronous OverLap-Add (TD-PSOLA; A demo is available here) synthesis algorithm. Quality assessment tests have been performed, on the basis of practical software implementations of the related HQ-TTS systems have shown that our own MBROLA model combines the computational efficiency and the naturalness of the original TD-PSOLA algorithm with the flexibility of the H/S model.

We also study corpus-based techniques for speech synthesis, especially in the context of prosody generation. F. Malfrere will soon present his PhD dissertation on this topic. See here for more details.

We also work on developing aids for the handicaped, and have set up several international collaboration projects built on the model of the MBROLA project. See our Projects page for more details.

Last updated December 17, 1999, send comments to dutoit@tcts.fpms.ac.be