High Quality Text-To-Speech Synthesis of the French Language
Supervised by Prof. Henri Leich
PhD dissertation submitted at the Faculte Polytechnique de Mons, TCTS Lab, 31 bvd Dolez, B-7000 Mons (Belgium)
In the context of High Quality Text-to-Speech (HQ-TTS) synthesis by concatenation, we have investigated the use of four leading speech models : the classical Auto-Regressive (LPC) one (with order 18), the hybrid Harmonic/Stochastic (H/S) model (also denoted as the MBE model), the 'null' model, as implemented by the Time-Domain Pitch-Synchronous OverLap-Add (TD-PSOLA) synthesis algorithm, and our own Multi-Band Re-synthesis Pitch-Synchronous OverLap-Add (MBR-PSOLA) model, which combines the computational efficiency of the original TD-PSOLA algorithm with the flexibility of the H/S model. Quality assessment tests have been performed, on the basis of practical software implementations of the four related HQ-TTS systems, for which the same segments database and test input data (phonemes and prosody) were used, so as to highlight the contributions of the models.
Our conclusions can be summarized as follows :
1.LPC, hybrid H/S, and MBR-PSOLA are superior to TD-PSOLA regarding the availability of automatic analysis procedures, a key point for developing multi-lingual TTS systems.
2.Prosody matching gives comparable results with all four models.
3.As far as segments concatenation capabilities are concerned, which are essential features in TTS synthesis, LPC is slightly superior to hybrid H/S, which is itself approximately equivalent to MBR-PSOLA. TD-PSOLA virtually exhibits no segments concatenation capabilities.
Switching to more economical criteria, one notices that :
4.The availability of an efficient segment database compression algorithm is ensured for LPC and hybrid H/S synthesizers. It is currently being developed for MBR-PSOLA, but it is clear that the resulting compression ratio will be superior to the one obtained for TD-PSOLA, while remaining computationally simple.
5.As a result of the computational complexity of their respective synthesizers, the LPC and hybrid H/S approaches clearly cannot spare a DSP. In contrast, both TD and MBR-PSOLA run in real time on a PC-386 machine.
Finally, when comparing the quality and intelligibility test results of the four models, it appears that :
6.TD-PSOLA and MBR-PSOLA have the highest CVC-scores, with a slight advantage for TD-PSOLA. As expected, the hybrid H/S synthesizer is itself much more intelligible than the LPC one.
7.Fluidity is better ensured by MBR-PSOLA and hybrid H/S synthesizers, given their superior concatenation capabilities.
8.Regarding naturalness, TD-PSOLA prevails de facto, since it does not make use of any speech model. MBR-PSOLA, however, is found to be as natural as TD-PSOLA, given its increased fluidity. H/S follows, far before LPC.
Natural Language Processing problems have also been addressed, in the form of the LInguistic Processing for Speech Synthesis (LIPSS) system developed in the TCTS labs of the Faculté Polytechnique de Mons. LIPSS uses a combination of text and Prolog rules : text rules account for simple linguistic facts, while complex descriptions are written in Prolog and compiled. It is based on a multi-layer expert system analysis strategy, and implements a minimum requirement principle which has led us to minimize the size of rules and exceptions dictionaries, by enabling LIPSS to handle partial information. Lexicalization has practically been restrained to function words, non-adjectival adverbs, and verbs (not to mention the Grapheme-To-Phoneme exceptions lexica). Yet the link between the DSP and NLP blocks remains to be made, by further improving the Grapheme-To-Phoneme module and designing ad hoc Structural Analysis and Prosody Generation modules.
Contents (text document)
Get a postscript version of the dissertation (zipped Postscript files : 5 Mb)
NB : Interested readers are also invited to report to
Last updated September, 2007.