TTS Synthesis with the hybrid H/S model
The hybrid harmonic/stochastic (HS) synthesizer demonstrated here is one of the four synthesizers developed in the course of Thierry Dutoit's PhD (click here for the complete demo). It is described in detail in :
T. DUTOIT, "On the use of a hybrid harmonic/stochastic model for TTS synthesis-by-concatenation", submitted to Speech Communication, to appear, August 1996.
Together with three other TTS systems based on the same diphone database, it has been used for a general comparison of the use of speech models in the context of TTS synthesis, in :
"High Quality Text-To-Speech Synthesis : A Comparison of Four Candidate Algorithms", T. DUTOIT, Proc. ICASSP'94, Adelaide, Australia, 19-22 April 1994, vol. 1, pp. 565-568. (Postscript file of a draft version : 36 Kb)
Last but not least, it has served as a basis in the development of the MBROLA TTS synthesizer, the results of which are now available to internet users in the context of the MBROLA Project.
Demo files (16 kHz/16 bits - SUN .AU format)
This French hybrid H/S TTS synthesizer is based on Griffin's analysis algorithm, uses the OLA synthesis approach, the prosody matching and segment concatenation detailed in the above mentioned paper. We have scaled the amplitude of UV components by a factor of 0.8, in order to minimize extra-breath in sonorants while not too much affecting the production of unvoiced sounds.
IMPORTANT : It should be emphasized that, in order to test the segmental quality of this concatenation-based synthesizer independently of suprasegmental effects, we have provided it with prosodic information directly stylized from natural pronunciation of the text.
For example, "bonjour.raw" was obtained from the following input file :
_ 51 25 114
on 127 48 170
j 110 53 116
r 150 50 91
Each line contains a phoneme name, a duration (in ms), and a series (possibly none) of pitch pattern points composed of two integer numbers each : the position of the pitch pattern point within the phoneme (in % of its total duration), and the pitch value (in Hz) at this position. Hence, the first line of bonjour.pho :
_ 51 25 114
tells the synthesizer to produce a silence of 51 ms, and to put a pitch pattern point of 114 Hz at 25% of 51 ms. Pitch pattern points define a piecewise linear pitch curve.
It should be clear to listeners that, although speech suffers from some "roughness" which appears when using the model to concatenate diphones (copy synthesis experiments with exactly the same analysis/synthesizer source code produced synthetic speech which could hardly be distinguished from the original), the overall quality is superior to classical LPC-based synthesis, and that the segment concatenation algorithm used here produces very fluid speech. Concatenation points, indeed, can hardly be detected.
We have also included here a number of test files which give examples of the experiments we have made with wide-band hybrid H/S speech, and reported in our recent Speech Communication paper (to appear, August 1996, see above).
NB : All speech files listed below are .au 16bits/16kHz sample files. They all have been synthesized with the OLA IFFT algorithm described in the paper.
- French word 'devisageons', original sppech (28 kb)
- 'devisageons', copy H/S synthesis, criterion 1, original phases (27 kb)
- 'devisageons', copy H/S synthesis, criterion 1, continuous phases (27 kb)
- 'devisageons', copy H/S synthesis, criterion 5, original phases (27 kb)
- 'devisageons', copy H/S synthesis with prosody modification, criterion 1, original phases (41 kb)
- 'devisageons', copy H/S synthesis with prosody modification, criterion 1, continuous phases (41 kb)
- 'comme un coup donne par l'air', concatenative H/S synthesis, criterion 1, original phases (48 kb)
- 'comme un coup donne par l'air', concatenative H/S synthesis, criterion 1, continuous phases (48 kb)
It should be clear from these examples (and our more extensive tests have confirmed it) that :
- Hybrid H/S copy synthesis achieves very high segmental quality when original phases are maintained (in contrast with the poorer quality obtained when phases are computed as the integrals of fundamental frequency and imposed to be a continuous function of time). Synthetic speech, indeed, can hardly be distinguished from its original.
- Criterion 5 results in a loss of high frequency harmonics, as opposed to criterion 1 (although some extra breath is also encountered with criterion 1).
- Prosodic modifications (with the phase modification algorithm described in the paper)give good results, although the difference between original and continuous harmonic phases is reduced.
- One can hardly separate original phase from continuous phase synthesis when using the H/S model for concatenative synthesis.
- Few concatenation points, if any, can be heard during concatenative synthesis (this is still clearer in the numerous examples given at top of this page, for concatenative synthesis only).
It remains that H/S concatenative synthesis is unquestionably superior to its LPC conterpart
Last updated December 17, 1999, send comments to firstname.lastname@example.org