Noé Tits holds a degree of Electrical Engineering, specialized in Signals, Systems and Bio-engineering, at the Engineering Faculty of the University of Mons since June 2017. His master thesis was performed at the University of the Basque Country in Bilbao (Spain) in the Aholab laboratory specialized in speech processing. He developed a tool for pathological voice analysis. He is currently pursuing a PhD thesis on expressive speech synthesis using Machine Learning techniques.
PhD thesis (2017 - 2021)
Text-To-Speech synthesis systems have existed for decades and have recently improved with the advent of Deep Neural Networks (DNNs). These systems offer excellent speech quality, by learning from tens of hours of speech.
The challenge faced by researchers today has therefore evolved: it is now necessary to be able to produce, remarkable voices, similar to those of actors, possessing a specific grain and a great ability for expressiveness. This is the field of expressive speech synthesis.
In this context, the main issues are the variability in vocal expression of emotions, and the difficulty of annotating large databases with very subjective expressive metadata that are still poorly defined. Deep Learning has proven to be effective in handling complex data but requires a large amount of this annotated data.
Currently, there exist some databases annotated in emotions by hand. Those databases are suitable for recognition systems but not for speech synthesis because of their format. Indeed these databases contain exterior noises and speech overlaps as they generally consist of dyadic conversations.
The strategy proposed in this project is precisely to develop an automated system of expressive annotation of large voice databases.This system would be trained thanks to existing annotated databases suitable for recognition and then applied to speech databases with high audio quality suitable for speech synthesis.
The result of this study will be twofold, since it will simultaneously produce large annotated databases and the associated expressive speech synthesis system. In order to evaluate the quality of this system, we will place it in a context of interaction in which it has to imitate the expressiveness of its interlocutor.
^ Top ^
^ Top ^