|
|
ProjectsThe Speech Synthesis Research Group of the Facult Polytechnique de Mons was created in the 90's, and produced several widely spread tools, mostly in the context of the MBROLA project, and its follow-up projects, like MBROLIGN and MBRDICO, and W. The activities of the TTS group later (2000->) evolved to the development of the EULER Project (a temptative towards a generic, open source TTS solution).
In 2003, the activities of the group were retargetted toward the processing of voice quality effects, and the group was renamed as "VOQUAL". See the VOQUAL Group Web pages for more info. |
|
Current R&D projects
The MediaTIC portfolio was submitted in September 2007 in response to the first call for proposals of the ERDF and started on 1st July, 2008. This ambitious project falls within the scope of measure 2.2 dedicated to the exploitation of the potential of research centres. More concretely, the project’s objective is to increase the competitiveness of innovating technological SMEs in Wallonia through collective projects dictated by concrete industrial requests. It works as a cross-action for the innovation in the NTIC component of each strategic line defined by the Walloon Marshall Plan.
Numediart is a long-term research programme centered on Digital Media Arts, funded by Région Wallonne, Belgium (grant N°716631). Its main goal is to foster the development of new media technologies through digital performances and installations, in connection with local companies and artists.
CALLAS ("Conveying Affectiveness in Leading-Edge Living Adaptive Systems") is a European Integrated Project (FP6). It aims at designing and developing multimodal architectures giving a strong importance to emotions, for Arts and Entertainment. The global idea of the project is that New Medias, targeting recognition and production of emotions, can enhance users' (or spectators') experience and interaction. CALLAS is thus investigating how, at the input level, emotions can be detected and how, at the output level, these emotions can be processed to generate a new audiovisual content enriching users' experience. The input modalities include both vocal and body languages (recorded through video cameras and haptic devices). In order to improve the recognition of emotions, the problem of merging the information coming from these different modalities will also be examined. The applications are ranging from digital theatre productions (playing an audio or visual content in relation with the actors' and spectators' feelings) to real or virtual museum tours (taking the visitor's interest into account to reshape the exposition and select the level of information its audioguide will give), without forgetting interactive television (modifying a scenario according to the spectator's emotions).
Intelligibility and expressivity have become the keywords in speech synthesis. For this, a system (HTS) based on the statistical generation of voice parameters from Hidden Markov Models has recently shown its potential efficiency and flexibility. Nevertheless this approach has not yet reached its maturity and is limited by the buzziness it produces. This latter inconvenience is undoubtedly due to the parametrical representation of speech inducing a lack of voice quality. The first part of this thesis is consequently devoted to the high-quality analysis of speech. In the future, applications oriented towards voice conversion and expressive speech synthesis could also be carried out.
Automatic speech recognition has a huge importance in the field of automatic indexing of audiovisual documents. Indexing time widespread broadcast news is a challenge from a vocabulary point of view, because of new words, new names, new places. Techniques for updating LVCSR language models (vocabulary and grammar) are necessary. An alternative to LVCSR is to use keyword spotting. In this case, we just need the phonetic translation of the new words that have to be detected. Every keywords are not equals in terms of "detectability". The work focuses on the prediction of keyword spotting performances, and on keyword spotting accuracy improvement by adapting decision parameters given a priori information on the words to be detected. Human speech contains a lot of paralinguistic sounds conveying information about the speaker’s (affective) state. Laughter is one of those signals. Due to its high variability, both inter- and intra- speaker (one same person will laugh differently depending on its emotional state, environment, etc.), it is difficult to recognize laughter from an audio record or to synthesize human-like laughter, sounding natural. In the framework of the CALLAS project, our study aims at catching the global patterns of laughter in order to develop algorithms to detect it in real-time and to produce natural laughter utterances. Potential uses cover the broad range of applications using automatic speech recognition and synthesis for human computer interactions. PAST stands for Pathology Assessment by Source-Tract separation of speech. Speech is one of the most natural way to communicate among humans and can be affected by some troubles when used in an intensive way. Specially, this kind of problems affect people like singers or teachers. When the pathology becomes painful, these persons have to undercome a speech assessment performed by a clinician. This examination consists of acoustical, aerodynamic and image recordings which help the clinician to diagnose the degree of pathology. In the field of speech processing, most researchers have been interested in estimating contributions of the glottal source and the vocal tract in the speech signal. Among these, the ZZT representation was recently proposed and suggest very interesting perspectives. This PhD thesis proposes to use this representation and other ones in order to evaluate the impact of pathology by the estimation of the glottal source and the vocal tract contributions in speech signal.
The main objective of the Action is to develop an advanced acoustical, perceptual and psychological analysis of verbal and non-verbal communication signals originating in spontaneous face-to-face interaction, in order to identify algorithms and automatic procedures capable of identifying human emotional states. Several key aspects will be considered, such as the integration of the developed algorithms and procedures for application in telecommunication, and for the recognition of emotional states, gestures, speech and facial expressions, in anticipation of the implementation of intelligent avatars and interactive dialogue systems that could be exploited to improve user access to future telecommunication services. There are various methods of analysis aiming at classifying vocal pathologies, but none is really powerful. First of all, the “perceptive” analysis makes it possible to the doctor to qualify the quality of the voice according to several criteria, the problem of this method being subjectivity of the judgement. That’s why specialists prefer the “acoustic” analysis, computer-assisted method consisting in calculating on the vocal signal a series of objective parameters which are used to qualify the voice of the patient. But this method is only effective to analyze supported vowels, and thus not continuous speech, what would be more suitable. Moreover, the strongly hoarse speakers are unable to produce pseudoperiodic speech.
TTSBOX performs the synthesis of Genglish (for "Generic English"), an imaginary language obtained by replacing English words by generic words. Genglish therefore has a rather limited lexicon, but its pronunciation maintains most of the problems encountered in natural languages. TTSBOX uses simple data-driven techniques (Bigrams, CARTs, NUUs) while trying to keep the code minimal, so as to keep it readable for students with reasonable MATLAB practice. The goal of the MBROLA project is to obtain a set a high quality speech synthesizers for as many languages as possible, free for use in non-commercial applications. The ultimate goal is to boost up academic research on speech synthesis, and particularly on prosody generation, known as one of the biggest challenges in Text-to-Speech Synthesis for the years to come. As of 2003, 26 languages are available, and ore than 50 voices. Many other languages are in preparation. The software has been compiled on 21 machine/OS combinations Past R&D projects
L’objectif d’IRMA est de concevoir et développer une interface modulaire innovante pour la recherche et la navigation multimodale personnalisée, performante, sécurisée et économiquement viable dans des bases de données audiovisuelles indexées. Elle permettra une recherche contextuelle, intuitive et naturelle complétée par une navigation fluide. De la sorte, IRMA fournira un environnement permettant d’exploiter au mieux l’intelligence de l’utilisateur du moteur de recherche.
The main objective of this COST Action is to improve the quality and capabilities of the voice services for telecommunication systems through the development of new nonlinear speech processing techniques. The proposed new mathematical methods are expected to provide advances in generic speech processing functions. Examples of these are: higher quality speech synthesis, more efficient speech coding, improved speech recognition, and improved speaker identification.
The STOP Project aims at studying the relationship between speech dynamics and voice quality, based on home-made tools for efficient source-tract separation.
Armageddon is an opera sung and played by human-controled robots, in real time. Created by Art Zoyd; Robot voices taken from the MBROLA Project (under Max/MSP).
The SIMILAR European Network of Excellence will create an integrated task force on multimodal interfaces that respond intelligently to speech, gestures, vision, haptics and direct brain connections by merging into a single research group excellent European laboratories in Human-Computer Interaction (HCI) and in Signal Processing.
NUMBROLA is an extension of MBROLA towards corpus-based, non-uniform unit (NUU) selection techniques in speech synthesis. The goal of NUMBROLA is to provide a standard concatenative synthesizer to people active in NUU research. A French database has been made available, and a first version of the software. We are currently working on an improved version, based on a modified MBROLA agorithm : TP-MBROLA.
The main objective of this Action is to create knowledge in several problem areas of spoken language interaction in telecommunications in order to achieve communicative interfaces that provide a natural human-computer interaction through more cognitive, intuitive and robust interfaces, whether monolingual, multilingual or multimodal. The scientific programme emphasises speech and dialogue processing in multimodal communication interfaces, issues related to robustness and multilinguality, human-computer dialogue theories, and models and systems and associated tools for the establishment of interactive systems. The programme also involves the evaluation of telecommunication applications in which spoken language is the only or one of many types of input or output modalities.
The goal of this program is to transcribe a symbolic input, i.e. a string of symbols belonging to some alphabet, into a symbolic output according to a regular grammar described in terms of a system of multi-level rewriting rules (MLRR). "Symbols" and "alphabet" have to be understood here as generic terms: they can be characters, phonemes, syllables, words, phrases, etc. This project is closed but the software is available in Open Source format.
Confidence measures for the results of speech/speaker recognition make the systems more useful in the real time applications. Confidence measures provide a test statistic for accepting or rejecting the recognition hypothesis of the speech/speaker recognition system. Speech/speaker recognition systems are usually based on statistical modeling techniques. In this thesis we defined confidence measures for statistical modeling techniques used in speech/speaker recognition systems. For speech recognition we tested available confidence measures and the newly defined acoustic prior information based confidence measure in two different conditions which cause errors: the out-of-vocabulary words and presence of additive noise. We showed that the newly defined confidence measure performs better in both tests. Review of speech recognition and speaker recognition techniques and some related statistical methods is given through the thesis. We defined also a new interpretation technique for confidence measures which is based on Fisher transformation of likelihood ratios obtained in speaker verification. Transformation provided us with a linearly interpretable confidence level which can be used directly in real time applications like for dialog management. We have also tested the confidence measures for speaker verification systems and evaluated the efficiency of the confidence measures for adaptation of speaker models. We showed that use of confidence measures to select adaptation data improves the accuracy of the speaker model adaptation process. Another contribution of this thesis is the preparation of a phonetically rich continuous speech database for Turkish Language. The database is used for developing an HMM/MLP hybrid speech recognition for Turkish Language. Experiments on the test sets of the database showed that the speech recognition system has a good accuracy for long speech sequences while performance is lower for short words, as it is the case for current speech recognition systems for other languages. A new language modeling technique for the Turkish language is introduced in this thesis, which can be used for other agglutinative languages. Performance evaluations on newly defined language modeling techniques showed that it outperforms the classical n-gram language modeling technique.
This book addresses the problems of spoken dialogue system design and especially automatic learning of optimal strategies for man-machine dialogues. Besides the description of the learning methods, this text proposes a framework for realistic simulation of human-machine dialogues based on probabilistic techniques, which allows automatic evaluation and unsupervised learning of dialogue strategies. This framework relies on stochastic modelling of modules composing spoken dialogue systems as well as on user modelling. Special care has been taken to build models that can either be hand-tuned or learned from generic data.
For years, non-coordinated research effort on the design of text-to-speech (TTS) systems has led to unavoidable cross-system and cross-language incompatibility. The EULER project aimed at producing a unified, extensible, and publicly available research, development and production environment for multilingual TTS synthesis. EULER has led to the development of a corpus-based French TTS system. The project is no longer supported, but the software components are still available.
MBRDICO is a talking dictionnary using MBROLA as a back-end speech synthesizer. Text processing is performed using a complete GNU GPL package for automatic phonetization training (letter/phoneme alignement, decision tree building, stress assignment) and duration/intonation generation. French, US English, and Arabic are available. We do not work directly on this project any longer, but all its sources are available for use or extension. This work is the result of a collaboration between:
MBROLIGN is a fast MBROLA-based text-to-speech aligner. It is provided free for use in non commercial applications. The goal of this project is to create large phonetically and prosodically labeled for as many languages as possible, thereby drastically expanding the reach of speech technology. This project is currently closed, but the software is available for database creation. The W project aimed at creating a fast computer keyboard driver for people with speech disabilities. The related software is based on grade II Braille languages developed by blind people associations all over the world and minimizes the number of keystrokes to utter a word (the name of the project is the grade II abreviation for "word" in English). This project has been extended by MULTITEL ASBL in the framework of the FASTY EC/FP5 Project.
Speaker Recognition in Telephony
OOBP is a programming paradigm developped at TCTS Lab since 1994. It is defined as Object Oriented Programming around processes and combines OOP and block descriptions. Plug and Play Software extends OOBP by defining input and output data as abstract streams. |
|