next up previous contents
Next: WP1: DATABASES AND Up: No Title Previous: Contents

INTRODUCTION

Project Overview

 

Objectives

 

The basic theme of this project, referred to as SPRACH ( SPeech Recognition Algorithms for Connectionist Hybrids), is to build upon WERNICKE (ESPRIT Basic Research Project 6487, October 1992-October 1995) to further develop new theories, algorithms, hardware and software tools for the extension of hybrid Hidden Markov Models (HMM) --- Artificial Neural Networks (ANN) methods for different continuous speech recognition systems. However, while continuing the theoretical and development work successfully carried out in WERNICKE, this new project also aims at extending the WERNICKE results to new languages (UK English, French and Portuguese) and to flexible speech recognition systems that can easily be adapted to new domains with new lexica and new syntaxes. This thus means that one of the SPRACH objectives is also to develop powerful tools to allow an easy adaptation and testing of the known (as well as the newly developed) technology to different tasks.

In WERNICKE, on top of substantial theoretical results [2], it was demonstrated (see, e.g., [15]), using standard international reference databases (such as the unlimited vocabulary ARPA North American Business News database and the EU funded SQALE project), that the hybrid HMM/ANN approaches lead to competitive state-of-the-art systems. Furthermore, the investigated hybrid approach was shown to have additional advantages in terms of CPU utilization and memory bandwidth. It is, however, our belief that such systems can also be more flexible and more robust. In addition to building on the WERNICKE large vocabulary continuous speech recognition system, SPRACH investigates the development of systems for smaller, task independent applications, with no need to retrain the system or develop a new lexicon or grammar when moving from one task to another.

Industrial Relevance

 

Motivated by the results achieved in WERNICKE, several industrial and academic laboratories have recently compared the hybrid approaches developed in WERNICKE with the best classical HMM approaches on a number of speech recognition tasks. In cases where the comparison was controlled, the hybrid approach performed better when the number of parameters were similar, and about the same for some cases in which the classical system used many more parameters. Evidence for this can be found in a number of sources, including:

The most recent results, those of the EU funded SQALE evaluations, show the hybrid approach slightly ahead of more traditional HMM systems. The hybrid system was evaluated on both British and American English tasks, using a 20,000 word vocabulary and a trigram language model, along with the other leading European systems produced by LIMSI (France), Philips (Germany) and Cambridge University/HTK (UK) [18]. Additionally, the hybrid system was efficient in its runtime CPU and memory requirements.

Finally, the hybrid HMM/ANN approaches developed in WERNICKE are quite general and can be applied to other tasks. Recently, this approach was adopted by several laboratories to handle speaker verification [13] (NYNEX), handwriting recognition [17] (AT&T), gene classification and fault diagnosis [19].

Partners

 

The partners are:

Industrial Advisory Board

 

To further reinforce the industrial relevance of this project and its possible industrial impact, four major industrial partners agreed to be part the SPRACH Industrial Advisory Board with the aim of (1) guiding the research partners through the cooperative definition of potential applications, test tasks and development prototypes, and (2) maintaining an awareness of current and future developments in the area.

These industrial partners are: (1) British Broadcasting Corporation (BBC), UK, (2) Thomson CSF, France, (3) Daimler-Benz, Germany, and (4) CSELT, Italy. It is clear that all of them are highly interested in the possible outputs of the present project. Furthermore, it is worth noting that:

  1. BBC and Thomson are particularly interested in automatic indexing of spoken language and of recognition of broadcast speech (which is one of the specific applications considered in this project.
  2. Daimler-Benz is very active in the area of speech recognition and has an interest in learning about potential advantages of the hybrid HMM/ANN technology, particularly for robust systems. Additionally, Daimler-Benz is also one of the German industries funding ICSI, the US Subcontractor of the current project.
  3. CSELT is also a major player in the European speech recognition technology and is committed to turning this technology into products. Recently, they presented a (patent pending) speech recognition system based on hybrid HMM/ANN technology [8].

Expected Results

 

Possible applications and demonstration systems that are targeted in this project include:

  1. Very large vocabulary ( 64K words) continuous speech recognition of read speech---this will be an essential enabling technology for many multimedia and telematics applications.
  2. Voice-driven typewriter: A dictation system running in real time with simple editing commands.
  3. Flexible continuous speech recognizer in which lexica and grammars can be defined on the spot, without the need of training.
  4. Smaller (but realistic) tasks, including, e.g., robust recognition of free format numbers. This could be done on the basis of existing databases like the OGI numbers databases.
  5. Recognition of broadcast speech---transcription of radio or television speech (e.g. news-readers).
  6. Extension of the above to several European languages. On top of the properties discussed above, another interesting feature of the hybrid systems is that they do not seem to require extensive knowledge of the languages or their phonological rules to adapt the recognizer. With appropriate databases (which become more and more available), development of a new language is quite straightforward.

To conclude this introduction, we also remind the reader that in this project all the partners use a common fast and flexible hardware (SPERT) that has been developed by ICSI, the SPRACH subcontractor. As already shown in WERNICKE, the availability of common hardware and software that is somewhat customized for the research approaches under investigation permitted both the incorporation of very computationally-intensive algorithms, and the comparison of their efficacy across the different sites.

Work plan Overview

General Analysis

 

As already mentioned, this project builds upon the 1992-1995 ESPRIT project WERNICKE which developed a state-of-the-art, speaker independent, large vocabulary continuous speech recognition system (comparable with the best) that is significantly more compact and efficient than its competitors.

WERNICKE also demonstrated that hybrid HMM/ANN technology is viable, probably preferable, to build on for the goals of this project (e.g., more compact, less ``specialized'' and, consequently, easier to adapt to new tasks and new languages). Actually, the resulting hybrid HMM/ANN systems have proven to be good alternatives to standard HMM technology. This is particularly promising since it seems to be more and more difficult to improve on standard HMMs and the need for alternative technologies and new paradigms is often acknowledged by scientists working in this field. As briefly discussed in Section 0.1.2, this technology has also proven to be potentially useful in other application domains. The output of WERNICKE can thus be considered as successful and has already attracted substantial interest from several industries. However, it is clear that there is still much to be done to improve the existing system.

As briefly explained in Section 0.1.1, the fundamental aim of the present project is to further develop and optimize our hybrid HMM/ANN speaker independent, large vocabulary ( 64K words), continuous speech recognizers, and continue their comparison with other state-of-the-art systems. In SPRACH, the advantages of hybrid HMM/ANN systems are further exploited by extending the systems to new languages (UK English, French and Portuguese) and to flexible speech recognition systems that can easily be adapted to new domains with new lexica and new syntaxes.

To achieve this goal, the approach followed in this project has been built upon several basic parts, spread across different Work Packages, with very strong relationships and inter-dependencies:

  1. Extension of baseline HMM/ANN systems (available for American English and UK English) to French and Portuguese, and adaptation to different assessment databases. This is covered by Work Packages WP1 (for databases and baseline systems), WP2 (for lexica and automatic learning of lexica) and WP3 (for language models and language model adaptation).
  2. Development, and assessment on applications defined in Section 0.1.5, of task independent hybrid HMM/ANN recognizers in UK English, US English (for international assessment), French and Portuguese. This requires: (1) large databases in the targeted languages (covered by WP1),(2) automatic generation of phonetic transcription and phonological rules of new lexica (covered by WP2), (3) fast adaptation of language models (covered by WP3), and (4) task-independent acoustic models robust to noise and channel conditions (covered by WP4).

    Formal assessment of these systems are not always be possible. However, prototype systems will be set up regularly and will be made available for testing by our industrial advisors (on applications possibly defined by them); this is covered by WP7. However, whenever possible, formal assessment will be done on smaller databases (with or without retraining) when available; this will be the case for the OGI free format numbers, as mentioned in Section 0.1.5.

  3. Following the WERNICKE format, formal assessment and comparisons with other state-of-the-art systems via international competition on the basis of common databases are being pursued. Therefore, this project has to put a large effort in the use of speech data that are widely used for evaluating continuous speech recognizers all around the world. This is covered by WP1 and WP7 (since training and assessment on large common databases requires substantial effort and was originally underestimated in WERNICKE). In WP7, a task exclusively devoted to maintaining a good and efficient decoder for large lexica has been added.
  4. Development and evaluation of new theories and methods to improve or go beyond the existing hybrid HMM/ANN systems. This constitutes the ``research core'' of this project, and is addressed in work package WP5. In this work package, several promising approaches that could go beyond the initial hybrid HMM/ANN systems and improve them have been listed. Although this part is more research oriented, it is not too speculative since preliminary work has already been done in each of the mentioned areas and since these are closely related to the above mentioned issues.
  5. Use of common hardware and software tools to help the research and to implement resulting algorithms (covered by WP6). This was shown to be particularly useful and efficient in WERNICKE since:
    1. This forces all the partners to work on the same software and hardware.
    2. Although hybrid HMM/ANN approaches appear to show several advantages in terms of performance and reduced complexity during recognition, this is achieved at the cost of drastically increased time for training, which makes further European developments and investigations in this field (and probably also in many other problems involving ANN algorithms) completely impossible without special hardware. Such kind of hardware and associated software does not exist in Europe yet and its development would probably require tens of man-years. Note that there are some more specialized computers that have been developed for this purpose in Europe, but they are less applicable to the kind of flexible programming needs that are present in the research environment such as was the case in WERNICKE.
    3. This significantly reduces research and test cycles.
Recently, our subcontractor ICSI released (as originally planned in WERNICKE) their full-custom single chip vector microprocessor that will be used in this project. This processor was designed to be a good match to the kind of research that is being done by the SPRACH partners. However, to surpass the level of performance obtained by high-end workstations, the design needed to be somewhat specialized for the relevant styles of computation. In order to permit efficient use of this chip that is simultaneously flexible along the lines of research pursued by this group, ICSI keeps developing software classes that permit all the computation for the kinds of neural networks that are used in this project.

Structure, Work Packages and Tasks

In short, on top of WP0 on Project Management, eight work packages have been defined:

  1. WP1: Database gathering from different sources and set up of baseline systems. In this framework, the large vocabulary, continuous speech recognizer resulting fromWERNICKE will be extended to French and Portuguese.
  2. WP2: Development of (and development tools for) lexica for multiple languages, including baseline dictionaries for new languages and automatic learning of new dictionaries.
  3. WP3: Development tools and research on different approaches to represent and adapt language models (LM), with particular focus on generality and ease to use.
  4. WP4: Development tools and research on application domain independence and adaptation, including task independency of acoustic models, and unsupervised adaptation and training of speakers and acoustic models.
  5. WP5: More fundamental research into important issues related to speech recognition in general and hybrid systems in particular, including perceptual models, global discrimination, mixture of experts, and others. It is expected that, as for WERNICKE, research into those very well defined promising research areas will lead to further enhancement of our existing systems.
  6. WP6: Development of the necessary software and hardware tools necessary to carry out the proposed work. As already shown with WERNICKE, this is particularly important in (1) reducing the research cycle and (2) forcing all the partners to work on the same software and hardware basis.
  7. WP7: Evaluations and prototypes development to regularly assess the progress of SPRACH. Building upon the WERNICKE software, it is expected that some of those demonstration systems will actually be close to real ``products''.
  8. WP8: Results dissemination and exploitation.

In the table below, we summarize the seven work packages broken down into their component tasks and their estimated manpower.

Strong interaction between all the partners and all work packages is guaranteed through the use of the same hardware, software and (research, i.e., US English) databases. Only language specific developments (UK English, French and Portuguese) will be carried out by the respective sites.

General Status

At the end of its first year, the general status of this project is quite satisfactory. The preparation of the Portuguese database acquisition is doing well, and a Portuguese dictionary has been built. A first version of a baseline Portuguese speaker-independent continuous speech recognition system was built with sucess. Unfortunately, no work could been done for the development of a French large vocabulary speech recognition system, due to the lack of a database. Work on a variety of novel language modelling techniques is in progress, with some preliminary results reported in WP 3. A vocabulary independent isolated word recognition system has been developed. The Linear Input Network (LIN) technique for speaker adaptation has been further investigated at CUED. Of course, work on large vocabulary continuous speech recognition has been continued. Many new techniques have been investigated in WP 5, including: SPAM, sub-band based model, REMAP, and mixture of expert for speaker adaptation.

All the partners have been equipped with the SPERT board developed at ICSI, allowing use of algorithm requiring high performance hardware. The software was roughly functional at an early stage, but 16 bit variables were found to be inadequate for the weights used in our training algorithms. Modified software was developed to update, store, and retrieve 32-bit weights. The modified fixed-point routines appear to have resolved the differences between floating point and fixed point trainings for the feedforward neural network. We are now working on a corresponding resolution for the recurrent neural network (RNN). A speech training and recognition toolkit, compatible with existing software, has been developed at FPMs. CUED has released a new version of AbbotDemo in September. The consortium has decided to make the software developed in the framework of WERNICKE and SPRACH available to the research community.

Organization of this Progress Report

The general guidelines for the format of this progress report were "short" and "precise". Theoretical results developed previously and related to the approaches used in this work are not recalled in the written documents (although they will probably be briefly recalled during the formal presentation); only references to these theoretical results are provided. In case of new results, general technical descriptions are given in the technical section of the progress report (section on ``Technical Description'' for each task); a more detailed description is given in the Deliverables when necessary. Technical reports and publications have been included as parts of these Deliverables.

The outline for each workpackage write-up is the following:

  1. WP Overview: List of workpackage manager and partners and short description of the workpackage.
  2. Milestones and Deliverables: List of T0+12 Milestones and Deliverables and pointers to the following sections.
  3. For each Task x:
  4. Conclusion


next up previous contents
Next: WP1: DATABASES AND Up: No Title Previous: Contents



Jean-Marc Boite
Tue Jan 7 12:46:31 MET 1997