Work Package Manager: CUED
Executing Partners: CUED, FPMs, INESC, SU, ICSI In recent years formal evaluations have provided a way of both quantifying the effect of algorithmic developments in speech recognition and disseminating these advances to the research community.
Task Coordinator: SU
Executing Partners: CUED, FPMs, INESC, SU, ICSI
Note: this task was named ``Decoder Issues'' in the proposal.
An achievement of the WERNICKE project was the development of an efficient single-pass decoder, NOWAY. In SPRACH three principal decoder issues are investigated:
To integrate software coming from different sources, FPMs decided to develop the ``Speech Training and Recognition Unified Tool'' (STRUT).
The STRUT software is now in use by all researchers in speech recognition at FPMs, as well as in other speech labs. STRUT is able to do speech analysis, models training and speech recognition. It does not contain any large vocabulary decoder, as it is compatible with NOWAY, but a lattice decoder has been written. A User's Guide can be found at:
http://tcts.fpms.ac.be/speech/strut/users-guide/users-guide.htmlA postscript version is also available:
The current release version of NOWAY is v1.14. The following extensions and algorithmic enhancements have been made to the NOWAY decoder:
Using the 1994 NAB Hub 3 data (Sennheiser mic) with a 60,000 word vocabulary, NOWAY performs decoding using an average of less than 800 phone models per frame (corresponding to about 1.35 realtime on a 167 MHz UltraSparc) resulting in a search error of less than 4%.
Additionally an initial version of a lattice decoder has been developed, with an alpha release scheduled for the near future.
The NOWAY decoder was initially developed as part of WERNICKE and has since been further developed as part of SPRACH. It uses a start synchronous search architecture based on stack decoding (fig. 7.1). There is a stack of hypotheses for each time t, with each hypothesis containing a reference time (t), a proposed transcription covering time 0 to t and a log probability score. The hypotheses in the stack with the lowest reference time are extended in parallel by all possible extension words (modulo pruning), using a pronunciation prefix tree. A particular feature of the search algorithm is the direct use of posterior probabilities in phone deactivation pruning, an approach which offers up to an order of magnitude speedup with little or no search error. Structural advantages of this search algorithm include the facts that each hypothesis contains the entire history and that the acoustic model (and not the language model) generates possible extensions. This allows a very simple interface between the search and the language model (the language model need only return the probability of a string of words being extended by a hypothesized extension word), and also means that the vocabulary is defined by the acoustic model, so long as there is a default symbol in the language model for ``unknown'' words. We have recently run a decoding using a 350,000 word vocabulary (with a language model vocabulary of 60,000 words); a similar recognition rate to the standard 60,000 word decoding was achieved, with a three-fold increase in the size of the active search space.
Figure 7.1: Illustration of the start synchronous search strategy. The stack at time is being processed. The most likely hypothesis at this time (``efficient search'') is extended by the most probable one word extensions (``a'' and ``algorithm'' are illustrated). The resultant extended hypotheses are inserted into the stack at their reference time --- in this case ``algorithm'' has duration . In practice all hypotheses with identical reference times are extended in parallel.
Some pruning improvements have been achieved using unigram smearing and an improved initialization of the least upper bound (pruning reference). Unigram smearing  involves approximating the trigram language model probabilities by the unigram probabilities at the state-level (within words). This may be computed in advance, by storing in each node of the pronunciation tree the maximum unigram probability of each word passing through. These weak incorporation of unigram probabilities is effective if a small search error (typically less than 5%) is tolerated, halving the search space. A further improvement in pruning was obtained by initially estimating the least upper bound (pruning reference) by a depth-first single word extension along the locally most probable node sequence.
More complex acoustic and language models may be rapidly evaluated using word graphs (or lattices). All the information for lattice generation is produced as a side effect of the start synchronous search strategy, so the development of the lattice generation facility may be classed as a software advance. An advantage of the start-synchronous search architecture is the fact that portions of intraphrase silence may be recognized without interfering with the search-LM interface, and may thus be included in the lattices that are generated. Combined with this a lattice decoder is under development (currently in pre-alpha test state). Although the search-LM interface in NOWAY is straightforward, it is easier to implement novel LMs within a lattice decoder since efficiency issues (such as LM caching) need not be considered in the smaller search space of the word graph.
The STRUT software consists in many ``independent'' small pieces of code, one for each module in the process of speech recognition: sampling, feature extraction, clustering, probability estimation, and decoding. Data exchange between programs can be done either by files, pipes or sockets. For this structure to work well, data and file formats have to be precisely defined. Each data file has an ASCII header of at least 1024 bytes, describing the contents of the file. Putting as much information as possible in those headers allows the user to check each step of the recognition or training process. Routines have been developed to edit or remove those header, as well as to read or write the data. The format of the header has been inspired by the format defined by the National Institute of Standards and Technology (NIST) for speech waveform files. This allows to be compatible to databases provided by the Linguistic Data Consortium (LDC), as well as to use the SPeech HEader REsources (SPHERE) software package to handle those headers.
Four feature analysis procedures are currently available : PLP, Rasta-PLP, LPC-Cepstrum and CMS.
Three speech recognition methods can be implemented with STRUT:
Version 2.0 of NOWAY will be released with the alpha release of SPRACHWORKS on 1 Feb 1997 (see task T8.1). An alpha-test version of the lattice decoder will also be released at that time.
Planned enhancements include the following:
STRUT still need to be improved, and cleaned up in view of the SPRACHWORKS release (see Section 8.3 for a description of SPRACHWORKS).
Task Coordinator: SU
Executing Partners: CUED, FPMs, SU
Cambridge University connectionist speech recognition group (CU-CON) has participated in the Advanced Projects Research Agency (ARPA) sponsored CSR evaluations since 1993. Various sites from the USA and Europe participate in these evaluations, which consist of a number of benchmark tests. The tests specify acoustic and language model training data, and this a allows the direct comparison of a number of different systems. The ABBOT system was shown to be competitive with state-of-the-art HMM systems in the 1995 evaluations [42,34]. Indeed, the system was shown to be more robust to varying channel conditions than any of the other participating systems.
The task for the 1996 evaluations is the transcription of broadcast television and radio news. The objective therefore, is to assess the potential of hybrid connectionist-HMM systems on a new and challenging task, and to determine if the robustness demonstrated for the read business news task of 1995 will be maintained for more difficult tasks. An additional benefit of participating in the evaluation is the supply of acoustic and language modelling data.
The CU-CON evaluation system is currently undergoing fine tuning in preparation for decoding the evaluation data. The decoding of the evaluation data will be completed by the 12 of December 1996, and results will be available on the 16 of December 1996.
FPMs did not participate in evaluations on the BREF database, since they never received the CDROM-s.
The 1996 evaluation consists of two components, a ``partitioned evaluation'' (PE) component, and an ``unpartitioned evaluation'' (UE) component. CU-CON will participate in the PE only. The PE contains speech that is manually segmented into homogeneous regions, and provides for a set of controlled contrastive conditions referred to as ``evaluation focus conditions'':
This condition describes speech that is directed to the general broadcast audience, and that is recorded in a quiet studio environment. The speech is assumed to be mostly read from prepared texts.
This condition describes speech that is directed to one or more human conversational partners, and that is recorded in a quiet studio environment. The speech is assumed to be spontaneous.
This condition describes speech that is recorded over reduced-bandwidth conditions, such as local or long distance telephony.
This condition describes speech that satisfies the conditions of baseline broadcast speech, or spontaneous broadcast speech, except that it is broadcast with additive background music.
This conditions describes speech that is acoustically degraded for reasons other than the use of telephone channels or the presence of background music. Sources of degradation include additive or environmental noise.
This condition describes speech that satisfies the attributes of baseline broadcast speech, except that it is spoken by non-native speakers of American English. It is assumed to be spoken by fluent speakers of English with a foreign accent.
The CU-CON evaluation system uses a number of different acoustic and language models for the different focus conditions. Acoustic models have been built for the F0, F1, and F2 focus conditions, and a baseline acoustic model trained on all available broadcast news data (35 hours) is used for the remaining focus conditions. The language model used for spontaneous speech (focuses F1 and F2) was built using the transcribed broadcast news data supplied specifically for the 1996 evaluations. The 1995 Hub 4 general North American business news text data was also used to build the language model for the remaining focus conditions. The application of boosting  and linear input network adaptation  has also been investigated during the development of the evaluation system. A more complete description of the system will be published .
NIST plan to release a further 50 hours of transcribed acoustic training data towards the end of December. Further development of the evaluation system is planned using this extra data. The language model adaptation techniques described in Task 3.3 are currently not supported by the NOWAY decoder. It is planned to develop lattice rescoring software to allow the application of various language models, and to investigate the performance of these models on the evaluation data.
Task Coordinator: CUED
Executing Partners: CUED, FPMs, INESC, SU
From their system trained on PHONEBOOK, FPMs has developed a demonstration of Stock Market information.
A new release of AbbotDemo was made in September 1996. AbbotDemo is a pre-compiled version of the hybrid neural net/HMM recognition system that is supplied as a demonstration of the technology.
A system of Stock Market information has been developed at FPMs and will be demonstrated at the review meeting. The demonstration programs runs on a SUN station equipped with a telephony board manufactured by Linkon. A demonstration is permanently available at +32 65 37 41 77. If you phone to that number, after having received a welcome message, you can pronounce the name of a share and the system will give you its current value. The list of shares is available at the Stock Market demonstration home page:
http://tcts.fpms.ac.be/speech/quotes.html}The September 1996 release of AbbotDemo increases the base vocabulary from 5,000 words to 10,000 words. In addition, more language modelling text was used to as to be more in line with typical speech (a move away from North American Newspaper Language).
Periodically new releases of AbbotDemo will be made to incorporate updates made to the base system.