TCTS Lab Research Groups


[ FPMs > TCTS > ASR group > Projects > STRUT ]



The Speech Training and Recognition Unified Tool

The Speech Training and Recognition Unified Tool (STRUT) has been developped to do research on speech recognition and fast development and testing of related applications. The software is able to do speech analysis, models training and speech recognition. The tool consists in many "independent'' small pieces of code, one for each of identified module in the process of speech recognition: sampling, feature extraction, clustering, probability estimation, and decoding. 

There are many advantages to this software development approach, including:

  • It is much easier to develop, test and maintain "independent'' small pieces of code than just a big program. 

  • Replacing one block by another one is just a matter of changing the name of a program in a Unix command line. This makes research much easier. Also, integration of pieces of code written by several people becomes simpler.

  • Each block being a separate process, the programs can run in parallel on more than one processor or workstation.

Data exchange between the programs can be done in three ways:

  • Files : the output of each program is stored in a file. This is particularly useful when testing (debugging) a small part of the whole program. In this case it is indeed much better to save all the required downstream information in a file. 

  • Pipes : the standard output of a program is sent ("piped'') to the standard input of the following program. This is the procedure that will be used for on-line demos. In case of a multi-processor workstation, using Unix pipes will benefit from the fact that each program belongs to a separate process, thus running on a different processor.

  • Sockets : the output of a program is written to a socket (standard process communication interface in Unix). The next process in the chain gets its input from the same socket. While it is harder to bring everything into order, it allows programs to run on separate workstations, provided that they belong to the same network. This again allows parallel processing, which is a great plus for an on-line demo. Sockets also allow more than one input and one output from or to a program.

For this structure to work well, data and file formats have to be precisely defined. Each data file has an ASCII header of at least 1024 bytes, describing the contents of the file. Putting as much information as possible in those headers allows the user to check each step of the recognition or training process. Routines have been developed to edit or remove those header, as well as to read or write the data. The format of the header has been inspired by the format defined by the National Institute of Standards and Technology (NIST) for speech waveform files. This allows to be compatible to databases provided by the Linguistic Data Consortium (LDC), as well as to use the SPeech HEader REsources (SPHERE) software package to handle those headers. 


Figure 1: Recognizer block diagram

The block diagram of a recognizer is presented in figure 1. If you don't work with discrete probabilities, the clustering block can be removed.

Figure 2: Viterbi training block diagram

In a Viterbi training diagram, the decoder block is replaced by a state path decoder (figure 2). Then the segmentation is used by another program to update the models, whatever they are. Baum-Welch can also be easily implemented: probabilities for each state are stored instead of segmentation.


Have a look at the strut user's guide