Discriminant features are often computed by applying a Linear Discriminant Analysis (LDA) on sequences of acoustic vectors. LDA extracts, from these sequences, a set of discriminant parameters maximizing the class separability by designing a linear transformation that projects a n-dimensional space on a m-dimensional space (m<n). Previous works show that application of LDA to speech recognition problems increases performance ([26], [27], [28]) and robustness against some types of noises [29]. Here, we propose a method for extracting discriminant parameters using Artificial Neural Networks (ANN) and more particularly Multilayer Perceptrons (MLP). ANN are indeed powerful tools that can be trained to solve complex nonlinear classification problems. Each hidden layer of feed-forward networks computes its outputs as a nonlinear transformation of its inputs, so that we can consider that each hidden layer proposes an internal representation of the input signal that prepares the signal to the classification task. Therefore, such a representation can be seen as a nonlinear discriminant analysis (NLDA) of the input features and provides an alternative to classical speech features (MFCC, LPC-cepstrum, RASTA-PLP cepstrum, ...).
Nonlinear discriminant analysis can then be achieved by designing a MLP where the number of nodes contained in the last hidden layer is inferior to the number of input nodes. Based on this architecture, the hidden layer will act as a bottle-neck both decreasing redundancy from the input layer and extracting relevant information for the classification.
One could worry about the possibility to train efficiently a neural network designed with a small number of hidden nodes. Practically, training such ANN will be efficient only for simple tasks. A better way to train ANN containing a bottle-neck is to introduce a second hidden layer containing a high number of neurons. Once the neural network has been trained, we expect that the outputs of the last hidden layer will provide us with discriminant features that will be fed to a classical recognizer (discrete HMM, Multi-gaussian HMM, hybrid HMM/MLP, ...). The so-defined features gather the following advantages :
Tests have been conducted on both a continuous densities HMM recognizer (CDHMM) and a hybrid HMM/MLP recognizer, on the PHONEBOOK database [25].
Recognition results are presented in table 5.4 for continuous densities HMM and in table 5.5 for the hybrid HMM/MLP recognizers. Differences of performance between CDHMM and hybrid HMM/ANN can be explained by the fact that we only trained context independent phone models and that a minimum duration of phone models was imposed for the hybrid systems and not for CDHMM.
In our experiments, we first try to extract NLDA parameters from a single hidden layer MLP (NLDA-234-38-47). Corresponding results indicate clearly that this structure is inefficient due to the reduced size of the MLP that is unable to estimate reliable posterior probabilities. Experiments with two hidden layer MLP show that improvement on the continuous densities recognizer is quite significant (about 25% reduction of the error rate) on both test sets. This could be explained by the fact that gaussians have diagonal covariance matrices which supposes that the parameters of the feature vectors are decorrelated. The MLP used for NLDA probably decorrelates the parameters to extract a maximum of information matching the assumption of diagonal covariance matrices. To verify this assumption we compared the correlation coefficients of RASTA-PLP parameters and NLDA parameters as following:
Let v be the complete feature vector (including derivatives if any). For each HMM state, we computed the correlation matrix between the coefficients of the feature vectors :
(5.5) |
where and . To facilitate the comparison of correlation matrices, we computed the value r related to the correlation matrix by :
(5.6) |
where N is the feature vector dimension. This value gives us an idea of the global correlation between the coefficients of a feature vectors. In figure 5.8 we computed the global correlation of the feature vector coefficients (RASTA-PLP for solid line and NLDA for dotted line) corresponding to each phoneme model. This figure shows that global correlations for RASTA-PLP and NLDA parameters are almost the same. This is quite interesting since NLDA parameters are extracted from several context frames. This indicates that NLDA is able to extract parameters incorporating context information while keeping global correlation at the same level as for one RASTA-PLP feature vector.
In a second set of experiments, we trained neural network directly on NLDA parameters, again with some context frames (9 frames except for the 64 components NLDA vector where we used 5 context frames). Results with these hybrid recognizers also show improved performance especially for the second test set. However improvements are not so important as for the CDHMM probably because both MLP (used for NLDA, and for probability estimation) are trained with the same criterion. Improvements could result from a better modeling of the context since the probability estimator accounts for some context of the discriminant features that are themselves extracted from some acoustic context. It is interesting to note that increasing the context (more than 90 ms) for the baseline recognizer never led to better performance. Also increasing the number of hidden nodes (to 1,000) for the baseline did not decrease the error rate.
It is interesting to note that improvements generated by NLDA parameters are quite independent of the size of feature vectors : quite similar results have been achieved for 26, 38 and 64 components vectors.