Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications
by Lakhmi C. Jain; N.M. Martin
CRC Press, CRC Press LLC
ISBN: 0849398045   Pub Date: 11/01/98
  

Previous Table of Contents Next


To reduce the memory requirements further, “key-frames” have been subdivided into 4×4 pixel blocks and the resultant vectors have been clusterized a second time in the 16-dimensional space of pixels providing a predefined number of “key-blocks” (128 and 256), which form the viseme reconstruction codebook (see Figure 21). Each of the 128 “key-frames” has been finally associated to a list of 7-bit indexes for addressing the suitable blocks in the reconstruction codebook. Experiments have been performed using 256×256 and 128×128 pel image formats composed of 4096 and 1024 “key-blocks,” respectively.

The visual synthesis of speech has been evaluated by computing the Mean Square Error (MSE) between the original images of the corpus and those reconstructed by means of the “key-blocks.” In this evaluation, the viseme reconstruction codebook has been addressed either by using the actual articulatory vectors measured from the images, or the estimates derived through speech analysis. Various dimensionalities of the articulatory space and of the reconstruction code-book have been used. The estimation MSE for each articulatory parameter has also been evaluated. The objective MSE evaluation alone, however, cannot provide enough indications on the performances since relevant components in speech reading depend on the quality with which the coarticulatory dynamics are rendered and on the level of coherence with acoustic speech.

Because of this, a set of subjective experiments has been carried out with both normal hearing and hearing-impaired subjects. Experiments consist of showing some sample sequences of visualized speech to persons asked to express their evaluation in terms of readibility, visual discrimination, quality of the articulatory dynamics, and level of coherence with acoustic speech. Sequences were encoded off-line with different configurations; in particular, two choices have been taken both for spatial resolution (128×128 and 256×256 pixel) and for time resolution (15 and 25 frame/sec). The number of articulatory parameters, used to synthesize the mouth articulation, has been increased from 2 (mouth height and width) to 10 (including the protrusion parameter extracted from the side view). The original video sequence, representing the speaker’s mouth pronouncing a list of Italian isolated words (from a corpus of 400 words), was displayed at 12.5 frame/sec without audio at half-resolution (128×128 pels) and full-resolution (256×256 pels). Only the frontal view of the mouth was displayed. Observers, seated at a distance of 30 cm from a 21’ monitor in an obscure and quiet site of the laboratory, were asked to write down the words they succeeded in speech reading. The sequence was displayed a second time, increasing the time resolution to 25 frame/sec.

The presentation of the original sequence provided indications on the personal proficiency of each observer. In fact, besides the evident difference of sensitivity between normal hearing and hearing-impaired people, a significant variability is also present within the same class of subjects. Therefore, the computation of the subjective evaluation score has been normalized for each individual on the basis of his/her speech reading perception threshold. For each observer, with reference to each of the two possible image formats (128×128 or 256×256), the minimum time resolution (12.5 or 25 Hz) allowing successful speech reading was found. Success was measured on the basis of a restricted set of “articulatory easy” words for which 90% correct recognition was required.

Further experimentation with synthetic images was then performed, observer by observer, using the exact time frequency which had allowed his/her successful speech reading of the original sequences. The whole test was repeated replacing the original images with synthetic ones, reconstructed by “key-blocks” addressed by means of actual (no estimation error) articulatory parameters. Finally, a third repetition of the test was performed addressing the reconstruction code-book by means of parameters estimated from speech through the TDNNs (with estimation error). Before each test repetition, the order of the words in the sequence were randomly shuffled to avoid expectations by the observer.

The number of people involved in the tests is still too small, especially as far as pathological subjects are concerned, but on-going experimentation aims at enlarging this number significantly. A total number of 15 observers were involved in the evaluations and only 2 of them are hearing impaired with 70 dB loss. The preliminary results reported in Tables 4 and 5 take into account, for each observer, only the words that he/she has correctly recognized in the original sequence. This has been done according to our particular interest in evaluating the subjective “similarity” of the reconstructed images to the original, as far as the possibility of correct speech reading was concerned.

Table 4 Results reported in each column express the average percentage of correctly recognized words. The test is subjective since, as explained in the text, both the video time resolution and the set of test words have been calibrated for each observer. In this case images are synthesized by selecting each time one of 128 possible key-frames (visemes) and by approximating it as a puzzle of 4×4 blocks extracted from a code-book of either 128 or 256 elements. The format of the images was 128×128 and 256×256.

The use of parameters W and H alone did not allow speech reading; since observers were more inclined to guess than recognize, the test outcomes have been considered unreliable and have been omitted. From the results in both tables, it is evident that the progressive introduction of parameters W, H, dw, LM, and Lup raises significantly the recognition rate while slight improvement is gained when parameters h, w, LC, and lup are added. This can be easily explained by the fact that the former set of parameters represent a basis (mutual independence) while the latter set of parameters is strongly correlated to the previous one and can supply only marginal information. The information associated to teeth (supplied manually since it could not be estimated through the TDNN) has proved to be of great importance for improving the quality of speech visualization since it directly concerns the dental articulatory place and provides information on the tongue position.

Table 5 Results of the same subjective tests reported in Table 4, except that in this case 256 possible key-frames (visemes) were used.


Previous Table of Contents Next

Copyright © CRC Press LLC