NASA Scientific and Technical Aerospace Reports
NASA Scientific and Technical Aerospace Reports
NASA Scientific and Technical Aerospace Reports
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
discrimination information (or the cross-entropy) between the source <strong>and</strong> the model is proposed. This approach does not<br />
require the commonly used assumption that the source to be modeled is a hidden Markov process. The algorithm is started<br />
from the model estimated by the traditional maximum likelihood (ML) approach <strong>and</strong> alternatively decreases the discrimination<br />
information over all probability distributions of the source which agree with the given measurements <strong>and</strong> all hidden Markov<br />
models. The proposed procedure generalizes the Baum algorithm for ML hidden Markov modeling. The procedure is shown<br />
to be a descent algorithm for the discrimination information measure <strong>and</strong> its local convergence is proved.<br />
Author<br />
Markov Processes; Information Theory; Information Systems; Probability Distribution Functions; Maximum Likelihood<br />
Estimates<br />
20060001627 International Business Machines Corp., Paris, France<br />
Context-Dependent Phonetic Markov Models for Large Vocabulary Speech Recognition<br />
Derouault, Anne-Marie; IEEE International Conference on Acoustics, Speech, <strong>and</strong> Signal Processing (ICASSP ‘87); Volume<br />
1; 1987, pp. 10.1.1 - 10.1.4; In English; See also 20060001583; Copyright; Avail.: Other Sources<br />
One approach to large vocabulary speech recognition, is to build phonetic Markov models, <strong>and</strong> to concatenate them to<br />
obtain word models. In previous work, we already designed a recognizer based on 40 phonetic Markov machines, which<br />
accepts a 10,000 words vocabulary ([3]), <strong>and</strong> recently 200,000 words vocabulary ([5]). Since there is one machine per<br />
phoneme, these models obviously do not account for coarticulatory effects, which may lead to recognition errors. In this paper,<br />
we improve the phonetic models by using general principles about coarticulation effects on automatic phoneme recognition.<br />
We show that both the analysis of the errors made by the recognizer, <strong>and</strong> linguistic facts about phonetic context influence,<br />
suggest a method for choosing context dependent models. This method allows to limit the growing of the number of phonems,<br />
<strong>and</strong> still account for the most important coarticulation effects. We present our experiments with a system applying these<br />
principles to a set of models for French. With this new system including context-dependent machines, the phoneme recognition<br />
rate goes from 82.2% to 85.3%, <strong>and</strong> the error rate on words with a 10,000 word dictionary, is decreased from 11.2 to 9.8%.<br />
Author<br />
Context; Phonemes; Error Analysis; Phonetics; Words (Language); Speech Recognition; Linguistics<br />
20060001657 Mitre Corp., McLean, VA, USA<br />
Information-Theoretic Compressibility of Speech Data<br />
Ramsey, L. Thomas; Gribble, David; IEEE International Conference on Acoustics, Speech, <strong>and</strong> Signal Processing (ICASSP<br />
‘87); Volume 1; 1987, pp. 1.6.1 - 1.6.4; In English; See also 20060001583; Copyright; Avail.: Other Sources<br />
Two st<strong>and</strong>ard reversible coding algorithms, Ziv-Lempel <strong>and</strong> a dynamic Huffman algorithm, are applied to various types<br />
of speech data. The data tested were PCM, DPCM, <strong>and</strong> prediction residuals from LPC. Neither algorithm shows much promise<br />
on small amounts of data, but both performed well on large amounts. Typically the Ziv-Lempel required about 12 seconds of<br />
data (with 8000 samples per second) to reach a stable compression rate. The dynamic Huffman coding took much less time<br />
to warm up’, often needing something like 64 milliseconds. Approximately 66 seconds of PCM with 12 bits per samples was<br />
compressed 6.4% by the Ziv-Lempel coding <strong>and</strong> 20.7% by the dynamic Huffman coding. The same numbers for DPCM with<br />
13 bits per sample are 17.7% <strong>and</strong> 35.6% respectively. The prediction residuals had compression rates very close to those of<br />
DPCM, regardless of whether 1, 2, 5, or 10 prediction coefficients were used.<br />
Author<br />
Information Theory; Compressibility; Predictions; Speech; Coeffıcients; Differential Pulse Code Modulation<br />
20060001668 American Telephone <strong>and</strong> Telegraph Co., NJ, USA<br />
A Connected Speech Recognition System Based on Spotting Diphone-Like Segments - Preliminary Results<br />
Rosenberg, A. E.; Colla, A. M.; IEEE International Conference on Acoustics, Speech, <strong>and</strong> Signal Processing (ICASSP ‘87);<br />
Volume 1; 1987, pp. 3.6.1-3.6.4; In English; See also 20060001583; Copyright; Avail.: Other Sources<br />
A template-based connected speech recognition system, which represents words as sequences of diphone-like segments,<br />
has been implemented <strong>and</strong> evaluated. The inventory of segments is divided into two principal classes: ‘steady-state’ speech<br />
sounds such as vowels, fricatives, <strong>and</strong> nasals, <strong>and</strong> ‘composite’ speech sounds consisting of sequences of two or more speech<br />
sounds in which the transitions from one sound to another are intrinsic to the representation of the composite sound. Templates<br />
representing these segments are extracted from labelled training utterances. Words are represented by network models whose<br />
branches are diphone segments. Word juncture phenomena are accommodated by including segment branches that characterize<br />
transition pronunciations between specified classes of words. The recognition of a word in a specified utterance takes place<br />
218