Explicit Pronunciation Training Using Automatic Speech ...


Explicit Pronunciation Training Using Automatic Speech ...

Jonathan Dalby and Diane Kewley-Port

Explicit Pronunciation Training Using

Automatic Speech Recognition


© 1999 CALICO Journal

Jonathan Dalby

Communication Disorders Technology, Inc.

Diane Kewley-Port

Indiana University and Communication Disorders Technology, Inc.


A system is described, provisionally named Pronto, which uses automatic

speech recognition (ASR) for training pronunciation of second languages

in adult learners. The first version of Pronto was developed for native

speakers of American English learning Spanish and for Mandarin Chinese

speakers learning English. Pronto grows out of work in the Indiana Speech

Training Aid (ISTRA) research program, which has demonstrated significant

improvement in the pronunciation of hearing-impaired and normalhearing

but misarticulating children through the use of ASR-derived feedback.

This feedback has also been shown to improve pronunciation in

adult learners of a second language. Methods are described for developing

training in Pronto, and results are presented from evaluating classes of

speech recognizers for use in different aspects of pronunciation training.


Pronunciation Training, Speech Recognition, Speech Training Aids, Evaluation,

User Tests, Minimal Pairs


One of the most difficult tasks associated with learning a second language

as an adult is mastering the phonological and phonetic systems of

the new language. The twenty- or thirty- or even forty-year immigrant

Volume 16 Number 3 425

Explicit Pronunciation Training

who still speaks with a “heavy, thick, or strong” accent is a common anecdotal

character. In this paper we will describe a system, provisionally named

Pronto, that uses automatic speech recognition (ASR) technology for explicit

training of second language pronunciation in adult learners. The

first version of Pronto was developed for native speakers of American

English who wish to learn Spanish, and for native speakers of Mandarin

Chinese learning English.

We use the term “explicit training” to describe the system because it

includes curricula for specific pairs of native/target language sounds for

which empirical methods have been applied to discover typical segmental

(phoneme) errors of learners. Assumptions that have motivated the design

features of Pronto include the following:

1) Segment-level (phoneme) errors seriously degrade the intelligibility

of speech by nonnatives (Rogers & Dalby, 1996);

2) Typical pronunciation difficulties for a given target language

will differ for speakers of different native languages

(Kenworthy, 1987; Flege & Wang, 1989, Rochet, 1995);

3) Second language learners may have both a production and a

perception intelligibility deficit for certain phonological contrasts

of the target language (Strange, 1995);

4) Explicit production and perception training on difficult segmental

contrasts in the second language can improve the intelligibility

of nonnative speakers (Rogers, Dalby, & DeVane,


5) Feedback derived from automatic speech recognition technology

should be similar to speech quality judgments made

by humans (Anderson & Kewley-Port, 1995; Kewley-Port et

al., 1991; Watson et al., 1989).

Methods will be described for developing structured curricula for second

language intelligibility training in Pronto, as well as techniques for

evaluating different ASR technologies to support these curricula. Finally,

methods will be discussed for evaluating the effectiveness of the intelligibility

training developed in Pronto.



The work in Pronto stems from the Indiana Speech Training Aid (ISTRA)

research program, which seeks to develop training aids for speech-language

pathologists and educators of the deaf to use in providing pronunciation

training to clients with a variety of speech disorders. One focus is

426 CALICO Journal

Jonathan Dalby and Diane Kewley-Port

on hearing-impaired and normal-hearing but misarticulating children.

ISTRA prototypes have led to demonstrable improvement in the quality

of pronouncing target words, with some generalization to nontarget words,

based on independent evaluation by juries of listeners (Kewley-Port et al.,


ISTRA employs a template-based, speaker-dependent isolated word recognizer

(also known as discrete ASR). This technology is combined with

speech drills in graphical game-like formats, with names such as BASE-

BALL, MOONRIDE, and BOWLING. These formats provide appealing

environments for prompting articulation and giving pronunciation feedback.

Figure 1 shows the graphical interface for the BOWLING game,

where the word to be pronounced is displayed on screen and feedback on

pronunciation quality is in the form of number of pens knocked down.

Figure 1

Graphical Interface from a Speech Drill in ISTRA

Note: The interface is modeled on a bowling game. The learner is prompted

to say the word displayed (“rum”), and a pronunciation score appears as

the number of pins knocked down.

Feedback on pronunciation quality is determined by the speech recognizer

as follows. In an ISTRA application, templates are made using four

tokens of a client’s current best productions of a target word as judged by

the speech clinician. The clinician elicits these good productions by using

traditional articulation training methods. Once a template for an improved

target is made, the client can practice the word on the computer without

Volume 16 Number 3 427

Explicit Pronunciation Training

supervision. Feedback is derived from the distance metric of the recognizer,

comparing a current production with the template. This ASR-derived

metric has been shown to correlate well with speech quality judgments

made by human listeners (Watson et al., 1989; Anderson & Kewley-

Port, 1995).

To inform learners of the goodness of their pronunciation over a series

of attempts for a set of words, the system uses graphical displays such as

bargraphs. Studies of ISTRA show that children as young as four years

can understand and benefit from bargraph feedback (Kewley-Port et al.,





Phonemes Versus Phonetic Features

A key distinction in our discussion is between phonology and phonetics.

Whereas articulation of phonemes is called segmental and use of prosody

is suprasegmental (involving intonation and stress), phonetic features might

be called subsegmental—that is, they concern articulation differences

within the phoneme (or within-segment deviations), such as the time of

voice onset in the stops /b/ and /d/, discussed below. Both levels, phonological

and phonetic, play a role in interference from the native language

to the language being learned.

Sources of Interference

Recent research has shown that the difficulty adult learners have in mastering

the phonological and phonetic systems of a new language occurs in

both the speech production and speech perception domains (Strange,

1995). In a general way this difficulty is caused by differences in structure

of the phonological and phonetic systems between the target and the native

languages. However, the manner in which the structure of a first language

(L1) may interfere with learning the sound system of a second language

(L2) turns out to be complex.


Certain types of L1 interference in L2 phonological learning are reasonably

obvious and might even be predicted a priori from knowledge of the

428 CALICO Journal

Jonathan Dalby and Diane Kewley-Port

phonological structures of the two languages. For example, it is not very

surprising that Chinese learners of English have difficulty producing and

perceiving English syllable-final consonants and consonant clusters, because

few such sequences occur in (the major dialects of) Chinese

(Kenworthy, 1987; Flege & Wang, 1989; Anderson, 1983). Nor is it very

surprising that native Spanish speakers of English tend to transfer Spanish

phonological rules to English words inappropriately. One such Spanish

phonological rule realizes voiced stop consonants (e.g., the /d/ sound

in “dog”) as fricatives when they occur between vowels (cada ‘each’).

English words such as “ladder” thus tend to be pronounced “lather” by

these speakers (Flege & Davidian, 1984). American English learners of

Spanish have a similar kind of interference from the English rule that

“flaps” intervocalic apical stops (i.e., /t, d/) in English (as in “ladder”).

When learners inappropriately apply that rule in speaking Spanish, they

produce a sound like the Spanish tapped /r/. As a result, American English

pronunciations of, for example, Spanish cada (phonetically /kada/)

sound like cara ‘face’ (phonetically /kara/) to Spanish-speaking listeners.


It is less obvious that phonetic similarity can also contribute to the difficulty

of L2 learning. Phonetic distinctions among vowels may be reflected

in the acoustic signal by small differences in formants. Formants are resonance

or vibration bands in the frequency spectrum that determine the

quality of a vowel. Flege (1987) demonstrated that American English learners

of French were more accurate in terms of formant frequency values in

their productions of the French high front rounded vowel /y/ (as in tu

‘you’), which does not occur in English, than they were in their productions

of French /u/ (as in tous ‘all’), which has a close but not identical

counterpart in English (the /u/ in “to”). He hypothesizes that native English

speakers fail to learn to produce the phone that is closer in formant

space to an English phone due to “equivalence classification.” That is, the

/u/ of English is perceptually close enough to French /u/ that learners use

the English phone rather than learn a new sound. The notorious difficulty

Japanese learners of English have with the /r, l/ distinction may also be

due to the existence of similar sounds in Japanese. In this case the two

sounds are allophonic variants of a single phoneme in Japanese, rather

than separate phonemes as they are in English (Miawaki et al., 1975).

Many of the characteristics of foreign-accented speech can be properly

characterized only at the phonetic level of analysis, the level at which the

acoustic cues to phonemic identity are produced and perceived (see

Eskenazi, this issue). The acoustic-phonetic details of the encoding of the

“same” phonological contrasts can vary greatly from language to language.

Volume 16 Number 3 429

Explicit Pronunciation Training

The voicing contrast between Spanish and English in syllable-initial stops,

such as /d/ in “dog,” provides an example. In this position the distinction

between /p, t, k/ and /b, d, g/ is largely cued acoustically by differences in

voice onset time (VOT) relative to the release of the stop closure (Lisker

& Abramson, 1964). This is true in both Spanish and English, but the

boundary between the two classes of sounds is very different in the two

languages. English contrasts “long lag” voiceless stops with voiced stops

that have short lag or short lead, while Spanish contrasts short lag voiceless

and long lead voiced stops in this position. 1 An English-accented Spanish

/b/ may well be heard as /p/ by native Spanish listeners, and a Spanish-accented

English /p/ can be confused with /b/ by native English listeners

(e.g., the Spanish speaker’s “pat” heard as “bat”) (Williams, 1979).

Fine-grained articulatory and perceptual patterns such as VOT tend to

be transferred from L1 to L2 (Port & Mitleb, 1983). Because these patterns

involve complex articulatory and perceptual habits, they can be very

difficult to modify. It has been demonstrated that subsegmental deviations

from native norms such as these play a role in native listeners’ perception

of foreign accent (Flege, 1984; see also Eskenazi, this issue), and it is

likely that they also have important consequences for the intelligibility of

nonnative speech.

Rochet (1995) describes a case of phonological interference that illustrates

the need for detailed language-specific analysis of speech production

errors. The facts in the case are the more pointed for being, on the

surface, quite counter-intuitive. English and Brazilian Portuguese each have

two high vowels, /i/ and /u/. French has three, /i/ and /u/ and the high

front round /y/. English speaking learners of French tend to substitute

English /u/ for French /y/ whereas Brazilian Portuguese learners of French

tend to substitute their /i/ for this novel phoneme.

These facts pose something of a conundrum for those who would like to

base predictions of learning difficulty strictly on phonological analyses. In

terms of phonemes, English and Brazilian Portuguese have what appears

to be the same difference with respect to French—namely two high vowels

as opposed to three. Only by examining differences in the category

boundaries in the three languages—that is, phonetic differences—can the

difference in perception be understood. Using synthetic stimuli that varied

the second formant in a continuum from /i/ to /u/, Rochet (1995)

showed that English and Portuguese speakers divide this perceptual space

differently. Portuguese speakers identified more of the stimuli as /i/ than

as /u/ and English speakers did the opposite. Stimuli in the middle of the

continuum (which French speakers identified as /y/) were thus classified

differently by English and Portuguese speakers. In summary, as Rochet

(1995) points out, these examples underscore the need for detailed acoustic-phonetic

analysis if the effects of a given L1 phonology on the learning

of a given L2 are to be understood.

430 CALICO Journal

Jonathan Dalby and Diane Kewley-Port



Perception Precedes Production?

The relation of speech production ability to speech perception ability in

second language learning is not very well understood at present. While

second language teachers often assume that students must be able to perceive

an L2 contrast before they can learn to produce it, this is not necessarily

always the case. Goto (1971) and later Sheldon and Strange (1982)

showed that some Japanese learners of English were able to produce the

English /r, l/ contrast more reliably than they could perceive it.

Learning to Perceive

Speech perception training for second languages has been well studied

in recent years. This research has yielded several interesting facts that

have led us to include perception training as an important component of

the Pronto system. Logan, Lively, and Pisoni (1991) showed that perception

training with natural speech tokens of American English /l/ and /r/ in

several phonological environments spoken by multiple talkers was effective

in training Japanese learners to perceive this novel (and difficult) contrast.

The subjects in this study showed not only improved identification

(and lowered response latencies) for the words actually trained, but also

generalization of training to new words containing these sounds, spoken

by new talkers. The generalization of training effect in this study contrasts

with the later finding of no generalization for subjects trained on a single

talker (Lively, Logan, & Pisoni, 1993). Furthermore, Lively et al. (1994)

showed that training of this sort can result in changes in adults’ second

language perception that persist over time. Subjects trained using this

paradigm who were retested after three months showed that they had

retained their improved ability to correctly identify words containing these


Perception as a Route to Production

Rochet (1995) cites a study showing that perception training can actually

improve speech production skills (see also Bradlow et al., 1996). Native

Mandarin Chinese-speaking subjects in this experiment were trained

to modify their French VOT boundary for /b/ and /p/ toward native French

values using synthetic speech stimuli in a “bu/pu” continuum. This perception

training by itself was shown to improve the subjects’ correct pro-

Volume 16 Number 3 431

Explicit Pronunciation Training

ductions of words containing these phonemes as measured by native French

listeners’ classification of the words in tests given before and after training.

Subjects’ perception of the French category boundary was also improved,

and this result generalized to /b/ and /p/ before different untrained

vowels as well as to new voiced/voiceless consonant pairs /g, k/

and /t, d/ before /u/. Importantly, this training did not generalize to different

word positions for these phonemes—for example, syllable-final or

intervocalic positions of /b/ and /p/. This fact emphasizes the need to

train L2 learners with words containing target contrasts in as many word

positions as possible (Rochet, 1995; Lively, Logan, & Pisoni, 1993).

These facts underlie the statement of two principles that have guided

the development of the Pronto training modules. First, segmental errors

that are typical in speakers of one language when learning a second language

are not predictable from theory (at least, not from any currently

developed theory) and must be determined empirically for specific L1/L2

pairs (Munro, 1991). Second, speech perception training using native productions

by multiple talkers should be combined with speech production

drills to have the best chance of improving the intelligibility of learners’




Why to Train: Cognitive Costs of Listening to Accented Speech

A third principle guiding curriculum development in the Pronto system

is that the effects on intelligibility of specific nonnative pronunciation errors

should be established empirically. It is possible that “accented but

perfectly intelligible speech” (a subjective rating by a trained Foreign Service

Institute evaluator cited in Yule, 1990) does exist. It is certainly the

case that there are many factors besides the pronunciation of individual

speech segments that influence speech comprehension. Syntactic and semantic

context (Morton, 1979), grammaticality and familiarity of topic

(Gass & Veronis, 1984), and familiarity of accent and speaker (Brodkey,

1972) are additional factors that influence how well listeners understand

speech. But human speech understanding cannot operate completely “topdown.”

Errors in the sensory input that result in bottom-up processing

errors or uncertainties for listeners will reduce the comprehensibility of

an utterance (Marslen-Wilson, 1985). Experiments with synthetic speech

have shown that even when that speech is highly intelligible, it requires

more cognitive effort to process (as estimated by response latencies in

word/nonword classification tasks) than does natural (native) speech

(Pisoni, Nusbaum, & Greene, 1985). This finding suggests that even “per-

432 CALICO Journal

Jonathan Dalby and Diane Kewley-Port

fectly intelligible” foreign-accented speech may require more processing

time than native speech and may in fact be less intelligible in suboptimal

listening conditions such as over the telephone, in the presence of background

noise, or under conditions of high cognitive load.

What to Train: Estimating Effects of Specific Pronunciation Errors on Intelligibility

In addition to determining empirically the kinds of pronunciation errors

typical of a given L1/L2 pair, the effect of these errors on the intelligibility

of individual words and on utterances longer than a word should be established.

For developing an efficient pronunciation training curriculum, it

would also be useful to know which segmental errors are the most detrimental

to overall intelligibility in the target language. While these issues

have been discussed from a theoretical perspective in the past (e.g., Brown,

1988), they have not received the empirical study they deserve.

Rogers and Dalby (1996) studied the effects of segmental errors in Mandarin

Chinese-accented English. Segmental errors were evaluated by collecting

native English listeners’ responses to Mandarin-accented productions

of isolated words in a forced-choice, two-alternative identification

task using minimal pairs (such as “bead, bid”). In this procedure the interpretation

of the identity of the error is unambiguous, even in the presence

of other possible production errors in the same word (Weismer & Martin,

1992). Sentence and passage errors were measured using a count of words

correct in orthographic transcriptions of native listeners. The study showed

that segmental error scores obtained from read productions of isolated

words predict errors in reading whole sentences and passages reasonably

well (see also Rogers, 1997). Furthermore, Rogers and Dalby showed that

certain segmental error types had a greater effect on connected speech

intelligibility than did others. Errors in vowel tenseness, the distinction

between English /i/ and /I/ in “bead, bid,” for example, affected intelligibility

more than other vowel or consonant errors. Among consonant errors,

errors in voicing (/p/ versus /b/ in “pat, bat”) degraded intelligibility

more than other types.

The establishment of a link between segmental errors in isolated words

and the intelligibility (or comprehensibility) of larger utterances is important

because it validates the use of minimal contrast drill (also called minimal

pairs) in pronunciation training and makes it easier to evaluate the

effectiveness of that training. Intelligibility training based on minimally

contrasting words is typical of speech training provided to hearing-impaired

and normal-hearing misarticulating children by speech-language

pathologists (Kewley-Port et al., 1991). It is also widely used in second

language instruction (Kenworthy, 1987; see also Wachowicz & Scott, this

Volume 16 Number 3 433

Explicit Pronunciation Training

issue) and has even proved effective in speech perception training (Logan

et al., 1991).

How to Train: Effectiveness of Minimal Pair Training Using ASR

Rogers, Dalby, and DeVane (1994) established that speech drill using

minimal pairs can result in improved productions of English words by

native Mandarin Chinese speakers. They also showed that this kind of

drill was effective when conducted using ASR technology. The computerbased

training used by Rogers et al. employed ISTRA’s template-based,

speaker-dependent word recognizer. While ISTRA was not developed for

second language training, Rogers et al. (1994) used it in this preliminary

study to determine whether feedback derived from a speech recognition

score could improve the intelligibility of L2 speech. Pre- and posttest intelligibility

ratings by a jury of native listeners showed that the training

was effective. Not only did both vowel (the /i/ vs. /I/ contrast) and consonant

(/th/ vs. /s/) productions improve, but the study also showed that

this ASR-trained intelligibility improvement generalized to untrained words

containing these contrasts. Though modest in scope, this study appears to

be one of the few to show experimentally that speech production skills

can improve with training that uses ASR technology. The importance of

such studies in establishing the validity of this kind of training cannot be



Methods for Developing Curricula for Segmental Intelligibility Training

The following sections will describe the methods used for developing

pronunciation training modules for specific L1/L2 pairs for Pronto. We

believe the development path described here is unique in that it relies on

linguistic analysis to establish, to the greatest extent possible, the specifications

of a suitable ASR technology. This contrasts with approaches in

which the capabilities of the selected speech recognizer determine what

will be trained.

Step 1. Creating the Segmental Inventory Test: The first step in the empirical

discovery of L2 pronunciation difficulties for modules of the Pronto

system is to create word lists that contain all the vowel, diphthong, and

consonant phonemes of the target language in as many syllable positions

as possible. Ideally this would be an exhaustive inventory. With a language

like Spanish it is practical to approximate this comprehensive coverage

more closely than it is for English, since the number of initial, me-

434 CALICO Journal

Jonathan Dalby and Diane Kewley-Port

dial, and final consonant clusters in Spanish is relatively small. With American

English as the target language, however, it is impractical to include all

the medial clusters of the language in the list: American English has 67

possible syllable-initial clusters, about 173 possible syllable-final clusters,

and very many more possible word-medial clusters. The necessary economy

of sampling medial clusters is not ideal, but we have rationalized that it is

at least appropriate because all medial clusters in English are composed of

a possible syllable final cluster followed by one of the possible syllable

initial clusters. To date we have versions of a Segmental Inventory Test for

General American English and for Spanish. The American English list

contains about 360 words and the Spanish list contains about 230. Both

lists also contain several polysyllabic words designed to elicit typical errors

in stress placement.

Step 2. Analyzing Errors from the Segmental Inventory Test: The second

development step for each training module is to perform an error

analysis of accented speech. Digital audio recordings are collected of native

L1 speakers reading the L2 segmental inventory test. The speakers

are selected to represent different levels of ability in L2. Phoneticians using

a consensus method between multiple transcribers then carefully transcribe

these recordings. The errors produced in the readings are collated

and counted, yielding a subset of the segmental inventory test containing

the words in which segmental errors were actually made by the talkers. To

help ensure that the errors are representative of the group of speakers,

errors that occur only once are eliminated. The error analysis process is

labor intensive, but one can be reasonably confident that it yields an error

list that is typical of segmental pronunciation difficulties for a specific

group of speakers. Error analysis of Mandarin Chinese-accented English

yielded an inventory of 45 consonantal and 13 vocalic errors (Rogers,

1997). Analysis of Spanish-accented English produced 60 consonantal and

19 vocalic errors, while an analysis of English-accented Spanish showed

54 consonantal errors and 55 vocalic errors. In addition to segmental errors,

word-level stress placement errors are also transcribed and tabulated.

Step 3. Developing the Pronunciation Training Sequence: Motivated by

the documented success of phonological contrast training, lists of minimal

pairs of words are created containing the segmental contrasts derived

from the error analysis. To speed this process, software has been developed

to search machine-readable dictionaries that include phonemic transcriptions.

This software allows the user to specify a contrast and a phonological

environment; it then searches the lexicon exhaustively for word

pairs containing the contrast. In the English/Spanish module, for example,

one error from the Segmental Inventory Test by native speakers of English

was pronouncing intervocalic Spanish /d/ as Spanish /r/, based on interference

from the English rule that flaps /d/ between vowels, discussed

Volume 16 Number 3 435

Explicit Pronunciation Training

earlier. The rule defining that error would be input as follows:

V d V —> V r V (i.e., /d/ is pronounced /r/ between vowels).

Given this rule, the program produces (eventually) a list of Spanish word

pairs such as

acabada/acabara ‘finished/it finished’

afectada/afectara ‘affected/it affected’

and so on. For the above rule, the program produced 363 pairs, to be used

for training American speakers away from applying the flapping rule to

Spanish words. The software is currently not sensitive to word frequency

and sometimes produces pairs containing extremely infrequent words,

which have to be eliminated by hand by native speakers of the target language.

Step 4. Ranking the Importance of Errors: The next step is to determine

the importance of each error in the module’s error list, as estimated by

phoneme frequency and the relative number of minimal pairs representing

the error found in the lexicon. A formula provides a measure of importance

that is used in editing the lists of minimal pairs to derive a set of

training words for the speech production and perception drills. As a result

of this editing, the minimal pairs represent the errors in the error list in

numbers that are roughly proportional to the error’s estimated importance.

Assuming that students will have equal difficulty in learning each of the

contrasts, this distribution of training pairs allows for more training time

to be allocated to the more important contrasts.

Step 5. Recording Native Speakers: The edited error lists are given to

native speakers to read and record. The digitized waveforms are used in

perception training as well as in training and evaluating the speech recognizer

used in the Pronto system.

Step 6. Ordering by Complexity: While the Pronto system does not rank

single contrasts relative to other single contrasts in terms of expected learning

difficulty, it does assume that words containing more than one error

environment are more difficult for learners to produce and perceive than

are those containing a single error environment. Thus, early training is

conducted using word pairs with only one error environment. Words containing

multiple environments are introduced only after improvement has

been shown for each of the individual errors.

436 CALICO Journal

Jonathan Dalby and Diane Kewley-Port

Learner Tasks, Feedback, and Adaptive Sequencing in Pronto

It is not the case that all typical L2 errors present equal learning difficulty.

However, aside from such notable errors as English /l, r/ for Japanese

and speakers of certain other Asian languages, or the difficulty native

French speakers have with English /dh/ and /th/, not enough is currently

known to anticipate relative learning difficulties with confidence in curriculum

design. In the Pronto system we address this uncertainty with a

system of scoring designed to adapt automatically to differences in learning

rate for the contrasts in each module. This is done by keeping a record

of student performance for each contrast on each of three tasks:

1) Word identification—Students listen to words from multiple

talkers presented aurally and indicate which word they hear

via mouse or keyboard;

2) Word imitation—Students repeat words presented aurally and

their response is evaluated by the recognizer;

3) Word production—Students respond to visually presented

prompts by speaking the word, with no immediate auditory


The system evaluates and records performance on each of these tasks

continuously. Feedback is displayed to student and instructor in a barchart,

illustrated in Figure 2, which summarizes skill level by contrast pair by

task (perception or production).

Volume 16 Number 3 437

Explicit Pronunciation Training

Figure 2

View of the Pronto system User Interface

Note: Phonological contrast (minimal) pairs are listed in order of importance

from top to bottom. Light outer portions of the horizontal bars show

student’s current skill level in speech perception (left) and speech production

(right). The dark inner portions of the bars indicate students’ current

“intelligibility gap” for these skills. (Graphic design by Roy Sillings.)

In addition to keeping current scores for each task for each contrast in

the curriculum, the system keeps a global score that is a weighted sum of

the scores by task. By maintaining this global training score, the system

can adapt automatically to differential learning difficulty for the various

contrasts. This unique aspect of Pronto training provides a mechanism

derived from linguistic principles for steering the student efficiently through

the curriculum. Users may choose which contrast and which types of drills

they wish to practice, but their overall intelligibility profile will improve

more if they show improvement on the phonetic contrasts that are more

highly valued by the global training score.

438 CALICO Journal

Jonathan Dalby and Diane Kewley-Port



A key thrust of the ISTRA program has been to identify a speech recognition

engine that can be configured so that its output, a recognition score,

provides a measure of speech quality that is useful as feedback to the

learner. To be most useful in intelligibility training of the sort described

here, the speech recognizer should meet two requirements:

1) It should be capable of highly accurate recognition of word

pairs containing the target phoneme contrasts when spoken

by native speakers. This establishes the baseline recognition

accuracy of the system. If the recognizer cannot reliably discriminate

between “thick” and “sick,” for example, when these

words are produced by native speakers, it certainly will not

be able to do so when presented with speech that is accented

or disordered in some other way.

2) It should produce recognition scores that can be used to provide

valid evaluative feedback for speech training drills.

In this section we will discuss methods that we and our co-investigators

have developed for assessing how well a proposed recognition technology

meets the two requirements of identification accuracy and evaluative validity

(Anderson & Kewley-Port, 1995; Watson et al., 1989).

HMM Versus Template-Based Recognizers

There are two main classes of recognizers: (a) Hidden Markov Model

(HMM) systems, based on nondeterministic stochastic modeling, and (b)

template-based systems, which perform pattern matching using dynamic

programming or other time normalization techniques. HMM systems underlie

many of the language tutors described in this volume (e.g., the Entropic

HTK recognizer used in Subarashii, the SRI Nuance recognizer

used in VILTS, and the Carnegie Mellon University Sphinx recognizer used

in work reported by Aist and Mostow and by Eskenazi). These are commonly

used to support continuous speech recognition. Template-based

systems include the well known Scott Instruments Model SIR, no longer

marketed but key to early ASR-based language tutors (e.g., Auralog’s earlier

AuraLang products; see also LaRocca, 1994). A current example is

Motorola’s Clamor speech recognizer. These are commonly used to support

discrete speech recognition. Whereas research in speech recognition

now tends to rely on the stochastic modeling of HMM, it is also true that

Volume 16 Number 3 439

Explicit Pronunciation Training

earlier successes with pattern-matching systems have shown them to be

useful for speech training aids, as detailed below.

Tests of Speech Recognizers on Minimal Pairs

Many commercially available recognition engines in both classes have

undergone extensive testing and can boast of impressively high recognition

accuracy for typical voice input tasks. However, the phonetic discriminations

these tasks require are not typically as challenging as are

those of intelligibility training employing minimally contrasting word pairs.

Using tests with minimal pairs, we have conducted our own evaluations

of various commercial and experimental systems that represent both classes

of technology. HMM recognizers used in past evaluations include

DragonWriter from Dragon Systems (available from their Voice Tools

toolkit product) and VoiceType Application Factory from IBM. Templatebased

recognizers have included Micro IntroVoice from Voice Connexion

and the Scott Instruments Model SIR, later implemented on the Aria chip

set from Sierra Semiconductor.

Tests in ISTRA

METHOD: In the course of evaluating three speaker-dependent recognizers

as candidates for use in the ISTRA system, Anderson and Kewley-Port

(1995) created a database of normal and misarticulated English speech.

The normal speech was from adult speakers and included minimal pairs

containing the twenty-five most common substitution errors of

misarticulating children. The database contained multiple repetitions of

word pairs such as “some/thumb, red/wed, then/den, thin/fin” and so on.

Tokens of disordered speech were collected from children undergoing

speech therapy. Misarticulated tokens were recorded at the beginning of

their training and improved productions were elicited at the end. The basic

recognition accuracy of one HMM and two template-based recognizers

was evaluated using twenty test tokens per contrast from normal adult

speech. All recognizers were optimized for performance prior to testing.


recognizer had the best overall accuracy with 90% correct. The best of

the two template recognizers was correct on 86% of the trials while the

second template recognizer was correct on only 79% of the trials. The

HMM recognizer did not perform better on all the contrasts in the database,

however. On the manner of articulation distinction in the pair “shore/

chore,” and the manner and place contrast in the pair “gem/them,” for

example, the HMM performed worse than at least one of the template

440 CALICO Journal

Jonathan Dalby and Diane Kewley-Port

recognizers, and the pattern of errors was also very different between the

two template-based systems. Since the methods of signal processing and

evaluation used in each of the systems differ greatly, this result is not surprising.

However, it does reveal how important it is to specify the details

of the recognition task when evaluating a recognizer. These data showed

that all three of the recognizers had strengths and weaknesses that are not

obvious from examining just the overall recognition accuracy.

Results—Validity of Evaluative Scores: Speech from misarticulating children

was used by Anderson and Kewley-Port (1995) to evaluate the

recognizers for their capability to distinguish between tokens of words

rated as “correct” versus “incorrect” by trained human listeners as well as

for their potential in deriving measures of evaluative feedback to be used

in speech drill. Again, the HMM recognizer was better than the template

recognizers in distinguishing between the two categories of tokens. However,

it was much worse than either template recognizer at providing an

evaluative score for samples of disordered speech. In this test a jury of five

trained listeners rated multiple tokens of words from misarticulating children

on a seven-point scale where one is equal to “very poor articulation”

and seven equates with “normal.” The recognizers were trained using the

three best of these productions and the recognition scores for the other

tokens in the set were recorded. Correlation analysis of the recognition

scores from the three recognizers with the average of the human listeners

was then performed. This analysis showed that the recognition scores for

the template-based recognizers were quite good, quite similar in fact to

interrater correlations, but that the “confidence” score returned by the

HMM recognizer was not. Subsequent study of the more standard log

likelihood ratios of another HMM recognizer also showed low correlation

with human ratings.

In summary, prior research in the ISTRA program has found HMMbased

ASR more accurate for identifying which word was said and template-based

ASR better for measuring quality of pronunciation, based on

its distance metric. In addition, the two classes of ASR differ in terms of

which contrasts each handles better. To date, the decision in ISTRA has

been to rely on template-matching algorithms.

Tests in the Pronto System

Recognizer evaluation for the Pronto system has been conducted following

procedures similar to those in Anderson and Kewley-Port (1995).

A database of digitized speech was collected from fourteen male and fourteen

female speakers of American English for the minimal pair drill derived

from the error analysis of Mandarin-accented English. Baseline

speaker-independent recognition rates on a subset of these data show a

Volume 16 Number 3 441

Explicit Pronunciation Training

similar pattern of relative performance for a template-based recognizer

and an HMM recognizer. While the overall recognition accuracy for the

HMM recognizer is higher, its scores are lower than those of the template

recognizer for vowel and nasal contrasts. To date, neither recognizer has

produced evaluative scores with acceptably high correlations with human

judgments of speech quality when tested in speaker-independent mode.

While research continues on deriving a valid intelligibility rating measure

from a speaker-independent recognition technology, the first version

of the Pronto system has been implemented using the HMM recognizer.

Because pronunciation training will be conducted using the minimal pairs

target/error paradigm, with both templates active at the time of recognition,

the dichotomous (“hit/miss”) feedback of the system is valid. Experiments

are planned for the future to assess the effectiveness of Pronto

training in improving the intelligibility of learners speaking second-language



The authors would like to thank William Mills for his contributions to the

development of the ISTRA and Pronto systems. This work was supported

by National Institutes of Health’s National Institute on Deafness and Other

Communication Disorders, SBIR grant DC02213, and by Army Research

Institute contract DASW01-96-C-044 to Communication Disorders Technology,

Bloomington, IN.


1 While the phonemic boundary for English voiced/voiceless stops (/b/ vs. /p/) is

at the “very short lag” position, variants of voiced stops with long lags occasionally

occur in English.


Anderson, J. I. (1983). The difficulties of English syllable structure for Chinese

ESL learners. Language Learning and Communication, 2 (1), 53-61.

Anderson, S., & Kewley-Port, D. (1995). Evaluation of speech recognizers for speech

training applications. IEEE Proceedings on speech and audio processing,

3 (4), 229-241.

Bradlow, A., Akahane-Yamada, R., Pisoni, D. B., & Tohkura, Y. (1996). Three

converging tests of improvement in speech production after perceptual

identification training on a non-native phonetic contrast. Journal of the

Acoustical Society of America, 100 (4), Pt. 2, 2725 (A).

442 CALICO Journal

Jonathan Dalby and Diane Kewley-Port

Brodkey, D. (1972). Dictation as a measure of mutual intelligibility: A pilot study.

Language Learning, 22 (2), 203-217.

Brown, A. (1988). Functional load and the teaching of pronunciation. TESOL

Quarterly, 22, 593-606.

Flege, J. E. (1984). The detection of French accent by American listeners. Journal

of the Acoustical Society of America, 76, 692-707.

Flege, J. E. (1987). The production of “new” and “similar” phones in a foreign

language: Evidence for the effect of equivalence classification. Journal of

Phonetics, 15, 47-65.

Flege, J. E., & Davidian, R. D. (1984). Transfer and developmental processes in

adult foreign language speech production. Applied Psycholinguistics, 5,

323- 347.

Flege, J. E., & Wang, C. (1989). Native-language phonotactic constraints affect

how well Chinese subjects perceive the word-final /t/-/d/ contrast. Journal

of Phonetics, 17, 299-315.

Gass, S., & Veronis, M. (1984). The effect of familiarity on the comprehensibility

of non-native speech. Language Learning, 34, 65-90.

Goto, H. (1971). Auditory perception by normal Japanese adults of the sounds “l”

and “r.” Neuropsychologia, 9, 317-323.

Kenworthy, J. (1987). Teaching English Pronunciation. New York: Longman

Kewley-Port, D., Watson, C. S., Elbert, M., Maki, D. & Reed, D. (1991). The

Indiana Speech Training Aid (ISTRA) II: Training curriculum and selected

case studies. Clinical Linguistics and Phonetics, 5, 13-38.

LaRocca, S. (1994). Exploiting strengths and avoiding weaknesses in the use of

speech recognition for language learning. CALICO Journal, 12 (1),102-


Lisker, L. & Abramson, A. (1964). A cross-language study of voicing in initial

stops: Acoustical measurements. Word, 20, 384-422.

Lively, S. E., Logan, J. S., & Pisoni, D. B. (1993). Training Japanese listeners to

identify English /r/ and /l/ II: The role of phonetic environment and

talker variability in learning new perceptual categories. Journal of the

Acoustical Society of America, 94, 1242-1255.

Lively, S. E., Pisoni, D. B., Yamada, R. A., Tohkura, Y., & Yamada, T. (1994).

Training Japanese listeners to identify English /r/ and /l/ III: Long-term

retention of new phonetic categories. Journal of the Acoustical Society of

America, 96, 2076-2087.

Logan, J. S., Lively, S. E., & Pisoni, D. B. (1991). Training Japanese listeners to

identify English /r/ and /l/: A first report. Journal of the Acoustical Society

of America, 89, 874-886.

Marslen-Wilson, W. D. (1985). Aspects of human speech understanding. In F. False

& W. A. Woods (Eds.), Computer speech processing. Englewood Cliffs,

NJ: Prentice Hall.

Morton, J. (1979). Word recognition structure and process. In J. Morton & J. Marshall

(Eds.), Structure and process. Cambridge, MA: MIT Press.

Volume 16 Number 3 443

Explicit Pronunciation Training

Miawaki, K., Strange, W., Verbrugge, R., Liberman, A., Jenkins, J., & Fujimura, O.

(1975). An effect of linguistic experience: The discrimination of [r] and

[l] by native speakers of Japanese and English. Perception and Psychophysics,

18 (5), 331-340.

Munro, M. (1991). Perception and production of English vowels by native speakers

of Arabic (Doctoral dissertation, University of Alberta, 1991).

Pisoni, D. B., Nusbaum, H., & Greene, B. (1985). Perception of synthetic speech

generation by rule. Proceedings of the IEEE, 73, 1665-1676.

Port, R., & Mitleb, F. (1983). Segmental features and implementation in acquisition

of English by Arabic speakers. Journal of Phonetics, 11, 219-229.

Rochet, B. L. (1995). Perception and production of second-language speech sounds

by adults. In W. Strange (Ed.), Speech perception and linguistic experience.

Timonium, MD: York Press.

Rogers, C. L. (1997). Segmental intelligibility assessment for Chinese-accented

English (Doctoral dissertation, University of Indiana, 1997).

Rogers, C. L., & Dalby, J. M. (1996). Prediction of foreign-accented speech intelligibility

from segmental contrast measures. Journal of the Acoustical Society

of America, 100 (4) Pt. 2, 2725 (A).

Rogers, C. L., Dalby, J. M., & DeVane, G. (1994). Intelligibility training for foreign-accented

speech: A preliminary study. Journal of the Acoustical Society

of America, 96 (5) Pt. 2, 3348 (A).

Sheldon, A., & Strange, W. (1982). The acquisition of /r/ and /l/ by Japanese learners

of English: Evidence that speech production can precede speech perception.

Applied Psycholinguistics, 3, 243-261.

Strange, W. (1995). Cross-language studies of speech perception a historical review.

In W. Strange (Ed), Speech perception and linguistic experience.

Timonium, MD: York Press.

Watson, C. S., Reed, D., Kewley-Port, D., & Maki, D. (1989). The Indiana Speech

Training Aid (ISTRA) I: Comparisons between human and computerbased

evaluation of speech quality. Journal of Speech and Hearing Research,

32, 245-251.

Weismer, G., & Martin, R. (1992). Acoustic and perceptual approaches to the study

of intelligibility. In R. D. Kent (Ed.), Intelligibility in speech disorders:

Theory, measurement and management. Amsterdam: J. Benjamins.

Williams, L. (1979). The modification of speech perception and production in second-language

learning. Perception and Psychophysics, 26 (2), 95-104.

Yule, G. (1990). The spoken language. Annual Review of Applied Linguistics, 10,


444 CALICO Journal


Jonathan Dalby and Diane Kewley-Port

Jonathan Dalby is Senior Scientist at Communication Disorders Technology

(CDT), Inc., where he conducts research and development of speech

training systems that employ automatic speech recognition technology.

Before joining CDT, he served as Research Associate at the Centre for

Speech Technology Research at the University of Edinburgh, Scotland,

and participated in the development of a large-vocabulary continuous

speech recognition system. Previously, he studied speech production and

speech perception in the Phonetics Laboratory at Indiana University, and

he taught English as a Second Language for several years both overseas

and in the United States. His Ph.D. in linguistics is from Indiana University.

Diane Kewley-Port is Associate Professor in the Department of Speech

and Hearing Sciences at Indiana University as well as cofounder and Executive

Vice President of Communication Disorders Technology, Inc. She

has studied the use of automatic speech recognition in speech training

systems since 1987. Previously, she conducted research in speech signal

processing and speech perception for several years at Haskins Laboratories

and at Bell Laboratories. A Fellow of the Acoustical Society of America,

she is past associate editor of topics in speech processing and communication

systems for the Journal of the Acoustical Society of America. Her

Ph.D. in speech sciences is from City University of New York. She won

the Edward Sapir Award for best dissertation in linguistics and, as a University

of Michigan student, the Sarah Parker Memorial Award as outstanding

woman engineer.


Jonathan Dalby

Communication Disorders Technology, Inc.

501 North Morton Street #215

Bloomington, IN 47404

Phone: 812/336-1766

E-Mail: dalby@comdistec.com

Professor Diane Kewley-Port

Department of Speech and Hearing Sciences

Indiana University

Bloomington, IN 47405

Phone: 812/855-5103

E-Mail: kewley@indiana.edu

Volume 16 Number 3 445

Explicit Pronunciation Training



1-5 June 1999

Tuesday-Wednesday 1-2 June Preconference Workshops


Thursday-Saturday 3-5 June Opening Plenary, Sessions, Exhibits, Luncheon,

Courseware Showcase, SIG Meetings, Banquet,

Closing Plenary

Plenary speakers G. Richard Tucker, Carnegie Mellon University

Diane Birckbichler, Ohio State University

Gary Strong, National Science Foundation.

Register online at http://calico.org/calico99.html

Early (before 1 May) with luncheon & banquet no luncheon or banquet

Member $175 $165

Nonmember $200 $190

K-12 or Community College $125 $115

Saturday only $50

Regular (after 1 May)

Member $200 $190

Nonmember $225 $215

K-12 or Community College $150 $140

Saturday only $55


Member $225 $215

Nonmember $250 $240

K-12 or Community College $175 $165

Saturday only $60

This year’s conference does not have a designated conference hotel. Lodging is

available in residence halls on campus and at motels in the area. For more information,

visit CALICO’s web site.

Ascot Travel is the official travel agency for CALICO ‘99 and offers special discount

fares on Delta Airlines. Visit CALICO’s web site or call Ascot Travel at

800/460-2471. Be sure to mention you are part of the CALICO group.

For more information, contact CALICO

512/245-1417, info@calico.org, http://www.calico.org.

446 CALICO Journal

More magazines by this user
Similar magazines