13.07.2015 Views

paper - Department of Speech, Music and Hearing

paper - Department of Speech, Music and Hearing

paper - Department of Speech, Music and Hearing

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Proceedings, FONETIK 2004, Dept. <strong>of</strong> Linguistics, Stockholm UniversityModelling interactive language learning:a project presentationLacerda, F., Sundberg, U., 1 Carlson, R. 2 <strong>and</strong> Holt, L. 31 Dept. <strong>of</strong> Linguistics, Stockholm University, Stockholm2 Dept. <strong>of</strong> <strong>Speech</strong>, <strong>Music</strong>, <strong>Hearing</strong>, KTH, Stockholm3 Dept. <strong>of</strong> Psychology, Carnegie Mellon University, Pittsburgh, USAAbstractThis <strong>paper</strong> describes a recently started interdisciplinaryresearch program aiming at investigating<strong>and</strong> modelling fundamental aspects <strong>of</strong>the language acquisition process. The workinghypothesis assumes that general purpose perception<strong>and</strong> memory processes, common toboth human <strong>and</strong> other mammalian species,along with the particular context <strong>of</strong> initialadult-infant interaction, underlie the infant’sability to progressively derive linguistic structureimplicitly available in the ambient language.The project is conceived as an interdisciplinaryresearch effort involving the areas <strong>of</strong>Phonetics, Psychology <strong>and</strong> <strong>Speech</strong> recognition.Experimental speech perception techniques willbe used at Dept. <strong>of</strong> Linguistics, SU, to investigatethe development <strong>of</strong> the infant’s ability toderive linguistic information from situated connectedspeech. These experiments will bematched by behavioural tests <strong>of</strong> animal subjects,carried out at CMU, Pittsburgh, to disclosethe potential significance that recurrentmulti-sensory properties <strong>of</strong> the stimuli mayhave for spontaneous category formation. Datafrom infant <strong>and</strong> child vocal productions as wellas infant-adult interactions will also be collected<strong>and</strong> analyzed to address the possibility <strong>of</strong>a production-perception link. Finally, the datafrom the infant <strong>and</strong> animal studies will be integrated<strong>and</strong> tested in mathematical models <strong>of</strong>the language acquisition process, developed atTMH, KTH.BackgroundIn the advent <strong>of</strong> the experimental studies on thelanguage acquisition process the primary focuswas on the infant’s ability to discriminate orproduce isolated speech sounds (e.g. Eimas,Siquel<strong>and</strong>, Jusczyk, & Vigorito, 1971; Locke,1983). However necessary <strong>and</strong> important asinitial steps, the study <strong>of</strong> isolated perceptual orarticulatory phenomena soon falls short <strong>of</strong> addressingthe general linguistic implications <strong>of</strong>such initial phonetic abilities. Therefore morerecent attempts have been instead focused oninvestigating, for example, the internal structure<strong>of</strong> phonetic categories in half-year-old infants(Kuhl, Williams, Lacerda, Stevens, &Lindblom, 1992), the infant’s ability to recognizeword-like patterns from continuous speech(e.g. (Jusczyk, 1999) or to capitalize on statisticalregularities in the speech signal (Saffran,Aslin, & Newport, 1996). Also research on infants’vocal production have shifted from focuson isolated speech sound productions to attemptsto relate patterns <strong>of</strong> babbling with awider biologically oriented linguistic frame(MacNeilage & Davis, 2000). Yet, many <strong>of</strong>these studies emanate from a perspective <strong>of</strong>adult-full-fledged linguistic behaviour wherethe infant is assumed to acquire its ambientlanguage by engaging in a process <strong>of</strong> findingout its phonemic <strong>and</strong> other traditional linguisticconstituents. Not surprisingly, the infant’s capability<strong>of</strong> learning the ambient language withall its variability, have been proposed to rely onpre-wired linguistic knowledge in terms <strong>of</strong>Language Acquisition Device or notions <strong>of</strong>‘poverty <strong>of</strong> the stimulus’. In the present researchprogram, on the other h<strong>and</strong>, the infant isassumed to have no innate language-specificpredispositions <strong>and</strong> the language acquisition isviewed as a consequence <strong>of</strong> general sensory<strong>and</strong> memory processes <strong>and</strong> continuity betweenlow-level sensory information processing <strong>and</strong>language capacity.From the theoretical outline <strong>of</strong> the present researchproject it is assumed that the initialphases <strong>of</strong> human language acquisition processis a general purpose process through whichauditory representations <strong>of</strong> speech sequencesare linked to other co-occurring sensory stimulias an automatic consequence <strong>of</strong> exposure tocorrelated multi-sensory information input.Since spoken language is <strong>of</strong>ten used to refer to


Proceedings, FONETIK 2004, Dept. <strong>of</strong> Linguistics, Stockholm Universityobjects or actions in the shared outside world,there is an inevitable implicit correlation betweenhearing the sounds <strong>of</strong> words or phrasesrelating to these objects <strong>and</strong> seeing, feeling,smelling or otherwise perceiving the referents<strong>of</strong> the spoken language.Initially, a sound sequence may be associatedwith any object that happen to be presentedwith temporal contiguity with the sound, butcontinued exposure to richly varying spokenlanguage soon provides enough information tonarrow the scope or review the sound-objectlinks (Lacerda, 2003). As this happens, a moredetailed representation <strong>of</strong> the initially unanalyzedchunks <strong>of</strong> spoken sounds starts toemerge, enabling differentiation between identifiable(acquired) sound objects <strong>and</strong> the remainingunknown sounds.In the present theoretical framework, this representsthe emergence <strong>of</strong> a general segmentationprocedure that will eventually lead to lexicalrepresentation. At this basic level, the humanrepresentation capacity is likely to be shared bysome other species although humans at somelevel obviously depart to using sound representationsas objects that can be processed as combinatorialelements while keeping their attachedmulti-sensory links. At this level a cognitiveprocess is unparalleled in the animal world beginsto emerge but its ontogenesis is, so far,unaccounted for. By the coordinated study <strong>of</strong>the early segmentation <strong>and</strong> sound-object associationprocesses in human <strong>and</strong> animal subjects<strong>and</strong> detailed simulation on the experimentalconditions with operative mathematical models,the current project expects to make a contributionto the underst<strong>and</strong>ing <strong>of</strong> this unique humancapacity.Research planThrough a joint effort from three areas <strong>of</strong> research– early human language development,animal learning <strong>and</strong> speech technology - datawill be gathered <strong>and</strong> used to generate <strong>and</strong> calibratemathematical models that may accountfor language learning in terms <strong>of</strong> general multisensoryinput, memory processes <strong>and</strong> the infant’sinteraction with its environment.Infants as models <strong>of</strong> speech <strong>and</strong> languagedevelopmentHuman infants typically engage in speechcommunication activities within their first years<strong>of</strong> life <strong>and</strong> usually achieve a well-developedlinguistic competence by about 10-12 years <strong>of</strong>age. Retrospective analyses <strong>of</strong> the language acquisitionprocess <strong>of</strong>ten present language learningas an effortless <strong>and</strong> smooth developmenttowards the adult language. This approachmisses the important information conveyed bydifficulties <strong>and</strong> the errors that children actuallydo <strong>and</strong> detaches the speech signal from its ecologiccontext. Indeed, traditional language developmentresearch fails to address some fundamentalquestions such as:• Why does the sound <strong>of</strong> speech seem tobe more attractive for the newborn infantthan other sounds? In this neonatalpreference a sign <strong>of</strong> human specific geneticmechanisms that under normalcircumstances generate behaviour that istaken for an innate preference for listeningto spoken language?• How do representations <strong>of</strong> phonemes,syllables <strong>and</strong> words arise <strong>and</strong> develop inthe human infant? Can these traditionallyassumed units, the very core <strong>of</strong> theadult linguistic structure, be derivedfrom continuous speech in the absence<strong>of</strong> pre-wired linguistic knowledge?• How do adults conceive the infant’s linguisticcompetence <strong>and</strong> what phonetic<strong>and</strong> linguistic modifications do they introduceto accommodate the infant?• Are production preferences reflected inperceptual biases? For example, how isthe perception <strong>of</strong> phonemic contrasts affectedby the child’s production abilities?How does the interplay betweenperception <strong>and</strong> production develop duringthe first years <strong>of</strong> life? How does thecombinatorial pressure <strong>of</strong> lexical representationsrelate to the child’s phonologicalawareness?Questions like these will be investigatedfrom a broad developmental perspective usingwell established techniques such as theHigh-Amplitude Sucking technique, to testthe youngest infants, <strong>and</strong> the visual preferenceparadigm to study the emergence <strong>of</strong>lexical <strong>and</strong> phonological structure in childrenup to 2 years <strong>of</strong> age. The next step willbe to investigate the stability over time <strong>and</strong>


Proceedings, FONETIK 2004, Dept. <strong>of</strong> Linguistics, Stockholm Universitygeneralization ability from the establishedauditory-visual representations, testing thesubjects’ ability to cope with different kind<strong>of</strong> noise affecting the speech signal, in linewith the phoneme restoration <strong>and</strong> rhymingstudies.The subject population will consist <strong>of</strong> largecross-sectional samples (about 100 subjects)to address specific perceptual issuesplus a group <strong>of</strong> 20 infant-adult dyads, foradditional detailed investigation <strong>of</strong> production<strong>and</strong> interaction aspects.Animal studies <strong>of</strong> sensory processing <strong>and</strong>memoryAs speech scientists began to argue about thespecificity <strong>of</strong> human speech perception, the resultsfrom auditory experiments with animalsubjects became a crucial contribution to theon-going debate. For instance, animal studiescontributed with important result that certainacoustic continua were categorized in much thesame way by both humans <strong>and</strong> non-human subjects(Kuhl & Padden, 1983; Kuhl & Padden,1982). More recent studies involving animalsubjects have further exposed general mechanismsaccounting for the internal organization<strong>of</strong> categories <strong>of</strong> speech sounds, the influence <strong>of</strong>phonetic context (Lotto, Kluender, & Holt,1997) <strong>and</strong> sensitivity to acoustic trading relations(Holt, Lotto, & Kluender, 2001; Kluender,Lotto, Holt, & Bloedel, 1998; Kluender &Lotto, 1994; Lotto et al., 1997). Using animalsubjects in perception studies has proven to <strong>of</strong>fera unique possibility to exercise completeexperimental control over experience. The generalprocedure in studies with, for instance,gerbils allows assessing detection, discrimination<strong>and</strong> identification <strong>of</strong> speech stimuli.In the present project gerbils will be used inperception experiments since this species hasbeen used successfully in previous studies.Methods <strong>of</strong> training these animal subjects arebased on a go/no-go paradigm. The gerbils aretrained with positive reinforcement to remainon a little platform in a cage on a “positive”stimulus <strong>and</strong> to jump <strong>of</strong>f the platform on a“negative” stimulus. The gerbils are “at work”15-20 min. in daily sessions <strong>and</strong> their performanceis measured in terms <strong>of</strong> % correct responses<strong>and</strong> d’.The current project attempts to widen the scope<strong>of</strong> speech perception experiments with animalsby trying to create realistic animal models <strong>of</strong>certain early stages <strong>of</strong> the language acquisitionprocess. It is expected that the animal modelswill perform more or less like the infant subjectsup to a certain level <strong>of</strong> initial representation.By studying the levels <strong>of</strong> complexity forwhich the animal models fall short <strong>of</strong> the humanperformance the project is expected tocontribute with important insights on the emergence<strong>of</strong> early cognitive processes.Testing the hypotheses in mathematicalmodelsAddressing the early stages <strong>of</strong> the language acquisitionprocess from a broad perspective <strong>of</strong>fersa unique opportunity to underst<strong>and</strong> <strong>and</strong> designspeech communication systems that hopefullymay be able to capitalize on the key aspects<strong>of</strong> the human children’s flexibility <strong>and</strong>learning capability rather than attempting tomimic adult stereotypes.The ultimate goal <strong>of</strong> a speech recognition systemis to be able to h<strong>and</strong>le speech signals withhuman-like efficiency. While the performance<strong>of</strong> available systems using speech input may byreasonable under optimal communication situations,the systems’ lack <strong>of</strong> flexibility <strong>and</strong> vulnerability<strong>and</strong> noise or moderately adversecommunication situations is a significant hinderto a wider <strong>and</strong> safer application <strong>of</strong> such speechinterfaces. There are several factors that accountfor the mismatch between the expected<strong>and</strong> the obtained performance <strong>of</strong> such speechcommunication interfaces. In addition to thedifficulties inherent to the characterization <strong>of</strong>the speech signals per se, these engineered interfaceshave also been developed to mimic <strong>and</strong>match the speech communication performance<strong>of</strong> experienced adult speakers, bypassing thedevelopmental process that eventually led tothe adult communicative competence. Andwhile this in an acceptable strategy to quicklyapproach the ultimate communicative goal itrelies on a deterministic view <strong>of</strong> the speech signal<strong>and</strong> <strong>of</strong> the whole human communicationprocess that tends to result into rather rigid systemsthat can hardly cope with realistic variancein, for instance, the speech signal.


Proceedings, FONETIK 2004, Dept. <strong>of</strong> Linguistics, Stockholm UniversityPart <strong>of</strong> the problems usually faced by speechrecognition systems is caused by the strong focuson the speech signal itself, as the primarycomponent <strong>of</strong> the speech communication process.While this is a reasonable starting point forthe development <strong>of</strong> man- machine speech interfaces,it is obvious that in natural speech communicationsettings the speech signal is onlypart, albeit crucial, <strong>of</strong> the communication process(Colin et al., 2002; McGurk & MacDonald,1976) <strong>and</strong> recent developments <strong>of</strong> language <strong>and</strong>speech systems are beginning to take advantage<strong>of</strong> multimodal information to improve the systems’communication efficiency (e.g. Beskow,2003). However, in spite <strong>of</strong> the improvement incommunication performance clearly accrued bythe introduction <strong>of</strong> additional informationsources, the design <strong>of</strong> these communicationsystems may still be limited by their attempt toengineer a full-fledged human communicationsystem while attempting to bypass the long developmentalpath that eventually led to such anadult competent performance. Thus, the currentproject will attempt to incorporate linguistic<strong>and</strong> psychological knowledge <strong>of</strong> the earlystages <strong>of</strong> the language acquisition process inorder to design a prototypical system that essentiallywill be able to learn from multimodalinformation sources <strong>and</strong> integrate them toachieve flexible <strong>and</strong> realistic representations <strong>of</strong>its environment.AcknowledgementsThe project is funded by The Bank <strong>of</strong> SwedenTercentenary Foundation K2003-0867.ReferencesBeskow, J., 2003. Talking heads – models <strong>and</strong>applications for multimodal speech synthesis.PhD thesis, TMH/KTH.Colin, C., Radeau, M., Soquet, A., Demolin,D., Colin, F., & Deltenre, P. (2002). Mismatchnegativity evoked by the McGurk-MacDonald effect: a phonetic representationwithin short-term memory. Clin. Neurophysiol.113, 495-506.Eimas, P. D., Siquel<strong>and</strong>, E. R., Jusczyk, P., &Vigorito, J. (1971). <strong>Speech</strong> perception in infants.Science 171, 303-306.Holt, L. L., Lotto, A. J., & Kluender, K. R.(2001). Influence <strong>of</strong> fundamental frequencyon stop-consonant voicing perception: acase <strong>of</strong> learned covariation or auditory enhancement?JASA 109, 764-774.Jusczyk, P. W. (1999). How infants begin toextract words from speech. Trends CognSci. 3, 323-328.Kluender, K. R. & Lotto, A. J. (1994). Effects<strong>of</strong> first formant onset frequency on [-voice]judgments result from auditory processesnot specific to humans. JASA 95, 1044-1052.Kluender, K. R., Lotto, A. J., Holt, L. L., &Bloedel, S. L. (1998). Role <strong>of</strong> experiencefor language-specific functional mappings<strong>of</strong> vowel sounds. JASA 104, 3568-3582.Kuhl, P. K. & Padden, D. M. (1982). Enhanceddiscriminability at the phonetic boundariesfor the voicing feature in macaques. Percept.Psychophys32, 542-550.Kuhl, P. K. & Padden, D. M. (1983). Enhanceddiscriminability at the phonetic boundariesfor the place feature in macaques. JASA 73,1003-1010.Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens,K. N., & Lindblom, B. (1992). Linguisticexperience alters phonetic perceptionin infants by 6 months <strong>of</strong> age. Science255, 606-608.Lacerda, F. (2003). Phonology: An emergentconsequence <strong>of</strong> memory constraints <strong>and</strong>sensory input. Reading <strong>and</strong> Writing: An InterdisciplinaryJournal 16, 41-59.Locke, J. L. (1983). Phonological Acquisition<strong>and</strong> Change. New York: Academic PressLotto, A. J., Kluender, K. R., & Holt, L. L.(1997). Perceptual compensation for coarticulationby Japanese quail (Coturnix coturnixjaponica). JASA 102, 1134-1140.MacNeilage, P. F. & Davis, B. L. (2000). Onthe Origin <strong>of</strong> Internal Structure <strong>of</strong> WordForms. Science 288, 527-531.McGurk, H. & MacDonald, J. (1976). <strong>Hearing</strong>lips <strong>and</strong> seeing voices. Nature 264, 746-748.Saffran, J. R., Aslin, R. N., & Newport, E. L.(1996). Statistical Learning by 8-Month-OldInfants. Science 274, 1926-1928.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!