Proceedings Fonetik 2009 - Institutionen för lingvistik

More documents

Recommendations

Info

Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm UniversityA first step towards a text-independent speaker verificationPraat plug-in using Mistral/Alize toolsJonas LindhDepartment of Philosophy, Linguistics and Theory of Science, University of GothenburgAbstractText-independent speaker verification can be auseful tool as a substitute for passwords or increasedsecurity check. The tool can also beused in forensic phonetic casework. A text-independentspeaker verification Praat plug-inwas created using tools from the open sourceMistral/Alize toolkit. A gate keeper setup wascreated for 13 department employees and testedfor verification. 2 different universal backgroundmodels where trained and the same settested and evaluated. The results show promisingresults and give implications for the usefulnessof such a tool in research on voice quality.IntroductionAutomatic methods are increasingly being usedin forensic phonetic casework, but most oftenin combination with aural/acoustic methods. Itis therefore important to get a better understandingof how the two systems compare. Forseveral studies on voice quality judgement, butalso as a tool for visualisation and demonstration,a text-independent speaker comparisonwas implemented as a plugin to the phoneticanalysis program Praat (Boersma & Weenink,2009). The purpose of this study was to makean as easy to use implementation as possible sothat people with phonetic knowledge could usethe system to demonstrate the technique or performresearch. A state-of-art technique, the socalled GMM-UBM (Reynolds, 2000), was appliedwith tools from the open source toolkitMistral (former Alize) (Bonastre et al., 2005;2008). This paper describes the surface of theimplementation and the tools used without anydeeper analysis to get an overview. A small testwas then made on high quality recordings tosee what difference the possession of trainingdata for the universal background model makes.The results show that for demonstration purposesa very simple world model including thespeakers you have trained as targets is sufficient.However, for research purposes a largerworld model should be trained to be able toshow more correct scores.Mistral (Alize), an open source toolkitfor building a text-independent speakercomparison systemThe NIST speaker recognition evaluation campaignstarted already 1996 with the purpose ofdriving the technology of text-independentspeaker recognition forward as well as test theperformance of the state-of-the-art approachand to discover the most promising algorithmsand new technological advances (fromhttp://www.nist.gov/speech/tests/sre/ Jan 12,2009). The aim is to have an evaluation at leastevery second year and some tools are providedto facilitate the presentation of the results andhandling the data (Martin and Przybocki,1999). A few labs have been evaluating theirdevelopments since the very start with increasingperformances over the years. These labsgenerally have always performed best in theevaluation. However, an evaluation is a rathertedious task for a single lab and the question ofsome kind of coordination came up. This coordinationcould be just to share information,system scores or other to be able to improve theresults. On the other hand, the more naturalchoice to be able to share and interpret results isopen source. On the basis of this Mistral andmore specifically the ALIZE SpkDet packageswere developed and released as open sourcesoftware under a so-called LGPL licence(Bonastre et al., 2005; 2008).MethodA standard setup was made for placing datawithin the plugin. On the top of the tree structureseveral scripts controlling executable binaries,configuration files, data etc. were createdwith basic button interfaces that show up in agiven Praat configuration. The scripts weremade according to the different necessary stepsthat have to be covered to create a test environment.194
Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm UniversitySteps for a fully functional text-independentsystem in PraatFirst of all some kind of parameterization has tobe made of the recordings at hand. In this firstimplementation SPro (Guillaume, 2004) waschosen for parameter extraction as there was alreadysupport for this implemented in the Mistralprograms. There are 2 ways to extract parameters,either you choose a folder with audiofiles (preferably wave format, however otherformats are supported) or you record a sound inPraat directly. If the recording is supposed to bea user of the system (or a target) a scroll listwith a first option “New User” can be chosen.This function will control the sampling frequencyand resample if sample frequency isother than 16 kHz (currently default), perform aframe selection by excluding silent frameslonger than 100 ms before 19 MFCCs are extractedand stored in parameter file. The parametersare then automatically energy normalizedbefore storage. The name of the user is thenalso stored in a list of users for the system. Ifyou want to add more users you go through thesame procedure again. When you are done youcan choose the next option in the scroll listcalled “Train Users”. This procedure will controlthe list of users and then normalize andtrain the users using a background model(UBM) trained using Maximum LikelihoodCriterion. The individual models are trained tomaximise the a posteriori probability that theclaimed identity is the true identity given thedata (MAP training). This procedure requiresthat you already have a trained UBM. However,if you do not, you can choose the function“Train World” which will take your list of users(if you have not added others to be included inthe world model solely) and train one with thedefault of 512 Gaussian mixture models(GMM). The last option on the scroll list is instead“Recognise User” which will test the recordingagainst all the models trained by thesystem. A list of raw (not normalised) log likelihoodratio scores gives you feedback on howwell the recording fitted any of the models. In acommercial or fully-fledged verification systemyou would also have to test and decide onthreshold, as that is not the main purpose herewe are only going to speculate on possible useof threshold for this demo system.Preliminary UBM performance testTo get first impression how well the implementationworked a small pilot study was madeusing 2 different world models. For this purpose13 colleagues (4 female and 9 males) atthe department of linguistics were recorded usinga headset microphone. To enroll them asusers they had to read a short passage from awell known text (a comic about a boy endingup with his head in the mud). The recordingsfrom the reading task were between 25-30seconds. 3 of the speakers were later recordedto test the system using the same kind of headset.1 male and 1 female speaker was then alsorecorded to be used as impostors. For the testutterances the subjects were told to produce anutterance close to “Hej, jag heter X, jag skullevilja komma in, ett två tre fyra fem.” (“Hi, I amX, I would like to enter, one two three fourfive.”). The tests were run twice. In the first testonly the enrolled speakers were used as UBM.In the second the UBM was trained on excerptsfrom interviews with 109 young male speakersfrom the Swedia dialect database (Eriksson,2004). The enrolled speakers were not includedin the second world model.Results and discussionAt the enrollment of speakers some mistakes inthe original scripts were discovered such ashow to handle clipping in recordings as well asfeedback to the user while training models. Thescripts were updated to take care of that and afterwardsenrollment was done without problems.In the first test only the intended targetspeakers were used to train a UBM before theywere enrolled.LLR0,60,40,20-0,2-0,4-0,6-0,8-1LLR Score Test 1 Speaker RAM M M F F M F M M M M F MRA JA PN JV KC AE EB JL TL JaL SS UV HVRA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_t RA_tTest SpeakerFigure 1. Result for test 1 speaker RA against allenrolled models. Row 1 shows male (M) or female(F) model, row 2 model name and row 3 the testspeaker.In Figure 1 we can observe that the speaker iscorrectly accepted with the only positive LLR(0.44). The closest following is then the modelof speaker JA (-0.08).195
Page 1 and 2:
Department of LinguisticsProceeding
Page 3 and 4:
Proceedings, FONETIK 2009, Dept. of
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
Page 21 and 22:
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Proceedings, FOETIK 2009, Dept. of
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Page 115 and 116:
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144: Proceedings, FONETIK 2009, Dept. of
Page 193: Proceedings, FONETIK 2009, Dept. of
Page 203 and 204: Proceedings, FOETIK 2009, Dept. of
Page 205 and 206: Proceedings, FOETIK 2009, Dept. of
Page 227: Department of LinguisticsPhonetics
show all

Proceedings Fonetik 2009 - Institutionen för lingvistik

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?