10.07.2015 Views

Addressing database mismatch in forensic speaker ... - ATVS

Addressing database mismatch in forensic speaker ... - ATVS

Addressing database mismatch in forensic speaker ... - ATVS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Address<strong>in</strong>g</strong> <strong>database</strong> <strong>mismatch</strong> <strong>in</strong> <strong>forensic</strong> <strong>speaker</strong> recognitionwith Ahumada III: a public real-casework <strong>database</strong> <strong>in</strong> SpanishDaniel Ramos , Joaqu<strong>in</strong> Gonzalez-Rodriguez ,Javier Gonzalez-Dom<strong>in</strong>guez and Jose Juan Lucena-Mol<strong>in</strong>a ¡<strong>ATVS</strong> - Biometric Recognition Group, Escuela Politecnica SuperiorC./ Francisco Tomas y Valiente 11, Universidad Autonoma de Madrid E-28049 Madrid, Spa<strong>in</strong>¡ Acoustics and Image Process<strong>in</strong>g Department, Crim<strong>in</strong>alistic ServiceDireccion General de la Policia y de la Guardia Civil, M<strong>in</strong>isterio del Interior, Madrid, Spa<strong>in</strong>daniel.ramos@uam.esAbstractThis paper presents and describes Ahumada III, a speech<strong>database</strong> <strong>in</strong> Spanish collected from real <strong>forensic</strong> cases. In itscurrent release, the <strong>database</strong> presents ¢£ male <strong>speaker</strong>s recordedus<strong>in</strong>g the systems and procedures followed by Spanish GuardiaCivil police force. The paper also explores the usefulness ofsuch a corpus for fac<strong>in</strong>g the important problem of <strong>database</strong> <strong>mismatch</strong><strong>in</strong> <strong>speaker</strong> recognition, understood as the difference betweenthe <strong>database</strong> used for tun<strong>in</strong>g a <strong>speaker</strong> recognition systemand the data which the system will handle <strong>in</strong> operationalconditions. This problem is typical <strong>in</strong> <strong>forensic</strong>s, where variability<strong>in</strong> speech conditions may be extreme and difficult to model.Therefore, this work also presents a study evaluat<strong>in</strong>g the impactof such problem, for which a corpus quoted as NIST4M(NIST MultiMic MisMatch) has been constructed from NISTSRE 2006 data. NIST4M presents microphone data both <strong>in</strong> theenrolled models and <strong>in</strong> the test segments, allow<strong>in</strong>g the generationof trials <strong>in</strong> a variety of strongly <strong>mismatch</strong><strong>in</strong>g conditions.Database <strong>mismatch</strong> is simulated by elim<strong>in</strong>at<strong>in</strong>g some microphonechannels of <strong>in</strong>terest from the background data, and comput<strong>in</strong>gscores with speech from such microphones <strong>in</strong> unknowntest<strong>in</strong>g conditions as usually happens <strong>in</strong> <strong>forensic</strong> <strong>speaker</strong> recognition.F<strong>in</strong>ally, we show how the <strong>in</strong>corporation of AhumadaIII as background data is useful to face <strong>database</strong> <strong>mismatch</strong> <strong>in</strong>real-world <strong>forensic</strong> conditions.1. IntroductionThe <strong>in</strong>terest <strong>in</strong> us<strong>in</strong>g automatic systems for <strong>forensic</strong> <strong>speaker</strong>recognition has <strong>in</strong>creased <strong>in</strong> the last years. The ma<strong>in</strong> reasons forthis are the improvement of accuracy <strong>in</strong> the technology [1] and amore comprehensive study about the role of automatic <strong>speaker</strong>recognition <strong>in</strong> <strong>forensic</strong> science [2]. However, <strong>speaker</strong> recognitiontechnology have still important challenges to address. First,speech data scarcity rema<strong>in</strong>s a problem for automatic systems.Many <strong>forensic</strong> cases <strong>in</strong>volve recovered questioned record<strong>in</strong>gspresent<strong>in</strong>g a small amount of speech, often under unfavourablequality conditions. Recent research has focused on <strong>in</strong>creas<strong>in</strong>gthe robustness and accuracy of <strong>speaker</strong> recognition technol-This work has been supported by the Spanish M<strong>in</strong>istry of Educationunder project TEC2006-13170-C02-01. J. G.-D. also thanks SpanishM<strong>in</strong>istry of Education for support<strong>in</strong>g his doctoral research. We alsothank Ana Rosa Gonzalez-Sanz and people from the Acoustics and ImageProcess<strong>in</strong>g Department from Guardia Civil for their important effort<strong>in</strong> collect<strong>in</strong>g data for <strong>forensic</strong> purposes.ogy under short-duration conditions [3]. Second, session variability<strong>mismatch</strong> <strong>in</strong>troduces variation to different utterances ofthe same <strong>speaker</strong>, typically due to factors such as transmissionchannel, speak<strong>in</strong>g style, <strong>speaker</strong> emotional state, environmentalconditions, record<strong>in</strong>g devices, etc. This variability seriously degradesthe performance of a system, and its compensation hasbeen subject of abundant recent research [4, 5]. In fact, this hasbeen a major topic of research <strong>in</strong> recent Speaker RecognitionEvaluations (SRE) conducted by NIST [1]. F<strong>in</strong>ally, we identifythe <strong>database</strong> <strong>mismatch</strong> problem, understood as the variation <strong>in</strong>the conditions between the dataset used for tun<strong>in</strong>g an automatic<strong>speaker</strong> recognition system (referred to as background or development<strong>database</strong>) and the data used <strong>in</strong> real-world operationalconditions (known as evaluation or operational <strong>database</strong>).Database <strong>mismatch</strong> is a typical situation <strong>in</strong> <strong>forensic</strong> <strong>speaker</strong>recognition, ma<strong>in</strong>ly because the limitation <strong>in</strong> the availability ofreal-casework <strong>database</strong>s for system tun<strong>in</strong>g, and also becausethe conditions of the speech <strong>in</strong> real-world <strong>forensic</strong> record<strong>in</strong>gare extremely variable. Due precisely to session variability andextreme conditions, there are several frequent situations wherethe <strong>database</strong> <strong>mismatch</strong> may constitute a serious problem. Onthe one hand, problematic <strong>in</strong>crim<strong>in</strong>atory questioned speech maytypically <strong>in</strong>clude:¤ Telephone wire-taps. The development <strong>database</strong> may nothave speech data recorded <strong>in</strong> the same conditions as thewire-tapp<strong>in</strong>g system. This problem may become seriousif an effort has not been made by the <strong>forensic</strong> laboratory<strong>in</strong> order to collect field data us<strong>in</strong>g their record<strong>in</strong>g systemat hand. The acquisition of such <strong>database</strong> is usually expensiveand time-consum<strong>in</strong>g, as the f<strong>in</strong>al corpus shouldhave a significant size and represent the session variabilityof telephone record<strong>in</strong>gs used <strong>in</strong> casework.¤ Distant or hidden microphone, <strong>in</strong> cases where such a deviceis present on an <strong>in</strong>dividual or a prepared room. Inthis case, the environmental conditions of the record<strong>in</strong>gsare extremely variable depend<strong>in</strong>g on the room characteristics,the microphone position and layout, etc. It is virtuallyimpossible to simulate such conditions for the purposeof record<strong>in</strong>g a representative <strong>database</strong>, especially ifit is desired to take all possible cases <strong>in</strong>to account.On the other hand, control speech record<strong>in</strong>gs from a givensuspect may be problematic <strong>in</strong> the follow<strong>in</strong>g usual situations:¤ Telephone conversations for which the suspect recognizesto be the author, either recorded at police depen-


dencies or wire-tapped. This is exactly the same case asfor a recovered wire-tap.¤ Speech recorded us<strong>in</strong>g a microphone. Although therecord<strong>in</strong>g conditions are much more controlled than <strong>in</strong>the hidden microphone case for questioned record<strong>in</strong>gs,variability <strong>in</strong> the room layout, environmental conditionsand microphone types makes extremely difficult to predictwhich k<strong>in</strong>d of speech will be needed for a representativebackground <strong>database</strong>.Although <strong>database</strong> <strong>mismatch</strong> is commonly seen as a problem<strong>in</strong> the community and among <strong>forensic</strong> experts, to theauthors’ knowledge, rigorous experimental studies regard<strong>in</strong>gthis important issue <strong>in</strong> <strong>forensic</strong> <strong>speaker</strong> recognition have beenma<strong>in</strong>ly focused on telephonic <strong>database</strong>s. With the release ofNIST multi-microphone <strong>database</strong>s <strong>in</strong> SRE 2005 and 2006 [1],a richer dataset has been available for research <strong>in</strong> the topic. Infact, realistic <strong>database</strong>s for <strong>forensic</strong> <strong>speaker</strong> recognition are importantfor two ma<strong>in</strong> reasons: first, more data will be availablefor research <strong>in</strong> <strong>database</strong> <strong>mismatch</strong>; and second, <strong>forensic</strong> laboratorieswill be able to improve the robustness of their <strong>speaker</strong>recognition systems <strong>in</strong> casework conditions.Driven by these needs, this paper presents the Ahumada III<strong>database</strong>, a real-casework publicly available corpus <strong>in</strong> Spanish,which has been acquired by the Acoustics and Image Process<strong>in</strong>gDepartment of Spanish Guardia Civil. In its current release,Ahumada III Release 1 (Ah3R1), it <strong>in</strong>cludes speech data fromreal <strong>forensic</strong> cases recovered us<strong>in</strong>g one of the typical record<strong>in</strong>gsystems from Guardia Civil, namely analog magnetic tapesconta<strong>in</strong><strong>in</strong>g GSM tapp<strong>in</strong>gs. Moreover, the <strong>database</strong> is be<strong>in</strong>g extendedwith more material under this platform and also us<strong>in</strong>gSITEL, a Spanish nationwide digital tapp<strong>in</strong>g system. Ah3R1<strong>in</strong>cludes variability <strong>in</strong> conditions such as noise, environmentalcharacteristics, emotional state, country and region of orig<strong>in</strong>and dialect of <strong>speaker</strong>s, etc. Next releases will significantly <strong>in</strong>creasethe amount of data present<strong>in</strong>g strong variability from realcases.This work also explores the <strong>database</strong> <strong>mismatch</strong> problem us<strong>in</strong>gthe presented Ahumada III <strong>database</strong> and a corpus generatedfrom multi-microphone data from NIST SRE 2006, namely theNIST4M <strong>database</strong> (NIST MultiMic MisMatch). This paper isorganized as follows. First, the Ahumada III <strong>database</strong> is described<strong>in</strong> the context of Guardia Civil operative procedures for<strong>forensic</strong> <strong>speaker</strong> recognition. The experimental section thenpresents results which illustrate the impact of <strong>database</strong> <strong>mismatch</strong><strong>in</strong> system performance <strong>in</strong> two different ways: <strong>database</strong><strong>mismatch</strong> us<strong>in</strong>g NIST multi-microphone corpora, and robustness<strong>in</strong> real-casework conditions us<strong>in</strong>g Ahumada III. Resultsshow the importance of the <strong>database</strong> <strong>mismatch</strong> problem andencourages research <strong>in</strong> the field and data collection. F<strong>in</strong>ally,conclusions are drawn.2. <strong>Address<strong>in</strong>g</strong> <strong>database</strong> <strong>mismatch</strong> <strong>in</strong> realcasework: the Ahumada III <strong>database</strong>In the last years, Spanish Guardia Civil has done a significanteffort on the application of automatic <strong>speaker</strong> recognition systemsto <strong>forensic</strong> voice evidences [2]. Follow<strong>in</strong>g a Bayesianframework, and pursu<strong>in</strong>g transparency <strong>in</strong> their reports, muchof this effort has concentrated <strong>in</strong> the assessment of their system<strong>in</strong> the sake of testability, as a demand<strong>in</strong>g requirement <strong>in</strong>the new paradigm of <strong>forensic</strong> identification sciences [6]. Mostvoice evidences arriv<strong>in</strong>g to Guardia Civil laboratories have twopossible orig<strong>in</strong>s. First, digitized analog magnetic record<strong>in</strong>gsfrom GSM mobile calls are typical <strong>in</strong> cases between 1995 and2004. From those record<strong>in</strong>gs of this type received <strong>in</strong> the lastten years, those authorized (case by case) by the correspond<strong>in</strong>gjudge after a trial, have been added to a <strong>database</strong> registered <strong>in</strong>the Spanish M<strong>in</strong>isterio del Interior, known as Base de Datos deRegistros Acústicos (BDRA) 1 . Second, nationwide digital <strong>in</strong>terceptionsystem (SITEL) has been used s<strong>in</strong>ce 2005 by the twoSpanish State Police forces. This system records digital wiretapsdirectly connected to all mobile telephone operators.With the purpose of robustness <strong>in</strong> real-case conditions andproper assessment of systems, Guardia Civil has recorded severalspeech <strong>database</strong>s <strong>in</strong> the last decade. Back <strong>in</strong> 1998, the AhumadaI corpus was collected [7] conta<strong>in</strong><strong>in</strong>g 100 male <strong>speaker</strong>s<strong>in</strong> telephone and microphone record<strong>in</strong>gs. A subcorpus ofAhumada I spontaneous telephone speech was used <strong>in</strong> NISTSpeaker Recognition Evaluations both <strong>in</strong> 2000 and 2001.complementary <strong>database</strong> known as Gaudi with 100 female<strong>speaker</strong>s was recorded <strong>in</strong> 2001 2 . From 2004 to 2006, the Baeza<strong>database</strong> was recorded, <strong>in</strong>clud<strong>in</strong>g GSM and microphone spontaneousconversational speech sessions recorded at the sametime, from ¡¢£ males and ¡¢ females. Baeza has shown, upto this moment, the most relevant reference population to beused <strong>in</strong> real cases, as it was recorded through the same SpanishGSM network as <strong>in</strong> actual cases. In 2008, all 100 male <strong>speaker</strong>sfrom Ahumada I are still available and are to be aga<strong>in</strong> recordedthrough GSM and SITEL. In this way, Ahumada II (as is to beknown) will constitute a major contribution for the analysis oflong-term stability and degradation of <strong>speaker</strong> features after tenyears. As most new cases come through SITEL record<strong>in</strong>gs andare to be, by this time, evaluated with Baeza as reference population,also <strong>in</strong> 2008 almost 100 <strong>speaker</strong>s from Baeza (differentfrom Ahumada <strong>speaker</strong>s) are to be recorded aga<strong>in</strong> through SI-TEL. This <strong>database</strong> will be known as Ahumada IV and will beused for system assessment.2.1. The Ahumada III <strong>database</strong>Ahumada III consists of authorized conversational speech fromreal cases both from BDRA and SITEL. The expected size ofthe <strong>database</strong> <strong>in</strong> number of <strong>speaker</strong>s and variety of conditionsaddressed is huge both <strong>in</strong> terms of number of available calls andamount of data. However, as conditions are not uniform, andspeech record<strong>in</strong>gs have to be authorized one by one, differentreleases of the <strong>database</strong> will be progressively available.Ahumada III Release 1 (Ah3R1) consists ¢£ of <strong>speaker</strong>sfrom a number of real cases with GSM BDRA calls acrossSpa<strong>in</strong>, with a variety of country of orig<strong>in</strong> of <strong>speaker</strong>s, emotionaland acoustic conditions, and dialects <strong>in</strong> the case of Spanishspeech.The unique low variability dimension is gender,as all of them are male <strong>speaker</strong>s. All ¢£ <strong>speaker</strong>s <strong>in</strong> Ah3R1have two m<strong>in</strong>utes of speech available from a s<strong>in</strong>gle phone callto be used as unquestioned (control) record<strong>in</strong>g, with the purposeof model tra<strong>in</strong><strong>in</strong>g or voice characterization. Additionally,ten speech segments for £ <strong>speaker</strong>s and five segments for £<strong>speaker</strong>s are <strong>in</strong>cluded for test<strong>in</strong>g issues, each one from a differentcall. Such fragments present between ¡ and ¢¤ secondsof speech, with an average duration of £ seconds. An evaluationprotocol, equivalent to that of NIST SRE [1], is availableand ready to be used. In this way, experienced NIST SRE par-1 With reference public scientific file number 1981420003 fromSpanish Guardia Civil, Orden M<strong>in</strong>isterial INT/3764/2004 de 11 denoviembre.2 A 25 Ahumada male subcorpus and an equivalent 25 Gaudi femalesubcorpus are freely available <strong>in</strong> http://atvs.ii.uam.es/<strong>database</strong>s.jsp.A


ticipants can run and assess immediately their systems on theAh3R1 data. Ah3R1 is publicly available for research purposesunder a license agreement 3 to be signed with Guardia Civil(contact: crim-acustica@guardiacivil.es). Sample speech segmentsare ready to be directly heard <strong>in</strong> <strong>ATVS</strong> website <strong>in</strong> orderto perceive the quality and diversity of Ah3R1 record<strong>in</strong>gs.3. Experiments on <strong>database</strong> <strong>mismatch</strong>In this section we present results illustrat<strong>in</strong>g the <strong>database</strong> <strong>mismatch</strong>problem, and how Ahumada III will improve accuracy<strong>in</strong> real <strong>forensic</strong> casework. For score computation, the <strong>ATVS</strong>GMM system has been used, where speech data known to comefrom a given <strong>speaker</strong> is represented us<strong>in</strong>g Gaussian MixtureModels (GMM) adapted from an Universal Background Model(UBM). GMM of ££¢ mixtures have been used for modell<strong>in</strong>gMFCC-based features with channel compensation based on featurewarp<strong>in</strong>g and factor analysis [4].3.1. Database <strong>mismatch</strong> with NIST4M <strong>database</strong>An evaluation corpus, namely NIST4M (NIST MultiMic Mis-Match), has been constructed us<strong>in</strong>g NIST SRE 2006 data.This corpus aims at simulat<strong>in</strong>g unfavourable session variability.Moreover, utterances <strong>in</strong> NIST SRE 2006 from all <strong>speaker</strong>s hav<strong>in</strong>gmicrophone conversations have been comb<strong>in</strong>ed <strong>in</strong> NIST4Mfor trial generation, hav<strong>in</strong>g multimicrophone data both <strong>in</strong> theenrolled models and <strong>in</strong> the test segments. Database <strong>mismatch</strong>has been simulated by elim<strong>in</strong>at<strong>in</strong>g all the data for a given microphonefrom the background data, us<strong>in</strong>g such microphone only<strong>in</strong> NIST4M evaluation data. Thus, trials for that microphonewill be performed without consider<strong>in</strong>g such k<strong>in</strong>d of data <strong>in</strong> thebackground set, and therefore under <strong>database</strong> <strong>mismatch</strong>. Thissituation is usual <strong>in</strong> <strong>forensic</strong> <strong>speaker</strong> recognition.For the construction of NIST4M, half of the telephone utterancesof each <strong>speaker</strong> were selected as test<strong>in</strong>g segments andthe rest as tra<strong>in</strong><strong>in</strong>g data. For microphone data, microphones<strong>in</strong> one session per <strong>speaker</strong> has been used for tra<strong>in</strong><strong>in</strong>g data, and¢ microphones from each of the rest of sessions are used fortest<strong>in</strong>g. The microphone types changed with different <strong>speaker</strong>s,<strong>in</strong> such a way that the data for all microphone types tend tobe balanced. Trials were generated by perform<strong>in</strong>g all possiblecomb<strong>in</strong>ations between the same <strong>speaker</strong> utterances for targettrials, and the next £¢ <strong>speaker</strong>s for non-target trials, avoid<strong>in</strong>gdifferent-gender comparisons.Development data has been selected from past NIST SRE.Microphone data from NIST SRE 2005 has been divided <strong>in</strong>three sub-sets, separately used for tra<strong>in</strong><strong>in</strong>g factor analysis parameters,UBM and T-Norm cohorts. This data has been complementedwith NIST SRE 2005, 2004 and 2002 telephone data,as well as speech from Switchboard I 4 .For experiments presented here microphones (¡ ) and¤ (¡¤) as quoted by NIST [1] have been chosen. Both microphonespresent different characteristics, be<strong>in</strong>g ¡ a distantmicrophone ¡¤ and a higher-quality device. Two different test<strong>in</strong>gconditions have been def<strong>in</strong>ed. On the one hand, we def<strong>in</strong>ea same-microphone condition where trials conta<strong>in</strong><strong>in</strong>g onlya given microphone both <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g and test<strong>in</strong>g speech datahave been selected. For this condition, <strong>in</strong> the case of ¡ thereare target and ¤¢¤ non-target scores, and <strong>in</strong> the case of ¡¤there are £ target and ¤ £ non-target scores. Such numbersare quite small, and therefore results obta<strong>in</strong>ed for this condition3 Available at http://atvs.ii.uam.es/<strong>database</strong>s.jsp4 See LDC website http://www.ldc.upenn.edu/should be carefully <strong>in</strong>terpreted. However, we have consideredthem here <strong>in</strong> order to illustrate <strong>database</strong> <strong>mismatch</strong> when sessionvariability is controlled. On the other hand, we def<strong>in</strong>e amicrophone-model condition, where trials conta<strong>in</strong><strong>in</strong>g a givenmicrophone <strong>in</strong> the models and every possible channel <strong>in</strong> thetest segment have been considered. For this condition, when¡ is selected there are £¢ target and ¢££¤ non-target scores;whereas ¤£¤ target and ¤¢£¢ non-target scores have been generatedfor ¡¤.For <strong>database</strong> <strong>mismatch</strong> simulation, on the one hand, thescores for the ¡ microphone-model condition <strong>in</strong> a <strong>database</strong>match<strong>in</strong>g situation will have data from NIST SRE 2005 ¡microphone <strong>in</strong> the UBM, factor analysis data and T-Norm cohort.On the other hand, <strong>in</strong> <strong>database</strong> <strong>mismatch</strong> conditions datarecorded over NIST SRE 2005 ¡ microphone will be elim<strong>in</strong>atedfrom the UBM, factor analysis data and T-Norm cohort.The same procedure is applied to ¡¤. It is needed to remark thatthe data from a given microphone represents only a £¢¤ ¤¥ ofthe whole data used for tra<strong>in</strong><strong>in</strong>g UBM and factor analysis, andonly £ up to £ models <strong>in</strong> the whole T-Norm cohort. This isimportant <strong>in</strong> order to evaluate the impact of the elim<strong>in</strong>ation ofsuch a relatively small amount of data from the background set.3.1.1. ResultsTable 1 shows the effect of <strong>database</strong> <strong>mismatch</strong> for the samemicrophonecondition described above. Performance is measured<strong>in</strong> the form of equal error rate (EER) and Detection CostFunction (DCF) as def<strong>in</strong>ed by NIST [1]. Results show that<strong>database</strong> <strong>mismatch</strong> leads to a significant degradation <strong>in</strong> performance,especially for the DCF performance measure. Howeverthese results should be carefully <strong>in</strong>terpreted, as the number ofscores <strong>in</strong> each condition is scarce.DB match DB <strong>mismatch</strong>EER DCF EER DCF¡ ¡ ¤ ¡¤vs. 15.2137 0.0391 17.094 0.0472vs. 1.4815 0.0033 2.2222 0.0143Table 1: Performance of same-microphone condition under<strong>database</strong> <strong>mismatch</strong>.Figure 1 illustrates the effects of <strong>database</strong> <strong>mismatch</strong> for themicrophone-model condition. This aims to simulate a typical<strong>forensic</strong> situation, where speech from the suspect may havebeen recorded <strong>in</strong> particular microphone conditions for whicha background <strong>database</strong> is not available. DET curves show asignificant degradation <strong>in</strong> discrim<strong>in</strong>ation performance when themicrophone of the considered model is elim<strong>in</strong>ated from thebackground data.3.2. Experiments with Ahumada IIIIn this section, the use of Ahumada III for compensat<strong>in</strong>g<strong>database</strong> <strong>mismatch</strong> <strong>in</strong> real-casework data is illustrated. The simulationhas been conducted by add<strong>in</strong>g Ahumada III data to theUBM tra<strong>in</strong>ed with NIST SRE data described <strong>in</strong> Section 3.1. Althoughmore accurate results can be obta<strong>in</strong>ed by us<strong>in</strong>g Baeza orBDRA <strong>database</strong>s from Guardia Civil, we selected a NIST-SREbasedbackground dataset for the illustration of <strong>database</strong> <strong>mismatch</strong>,and also for transparency, because any laboratory hav<strong>in</strong>gthe NIST SRE <strong>database</strong>s can test the same results <strong>in</strong> AhumadaIII. The same <strong>ATVS</strong> GMM system has been used for perform<strong>in</strong>gthe experiments, except for session variability compensa-


False Rejection Probability (<strong>in</strong> %)4020105210.50.20.1m5 match: EER−DET = 7.28m5 <strong>mismatch</strong> no m5: EER−DET = 8.82m3 match: EER−DET = 21.06m3 <strong>mismatch</strong> no m3: EER−DET = 22.600.1 0.2 0.5 1 2 5 10 20 40False Acceptance Probability (<strong>in</strong> %)Figure 1: DET curves for the microphone-model condition, illustrat<strong>in</strong>gthe effect of <strong>database</strong> <strong>mismatch</strong>.False Rejection Probability (<strong>in</strong> %)4020105210.50.20.1Ah3 <strong>mismatch</strong>: EER−DET = 13.58Ah3 1/3 match: EER−DET = 12.19Ah3 1/2 match: EER−DET = 11.470.1 0.2 0.5 1 2 5 10 20 40False Acceptance Probability (<strong>in</strong> %)Figure 2: DET curves for experiments us<strong>in</strong>g Ahumada III.Database <strong>mismatch</strong> is reduced by <strong>in</strong>creas<strong>in</strong>gly add<strong>in</strong>g AhumadaIII data to the NIST-SRE-based UBM.tion, which has not considered factor analysis. Moreover, dueto data scarcity <strong>in</strong> Ah3R1, we will not explore effects <strong>in</strong> factoranalysis-basedsession variability compensation or T-Norm cohorts,but it will be considered for future work as next releasesof Ahumada III will be available.In order to simulate the effect of us<strong>in</strong>g Ahumada III <strong>in</strong> realcasework, we have used a cross-validation procedure, where werotate subsets of the <strong>speaker</strong>s used for UBM tra<strong>in</strong><strong>in</strong>g ( £ or ¢£from the ¢£ <strong>speaker</strong>s considered <strong>in</strong> the test) and compute scoreswith the rest of available <strong>speaker</strong>s ( £ or £). We call both conditionsthe £ ¢ and £ match respectively. Thus, we obta<strong>in</strong> anumber of ££ target trials for both conditions, £¡££ non-targettrials for the £ ¢ match condition and ¤¡££ non-target trials forthe £ match condition. In both situations, the proportion ofAhumada III speech data with respect to NIST SRE data <strong>in</strong> theUBM does not exceed £¤¥ after silence removal.Figure 2 shows the DET plots of the proposed experiments.It is shown that the <strong>in</strong>clusion of Ahumada III <strong>in</strong> the UBM <strong>in</strong>creasesdiscrim<strong>in</strong>ation accuracy, as expected. Moreover, the<strong>in</strong>crease is highly significant consider<strong>in</strong>g the small percentage(less than £¤¥) of Ahumada III data with respect to NIST SREdata <strong>in</strong> the UBM.4. ConclusionsThis paper has presented the first release of Ahumada III, aspeech corpus <strong>in</strong> Spanish from real police <strong>in</strong>vestigations. The<strong>database</strong> is be<strong>in</strong>g built as a tool for aid<strong>in</strong>g <strong>forensic</strong> laboratoriesto assess and tune their <strong>speaker</strong> recognition systems <strong>in</strong> themost possible realistic conditions. In this sense, the importantproblem of <strong>database</strong> <strong>mismatch</strong> has been explored us<strong>in</strong>g thepresented Ahumada III <strong>database</strong> and also other multichannelcorpora such as NIST4M, a <strong>database</strong> constructed from NISTSRE 2006 multichannel data. In this work, <strong>database</strong> <strong>mismatch</strong>has been simulated <strong>in</strong> UBM modell<strong>in</strong>g, channel compensationbased on factor analysis and T-Norm cohorts. As a result, it hasbeen observed that <strong>database</strong> <strong>mismatch</strong> has a critical impact <strong>in</strong>the performance of systems, even if the background data match<strong>in</strong>gthe operational speech conditions is relatively small with respectto the whole dataset used for tun<strong>in</strong>g. It is also shown how<strong>database</strong>s such as Ahumada III help to compensate for this importantproblem. There have been many topics which have notbeen explored <strong>in</strong> this work, such as the <strong>in</strong>fluence of <strong>database</strong><strong>mismatch</strong> <strong>in</strong> session variability, its impact <strong>in</strong> the calibration¡¢andcomputation and a more rigorous study for the different microphonechannels <strong>in</strong> the NIST4M corpus. However, with AhumadaIII new releases, scientists will be able to deeply exploresuch problems <strong>in</strong> a realistic <strong>forensic</strong> framework.5. References[1] M. A. Przybocki, A. F. Mart<strong>in</strong>, and A. N. Le,“NIST <strong>speaker</strong> recognition evaluations utiliz<strong>in</strong>g the Mixercorpora-2004, 2005, 2006,” IEEE Transactions on Audio,Speech and Language Process<strong>in</strong>g, vol. 15, no. 7, pp. 1951–1959, 2007.[2] J. Gonzalez-Rodriguez, Phil Rose, D. Ramos, Doroteo T.Toledano, and J. Ortega-Garcia, “Emulat<strong>in</strong>g DNA: Rigorousquantification of evidential weight <strong>in</strong> transparent andtestable <strong>forensic</strong> <strong>speaker</strong> recognition,” IEEE Transactionson Audio, Speech and Language Process<strong>in</strong>g, vol. 15, no. 7,pp. 2072–2084, 2007.[3] Benoît Fauve, Nicholas Evans, and John Mason, “Improv<strong>in</strong>gthe performance of text-<strong>in</strong>dependent short durationSVM- and GMM-based <strong>speaker</strong> verification,” <strong>in</strong> Proc. ofOdyssey, Stellenbosch, South Africa, 2008.[4] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel,“Speaker and session variability <strong>in</strong> gmm-based <strong>speaker</strong> verification,”IEEE Transactions on Audio, Speech and SignalProcess<strong>in</strong>g, vol. 15, no. 4, pp. 1448–1460, 2007.[5] R. Vogt and S. Sridharan, “Explicit modell<strong>in</strong>g of sessionvariability for <strong>speaker</strong> verification,” Computer Speech andLanguage, vol. 22, no. 1, pp. 17–38, 2007.[6] M. J. Saks and J. J. Koehler, “The com<strong>in</strong>g paradigm shift<strong>in</strong> <strong>forensic</strong> identification science,” Science, vol. 309, no.5736, pp. 892–895, 2005.[7] J. Ortega-Garcia, J. Gonzalez-Rodriguez, and V. Marrero-Aguilar, “AHUMADA: A large speech corpus <strong>in</strong> Spanishfor <strong>speaker</strong> characterization and identification,” SpeechCommunication, vol. 31, no. 2-3, 2000.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!