13.07.2015 Views

Name-Ethnicity Classification and Ethnicity ... - Dr. C. Lee Giles

Name-Ethnicity Classification and Ethnicity ... - Dr. C. Lee Giles

Name-Ethnicity Classification and Ethnicity ... - Dr. C. Lee Giles

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Table 6: Most predictive features, excluding nonASCII, foreach name-ethnicity according to the ethnicity classifier. ‘$’represents the word boundary <strong>and</strong> ‘+’ indicates the lastnameboundary.<strong>Ethnicity</strong> Feature ExampleMEA ou Jassem KhalloufiIND bh Bhairavi DesaiFRN ien$ Henri ChretienGER sch Gregor SchneiderITA one+ Giovanni FalconeSPA $de$ Jesus de PolancoRUS v+ Viktor YakushevCHI ng$ Jing HuangJAP tsu Yutakayama HiromitsuKOR ae Song Tae Kon, Nam Yun BaeVIE nh Nguyen Dinh, Linh Chicharacter sequences (as in ‘Bhairavi Desai’) are highly predictiveof Indians names. The ‘sch’ character sequences ismore unique in German names, while ‘v+’ (the lastnameending with ‘v’) is a good indicator for names of Russianethnicity. An example of notable phonetic features picked upby our classifier is Soundex S400 for Middle Eastern names(corresponding to ‘EL’ as in ‘Hatem El Mekki’). It is interestingto note that the majority of the top features are characterfeatures. This suggests that the written form of a nameis a better indicator of its ethnicity than its phonetic. In general,the top features learned by our classifier comply withour expectations.Overall, our name-ethnicity classifier can infer ethnicityfrom names with fairly high accuracy. Our result is overallcomparable to the previously reported attempt in (Ambekaret al. 2009), <strong>and</strong> is significantly better at identifyingsome ethnic groups such as German, <strong>and</strong> East Asian names.Using smaller units of names, instead of the full names, asfeatures not only makes the classifier generalize better to unseennames, but should also help reduce the number of trainingdata needed in order to include a new ethnic class. Oneinteresting direction for future work would be to comparethe effect of the size of the training data on the performancebetween our model <strong>and</strong> theirs.<strong>Name</strong> Matching EvaluationWe evaluate our name matching algorithm on the DBLP authordataset (Dataset 2) (Lange <strong>and</strong> Naumann 2011). Thedataset contains 10,000 pairs of ambiguous author recordsfrom DBLP, that were manually cleaned due to author requests,or by fine-tuning heuristics. The dataset was createdto be a challenging testbed for author record disambiguation.We use the subset of the data (2500 record pairs), whichcomprises only of author records from authors with multiplename aliases.As the baselines, we implement four st<strong>and</strong>ard stringmatching techniques, Jaro-Winkler <strong>and</strong> Levenstein editingdistance, together with their recursive versions (L2 Jaro-Winkler <strong>and</strong> L2 Levenstein) using the SecondString library 1 .1 http://secondstring.sourceforge.net/Figure 2: The confusion matrix of name-ethnicity classificationon Wikipedia data. The rows are the true ethnicities,<strong>and</strong> the columns are the predictions. Darker tiles indicatesless confusion between ethnic pairs, while lighter tiles representmore confusion.We evaluate the performance of all baselines with variousthresholds, <strong>and</strong> plot their precision <strong>and</strong> recall in Figure 3.The maximal F1 for all possible threshold values is 0.75 forJaro-Winkler, 0.70 for Levenstein distance, 0.80 for L2 Jaro-Winkler, <strong>and</strong> 0.77 for L2 Levenstein.We create three ethnicity-dependent name matching models;one for Middle Eastern names (MEA), one for Spanishnames (SPA), one for East Asian names (CHI, JAP, KOR,VIE), <strong>and</strong> one default model, which is trained on all ethnicitynames. The three ethnic groups are chosen since theyhave quite distinct naming conventions compared to typicalWestern names. We train <strong>and</strong> evaluate the ethnicity-sensitivename matching models using 10-folds cross validation. Ourmodel achieves 0.99 precision, 0.89 recall <strong>and</strong> 0.94 F1 measure.This is 14% improvement in F1 over the best st<strong>and</strong>ardsimilarity measures (L2 Jaro-Winkler). Our algorithm scoresextremely high precision. While the recall is at 0.89, whichis slightly lower, it is still quite high. When examining thenames that their matches were not correctly recalled, weobserve that some of them would be difficult even for humanwithout using other information. For instance, ‘HedvigSidenbladh’ <strong>and</strong> ‘Hedvig Kjellström’ are marked as the sameauthor in the DBLP dataset, so as ‘Maria-Florina Balcan’<strong>and</strong> ‘Maria-Florina Popa.’ These are examples of authorsthat changed their last names. One drawback of training separatemodels for each ethnicity is that we effectively makethe training data for each model smaller. Thus we would liketo test our model with other larger dataset in the future <strong>and</strong>to investigate the impact on performance.ConclusionWe propose a novel name-ethnicity classifier based on themultinomial logistic regression. Our name-ethnicity classifieris trained <strong>and</strong> evaluated on Wikipedia data, achievingaround 85% accuracy. We observe that name strings wereusually more important than phonetic sequences for classifi-1146

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!