Information Retrieval & Apache Solr Use Case - NBIC

Information Retrieval&Apache Solr Use CaseMay 21, 2010Leon Mei

Outline• Information retrieval & text mining• Apache Lucene/Solr• Use case: expert finder

Information retrievalInformation NeedInformation ItemsRepresentationRepresentationQueryRelevance?Indexed ItemsRetrievedinformationEvaluating/Relevancefeedback

Text vector representation• “Bag of words”• “Bag of phrases”• With stemming/normalization• N-gram...• Concept basedBagOfWordsTextN-gram...TermTerm frequency22111....• Content bearing word,stop list

Relevance #1• Boolean– e.g. query“term 1 term 2 term 3”

Relevance #2• Vector space model– d j= (w 1,j,w 2,j,...,w t,j)– q = (w 1,q,w 2,q,...,w t,q)• Probability model• Citation analysis model* wikipedia

Term frequency &inverse document frequency• Some terms are more important than others• w = TF · IDF• TF t , d– the frequency of occurrence of a term t in document d• IDF t–– N is the number of documents in the collection; and n tis the number ofdocuments where term t occurs– N=100, n t=25, idf = 0.6– N=100, n t=1, idf = 2

Evaluationrecall• Recallrelevant∧retrievedrelevantTPTP ∧FN• Precisionrelevant∧retrievedretrievedTPTP ∧FPprecisionKaren Spärck Jones. A statistical interpretation of term specificityand its application in retrieval. 1972.

Solr: a brief history• 2004, CNET– search capability for reviews, news, price offers, etc• 2006, join Apache and under Lucene– Lucene is high-performance, full-featured text search enginelibrary in Java– 20MB/minute on Pentium M 1.5GHz,– 1MB memory requirement– index size 20-30% of text size• Nov 2009, Solr 1.4 released• Who are using Solr– WhiteHouse.gov; AOL.com; SourceForge.net; Netflix.com;Plaxo.com; ...

Solr: features• XML over HTTP. update via POST, query via GET.• Support tokenizing, stemming, normalization• Web interface to invoke, monitor, analyze the search engine.• Support load distribution: server replication.• Caching, auto-warming• External file-based configuration of stopword lists, synonymlists, and protected word lists• Import data from DB or other sources• ...

Solr: update example• http://localhost:8983/solr/update/?• Addm00123456memory2GKingston[ ... [ ... ]]• Deletem00123456supplier:Kingston

Expert finderWho are the experts on malaria?Person DB(PubMed authors)

Article DB & Author DB in Solr• Article profile DB– 19 million PubMed articles– Each article: title, abstract, MESH, keyword, chemicals– list of concepts => article DB• – Query using a PMID• [C0041221, 0.39],[C0879626, 0.03], ...• Author profile DB– 2.14 million unique authors– Identify all articles of each unique author• Query the article DB• [C0041221, 0.39],[C0879626, 0.03], ...– Each author: list of expertise => author DB• 39 x C0041221 , 3 x C0879626

Cook an expert-finder query• Identify a set of articles in an interested area– a list of PMID• Query the article DB using this list– [C4249671, 1], [C0041234, 0.79], ...• Construct the query– {C4249671, C0041234, ...}• Can specify # of minimal matches

Summary on expert-finder• Rank the authors based on the similarity between their publications andthe publications belong to a particular fieldPubMedArticle DBAuthor DBA ranked list1, Barend2, Rob3, Machiel4, Marc...?Interested areaCooked query

Acknowledgement• Christine Chichester (NBIC)• Bharat Singh (NBIC)• Marc Weeber (Knewco, Inc)• Jun Wang (University College London)

Information Retrieval & Apache Solr Use Case - NBIC

Create successful ePaper yourself

Delete template?

Save as template?