Information Retrieval & Apache Solr Use Case - NBIC
Information Retrieval & Apache Solr Use Case - NBIC
Information Retrieval & Apache Solr Use Case - NBIC
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Information</strong> <strong>Retrieval</strong>&<strong>Apache</strong> <strong>Solr</strong> <strong>Use</strong> <strong>Case</strong>May 21, 2010Leon Mei
Outline• <strong>Information</strong> retrieval & text mining• <strong>Apache</strong> Lucene/<strong>Solr</strong>• <strong>Use</strong> case: expert finder
<strong>Information</strong> retrieval<strong>Information</strong> Need<strong>Information</strong> ItemsRepresentationRepresentationQueryRelevance?Indexed ItemsRetrievedinformationEvaluating/Relevancefeedback
Text vector representation• “Bag of words”• “Bag of phrases”• With stemming/normalization• N-gram...• Concept basedBagOfWordsTextN-gram...TermTerm frequency22111....• Content bearing word,stop list
Relevance #1• Boolean– e.g. query“term 1 term 2 term 3”
Relevance #2• Vector space model– d j= (w 1,j,w 2,j,...,w t,j)– q = (w 1,q,w 2,q,...,w t,q)• Probability model• Citation analysis model* wikipedia
Term frequency &inverse document frequency• Some terms are more important than others• w = TF · IDF• TF t , d– the frequency of occurrence of a term t in document d• IDF t–– N is the number of documents in the collection; and n tis the number ofdocuments where term t occurs– N=100, n t=25, idf = 0.6– N=100, n t=1, idf = 2
Evaluationrecall• Recallrelevant∧retrievedrelevantTPTP ∧FN• Precisionrelevant∧retrievedretrievedTPTP ∧FPprecisionKaren Spärck Jones. A statistical interpretation of term specificityand its application in retrieval. 1972.
<strong>Solr</strong>: a brief history• 2004, CNET– search capability for reviews, news, price offers, etc• 2006, join <strong>Apache</strong> and under Lucene– Lucene is high-performance, full-featured text search enginelibrary in Java– 20MB/minute on Pentium M 1.5GHz,– 1MB memory requirement– index size 20-30% of text size• Nov 2009, <strong>Solr</strong> 1.4 released• Who are using <strong>Solr</strong>– WhiteHouse.gov; AOL.com; SourceForge.net; Netflix.com;Plaxo.com; ...
<strong>Solr</strong>: features• XML over HTTP. update via POST, query via GET.• Support tokenizing, stemming, normalization• Web interface to invoke, monitor, analyze the search engine.• Support load distribution: server replication.• Caching, auto-warming• External file-based configuration of stopword lists, synonymlists, and protected word lists• Import data from DB or other sources• ...
<strong>Solr</strong>: update example• http://localhost:8983/solr/update/?• Addm00123456memory2GKingston[ ... [ ... ]]• Deletem00123456supplier:Kingston
Expert finderWho are the experts on malaria?Person DB(PubMed authors)
Article DB & Author DB in <strong>Solr</strong>• Article profile DB– 19 million PubMed articles– Each article: title, abstract, MESH, keyword, chemicals– list of concepts => article DB• – Query using a PMID• [C0041221, 0.39],[C0879626, 0.03], ...• Author profile DB– 2.14 million unique authors– Identify all articles of each unique author• Query the article DB• [C0041221, 0.39],[C0879626, 0.03], ...– Each author: list of expertise => author DB• 39 x C0041221 , 3 x C0879626
Cook an expert-finder query• Identify a set of articles in an interested area– a list of PMID• Query the article DB using this list– [C4249671, 1], [C0041234, 0.79], ...• Construct the query– {C4249671, C0041234, ...}• Can specify # of minimal matches
Summary on expert-finder• Rank the authors based on the similarity between their publications andthe publications belong to a particular fieldPubMedArticle DBAuthor DBA ranked list1, Barend2, Rob3, Machiel4, Marc...?Interested areaCooked query
Acknowledgement• Christine Chichester (<strong>NBIC</strong>)• Bharat Singh (<strong>NBIC</strong>)• Marc Weeber (Knewco, Inc)• Jun Wang (University College London)