09.07.2015 Views

Information Retrieval & Apache Solr Use Case - NBIC

Information Retrieval & Apache Solr Use Case - NBIC

Information Retrieval & Apache Solr Use Case - NBIC

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Information</strong> <strong>Retrieval</strong>&<strong>Apache</strong> <strong>Solr</strong> <strong>Use</strong> <strong>Case</strong>May 21, 2010Leon Mei


Outline• <strong>Information</strong> retrieval & text mining• <strong>Apache</strong> Lucene/<strong>Solr</strong>• <strong>Use</strong> case: expert finder


<strong>Information</strong> retrieval<strong>Information</strong> Need<strong>Information</strong> ItemsRepresentationRepresentationQueryRelevance?Indexed ItemsRetrievedinformationEvaluating/Relevancefeedback


Text vector representation• “Bag of words”• “Bag of phrases”• With stemming/normalization• N-gram...• Concept basedBagOfWordsTextN-gram...TermTerm frequency22111....• Content bearing word,stop list


Relevance #1• Boolean– e.g. query“term 1 term 2 term 3”


Relevance #2• Vector space model– d j= (w 1,j,w 2,j,...,w t,j)– q = (w 1,q,w 2,q,...,w t,q)• Probability model• Citation analysis model* wikipedia


Term frequency &inverse document frequency• Some terms are more important than others• w = TF · IDF• TF t , d– the frequency of occurrence of a term t in document d• IDF t–– N is the number of documents in the collection; and n tis the number ofdocuments where term t occurs– N=100, n t=25, idf = 0.6– N=100, n t=1, idf = 2


Evaluationrecall• Recallrelevant∧retrievedrelevantTPTP ∧FN• Precisionrelevant∧retrievedretrievedTPTP ∧FPprecisionKaren Spärck Jones. A statistical interpretation of term specificityand its application in retrieval. 1972.


<strong>Solr</strong>: a brief history• 2004, CNET– search capability for reviews, news, price offers, etc• 2006, join <strong>Apache</strong> and under Lucene– Lucene is high-performance, full-featured text search enginelibrary in Java– 20MB/minute on Pentium M 1.5GHz,– 1MB memory requirement– index size 20-30% of text size• Nov 2009, <strong>Solr</strong> 1.4 released• Who are using <strong>Solr</strong>– WhiteHouse.gov; AOL.com; SourceForge.net; Netflix.com;Plaxo.com; ...


<strong>Solr</strong>: features• XML over HTTP. update via POST, query via GET.• Support tokenizing, stemming, normalization• Web interface to invoke, monitor, analyze the search engine.• Support load distribution: server replication.• Caching, auto-warming• External file-based configuration of stopword lists, synonymlists, and protected word lists• Import data from DB or other sources• ...


<strong>Solr</strong>: update example• http://localhost:8983/solr/update/?• Addm00123456memory2GKingston[ ... [ ... ]]• Deletem00123456supplier:Kingston


Expert finderWho are the experts on malaria?Person DB(PubMed authors)


Article DB & Author DB in <strong>Solr</strong>• Article profile DB– 19 million PubMed articles– Each article: title, abstract, MESH, keyword, chemicals– list of concepts => article DB• – Query using a PMID• [C0041221, 0.39],[C0879626, 0.03], ...• Author profile DB– 2.14 million unique authors– Identify all articles of each unique author• Query the article DB• [C0041221, 0.39],[C0879626, 0.03], ...– Each author: list of expertise => author DB• 39 x C0041221 , 3 x C0879626


Cook an expert-finder query• Identify a set of articles in an interested area– a list of PMID• Query the article DB using this list– [C4249671, 1], [C0041234, 0.79], ...• Construct the query– {C4249671, C0041234, ...}• Can specify # of minimal matches


Summary on expert-finder• Rank the authors based on the similarity between their publications andthe publications belong to a particular fieldPubMedArticle DBAuthor DBA ranked list1, Barend2, Rob3, Machiel4, Marc...?Interested areaCooked query


Acknowledgement• Christine Chichester (<strong>NBIC</strong>)• Bharat Singh (<strong>NBIC</strong>)• Marc Weeber (Knewco, Inc)• Jun Wang (University College London)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!