08.06.2015 Views

Web Information Retrieval

Web Information Retrieval

Web Information Retrieval

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Web</strong> <strong>Information</strong> <strong>Retrieval</strong><br />

Hang Li<br />

Microsoft Research Asia


Talk Outline<br />

• What is <strong>Web</strong> <strong>Information</strong> <strong>Retrieval</strong><br />

• Overview of <strong>Web</strong> IR Technologies<br />

• Application Technologies for <strong>Web</strong> IR<br />

• Component Technologies for <strong>Web</strong> IR<br />

• Summary


What is <strong>Web</strong> <strong>Information</strong><br />

<strong>Retrieval</strong>


<strong>Web</strong> Search is Part of Our Life


<strong>Web</strong> Users Heavily Rely on Search Engines<br />

http://www.iprospect.com/premiumPDFs/iProspectSurveyComplete.pdf


Physically Search System is Data Center


Advanced <strong>Web</strong> Search Technologies<br />

Are Used…<br />

Statistical Learning<br />

Large Scale Distributed<br />

Computing


In This Talk<br />

<strong>Web</strong> <strong>Information</strong> <strong>Retrieval</strong> (IR)<br />

= <strong>Web</strong> Search from Algorithm<br />

Perspective


Research Areas Related to<br />

<strong>Web</strong> IR<br />

• <strong>Information</strong> <strong>Retrieval</strong><br />

• Statistical Machine Learning<br />

• Data Mining<br />

• Natural Language Processing<br />

• Computer Human Interaction


Overview on <strong>Web</strong> IR<br />

Technologies<br />

Document Search, Entity Search, Facet Search,<br />

Question Answering, Image Search<br />

Relevance Ranking, Importance Ranking,<br />

<strong>Web</strong> Page Understanding, Query Understanding,<br />

Crawling, Indexing, Search Result Presentation,<br />

Anti-Spam, Search Log Data Mining<br />

Classification, Clustering, Learning to Rank,<br />

Link Analysis, <strong>Information</strong> Extraction


Fundamental Technologies for<br />

• Classification<br />

• Clustering<br />

• Learning to Rank<br />

<strong>Web</strong> IR<br />

• Link Analysis (Graph Learning)<br />

• <strong>Information</strong> Extraction (Structure<br />

Prediction)


Component Technologies for <strong>Web</strong> IR<br />

• Relevance Ranking<br />

• Importance Ranking<br />

• <strong>Web</strong> Page Understanding<br />

• Query Understanding<br />

• Crawling<br />

• Indexing<br />

• Search Result Presentation<br />

• Anti-Spam<br />

• Search Log Data Mining


Application Technologies for<br />

<strong>Web</strong> IR<br />

• Document Search (<strong>Web</strong> Page Search)<br />

• Entity Search<br />

• Facet Search<br />

• Question Answering<br />

• Image Search


Overview on <strong>Web</strong> IR<br />

Technologies<br />

Document Search, Entity Search, Facet Search,<br />

Question Answering, Image Search<br />

Relevance Ranking, Importance Ranking,<br />

<strong>Web</strong> Page Understanding, Query Understanding,<br />

Crawling, Indexing, Search Result Presentation,<br />

Anti-Spam, Search Log Data Mining<br />

Classification, Clustering, Learning to Rank,<br />

Link Analysis, <strong>Information</strong> Extraction


Application Technologies


Application Technologies for<br />

<strong>Web</strong> IR<br />

• Document Search (<strong>Web</strong> Page Search)<br />

• Entity Search<br />

• Facet Search<br />

• Question Answering<br />

• Image Search


<strong>Information</strong> in Different<br />

Granularities and Forms<br />

• Document<br />

• Entity: person<br />

• Facet: definition, FAQ<br />

• Table


Document Search


Definition Search


Expert Search


FAQ Search


Image Search


Question Answering


Component Technologies


Component Technologies for <strong>Web</strong> IR<br />

• Relevance Ranking<br />

• Importance Ranking<br />

• <strong>Web</strong> Page Understanding<br />

• Query Understanding<br />

• Crawling<br />

• Indexing<br />

• Search Result Presentation<br />

• Anti-Spam<br />

• Search Log Data Mining


Example of <strong>Web</strong> Search<br />

Architecture<br />

Importance ranking<br />

Query understanding<br />

Relevance ranking<br />

User<br />

User<br />

Interface<br />

Ranker<br />

Index<br />

Search result presentation<br />

Search log data mining<br />

<strong>Web</strong> page understanding<br />

Crawler<br />

Indexer<br />

Anti-Spam<br />

<strong>Web</strong>


Relevance Ranking


General Framework for Relevance<br />

Ranking<br />

<br />

D d , d , 2<br />

,<br />

1<br />

<br />

documents<br />

(information)<br />

d n<br />

relevance scores for ranking<br />

query (or question)<br />

q<br />

f<br />

( q,<br />

d)<br />

d<br />

d<br />

1<br />

2<br />

~<br />

~<br />

f<br />

f<br />

<br />

( q,<br />

d<br />

( q,<br />

d<br />

1<br />

2<br />

)<br />

)<br />

d<br />

~<br />

f<br />

( q,<br />

n<br />

d n<br />

)


Relevance<br />

• No rigorous definition<br />

• Query = “soccer”, document = about soccer<br />

document relevant<br />

• Judgment by humans: several discretized<br />

levels, e.g. “definitely relevant”, “partially<br />

relevant”


Key Factor for Relevance: Matching<br />

between Query and Document<br />

f<br />

( q,<br />

d | D)<br />

q<br />

d<br />

q<br />

q’<br />

q


Probabilistic Model<br />

documents<br />

q<br />

d<br />

d<br />

<br />

1<br />

2<br />

d n<br />

query (or question)<br />

P(<br />

r<br />

| q,<br />

d)<br />

r {1,0}<br />

relevant scores for ranking<br />

d<br />

d<br />

d<br />

1<br />

2<br />

~ P(<br />

r<br />

~ P(<br />

r<br />

<br />

~<br />

P(<br />

r<br />

| q,<br />

d<br />

| q,<br />

d<br />

| q,<br />

1<br />

2<br />

n<br />

d n<br />

)<br />

)<br />

)


Okapi or BM25<br />

(Robertson and Walker 1994)<br />

documents<br />

d<br />

d<br />

<br />

1<br />

2<br />

ranking function<br />

d n<br />

query (or question)<br />

q<br />

<br />

( k 1)<br />

tf ( w)<br />

dl<br />

avgdl<br />

wd<br />

q<br />

(1 b)<br />

k b tf ( w)


Language Mode<br />

(Ponte and Croft 1998)<br />

document = bag of words<br />

d<br />

d<br />

d<br />

2<br />

n<br />

1<br />

q <br />

w<br />

w<br />

w<br />

w<br />

q 1<br />

11<br />

21<br />

n1<br />

w<br />

w<br />

w<br />

<br />

w<br />

q 2<br />

12<br />

22<br />

n2<br />

w<br />

w<br />

1l<br />

w<br />

w<br />

ql q<br />

1<br />

2l<br />

2<br />

nl n<br />

relevance scores for ranking<br />

d<br />

d<br />

d<br />

1<br />

2<br />

~ P(<br />

q<br />

~ P(<br />

q<br />

<br />

~<br />

P(<br />

q<br />

| d<br />

| d<br />

|<br />

1<br />

2<br />

n<br />

d n<br />

)<br />

)<br />

)


Learning to Rank Model<br />

(Herbrich et al., 2000)<br />

documents<br />

q<br />

d<br />

d<br />

<br />

1<br />

2<br />

d n<br />

query (or question)<br />

r f ( q,<br />

d)<br />

Ranking SVM<br />

relevance scores for ranking<br />

d<br />

d<br />

d<br />

1<br />

2<br />

~ f<br />

~ f<br />

~<br />

f<br />

<br />

( q,<br />

d<br />

( q,<br />

d<br />

( q,<br />

1<br />

2<br />

n<br />

d n<br />

)<br />

)<br />

)


Training Process<br />

d<br />

<br />

d<br />

<br />

<br />

<br />

<br />

d<br />

q1<br />

<br />

<br />

<br />

<br />

qm<br />

<br />

<br />

1,1<br />

1,2<br />

1, n<br />

<br />

<br />

d<br />

d<br />

<br />

d<br />

1<br />

m,1<br />

m,2<br />

m,<br />

n m<br />

d<br />

<br />

d<br />

q1<br />

<br />

<br />

<br />

<br />

d<br />

1. Data Labeling<br />

(rank)<br />

q<br />

m<br />

d<br />

<br />

d<br />

<br />

<br />

<br />

<br />

d<br />

1,1<br />

1,2<br />

1, n<br />

<br />

<br />

m,<br />

1<br />

m,2<br />

m,<br />

n<br />

y<br />

y<br />

y<br />

1,1<br />

1,2<br />

1 1,<br />

y<br />

y<br />

n<br />

1<br />

m,1<br />

y<br />

m,2<br />

m,<br />

m n m<br />

2. Feature<br />

Extraction<br />

x<br />

<br />

x<br />

<br />

<br />

<br />

<br />

x<br />

x<br />

<br />

x<br />

<br />

<br />

<br />

<br />

x<br />

1 ,1<br />

1,2<br />

1, n<br />

m,<br />

1<br />

m,2<br />

m,<br />

n<br />

y<br />

y<br />

y<br />

1,1<br />

1,2<br />

1 1,<br />

<br />

<br />

y<br />

y<br />

n<br />

1<br />

m,1<br />

y<br />

m,2<br />

m,<br />

m n m<br />

3. Learning<br />

f<br />

(x)<br />

35


1. Data Labeling<br />

(rank)<br />

2. Feature<br />

Extraction<br />

Testing Process<br />

36<br />

3. Ranking<br />

with )<br />

(x<br />

f<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

)<br />

(<br />

)<br />

(<br />

)<br />

(<br />

1<br />

1<br />

1 ,<br />

1<br />

1,<br />

1,<br />

1,2<br />

1,2<br />

1,2<br />

1,1<br />

1,1<br />

1,1<br />

m<br />

m<br />

m<br />

n<br />

m<br />

n<br />

m<br />

n<br />

m<br />

m<br />

m<br />

m<br />

m<br />

m<br />

m<br />

y<br />

x<br />

f<br />

x<br />

y<br />

x<br />

f<br />

x<br />

y<br />

x<br />

f<br />

x<br />

<br />

4. Evaluation<br />

Evaluation<br />

Result<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

1<br />

1,<br />

1,2<br />

1,1<br />

1<br />

n m<br />

m<br />

m<br />

m<br />

m<br />

d<br />

d<br />

d<br />

q<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

1<br />

1 ,<br />

1<br />

1,<br />

1,2<br />

1,2<br />

1,1<br />

1,1<br />

1<br />

m n m<br />

m<br />

n<br />

m<br />

m<br />

m<br />

m<br />

m<br />

m<br />

y<br />

d<br />

y<br />

d<br />

y<br />

d<br />

q<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

1<br />

1 ,<br />

1<br />

1,<br />

1,2<br />

1,2<br />

1,1<br />

1,1<br />

m n m<br />

m<br />

n<br />

m<br />

m<br />

m<br />

m<br />

m<br />

y<br />

x<br />

y<br />

x<br />

y<br />

x


Previous Approach: Pairwise<br />

Approach<br />

• Transforming ranking to classification<br />

– Ranking SVM (Herbrich et al, 2000)<br />

– RankBoost (Freund et al, 2003)<br />

– RankNet (Burges et al, 2005)<br />

– IR-SVM (Cao et al, 2005)<br />

– Frank (Tsai et al, 2006)


Our Proposal = Listwise Approach<br />

• Probabilistic Model<br />

– ListNet (Cao, et al 2007)<br />

– ListMLE (Xia, et al 2008)<br />

• Optimizing Upper Bounds of IR Measures<br />

– AdaRank (Xu & Li, 2007)<br />

– SVM-MAP (Yue, et al, 2007)<br />

– PermuRank (Xu, et al 2008)<br />

• Approximation of IR Measures<br />

– SoftRank (Taylor 2007)<br />

– ApproxRank (Qin, et al, to appear)


Importance Ranking


General Model for Importance<br />

<br />

D d , d , 2<br />

,<br />

1<br />

<br />

documents<br />

d n<br />

Ranking<br />

query (or question)<br />

q<br />

f ( q,<br />

d)<br />

g(<br />

d)<br />

importance and relevance<br />

scores for ranking<br />

g(<br />

d<br />

g(<br />

d<br />

g(<br />

d<br />

1<br />

2<br />

)<br />

)<br />

)<br />

<br />

<br />

<br />

<br />

f<br />

f<br />

f<br />

( q,<br />

d<br />

( q,<br />

d<br />

( q,<br />

1<br />

2<br />

n<br />

d n<br />

)<br />

)<br />

)


Importance<br />

• No rigorous definition<br />

• Citation, quality, novelty<br />

• Independent of query


Page Rank<br />

(Brin and Page, 1998)<br />

P(<br />

d<br />

i<br />

)<br />

P(<br />

d<br />

j<br />

)<br />

(1 )<br />

L(<br />

d )<br />

d M<br />

(<br />

j d i<br />

)<br />

j<br />

1<br />

n


Building User Browsing Graph<br />

(Liu et al 2008)<br />

A directed graph with rich meta<br />

data.<br />

Vertex: <strong>Web</strong> page<br />

Edge: Transition<br />

Edge weight w ij :<br />

Number of<br />

transitions<br />

Staying time T i :<br />

Time spend on<br />

page i<br />

Vertex weight C i :<br />

Number of visit<br />

for page i<br />

Reset probability :<br />

Normalized frequencies<br />

as first page of session


BrowseRank: Continuous-time<br />

Markov Model<br />

Hard<br />

Calculating<br />

Estimating<br />

staying time<br />

distribution of<br />

each state<br />

Computing the stationary<br />

distribution of a discretetime<br />

Markov chain (called<br />

embedded Markov chain)


<strong>Web</strong> Page Understanding


<strong>Web</strong> Page Understanding<br />

• <strong>Web</strong> <strong>Information</strong> Extraction<br />

– Block Analysis<br />

– Metadata Extraction (Title, Date, etc)<br />

– Text <strong>Information</strong> Extraction<br />

– Wrapper Generation<br />

• <strong>Web</strong> Page Classification<br />

– Based on Semantics<br />

– Based on Type (Homepage, Spec, etc)


• <strong>Web</strong> page can be<br />

Block Analysis<br />

(Song et al 2004)<br />

segmented into blocks<br />

• Important blocks can<br />

be identified


Title Extraction<br />

(Hu et al 2005)


Text <strong>Information</strong> Extraction<br />

October 14, 2002, 4:00 a.m. PT<br />

For years, Microsoft Corporation CEO Bill<br />

Gates railed against the economic philosophy<br />

of open-source software with Orwellian fervor,<br />

denouncing its communal licensing as a<br />

"cancer" that stifled technological innovation.<br />

Today, Microsoft claims to "love" the opensource<br />

concept, by which software code is<br />

made public to encourage improvement and<br />

development by outside programmers. Gates<br />

himself says Microsoft will gladly disclose its<br />

crown jewels--the coveted code behind the<br />

Windows operating system--to select<br />

customers.<br />

NAME TITLE ORGANIZATION<br />

Bill Gates CEO Microsoft<br />

Bill Veghte VP Microsoft<br />

Richard Stallman Founder Free<br />

Software<br />

Foundation<br />

"We can be open source. We love the concept<br />

of shared source," said Bill Veghte, a<br />

Microsoft VP. "That's a super-important shift<br />

for us in terms of code access.“<br />

Richard Stallman, founder of the Free<br />

Software Foundation, countered saying…


General Model for<br />

<strong>Information</strong> Extraction<br />

( O<br />

1<br />

( O<br />

<br />

( O<br />

2<br />

, S<br />

, S<br />

,<br />

1<br />

2<br />

n S n<br />

)<br />

)<br />

)<br />

Training Data<br />

Learning<br />

System<br />

P(<br />

O,<br />

S)<br />

or P(<br />

S | O)<br />

Model<br />

Test Data<br />

O O n<br />

, S )<br />

n1<br />

Extraction<br />

System<br />

(<br />

1 n1


Wrapper Generation


Query Understanding


Query Understanding<br />

• Spelling Error Correction<br />

• Query Refinement<br />

– E.g., “ny times” “new york times”<br />

• Query Classification<br />

– Based on Semantics (Sport, etc)<br />

– Based on Type (Name Query, etc)<br />

• Query Suggestion<br />

– E.g., “microsoft” “bill gates”


Mismatching between Query Term<br />

and Document Term<br />

I want to search “myspace”<br />

myspace<br />

my space<br />

space<br />

my<br />

54


Understanding the Intent and<br />

Solving the Mismatch<br />

I want to search “myspace”<br />

myspace<br />

my space<br />

Query<br />

Understan<br />

ding<br />

myspace<br />

space<br />

my<br />

55


Structured Prediction Problem<br />

windows<br />

onecare<br />

“Ideal” word sequence<br />

window<br />

onecar<br />

Observed “noisy” word<br />

sequence<br />

“ideal” query<br />

word<br />

sequence<br />

original query<br />

word<br />

sequence


Conditional Random Fields for Query<br />

Refinement (Guo et al 2008)<br />

Introducing Refinement Operations<br />

o i-1 o i o i+1<br />

y i-1 y i y i+1<br />

x i-1 x i x i+1<br />

Operations<br />

Spelling: insertion, deletion, substitution, transposition, …<br />

Word Stemming: +s/-s, +es/-es, +ed/-ed, +ing/-ing, …


Crawling


Crawling<br />

• Crawling Scheduling<br />

• Near Duplicate Detection


Crawling Scheduling<br />

• Search of a large scale and dynamically<br />

changing graph<br />

• Breadth first crawling<br />

• Preferential crawling: importance, freshness,<br />

coverage


Near Duplicate Detection<br />

similarity<br />

Shingle<br />

Shingle


Indexing


• Inverted Indexing<br />

Indexing


Inverted Index<br />

t1<br />

d11<br />

d12<br />

d13<br />

……….<br />

t2<br />

t3<br />

……


Search Result Presentation


Search Result Presentation<br />

• Summary Generation<br />

• Result Clustering<br />

• Result Diversification (implicit clustering)<br />

• Novelty<br />

• Location Sensitiveness


Search Result Clustering<br />

(Zeng et al 2004)


Query Suggestion


Search Intent and Context<br />

• Suppose a user raises a query “gladiator”<br />

History?<br />

People?<br />

Film?<br />

• If we know the user raises query “beautiful mind” before<br />

“gladiator”<br />

• User is likely to be interested in the film<br />

• User is likely to be searching the films played by Russell Crowe.


Context Aware Query Suggestion<br />

(Cao et al 2008)<br />

• Offline part: model learning<br />

– Summarizing queries into concepts by clustering click-through<br />

bipartite<br />

– Mining frequent patterns from session data and building a<br />

concept sequence suffix tree<br />

• Online part: query suggestion


Anti-Spam


Anti-Spam<br />

• Manipulate relevance and importance<br />

• Boundary between Search Engine Optimization<br />

– Not ethical, if to be ranked higher beyond real value<br />

– “Cheating” search engines<br />

• Spam Type<br />

– Content Spam<br />

– Link Spam<br />

– Comment Spam<br />

– Cloaking


Content Spam<br />

Highlighting<br />

q q q q q q<br />

Dumping<br />

a b c<br />

d e f


Link Spam<br />

• Link from Blog, Forum, etc<br />

• Link Exchange<br />

• Link Farm


Cloaking<br />

End User<br />

Search Engine


Search Log Data Mining


Search Log Data Mining<br />

• Query Log<br />

• Click-through Data<br />

• Search Session Data


Click-through Data<br />

q1<br />

q2<br />

d1<br />

d2<br />

q<br />

……<br />

……<br />

qm<br />

dn<br />

Bipartite graph:<br />

implicit tag<br />

Click position:<br />

implicit relevance<br />

judgment


Search Session Data<br />

• Sequence of queries from same user<br />

• Relation between queries<br />

– Error correction<br />

– Related query<br />

– Refined query<br />

– …


Summary


Talk Outline<br />

• What is <strong>Information</strong> <strong>Retrieval</strong><br />

• Overview of <strong>Web</strong> IR Technologies<br />

• Application Technologies for <strong>Web</strong> IR<br />

• Component Technologies for <strong>Web</strong> IR<br />

• Summary


Overview on <strong>Web</strong> IR<br />

Technologies<br />

Document Search, Entity Search, Facet Search,<br />

Question Answering<br />

Relevance Ranking, Importance Ranking,<br />

<strong>Web</strong> Page Understanding, Query Understanding,<br />

Crawling, Indexing, Search Result Presentation,<br />

Anti-Spam, Search Log Data Mining<br />

Classification, Clustering, Learning to Rank,<br />

Link Analysis, <strong>Information</strong> Extraction


Overview on <strong>Web</strong> Search System<br />

Query understanding<br />

Relevance ranking<br />

Learning to Rank<br />

User<br />

User<br />

Interface<br />

Ranker<br />

Index<br />

Search result presentation<br />

Search log data mining<br />

Document understanding<br />

Crawler<br />

Indexer<br />

<strong>Web</strong><br />

Importance ranking<br />

Anti-spam


Thank You!<br />

hangli@microsoft.com

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!