Web Information Retrieval

Web Information Retrieval 

Hang Li 

Microsoft Research Asia

Talk Outline 

• What is Web Information Retrieval 

• Overview of Web IR Technologies 

• Application Technologies for Web IR 

• Component Technologies for Web IR 

• Summary

What is Web Information 

Retrieval

Web Search is Part of Our Life

Web Users Heavily Rely on Search Engines 

http://www.iprospect.com/premiumPDFs/iProspectSurveyComplete.pdf

Physically Search System is Data Center

Advanced Web Search Technologies 

Are Used… 

Statistical Learning 

Large Scale Distributed 

Computing

In This Talk 

Web Information Retrieval (IR) 

= Web Search from Algorithm 

Perspective

Research Areas Related to 

Web IR 

• Information Retrieval 

• Statistical Machine Learning 

• Data Mining 

• Natural Language Processing 

• Computer Human Interaction

Overview on Web IR 

Technologies 

Document Search, Entity Search, Facet Search, 

Question Answering, Image Search 

Relevance Ranking, Importance Ranking, 

Web Page Understanding, Query Understanding, 

Crawling, Indexing, Search Result Presentation, 

Anti-Spam, Search Log Data Mining 

Classification, Clustering, Learning to Rank, 

Link Analysis, Information Extraction

Fundamental Technologies for 

• Classification 

• Clustering 

• Learning to Rank 


• Link Analysis (Graph Learning) 

• Information Extraction (Structure 

Prediction)

Component Technologies for Web IR 

• Relevance Ranking 

• Importance Ranking 

• Web Page Understanding 

• Query Understanding 

• Crawling 

• Indexing 

• Search Result Presentation 

• Anti-Spam 

• Search Log Data Mining

Application Technologies for 


• Document Search (Web Page Search) 

• Entity Search 

• Facet Search 

• Question Answering 

• Image Search


Technologies 


Question Answering, Image Search 







Application Technologies

Application Technologies for 


• Document Search (Web Page Search) 

• Entity Search 

• Facet Search 

• Question Answering 

• Image Search

Information in Different 

Granularities and Forms 

• Document 

• Entity: person 

• Facet: definition, FAQ 

• Table

Document Search

Definition Search

Expert Search

FAQ Search

Image Search

Question Answering

Component Technologies

Component Technologies for Web IR 

• Relevance Ranking 

• Importance Ranking 

• Web Page Understanding 

• Query Understanding 

• Crawling 

• Indexing 

• Search Result Presentation 

• Anti-Spam 

• Search Log Data Mining

Example of Web Search 

Architecture 

Importance ranking 

Query understanding 

Relevance ranking 

User 

User 

Interface 

Ranker 

Index 

Search result presentation 

Search log data mining 

Web page understanding 

Crawler 

Indexer 

Anti-Spam 

Web

Relevance Ranking

General Framework for Relevance 

Ranking 

 

D d , d , 2 

, 

1 

 

documents 

(information) 

d n 

relevance scores for ranking 

query (or question) 

q 

f 

( q, 

d) 

d 

d 

1 

2 

~ 

~ 

f 

f 

 

( q, 

d 

( q, 

d 

1 

2 

) 

) 

d 

~ 

f 

( q, 

n 

d n 

)

Relevance 

• No rigorous definition 

• Query = “soccer”, document = about soccer 

document relevant 

• Judgment by humans: several discretized 

levels, e.g. “definitely relevant”, “partially 

relevant”

Key Factor for Relevance: Matching 

between Query and Document 

f 

( q, 

d | D) 

q 

d 

q 

q’ 

q

Probabilistic Model 

documents 

q 

d 

d 

 

1 

2 

d n 


P( 

r 

| q, 

d) 

r {1,0} 

relevant scores for ranking 

d 

d 

d 

1 

2 

~ P( 

r 

~ P( 

r 

 

~ 

P( 

r 

| q, 

d 

| q, 

d 

| q, 

1 

2 

n 

d n 

) 

) 

)

Okapi or BM25 

(Robertson and Walker 1994) 

documents 

d 

d 

 

1 

2 

ranking function 

d n 


q 

 

( k 1) 

tf ( w) 

dl 

avgdl 

wd 

q 

(1 b) 

k b tf ( w)

Language Mode 

(Ponte and Croft 1998) 

document = bag of words 

d 

d 

d 

2 

n 

1 

q 

w 

w 

w 

w 

q 1 

11 

21 

n1 

w 

w 

w 

 

w 

q 2 

12 

22 

n2 

w 

w 

1l 

w 

w 

ql q 

1 

2l 

2 

nl n 


d 

d 

d 

1 

2 

~ P( 

q 

~ P( 

q 

 

~ 

P( 

q 

| d 

| d 

| 

1 

2 

n 

d n 

) 

) 

)

Learning to Rank Model 

(Herbrich et al., 2000) 

documents 

q 

d 

d 

 

1 

2 

d n 


r f ( q, 

d) 

Ranking SVM 


d 

d 

d 

1 

2 

~ f 

~ f 

~ 

f 

 

( q, 

d 

( q, 

d 

( q, 

1 

2 

n 

d n 

) 

) 

)

Training Process 

d 

 

d 

 

 

 

 

d 

q1 

 

 

 

 

qm 

 

 

1,1 

1,2 

1, n 

 

 

d 

d 

 

d 

1 

m,1 

m,2 

m, 

n m 

d 

 

d 

q1 

 

 

 

 

d 

1. Data Labeling 

(rank) 

q 

m 

d 

 

d 

 

 

 

 

d 

1,1 

1,2 

1, n 

 

 

m, 

1 

m,2 

m, 

n 

y 

y 

y 

1,1 

1,2 

1 1, 

y 

y 

n 

1 

m,1 

y 

m,2 

m, 

m n m 

2. Feature 

Extraction 

x 

 

x 

 

 

 

 

x 

x 

 

x 

 

 

 

 

x 

1 ,1 

1,2 

1, n 

m, 

1 

m,2 

m, 

n 

y 

y 

y 

1,1 

1,2 

1 1, 

 

 

y 

y 

n 

1 

m,1 

y 

m,2 

m, 

m n m 

3. Learning 

f 

(x) 

35

1. Data Labeling 

(rank) 

2. Feature 

Extraction 

Testing Process 

36 

3. Ranking 

with ) 

(x 

f 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

) 

( 

) 

( 

) 

( 

1 

1 

1 , 

1 

1, 

1, 

1,2 

1,2 

1,2 

1,1 

1,1 

1,1 

m 

m 

m 

n 

m 

n 

m 

n 

m 

m 

m 

m 

m 

m 

m 

y 

x 

f 

x 

y 

x 

f 

x 

y 

x 

f 

x 

 

4. Evaluation 

Evaluation 

Result 

 

 

 

 

 

 

 

 

 

 

 

 

1 

1, 

1,2 

1,1 

1 

n m 

m 

m 

m 

m 

d 

d 

d 

q 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1 

1 , 

1 

1, 

1,2 

1,2 

1,1 

1,1 

1 

m n m 

m 

n 

m 

m 

m 

m 

m 

m 

y 

d 

y 

d 

y 

d 

q 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1 

1 , 

1 

1, 

1,2 

1,2 

1,1 

1,1 

m n m 

m 

n 

m 

m 

m 

m 

m 

y 

x 

y 

x 

y 

x

Previous Approach: Pairwise 

Approach 

• Transforming ranking to classification 

– Ranking SVM (Herbrich et al, 2000) 

– RankBoost (Freund et al, 2003) 

– RankNet (Burges et al, 2005) 

– IR-SVM (Cao et al, 2005) 

– Frank (Tsai et al, 2006)

Our Proposal = Listwise Approach 

• Probabilistic Model 

– ListNet (Cao, et al 2007) 

– ListMLE (Xia, et al 2008) 

• Optimizing Upper Bounds of IR Measures 

– AdaRank (Xu & Li, 2007) 

– SVM-MAP (Yue, et al, 2007) 

– PermuRank (Xu, et al 2008) 

• Approximation of IR Measures 

– SoftRank (Taylor 2007) 

– ApproxRank (Qin, et al, to appear)

Importance Ranking

General Model for Importance 

 

D d , d , 2 

, 

1 

 

documents 

d n 

Ranking 


q 

f ( q, 

d) 

g( 

d) 

importance and relevance 

scores for ranking 

g( 

d 

g( 

d 

g( 

d 

1 

2 

) 

) 

) 

 

 

 

 

f 

f 

f 

( q, 

d 

( q, 

d 

( q, 

1 

2 

n 

d n 

) 

) 

)

Importance 

• No rigorous definition 

• Citation, quality, novelty 

• Independent of query

Page Rank 

(Brin and Page, 1998) 

P( 

d 

i 

) 

P( 

d 

j 

) 

(1 ) 

L( 

d ) 

d M 

( 

j d i 

) 

j 

1 

n

Building User Browsing Graph 

(Liu et al 2008) 

A directed graph with rich meta 

data. 

Vertex: Web page 

Edge: Transition 

Edge weight w ij : 

Number of 

transitions 

Staying time T i : 

Time spend on 

page i 

Vertex weight C i : 

Number of visit 

for page i 

Reset probability : 

Normalized frequencies 

as first page of session

BrowseRank: Continuous-time 

Markov Model 

Hard 

Calculating 

Estimating 

staying time 

distribution of 

each state 

Computing the stationary 

distribution of a discretetime 

Markov chain (called 

embedded Markov chain)

Web Page Understanding

Web Page Understanding 

• Web Information Extraction 

– Block Analysis 

– Metadata Extraction (Title, Date, etc) 

– Text Information Extraction 

– Wrapper Generation 

• Web Page Classification 

– Based on Semantics 

– Based on Type (Homepage, Spec, etc)

• Web page can be 

Block Analysis 

(Song et al 2004) 

segmented into blocks 

• Important blocks can 

be identified

Title Extraction 

(Hu et al 2005)

Text Information Extraction 

October 14, 2002, 4:00 a.m. PT 

For years, Microsoft Corporation CEO Bill 

Gates railed against the economic philosophy 

of open-source software with Orwellian fervor, 

denouncing its communal licensing as a 

"cancer" that stifled technological innovation. 

Today, Microsoft claims to "love" the opensource 

concept, by which software code is 

made public to encourage improvement and 

development by outside programmers. Gates 

himself says Microsoft will gladly disclose its 

crown jewels--the coveted code behind the 

Windows operating system--to select 

customers. 

NAME TITLE ORGANIZATION 

Bill Gates CEO Microsoft 

Bill Veghte VP Microsoft 

Richard Stallman Founder Free 

Software 

Foundation 

"We can be open source. We love the concept 

of shared source," said Bill Veghte, a 

Microsoft VP. "That's a super-important shift 

for us in terms of code access.“ 

Richard Stallman, founder of the Free 

Software Foundation, countered saying…

General Model for 

Information Extraction 

( O 

1 

( O 

 

( O 

2 

, S 

, S 

, 

1 

2 

n S n 

) 

) 

) 

Training Data 

Learning 

System 

P( 

O, 

S) 

or P( 

S | O) 

Model 

Test Data 

O O n 

, S ) 

n1 

Extraction 

System 

( 

1 n1

Wrapper Generation

Query Understanding

Query Understanding 

• Spelling Error Correction 

• Query Refinement 

– E.g., “ny times” “new york times” 

• Query Classification 

– Based on Semantics (Sport, etc) 

– Based on Type (Name Query, etc) 

• Query Suggestion 

– E.g., “microsoft” “bill gates”

Mismatching between Query Term 

and Document Term 

I want to search “myspace” 

myspace 

my space 

space 

my 

54

Understanding the Intent and 

Solving the Mismatch 

I want to search “myspace” 

myspace 

my space 

Query 

Understan 

ding 

myspace 

space 

my 

55

Structured Prediction Problem 

windows 

onecare 

“Ideal” word sequence 

window 

onecar 

Observed “noisy” word 

sequence 

“ideal” query 

word 

sequence 

original query 

word 

sequence

Conditional Random Fields for Query 

Refinement (Guo et al 2008) 

Introducing Refinement Operations 

o i-1 o i o i+1 

y i-1 y i y i+1 

x i-1 x i x i+1 

Operations 

Spelling: insertion, deletion, substitution, transposition, … 

Word Stemming: +s/-s, +es/-es, +ed/-ed, +ing/-ing, …

Crawling

Crawling 

• Crawling Scheduling 

• Near Duplicate Detection

Crawling Scheduling 

• Search of a large scale and dynamically 

changing graph 

• Breadth first crawling 

• Preferential crawling: importance, freshness, 

coverage

Near Duplicate Detection 

similarity 

Shingle 

Shingle

Indexing

• Inverted Indexing 

Indexing

Inverted Index 

t1 

d11 

d12 

d13 

………. 

t2 

t3 

……

Search Result Presentation

Search Result Presentation 

• Summary Generation 

• Result Clustering 

• Result Diversification (implicit clustering) 

• Novelty 

• Location Sensitiveness

Search Result Clustering 

(Zeng et al 2004)

Query Suggestion

Search Intent and Context 

• Suppose a user raises a query “gladiator” 

History? 

People? 

Film? 

• If we know the user raises query “beautiful mind” before 

“gladiator” 

• User is likely to be interested in the film 

• User is likely to be searching the films played by Russell Crowe.

Context Aware Query Suggestion 

(Cao et al 2008) 

• Offline part: model learning 

– Summarizing queries into concepts by clustering click-through 

bipartite 

– Mining frequent patterns from session data and building a 

concept sequence suffix tree 

• Online part: query suggestion

Anti-Spam

Anti-Spam 

• Manipulate relevance and importance 

• Boundary between Search Engine Optimization 

– Not ethical, if to be ranked higher beyond real value 

– “Cheating” search engines 

• Spam Type 

– Content Spam 

– Link Spam 

– Comment Spam 

– Cloaking

Content Spam 

Highlighting 

q q q q q q 

Dumping 

a b c 

d e f

Link Spam 

• Link from Blog, Forum, etc 

• Link Exchange 

• Link Farm

Cloaking 

End User 

Search Engine

Search Log Data Mining

Search Log Data Mining 

• Query Log 

• Click-through Data 

• Search Session Data

Click-through Data 

q1 

q2 

d1 

d2 

q 

…… 

…… 

qm 

dn 

Bipartite graph: 

implicit tag 

Click position: 

implicit relevance 

judgment

Search Session Data 

• Sequence of queries from same user 

• Relation between queries 

– Error correction 

– Related query 

– Refined query 

– …

Summary

Talk Outline 

• What is Information Retrieval 

• Overview of Web IR Technologies 

• Application Technologies for Web IR 

• Component Technologies for Web IR 

• Summary


Technologies 


Question Answering 







Overview on Web Search System 

Query understanding 

Relevance ranking 

Learning to Rank 

User 

User 

Interface 

Ranker 

Index 

Search result presentation 

Search log data mining 

Document understanding 

Crawler 

Indexer 

Web 

Importance ranking 

Anti-spam

Thank You! 

hangli@microsoft.com

Web Information Retrieval

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?