Web Information Retrieval
Web Information Retrieval
Web Information Retrieval
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Web</strong> <strong>Information</strong> <strong>Retrieval</strong><br />
Hang Li<br />
Microsoft Research Asia
Talk Outline<br />
• What is <strong>Web</strong> <strong>Information</strong> <strong>Retrieval</strong><br />
• Overview of <strong>Web</strong> IR Technologies<br />
• Application Technologies for <strong>Web</strong> IR<br />
• Component Technologies for <strong>Web</strong> IR<br />
• Summary
What is <strong>Web</strong> <strong>Information</strong><br />
<strong>Retrieval</strong>
<strong>Web</strong> Search is Part of Our Life
<strong>Web</strong> Users Heavily Rely on Search Engines<br />
http://www.iprospect.com/premiumPDFs/iProspectSurveyComplete.pdf
Physically Search System is Data Center
Advanced <strong>Web</strong> Search Technologies<br />
Are Used…<br />
Statistical Learning<br />
Large Scale Distributed<br />
Computing
In This Talk<br />
<strong>Web</strong> <strong>Information</strong> <strong>Retrieval</strong> (IR)<br />
= <strong>Web</strong> Search from Algorithm<br />
Perspective
Research Areas Related to<br />
<strong>Web</strong> IR<br />
• <strong>Information</strong> <strong>Retrieval</strong><br />
• Statistical Machine Learning<br />
• Data Mining<br />
• Natural Language Processing<br />
• Computer Human Interaction
Overview on <strong>Web</strong> IR<br />
Technologies<br />
Document Search, Entity Search, Facet Search,<br />
Question Answering, Image Search<br />
Relevance Ranking, Importance Ranking,<br />
<strong>Web</strong> Page Understanding, Query Understanding,<br />
Crawling, Indexing, Search Result Presentation,<br />
Anti-Spam, Search Log Data Mining<br />
Classification, Clustering, Learning to Rank,<br />
Link Analysis, <strong>Information</strong> Extraction
Fundamental Technologies for<br />
• Classification<br />
• Clustering<br />
• Learning to Rank<br />
<strong>Web</strong> IR<br />
• Link Analysis (Graph Learning)<br />
• <strong>Information</strong> Extraction (Structure<br />
Prediction)
Component Technologies for <strong>Web</strong> IR<br />
• Relevance Ranking<br />
• Importance Ranking<br />
• <strong>Web</strong> Page Understanding<br />
• Query Understanding<br />
• Crawling<br />
• Indexing<br />
• Search Result Presentation<br />
• Anti-Spam<br />
• Search Log Data Mining
Application Technologies for<br />
<strong>Web</strong> IR<br />
• Document Search (<strong>Web</strong> Page Search)<br />
• Entity Search<br />
• Facet Search<br />
• Question Answering<br />
• Image Search
Overview on <strong>Web</strong> IR<br />
Technologies<br />
Document Search, Entity Search, Facet Search,<br />
Question Answering, Image Search<br />
Relevance Ranking, Importance Ranking,<br />
<strong>Web</strong> Page Understanding, Query Understanding,<br />
Crawling, Indexing, Search Result Presentation,<br />
Anti-Spam, Search Log Data Mining<br />
Classification, Clustering, Learning to Rank,<br />
Link Analysis, <strong>Information</strong> Extraction
Application Technologies
Application Technologies for<br />
<strong>Web</strong> IR<br />
• Document Search (<strong>Web</strong> Page Search)<br />
• Entity Search<br />
• Facet Search<br />
• Question Answering<br />
• Image Search
<strong>Information</strong> in Different<br />
Granularities and Forms<br />
• Document<br />
• Entity: person<br />
• Facet: definition, FAQ<br />
• Table
Document Search
Definition Search
Expert Search
FAQ Search
Image Search
Question Answering
Component Technologies
Component Technologies for <strong>Web</strong> IR<br />
• Relevance Ranking<br />
• Importance Ranking<br />
• <strong>Web</strong> Page Understanding<br />
• Query Understanding<br />
• Crawling<br />
• Indexing<br />
• Search Result Presentation<br />
• Anti-Spam<br />
• Search Log Data Mining
Example of <strong>Web</strong> Search<br />
Architecture<br />
Importance ranking<br />
Query understanding<br />
Relevance ranking<br />
User<br />
User<br />
Interface<br />
Ranker<br />
Index<br />
Search result presentation<br />
Search log data mining<br />
<strong>Web</strong> page understanding<br />
Crawler<br />
Indexer<br />
Anti-Spam<br />
<strong>Web</strong>
Relevance Ranking
General Framework for Relevance<br />
Ranking<br />
<br />
D d , d , 2<br />
,<br />
1<br />
<br />
documents<br />
(information)<br />
d n<br />
relevance scores for ranking<br />
query (or question)<br />
q<br />
f<br />
( q,<br />
d)<br />
d<br />
d<br />
1<br />
2<br />
~<br />
~<br />
f<br />
f<br />
<br />
( q,<br />
d<br />
( q,<br />
d<br />
1<br />
2<br />
)<br />
)<br />
d<br />
~<br />
f<br />
( q,<br />
n<br />
d n<br />
)
Relevance<br />
• No rigorous definition<br />
• Query = “soccer”, document = about soccer<br />
document relevant<br />
• Judgment by humans: several discretized<br />
levels, e.g. “definitely relevant”, “partially<br />
relevant”
Key Factor for Relevance: Matching<br />
between Query and Document<br />
f<br />
( q,<br />
d | D)<br />
q<br />
d<br />
q<br />
q’<br />
q
Probabilistic Model<br />
documents<br />
q<br />
d<br />
d<br />
<br />
1<br />
2<br />
d n<br />
query (or question)<br />
P(<br />
r<br />
| q,<br />
d)<br />
r {1,0}<br />
relevant scores for ranking<br />
d<br />
d<br />
d<br />
1<br />
2<br />
~ P(<br />
r<br />
~ P(<br />
r<br />
<br />
~<br />
P(<br />
r<br />
| q,<br />
d<br />
| q,<br />
d<br />
| q,<br />
1<br />
2<br />
n<br />
d n<br />
)<br />
)<br />
)
Okapi or BM25<br />
(Robertson and Walker 1994)<br />
documents<br />
d<br />
d<br />
<br />
1<br />
2<br />
ranking function<br />
d n<br />
query (or question)<br />
q<br />
<br />
( k 1)<br />
tf ( w)<br />
dl<br />
avgdl<br />
wd<br />
q<br />
(1 b)<br />
k b tf ( w)
Language Mode<br />
(Ponte and Croft 1998)<br />
document = bag of words<br />
d<br />
d<br />
d<br />
2<br />
n<br />
1<br />
q <br />
w<br />
w<br />
w<br />
w<br />
q 1<br />
11<br />
21<br />
n1<br />
w<br />
w<br />
w<br />
<br />
w<br />
q 2<br />
12<br />
22<br />
n2<br />
w<br />
w<br />
1l<br />
w<br />
w<br />
ql q<br />
1<br />
2l<br />
2<br />
nl n<br />
relevance scores for ranking<br />
d<br />
d<br />
d<br />
1<br />
2<br />
~ P(<br />
q<br />
~ P(<br />
q<br />
<br />
~<br />
P(<br />
q<br />
| d<br />
| d<br />
|<br />
1<br />
2<br />
n<br />
d n<br />
)<br />
)<br />
)
Learning to Rank Model<br />
(Herbrich et al., 2000)<br />
documents<br />
q<br />
d<br />
d<br />
<br />
1<br />
2<br />
d n<br />
query (or question)<br />
r f ( q,<br />
d)<br />
Ranking SVM<br />
relevance scores for ranking<br />
d<br />
d<br />
d<br />
1<br />
2<br />
~ f<br />
~ f<br />
~<br />
f<br />
<br />
( q,<br />
d<br />
( q,<br />
d<br />
( q,<br />
1<br />
2<br />
n<br />
d n<br />
)<br />
)<br />
)
Training Process<br />
d<br />
<br />
d<br />
<br />
<br />
<br />
<br />
d<br />
q1<br />
<br />
<br />
<br />
<br />
qm<br />
<br />
<br />
1,1<br />
1,2<br />
1, n<br />
<br />
<br />
d<br />
d<br />
<br />
d<br />
1<br />
m,1<br />
m,2<br />
m,<br />
n m<br />
d<br />
<br />
d<br />
q1<br />
<br />
<br />
<br />
<br />
d<br />
1. Data Labeling<br />
(rank)<br />
q<br />
m<br />
d<br />
<br />
d<br />
<br />
<br />
<br />
<br />
d<br />
1,1<br />
1,2<br />
1, n<br />
<br />
<br />
m,<br />
1<br />
m,2<br />
m,<br />
n<br />
y<br />
y<br />
y<br />
1,1<br />
1,2<br />
1 1,<br />
y<br />
y<br />
n<br />
1<br />
m,1<br />
y<br />
m,2<br />
m,<br />
m n m<br />
2. Feature<br />
Extraction<br />
x<br />
<br />
x<br />
<br />
<br />
<br />
<br />
x<br />
x<br />
<br />
x<br />
<br />
<br />
<br />
<br />
x<br />
1 ,1<br />
1,2<br />
1, n<br />
m,<br />
1<br />
m,2<br />
m,<br />
n<br />
y<br />
y<br />
y<br />
1,1<br />
1,2<br />
1 1,<br />
<br />
<br />
y<br />
y<br />
n<br />
1<br />
m,1<br />
y<br />
m,2<br />
m,<br />
m n m<br />
3. Learning<br />
f<br />
(x)<br />
35
1. Data Labeling<br />
(rank)<br />
2. Feature<br />
Extraction<br />
Testing Process<br />
36<br />
3. Ranking<br />
with )<br />
(x<br />
f<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
)<br />
(<br />
)<br />
(<br />
)<br />
(<br />
1<br />
1<br />
1 ,<br />
1<br />
1,<br />
1,<br />
1,2<br />
1,2<br />
1,2<br />
1,1<br />
1,1<br />
1,1<br />
m<br />
m<br />
m<br />
n<br />
m<br />
n<br />
m<br />
n<br />
m<br />
m<br />
m<br />
m<br />
m<br />
m<br />
m<br />
y<br />
x<br />
f<br />
x<br />
y<br />
x<br />
f<br />
x<br />
y<br />
x<br />
f<br />
x<br />
<br />
4. Evaluation<br />
Evaluation<br />
Result<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
1<br />
1,<br />
1,2<br />
1,1<br />
1<br />
n m<br />
m<br />
m<br />
m<br />
m<br />
d<br />
d<br />
d<br />
q<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
1<br />
1 ,<br />
1<br />
1,<br />
1,2<br />
1,2<br />
1,1<br />
1,1<br />
1<br />
m n m<br />
m<br />
n<br />
m<br />
m<br />
m<br />
m<br />
m<br />
m<br />
y<br />
d<br />
y<br />
d<br />
y<br />
d<br />
q<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
1<br />
1 ,<br />
1<br />
1,<br />
1,2<br />
1,2<br />
1,1<br />
1,1<br />
m n m<br />
m<br />
n<br />
m<br />
m<br />
m<br />
m<br />
m<br />
y<br />
x<br />
y<br />
x<br />
y<br />
x
Previous Approach: Pairwise<br />
Approach<br />
• Transforming ranking to classification<br />
– Ranking SVM (Herbrich et al, 2000)<br />
– RankBoost (Freund et al, 2003)<br />
– RankNet (Burges et al, 2005)<br />
– IR-SVM (Cao et al, 2005)<br />
– Frank (Tsai et al, 2006)
Our Proposal = Listwise Approach<br />
• Probabilistic Model<br />
– ListNet (Cao, et al 2007)<br />
– ListMLE (Xia, et al 2008)<br />
• Optimizing Upper Bounds of IR Measures<br />
– AdaRank (Xu & Li, 2007)<br />
– SVM-MAP (Yue, et al, 2007)<br />
– PermuRank (Xu, et al 2008)<br />
• Approximation of IR Measures<br />
– SoftRank (Taylor 2007)<br />
– ApproxRank (Qin, et al, to appear)
Importance Ranking
General Model for Importance<br />
<br />
D d , d , 2<br />
,<br />
1<br />
<br />
documents<br />
d n<br />
Ranking<br />
query (or question)<br />
q<br />
f ( q,<br />
d)<br />
g(<br />
d)<br />
importance and relevance<br />
scores for ranking<br />
g(<br />
d<br />
g(<br />
d<br />
g(<br />
d<br />
1<br />
2<br />
)<br />
)<br />
)<br />
<br />
<br />
<br />
<br />
f<br />
f<br />
f<br />
( q,<br />
d<br />
( q,<br />
d<br />
( q,<br />
1<br />
2<br />
n<br />
d n<br />
)<br />
)<br />
)
Importance<br />
• No rigorous definition<br />
• Citation, quality, novelty<br />
• Independent of query
Page Rank<br />
(Brin and Page, 1998)<br />
P(<br />
d<br />
i<br />
)<br />
P(<br />
d<br />
j<br />
)<br />
(1 )<br />
L(<br />
d )<br />
d M<br />
(<br />
j d i<br />
)<br />
j<br />
1<br />
n
Building User Browsing Graph<br />
(Liu et al 2008)<br />
A directed graph with rich meta<br />
data.<br />
Vertex: <strong>Web</strong> page<br />
Edge: Transition<br />
Edge weight w ij :<br />
Number of<br />
transitions<br />
Staying time T i :<br />
Time spend on<br />
page i<br />
Vertex weight C i :<br />
Number of visit<br />
for page i<br />
Reset probability :<br />
Normalized frequencies<br />
as first page of session
BrowseRank: Continuous-time<br />
Markov Model<br />
Hard<br />
Calculating<br />
Estimating<br />
staying time<br />
distribution of<br />
each state<br />
Computing the stationary<br />
distribution of a discretetime<br />
Markov chain (called<br />
embedded Markov chain)
<strong>Web</strong> Page Understanding
<strong>Web</strong> Page Understanding<br />
• <strong>Web</strong> <strong>Information</strong> Extraction<br />
– Block Analysis<br />
– Metadata Extraction (Title, Date, etc)<br />
– Text <strong>Information</strong> Extraction<br />
– Wrapper Generation<br />
• <strong>Web</strong> Page Classification<br />
– Based on Semantics<br />
– Based on Type (Homepage, Spec, etc)
• <strong>Web</strong> page can be<br />
Block Analysis<br />
(Song et al 2004)<br />
segmented into blocks<br />
• Important blocks can<br />
be identified
Title Extraction<br />
(Hu et al 2005)
Text <strong>Information</strong> Extraction<br />
October 14, 2002, 4:00 a.m. PT<br />
For years, Microsoft Corporation CEO Bill<br />
Gates railed against the economic philosophy<br />
of open-source software with Orwellian fervor,<br />
denouncing its communal licensing as a<br />
"cancer" that stifled technological innovation.<br />
Today, Microsoft claims to "love" the opensource<br />
concept, by which software code is<br />
made public to encourage improvement and<br />
development by outside programmers. Gates<br />
himself says Microsoft will gladly disclose its<br />
crown jewels--the coveted code behind the<br />
Windows operating system--to select<br />
customers.<br />
NAME TITLE ORGANIZATION<br />
Bill Gates CEO Microsoft<br />
Bill Veghte VP Microsoft<br />
Richard Stallman Founder Free<br />
Software<br />
Foundation<br />
"We can be open source. We love the concept<br />
of shared source," said Bill Veghte, a<br />
Microsoft VP. "That's a super-important shift<br />
for us in terms of code access.“<br />
Richard Stallman, founder of the Free<br />
Software Foundation, countered saying…
General Model for<br />
<strong>Information</strong> Extraction<br />
( O<br />
1<br />
( O<br />
<br />
( O<br />
2<br />
, S<br />
, S<br />
,<br />
1<br />
2<br />
n S n<br />
)<br />
)<br />
)<br />
Training Data<br />
Learning<br />
System<br />
P(<br />
O,<br />
S)<br />
or P(<br />
S | O)<br />
Model<br />
Test Data<br />
O O n<br />
, S )<br />
n1<br />
Extraction<br />
System<br />
(<br />
1 n1
Wrapper Generation
Query Understanding
Query Understanding<br />
• Spelling Error Correction<br />
• Query Refinement<br />
– E.g., “ny times” “new york times”<br />
• Query Classification<br />
– Based on Semantics (Sport, etc)<br />
– Based on Type (Name Query, etc)<br />
• Query Suggestion<br />
– E.g., “microsoft” “bill gates”
Mismatching between Query Term<br />
and Document Term<br />
I want to search “myspace”<br />
myspace<br />
my space<br />
space<br />
my<br />
54
Understanding the Intent and<br />
Solving the Mismatch<br />
I want to search “myspace”<br />
myspace<br />
my space<br />
Query<br />
Understan<br />
ding<br />
myspace<br />
space<br />
my<br />
55
Structured Prediction Problem<br />
windows<br />
onecare<br />
“Ideal” word sequence<br />
window<br />
onecar<br />
Observed “noisy” word<br />
sequence<br />
“ideal” query<br />
word<br />
sequence<br />
original query<br />
word<br />
sequence
Conditional Random Fields for Query<br />
Refinement (Guo et al 2008)<br />
Introducing Refinement Operations<br />
o i-1 o i o i+1<br />
y i-1 y i y i+1<br />
x i-1 x i x i+1<br />
Operations<br />
Spelling: insertion, deletion, substitution, transposition, …<br />
Word Stemming: +s/-s, +es/-es, +ed/-ed, +ing/-ing, …
Crawling
Crawling<br />
• Crawling Scheduling<br />
• Near Duplicate Detection
Crawling Scheduling<br />
• Search of a large scale and dynamically<br />
changing graph<br />
• Breadth first crawling<br />
• Preferential crawling: importance, freshness,<br />
coverage
Near Duplicate Detection<br />
similarity<br />
Shingle<br />
Shingle
Indexing
• Inverted Indexing<br />
Indexing
Inverted Index<br />
t1<br />
d11<br />
d12<br />
d13<br />
……….<br />
t2<br />
t3<br />
……
Search Result Presentation
Search Result Presentation<br />
• Summary Generation<br />
• Result Clustering<br />
• Result Diversification (implicit clustering)<br />
• Novelty<br />
• Location Sensitiveness
Search Result Clustering<br />
(Zeng et al 2004)
Query Suggestion
Search Intent and Context<br />
• Suppose a user raises a query “gladiator”<br />
History?<br />
People?<br />
Film?<br />
• If we know the user raises query “beautiful mind” before<br />
“gladiator”<br />
• User is likely to be interested in the film<br />
• User is likely to be searching the films played by Russell Crowe.
Context Aware Query Suggestion<br />
(Cao et al 2008)<br />
• Offline part: model learning<br />
– Summarizing queries into concepts by clustering click-through<br />
bipartite<br />
– Mining frequent patterns from session data and building a<br />
concept sequence suffix tree<br />
• Online part: query suggestion
Anti-Spam
Anti-Spam<br />
• Manipulate relevance and importance<br />
• Boundary between Search Engine Optimization<br />
– Not ethical, if to be ranked higher beyond real value<br />
– “Cheating” search engines<br />
• Spam Type<br />
– Content Spam<br />
– Link Spam<br />
– Comment Spam<br />
– Cloaking
Content Spam<br />
Highlighting<br />
q q q q q q<br />
Dumping<br />
a b c<br />
d e f
Link Spam<br />
• Link from Blog, Forum, etc<br />
• Link Exchange<br />
• Link Farm
Cloaking<br />
End User<br />
Search Engine
Search Log Data Mining
Search Log Data Mining<br />
• Query Log<br />
• Click-through Data<br />
• Search Session Data
Click-through Data<br />
q1<br />
q2<br />
d1<br />
d2<br />
q<br />
……<br />
……<br />
qm<br />
dn<br />
Bipartite graph:<br />
implicit tag<br />
Click position:<br />
implicit relevance<br />
judgment
Search Session Data<br />
• Sequence of queries from same user<br />
• Relation between queries<br />
– Error correction<br />
– Related query<br />
– Refined query<br />
– …
Summary
Talk Outline<br />
• What is <strong>Information</strong> <strong>Retrieval</strong><br />
• Overview of <strong>Web</strong> IR Technologies<br />
• Application Technologies for <strong>Web</strong> IR<br />
• Component Technologies for <strong>Web</strong> IR<br />
• Summary
Overview on <strong>Web</strong> IR<br />
Technologies<br />
Document Search, Entity Search, Facet Search,<br />
Question Answering<br />
Relevance Ranking, Importance Ranking,<br />
<strong>Web</strong> Page Understanding, Query Understanding,<br />
Crawling, Indexing, Search Result Presentation,<br />
Anti-Spam, Search Log Data Mining<br />
Classification, Clustering, Learning to Rank,<br />
Link Analysis, <strong>Information</strong> Extraction
Overview on <strong>Web</strong> Search System<br />
Query understanding<br />
Relevance ranking<br />
Learning to Rank<br />
User<br />
User<br />
Interface<br />
Ranker<br />
Index<br />
Search result presentation<br />
Search log data mining<br />
Document understanding<br />
Crawler<br />
Indexer<br />
<strong>Web</strong><br />
Importance ranking<br />
Anti-spam
Thank You!<br />
hangli@microsoft.com