Learning to Rank for Information Retrieval
Learning to Rank for Information Retrieval
Learning to Rank for Information Retrieval
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong><br />
<strong>for</strong> In<strong>for</strong>mation <strong>Retrieval</strong><br />
Tie-Yan Liu<br />
Lead Researcher<br />
Microsoft Research Asia<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
1
Outline<br />
• <strong>Rank</strong>ing in In<strong>for</strong>mation <strong>Retrieval</strong><br />
• <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong><br />
– Introduction <strong>to</strong> <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong><br />
– Pointwise approach<br />
– Pairwise approach<br />
– Listwise approach<br />
• Summary<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
3
Introduction <strong>to</strong> <strong>Rank</strong>ing in IR<br />
4/23/2008 Tie-Yan Liu @ Renmin University 4
Inside Search Engine<br />
Query<br />
User Interface<br />
Online Part<br />
Caching<br />
Relevance <strong>Rank</strong>ing<br />
Inverted<br />
Index<br />
Index Builder<br />
Page <strong>Rank</strong>s<br />
Link Analysis<br />
Cached<br />
Pages<br />
Web Page Parser<br />
Pages<br />
Links &<br />
Anchors<br />
Link Map<br />
Web Graph Builder<br />
Web Graph<br />
Page & Site<br />
Statistics<br />
Crawler<br />
Offline Part<br />
4/23/2008 Tie-Yan Liu @ Renmin University 5
<strong>Rank</strong>ing in In<strong>for</strong>mation <strong>Retrieval</strong><br />
Document index<br />
D <br />
d i<br />
Query<br />
q {www2008}<br />
<strong>Rank</strong>ing<br />
model<br />
d<br />
d<br />
q<br />
k<br />
q<br />
2<br />
<br />
<strong>Rank</strong>ing of documents<br />
d<br />
q<br />
1<br />
www2008.org<br />
www2008.org/submissions/index. html<br />
<br />
<br />
www.iw3c2.org/blog/w elcome/www2008<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
6
Widely-used Judgments<br />
• Binary judgment<br />
– Relevant vs. Irrelevant<br />
• Multi-level ratings<br />
– Perfect > Excellent > Good > Fair > Bad<br />
• Pairwise preferences<br />
– Document A is more relevant than document B with<br />
respect <strong>to</strong> query q<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
7
Evaluation Measures<br />
• MRR (Mean Reciprocal <strong>Rank</strong>)<br />
q i<br />
– For query , rank position of the first relevant document:<br />
– MRR: average of 1/ over all queries<br />
• WTA (Winners Take All)<br />
r i<br />
– If <strong>to</strong>p ranked document is relevant: 1; otherwise 0<br />
– average over all queries<br />
• MAP (Mean Average Precision)<br />
• NDCG (Normalized Discounted Cumulative Gain)<br />
r i<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
8
MAP<br />
• Precision at position n<br />
P@<br />
n <br />
• Average precision<br />
AP<br />
<br />
n<br />
#{relevant documents in <strong>to</strong>p n results}<br />
n<br />
P@<br />
n<br />
I{document<br />
n is<br />
#{relevant documents}<br />
AP =<br />
relevant}<br />
1<br />
<br />
<br />
1<br />
0.76<br />
• MAP: averaged over all queries in the test set<br />
1<br />
3<br />
2<br />
3<br />
<br />
3<br />
<br />
5 <br />
<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
9
NDCG<br />
• NDCG at position n:<br />
N(<br />
n)<br />
Z<br />
n<br />
n<br />
<br />
j1<br />
(2<br />
r(<br />
j)<br />
1) / log(1 j)<br />
• NDCG averaged <strong>for</strong> all queries in test set<br />
4/23/2008 Tie-Yan Liu @ Renmin University 10
More about Evaluation Measures<br />
• Query-level<br />
– Averaged over all queries.<br />
– The measure is bounded, and every query contributes<br />
equally <strong>to</strong> the averaged measure.<br />
• Position<br />
– Position discount is used here and there.<br />
• Rating<br />
– Different ratings may correspond <strong>to</strong> different gains.<br />
4/23/2008 Tie-Yan Liu @ Renmin University 11
<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> <strong>for</strong> IR<br />
Overview<br />
Pointwise Approach<br />
Pairwise Approach<br />
Listwise Approach<br />
4/23/2008 Tie-Yan Liu @ Renmin University 12
Conventional IR Models<br />
Category<br />
Example Models<br />
Boolean model<br />
Similarity-based<br />
models<br />
Vec<strong>to</strong>r space model<br />
Latent Semantic Indexing<br />
…<br />
BM25 model<br />
Probabilistic models<br />
Hyperlink-based<br />
models<br />
Language model <strong>for</strong> IR<br />
…<br />
HITS<br />
Page<strong>Rank</strong><br />
……<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
13
Problems with Conventional <strong>Rank</strong>ing<br />
• For a particular model<br />
Models<br />
– Parameter tuning is difficult, especially when there are<br />
many parameters <strong>to</strong> tune.<br />
• For comparison between two models<br />
– Given a test set, when one model is over-tuned (overfitting)<br />
while the other is not, how can we compare them?<br />
• For a collection of models<br />
– There are hundreds of models proposed in the literature. It<br />
is non-trivial how <strong>to</strong> combine them.<br />
4/23/2008 Tie-Yan Liu @ Renmin University 14
Machine <strong>Learning</strong> Can Help<br />
• Machine learning is an effective <strong>to</strong>ol<br />
– To au<strong>to</strong>matically tune parameters<br />
– To combine multiple evidences<br />
– To avoid over-fitting (regularization framework,<br />
structure risk minimization, etc.)<br />
4/23/2008 Tie-Yan Liu @ Renmin University 15
documents<br />
Framework of <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> <strong>for</strong> IR<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
queries<br />
d<br />
d<br />
q<br />
(1)<br />
(1)<br />
1<br />
(1)<br />
2<br />
(1)<br />
d n<br />
,4<br />
<br />
,2<br />
<br />
<br />
<br />
<br />
,1<br />
(1)<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
d<br />
d<br />
d<br />
Training Data<br />
q<br />
q<br />
( d1<br />
,?),( d2<br />
,?)<br />
......,(<br />
d n<br />
Test data<br />
,?)<br />
( m)<br />
( m)<br />
1<br />
( m)<br />
2<br />
<br />
( m)<br />
( m )<br />
n<br />
,5 <br />
<br />
,3 <br />
<br />
<br />
,2<br />
<br />
Labels: multiple-level<br />
ratings <strong>for</strong> example<br />
<strong>Learning</strong><br />
System<br />
<strong>Rank</strong>ing<br />
System<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
d<br />
d<br />
d<br />
i1<br />
i2<br />
in<br />
f<br />
,<br />
,<br />
,<br />
Model<br />
( q,<br />
d,<br />
w)<br />
f ( q,<br />
d<br />
f ( q,<br />
d<br />
<br />
f ( q,<br />
d<br />
i1<br />
i2<br />
in<br />
, w)<br />
<br />
<br />
, w)<br />
<br />
<br />
<br />
, w)<br />
<br />
<br />
16
Connection with Conventional<br />
Machine <strong>Learning</strong><br />
• With certain assumptions, learning <strong>to</strong> rank can be<br />
trans<strong>for</strong>med (reduced) <strong>to</strong> conventional machine<br />
learning problems<br />
– Regression (assume labels <strong>to</strong> be scores)<br />
– Classification (assume labels <strong>to</strong> have no orders)<br />
– Pairwise classification (assume document pairs <strong>to</strong> be<br />
independent of each other)<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
17
Difference from Conventional<br />
Machine <strong>Learning</strong><br />
• Unique properties of ranking in IR<br />
– Role of query<br />
• Query determines the logical structure of the ranking<br />
problem.<br />
• Loss function should be defined on ranked lists w.r.t. a<br />
query, since IR measures are computed at query level.<br />
– Relative order is important<br />
• No need <strong>to</strong> predict category, or value of f(x)<br />
– Position<br />
• Top-ranked objects are more important<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
18
Categorization of <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> Methods<br />
Pointwise<br />
Approach<br />
Pairwise Approach<br />
Listwise Approach<br />
Reduced <strong>to</strong><br />
classification<br />
or regression<br />
Discriminative<br />
model <strong>for</strong> IR (SIGIR<br />
2004)<br />
Mc<strong>Rank</strong> (NIPS<br />
2007)<br />
…<br />
<strong>Rank</strong>ing SVM (ICANN 1999)<br />
<strong>Rank</strong>Boost (JMLR 2003)<br />
LDM (SIGIR 2005)<br />
<strong>Rank</strong>Net (ICML 2005)<br />
Frank (SIGIR 2007)<br />
GB<strong>Rank</strong> (SIGIR 2007)<br />
QB<strong>Rank</strong> (NIPS 2007)<br />
MP<strong>Rank</strong> (ICML 2007)…<br />
New problem IRSVM (SIGIR 2006)<br />
MHR (SIGIR 2007)<br />
…<br />
Ada<strong>Rank</strong> (SIGIR 2007)<br />
SVM-MAP (SIGIR 2007)<br />
Soft<strong>Rank</strong> (LR4IR 2007)<br />
GP<strong>Rank</strong> (LR4IR 2007)<br />
CCA (SIGIR 2007)<br />
Lambda<strong>Rank</strong> (NIPS 2006)<br />
<strong>Rank</strong>Cosine (IP&M 2007)<br />
ListNet (ICML 2007)<br />
ListMLE (ICML 2008)…<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
19
<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> <strong>for</strong> IR<br />
Overview<br />
Pointwise Approach<br />
Pairwise Approach<br />
Listwise Approach<br />
4/23/2008 Tie-Yan Liu @ Renmin University 20
Overview of Pointwise Approach<br />
• Reduce ranking <strong>to</strong> regression or classification<br />
on single documents.<br />
• Examples:<br />
– Discriminative model <strong>for</strong> IR.<br />
– Mc<strong>Rank</strong>: learning <strong>to</strong> rank using multi-class<br />
classification and gradient boosting.<br />
4/23/2008 Tie-Yan Liu @ Renmin University 21
Discriminative Model <strong>for</strong> IR<br />
(R. Nallapati, SIGIR 2004)<br />
• Features extracted <strong>for</strong> each document<br />
• Treat relevant documents as positive examples,<br />
irrelevant documents as negative examples.<br />
• Use SVM or MaxEnt <strong>to</strong> per<strong>for</strong>m discriminative<br />
learning.<br />
4/23/2008 Tie-Yan Liu @ Renmin University 22
Mc<strong>Rank</strong><br />
(P. Li, et al. NIPS 2007)<br />
• Loss in terms of DCG is upper bounded (although<br />
lossly) by multi-class classification error<br />
• Use gradient boosting <strong>to</strong> per<strong>for</strong>m multi-class<br />
classification<br />
– K trees per boosting iteration<br />
– Tree outputs converted <strong>to</strong> probabilities using logistic<br />
function<br />
– Convert classification <strong>to</strong> ranking:<br />
4/23/2008 Tie-Yan Liu @ Renmin University 23
Discussions<br />
• Assumption<br />
– Relevance is absolute and query-independent.<br />
• However, in practice, relevance is query-dependent<br />
– Relevance labels are assigned by comparing the<br />
documents associated with the same query.<br />
– An irrelevant document <strong>for</strong> a popular query might have<br />
higher term frequency than a relevant document <strong>for</strong> a rare<br />
query.<br />
4/23/2008 Tie-Yan Liu @ Renmin University 24
<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> <strong>for</strong> IR<br />
Overview<br />
Pointwise Approach<br />
Pairwise Approach<br />
Listwise Approach<br />
4/23/2008 Tie-Yan Liu @ Renmin University 25
Overview of Pairwise Approach<br />
• No longer assume absolute relevance.<br />
• Reduce ranking <strong>to</strong> classification on document pairs<br />
(i.e. pairwise classification) w.r.t. the same query.<br />
• Examples<br />
– <strong>Rank</strong>Net, Frank, <strong>Rank</strong>Boost, <strong>Rank</strong>ing SVM, etc.<br />
q<br />
( i)<br />
( i)<br />
d <br />
1<br />
,5<br />
<br />
( i)<br />
d ,3 <br />
2<br />
<br />
<br />
( i)<br />
<br />
( i )<br />
d<br />
,2<br />
n <br />
Trans<strong>for</strong>m<br />
( i)<br />
( i)<br />
<br />
( i)<br />
( i)<br />
<br />
( i)<br />
( i)<br />
d , d , d , d<br />
<br />
( i ) ,..., d , d ( i )<br />
1<br />
2<br />
1<br />
q<br />
( i)<br />
5 > 3 5 > 2 3 > 2<br />
n<br />
2<br />
n<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
26
• Target posteriors:<br />
• Modeled posteriors:<br />
• Define<br />
<strong>Rank</strong>Net<br />
(C. Burges, et al. ICML 2005)<br />
• New loss function: cross entropy (CE loss)<br />
• Model output probabilities using logistic function:<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
27
<strong>Rank</strong>Net (2)<br />
• Use Neural Network as model, and gradient<br />
descent as algorithm, <strong>to</strong> optimize the crossentropy<br />
loss<br />
• Evaluate on single documents: output a<br />
relevance score <strong>for</strong> each document w.r.t. a<br />
new query<br />
4/23/2008 Tie-Yan Liu @ Renmin University 28
<strong>Rank</strong>Boost<br />
(Y. Freund, et al. JMLR 2003)<br />
• Given: initial distribution D 1 over<br />
• For t=1,…, T:<br />
<br />
<br />
– Train weak learning using distribution D t<br />
– Get weak ranking h t : X R<br />
– Choose t<br />
R<br />
Dt ( x0, x1)exp – Update:<br />
t( ht ( x0) ht<br />
( x1))<br />
<br />
Dt<br />
1( x0, x1)<br />
<br />
Zt<br />
where<br />
<br />
Z D ( x , x )exp ( h ( x ) h ( x ))<br />
t t 0 1 t t 0 t 1<br />
x0,<br />
x1<br />
T<br />
• Output the final ranking:<br />
<br />
H ( x) tht( x)<br />
Use AdaBoost <strong>to</strong> per<strong>for</strong>m pairwise classification<br />
t1<br />
<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
29
<strong>Rank</strong>ing SVM<br />
(R. Herbrich, et al. ICANN 1999)<br />
min<br />
z<br />
w,<br />
<br />
i<br />
<br />
i<br />
1<br />
2<br />
w,<br />
x<br />
0<br />
|| w ||<br />
(1)<br />
i<br />
<br />
2<br />
x<br />
C<br />
(2)<br />
i<br />
f ( x;<br />
wˆ)<br />
wˆ<br />
, x <br />
i<br />
1<br />
i<br />
x (1) -x (2) as instance of learning<br />
Use SVM <strong>to</strong> per<strong>for</strong>m binary<br />
classification on these instances,<br />
<strong>to</strong> learn model parameter w<br />
Use w <strong>for</strong> testing<br />
Use SVM <strong>to</strong> per<strong>for</strong>m pairwise classification<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
30
Other Work in this Category<br />
• Frank: A Fidelity Loss <strong>for</strong> In<strong>for</strong>mation <strong>Retrieval</strong>, (M. Tsai, T. Liu,<br />
et al. SIGIR 2007)<br />
• Linear discriminant model <strong>for</strong> in<strong>for</strong>mation retrieval (LDM), (J.<br />
Gao, H. Qi, et al. SIGIR 2005)<br />
• A Regression Framework <strong>for</strong> <strong>Learning</strong> <strong>Rank</strong>ing Functions using<br />
Relative Relevance Judgments (GB<strong>Rank</strong>), (Z. Zheng, H. Zha, et<br />
al. SIGIR 2007)<br />
• A General Boosting Method and its Application <strong>to</strong> <strong>Learning</strong><br />
<strong>Rank</strong>ing Functions <strong>for</strong> Web Search (QB<strong>Rank</strong>), (Z. Zheng, H.<br />
Zha, et al. NIPS 2007)<br />
• Magnitude-preserving <strong>Rank</strong>ing Algorithms (MP<strong>Rank</strong>), (C.<br />
Cortes, M. Mohri, et al. ICML 2007)<br />
4/23/2008 Tie-Yan Liu @ Renmin University 31
Discussions<br />
• Progress made as compared <strong>to</strong> pointwise<br />
approach<br />
– No longer assume absolute relevance<br />
• However,<br />
– Unique properties of ranking in IR have not been<br />
fully modeled.<br />
4/23/2008 Tie-Yan Liu @ Renmin University 32
Problems with Pairwise Approach (1)<br />
• Number of instance pairs varies according <strong>to</strong> query<br />
4/23/2008 Tie-Yan Liu @ Renmin University 33
Problems with Pairwise Approach (2)<br />
• Negative effects of making errors on <strong>to</strong>p<br />
positions<br />
d: definitely relevant, p: partially relevant, n: not relevant<br />
ranking 1: p d p n n n n one wrong pair Worse<br />
ranking 2: d p n p n n n one wrong pair Better<br />
4/23/2008 Tie-Yan Liu @ Renmin University 34
Problems with Pairwise Approach (3)<br />
• Different rank pairs correspond <strong>to</strong> different ranking<br />
criteria<br />
– Perfect:<br />
• Uniqueness,<br />
• domain knowledge<br />
• Authority<br />
– Excellent / Good<br />
– Bad<br />
• Content match<br />
• Detrimental<br />
4/23/2008 Tie-Yan Liu @ Renmin University 35
As a Result, …<br />
• Users only see ranked list of<br />
documents, but not document<br />
pairs<br />
• It is not clear how pairwise<br />
loss correlates with IR<br />
evaluation measures, which<br />
are defined on query level.<br />
Pairwise loss vs. (1-NDCG@5)<br />
TREC Dataset<br />
4/23/2008 Tie-Yan Liu @ Renmin University 36
Incremental Solution <strong>to</strong> Problems (1)(2)<br />
IRSVM: Change Loss of <strong>Rank</strong>ing SVM<br />
(Y. Cao, J. Xu, T. Liu, et al. SIGIR 2006)<br />
Query-level normalizer<br />
Implicit Position Discount<br />
Larger weight <strong>for</strong> more critical pairs<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
37
Incremental Solution <strong>to</strong> Problems (3)<br />
Multi-Hyperplane <strong>Rank</strong>er<br />
(T. Qin, T. Liu, et al. SIGIR 2007)<br />
• Suppose there are k-level ratings<br />
– m=k*(k-1)/2 different types of document pairs:<br />
• (rank 1, rank 2), (rank 1, rank 3), (rank 2, rank 3), …<br />
• Divide-and-conquer approach<br />
– Train m hyperplanes using <strong>Rank</strong>ing SVM, with each<br />
hyperplane only focusing on one type of document pairs.<br />
– Combine ranking results given by these hyperplanes <strong>to</strong><br />
compose final ranked list, by means of rank aggregation.<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
38
More Fundamental Solution<br />
Listwise Approach <strong>to</strong> <strong>Rank</strong>ing<br />
• Directly operate on ranked lists<br />
– Directly optimize IR evaluation measures<br />
• Ada<strong>Rank</strong>,<br />
• SVM-MAP,<br />
• Soft<strong>Rank</strong>,<br />
• <strong>Rank</strong>GP, …<br />
– Define listwise loss functions<br />
• <strong>Rank</strong>Cosine,<br />
• ListNet,<br />
• ListMLE, …<br />
4/23/2008 Tie-Yan Liu @ Renmin University 39
<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> <strong>for</strong> IR<br />
Overview<br />
Pointwise Approach<br />
Pairwise Approach<br />
Listwise Approach<br />
4/23/2008 Tie-Yan Liu @ Renmin University 40
Overview of Listwise Approach<br />
• Instead of reducing ranking <strong>to</strong> regression and<br />
classification, per<strong>for</strong>m learning directly on document<br />
list.<br />
– Treats ranked lists as learning instances.<br />
• Two major categories<br />
– Directly optimize IR evaluation measures<br />
– Define listwise loss functions<br />
4/23/2008 Tie-Yan Liu @ Renmin University 41
Directly Optimize IR Evaluation Measures<br />
• It is a natural idea <strong>to</strong> directly optimize what is used <strong>to</strong><br />
evaluate the ranking results.<br />
• However, it is non-trivial.<br />
– Evaluation measures such as NDCG are usually nonsmooth<br />
and non-differentiable.<br />
– It is challenging <strong>to</strong> optimize such objective functions, since<br />
most of the optimization techniques were developed <strong>to</strong><br />
work with smooth and differentiable cases.<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
42
Tackle the Challenges<br />
• Use optimization technologies designed <strong>for</strong> smooth and<br />
differentiable problems<br />
– Fit the optimization of evaluation measure in<strong>to</strong> the logic of<br />
existing optimization technique (Ada<strong>Rank</strong>)<br />
– Optimize a smooth and differentiable upper bound of<br />
evaluation measure (SVM-MAP)<br />
– Soften (approximate) the evaluation measure so as <strong>to</strong> make it<br />
smooth and differentiable, and easy <strong>to</strong> optimize (Soft<strong>Rank</strong>)<br />
• Use optimization technologies designed <strong>for</strong> non-smooth<br />
and non-differentiable problems<br />
– Genetic programming (<strong>Rank</strong>GP)<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
43
• Use Boosting <strong>to</strong><br />
per<strong>for</strong>m the<br />
optimization.<br />
• Embed IR evaluation<br />
measure in the<br />
distribution update<br />
of Boosting.<br />
Ada<strong>Rank</strong><br />
(J. Xu, et al. SIGIR 2007)<br />
4/23/2008 Tie-Yan Liu @ Renmin University 44
Theoretical Analysis<br />
• Training error will be continuously reduced during<br />
learning phase when
Practical Solution<br />
• In practice, we cannot<br />
guarantee
SVM-MAP<br />
(Y. Yue, et al. SIGIR 2007)<br />
• Use SVM framework <strong>for</strong> optimization<br />
• Incorporate IR evaluation measures in the listwise constraints.<br />
• Sum of slacks ∑ξ upper bounds 1-E(y,y’).<br />
min<br />
1<br />
2<br />
w<br />
2<br />
<br />
C<br />
N<br />
<br />
y'<br />
y : F(<br />
q,<br />
d,<br />
y)<br />
F(<br />
q,<br />
d,<br />
y')<br />
(1 E(<br />
y',<br />
y))<br />
<br />
i<br />
<br />
i<br />
4/23/2008 Tie-Yan Liu @ Renmin University 47
Challenges in SVM-MAP<br />
• For Average Precision, the true labeling is a ranking where<br />
the relevant documents are all ranked in the front, e.g.,<br />
• An incorrect labeling would be any other ranking, e.g.,<br />
• Exponential number of rankings, thus an exponential<br />
number of constraints!<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
48
Structural SVM Training<br />
• STEP 1: Per<strong>for</strong>m optimization on only current working set of<br />
constraints.<br />
• STEP 2: Use the model learned in STEP 1 <strong>to</strong> find the most violated<br />
constraint from the exponential set of constraints.<br />
– This step can also be very fast since MAP is invariant on the order of<br />
documents within a relevance class<br />
• STEP 3: If the constraint returned in STEP 2 is more violated than<br />
the most violated constraint in the working set, add it <strong>to</strong> the<br />
working set.<br />
STEP 1-3 is guaranteed <strong>to</strong> loop <strong>for</strong> at most a polynomial number of<br />
iterations. [Tsochantaridis et al. 2005]<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
49
Soft<strong>Rank</strong><br />
(M. Taylor, et al. LR4IR 2007)<br />
• Add additive noise <strong>to</strong> the ranking result, and thus<br />
make NDCG “soft” enough <strong>for</strong> optimization.<br />
Define the probability of a document being ranked at a particular position<br />
4/23/2008 Tie-Yan Liu @ Renmin University 50
From Scores <strong>to</strong> <strong>Rank</strong> Distribution<br />
• Considering the score <strong>for</strong> each document s j as<br />
random variable<br />
• Probability of doc i is ranked be<strong>for</strong>e doc j<br />
• Expected rank<br />
• <strong>Rank</strong> Distribution p j (r)<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
51
Soft NDCG<br />
• From hard <strong>to</strong> soft<br />
• Neural networks <strong>for</strong> optimization<br />
4/23/2008 Tie-Yan Liu @ Renmin University 52
<strong>Rank</strong>GP<br />
(J. Yeh, et al. LR4IR 2007)<br />
• Define ranking function as a tree<br />
• Use genetic programming <strong>to</strong> learn the optimal tree<br />
• GP is widely used <strong>to</strong> optimize non-smoothing nondifferentiable<br />
objective.<br />
• Crossover, mutation, reproduction, <strong>to</strong>urnament selection.<br />
• Fitness function = IR evaluation measure.<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
53
Other Work in this Category<br />
• <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> with Nonsmooth Cost Functions ,<br />
(C.J.C. Burges, R. Ragno, et al. NIPS 2006 )<br />
• A combined component approach <strong>for</strong> finding<br />
collection-adapted ranking functions based on<br />
genetic programming (CCA), (H. Almeida, M.<br />
Goncalves, et al. SIGIR 2007).<br />
4/23/2008 Tie-Yan Liu @ Renmin University 54
Discussions<br />
• Progress has been made as compared <strong>to</strong> pointwise and<br />
pairwise approaches<br />
• Still many unsolved problems<br />
– Is the upper bound in SVM-MAP tight enough?<br />
– Is SoftNDCG highly correlated with NDCG?<br />
– Is there any theoretical justification on the correlation<br />
between the implicit loss function used in Lambda<strong>Rank</strong><br />
and NDCG?<br />
– Multiple measures are not necessarily consistent with each<br />
other: which <strong>to</strong> optimize?<br />
4/23/2008 Tie-Yan Liu @ Renmin University 55
Define Listwise Loss Function<br />
• Indirectly optimizing IR evaluation measures<br />
– Consider properties of ranking <strong>for</strong> IR.<br />
• Examples<br />
– <strong>Rank</strong>Cosine: use cosine similarity between score vec<strong>to</strong>rs as<br />
loss function<br />
– ListNet: use KL divergence between permutation<br />
probability distributions as loss function<br />
– ListMLE: use negative log likelihood of the ground truth<br />
permutation as loss function<br />
• Theoretical analysis on listwise loss functions<br />
4/23/2008 Tie-Yan Liu @ Renmin University 56
<strong>Rank</strong>Cosine<br />
(T. Qin, T. Liu, et al. IP&M 2007)<br />
• Loss function define as<br />
• Properties<br />
– Insensitive <strong>to</strong> number of documents / document<br />
pairs per query<br />
– Emphasize correct ranking on by setting high<br />
scores <strong>for</strong> <strong>to</strong>p in ground truth.<br />
– Bounded:<br />
4/23/2008 Tie-Yan Liu @ Renmin University 57
Optimization<br />
• Additive model as ranking function<br />
• Loss over queries<br />
• Gradient descent <strong>for</strong> optimization<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
58
More Investigation on Loss Functions<br />
• A Simple Example:<br />
– function f: f(A)=3, f(B)=0, f(C)=1 ACB<br />
– function h: h(A)=4, h(B)=6, h(C)=3 BAC<br />
– ground truth g: g(A)=6, g(B)=4, g(C)=3 ABC<br />
• Question: which function is closer <strong>to</strong> ground truth?<br />
– Based on pointwise similarity: sim(f,g) < sim(g,h).<br />
– Based on pairwise similarity: sim(f,g) = sim(g,h)<br />
– Based on cosine similarity between score vec<strong>to</strong>rs?<br />
f: g: h:<br />
sim(f,g)=0.85<br />
Closer!<br />
sim(g,h)=0.93<br />
• However, according <strong>to</strong> position discount, f should be closer <strong>to</strong> g!<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
59
Permutation Probability Distribution<br />
• Question:<br />
– How <strong>to</strong> represent a ranked list?<br />
• Solution<br />
– <strong>Rank</strong>ed list ↔ Permutation probability distribution<br />
– More in<strong>for</strong>mative representation <strong>for</strong> ranked list:<br />
permutation and ranked list has 1-1 correspondence.<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
60
Luce Model: Defining Permutation Probability<br />
• Probability of permutation π is defined as<br />
Ps<br />
( )<br />
• Example:<br />
P f<br />
<br />
ABC<br />
<br />
<br />
<br />
<br />
n<br />
( j)<br />
n<br />
1 (<br />
s<br />
k<br />
j ( k )<br />
<br />
j<br />
<br />
(<br />
s<br />
<br />
f (A) <br />
f (A) f (B) f (A) <br />
)<br />
)<br />
<br />
<br />
<br />
f (B) <br />
f (B) f (C) <br />
<br />
<br />
<br />
<br />
<br />
f<br />
f<br />
(C)<br />
(C)<br />
<br />
<br />
P(A ranked No.1)<br />
P(B ranked No.2 | A ranked No.1)<br />
= P(B ranked No.1)/(1- P(A ranked No.1))<br />
P(C ranked No.3 | A ranked No.1, B ranked No.2)<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
61
Properties of Luce Model<br />
• Nice properties<br />
– Continuous, differentiable, and convex w.r.t. score s.<br />
– Suppose g(A) = 6, g(B)=4, g(C)=3, P(ABC) is largest and<br />
P(CBA) is smallest; <strong>for</strong> the ranking based on f: ABC, swap<br />
any two, probability will decrease, e.g., P(ABC) > P(ACB)<br />
– Translation invariant, when<br />
( s)<br />
exp( s)<br />
– Scale invariant, when<br />
( s)<br />
<br />
as<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
62
Distance between <strong>Rank</strong>ed Lists<br />
φ = exp<br />
Using KL-divergence<br />
<strong>to</strong> measure difference<br />
between distributions<br />
dis(f,g) = 0.46<br />
dis(g,h) = 2.56<br />
4/23/2008<br />
Tie-Yan Liu @ Renmin University<br />
63
ListNet Algorithm<br />
(Z. Cao, T. Qin, T. Liu, et al. ICML 2007)<br />
• Loss function = KL-divergence between two<br />
permutation probability distributions (φ = exp)<br />
L(<br />
w)<br />
<br />
<br />
<br />
t<br />
<br />
• Model = Neural Network<br />
• Algorithm = Gradient Descent<br />
<br />
<br />
<br />
<br />
n<br />
exp( s<br />
y(<br />
j )<br />
t<br />
)<br />
<br />
log<br />
<br />
<br />
<br />
t<br />
<br />
exp( w<br />
X<br />
n<br />
n<br />
qQ<br />
G( j ,..., jk<br />
) 1 exp( s<br />
<br />
ut<br />
y(<br />
j<br />
w X<br />
u ))<br />
1 exp(<br />
ut<br />
y(<br />
ju<br />
)<br />
n<br />
y(<br />
j )<br />
1 )<br />
t<br />
)<br />
<br />
<br />
<br />
<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
64
Experimental Results<br />
More correlated!<br />
Pairwise (<strong>Rank</strong>Net)<br />
Listwise (ListNet)<br />
Training Per<strong>for</strong>mance on TREC Dataset<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
65
Limitations of ListNet<br />
• Too complex in practice.<br />
• Per<strong>for</strong>mance of ListNet depends on the<br />
mapping function from ground truth label <strong>to</strong><br />
scores<br />
– Ground truth are permutations (ranked list) or<br />
group of permutations.<br />
– Ground truth labels are not naturally scores.<br />
4/23/2008 Tie-Yan Liu @ Renmin University 66
Solution<br />
• Never assume ground truth <strong>to</strong> be scores.<br />
• Given the ranking model, compute the<br />
likelihood of a ground truth permutation, and<br />
maximize it <strong>to</strong> learn the model parameter.<br />
4/23/2008 Tie-Yan Liu @ Renmin University 67
Likelihood Loss<br />
• Top-k Likelihood<br />
• Regularized Likelihood Loss ( )<br />
4/23/2008 Tie-Yan Liu @ Renmin University 68
ListMLE Algorithm<br />
(F. Xia, T. Liu, et al. ICML 2008)<br />
• Loss function = likelihood loss<br />
• Model = Neural Network<br />
• Algorithm = S<strong>to</strong>chastic Gradient Descent<br />
TD2004 Dataset<br />
4/23/2008 Tie-Yan Liu @ Renmin University 69
Summary<br />
• <strong>Rank</strong>ing is a new application, and also a new<br />
machine learning problem.<br />
• <strong>Learning</strong> <strong>to</strong> rank: from pointwise, pairwise <strong>to</strong> listwise<br />
approach.<br />
• Many open questions.<br />
4/23/2008 Tie-Yan Liu @ Renmin University 70
poinwise pairwise listwise<br />
* Papers published by our group<br />
<strong>Rank</strong>ingSVM<br />
<strong>Rank</strong>Boost<br />
ListNet ListMLE<br />
<strong>Rank</strong>Cosine<br />
Ada<strong>Rank</strong><br />
SVM-MAP<br />
Soft<strong>Rank</strong><br />
<strong>Rank</strong>GP<br />
CCA<br />
Lambda<strong>Rank</strong><br />
SRA<br />
F<strong>Rank</strong><br />
<strong>Rank</strong>Net MHR<br />
MP<strong>Rank</strong><br />
QB<strong>Rank</strong><br />
LDM IRSVM<br />
GB<strong>Rank</strong><br />
Relational<br />
<strong>Rank</strong>ing<br />
<strong>Rank</strong><br />
Refinement<br />
Discriminative<br />
IR model<br />
MC<strong>Rank</strong><br />
2000 2001 2002 2003 2004 2005 2006 2007<br />
4/23/2008 Tie-Yan Liu @ Renmin University 71
References<br />
• R. Herbrich, T. Graepel, et al. Support Vec<strong>to</strong>r <strong>Learning</strong> <strong>for</strong> Ordinal Regression, ICANN1999.<br />
• T. Joachims, Optimizing Search Engines Using Clickthrough Data, KDD 2002.<br />
• Y. Freund, R. Iyer, et al. An Efficient Boosting Algorithm <strong>for</strong> Combining Preferences, JMLR 2003.<br />
• R. Nallapati, Discriminative model <strong>for</strong> in<strong>for</strong>mation retrieval, SIGIR 2004.<br />
• J. Gao, H. Qi, et al. Linear discriminant model <strong>for</strong> in<strong>for</strong>mation retrieval, SIGIR 2005.<br />
• C.J.C. Burges, T. Shaked, et al. <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> using Gradient Descent, ICML 2005.<br />
• I. Tsochantaridis, T. Joachims, et al. Large Margin Methods <strong>for</strong> Structured and Interdependent<br />
Output Variables, JMLR, 2005.<br />
• F. Radlinski, T. Joachims, Query Chains: <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> from Implicit Feedback, KDD 2005.<br />
• I. Matveeva, C.J.C. Burges, et al. High accuracy retrieval with multiple nested ranker, SIGIR 2006.<br />
• C.J.C. Burges, R. Ragno, et al. <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> with Nonsmooth Cost Functions , NIPS 2006<br />
• Y. Cao, J. Xu, et al. Adapting <strong>Rank</strong>ing SVM <strong>to</strong> In<strong>for</strong>mation <strong>Retrieval</strong>, SIGIR 2006.<br />
• Z. Cao, T. Qin, et al. <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong>: From Pairwise <strong>to</strong> Listwise Approach, ICML 2007.<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
72
References (2)<br />
• T. Qin, T.-Y. Liu, et al, Multiple hyperplane <strong>Rank</strong>er, SIGIR 2007.<br />
• T.-Y. Liu, J. Xu, et al. LETOR: Benchmark dataset <strong>for</strong> research on learning <strong>to</strong> rank <strong>for</strong> in<strong>for</strong>mation<br />
retrieval, LR4IR 2007.<br />
• M.-F. Tsai, T.-Y. Liu, et al. F<strong>Rank</strong>: A <strong>Rank</strong>ing Method with Fidelity Loss, SIGIR 2007.<br />
• Z. Zheng, H. Zha, et al. A General Boosting Method and its Application <strong>to</strong> <strong>Learning</strong> <strong>Rank</strong>ing<br />
Functions <strong>for</strong> Web Search, NIPS 2007.<br />
• Z. Zheng, H. Zha, et al, A Regression Framework <strong>for</strong> <strong>Learning</strong> <strong>Rank</strong>ing Functions Using Relative<br />
Relevance Judgment, SIGIR 2007.<br />
• M. Taylor, J. Guiver, et al. Soft<strong>Rank</strong>: Optimising Non-Smooth <strong>Rank</strong> Metrics, LR4IR 2007.<br />
• J.-Y. Yeh, J.-Y. Lin, et al. <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> <strong>for</strong> In<strong>for</strong>mation <strong>Retrieval</strong> using Generic Programming,<br />
LR4IR 2007.<br />
• T. Qin, T.-Y. Liu, et al, Query-level Loss Function <strong>for</strong> In<strong>for</strong>mation <strong>Retrieval</strong>, In<strong>for</strong>mation Processing<br />
and Management, 2007.<br />
• X. Geng, T.-Y. Liu, et al, Feature Selection <strong>for</strong> <strong>Rank</strong>ing, SIGIR 2007.<br />
• Y. Yue, T. Finley, et al. A Support Vec<strong>to</strong>r Method <strong>for</strong> Optimizing Average Precision, SIGIR 2007.<br />
4/23/2008 Tie-Yan Liu @ Renmin University<br />
73
References (3)<br />
• Y.-T. Liu, T.-Y. Liu, et al, Supervised <strong>Rank</strong> Aggregation, WWW 2007.<br />
• J. Xu and H. Li, A Boosting Algorithm <strong>for</strong> In<strong>for</strong>mation <strong>Retrieval</strong>, SIGIR 2007.<br />
• F. Radlinski, T. Joachims. Active Exploration <strong>for</strong> <strong>Learning</strong> <strong>Rank</strong>ings from Clickthrough Data, KDD 2007.<br />
• P. Li, C. Burges, et al. Mc<strong>Rank</strong>: <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> Using Classification and Gradient Boosting, NIPS 2007.<br />
• Z. Zheng, H. Zha, et al. A Regression Framework <strong>for</strong> <strong>Learning</strong> <strong>Rank</strong>ing Functions using Relative Relevance<br />
Judgments, SIGIR 2007.<br />
• Z. Zheng, H. Zha, et al. A General Boosting Method and its Application <strong>to</strong> <strong>Learning</strong> <strong>Rank</strong>ing Functions <strong>for</strong><br />
Web Search, NIPS 2007.<br />
• C. Cortes, M. Mohri, et al. Magnitude-preserving <strong>Rank</strong>ing Algorithms, ICML 2007.<br />
• H. Almeida, M. Goncalves, et al. A combined component approach <strong>for</strong> finding collection-adapted ranking<br />
functions based on genetic programming, SIGIR 2007.<br />
• T. Qin, T.-Y. Liu, et al, <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> Relational Objects and Its Application <strong>to</strong> Web Search, WWW 2008.<br />
• R. Jin, H. Valizadegan, et al. <strong>Rank</strong>ing Refinement and Its Application <strong>to</strong> In<strong>for</strong>mation <strong>Retrieval</strong>, WWW<br />
2008.<br />
• F. Xia. T.-Y. Liu, et al. Listwise Approach <strong>to</strong> <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> – Theory and Algorithm, ICML 2008.<br />
• Y. Lan, T.-Y. Liu, et al. On Generalization Ability of <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> Algorithms, ICML 2008.<br />
4/23/2008 Tie-Yan Liu @ Renmin University 74
Advertisement <br />
• SIGIR 2008 workshop on learning <strong>to</strong> rank<br />
– http://research.microsoft.com/users/LR4IR-2008/<br />
– Submission due: May 16.<br />
• SIGIR 2008 tu<strong>to</strong>rial on learning <strong>to</strong> rank<br />
– Include more new algorithms<br />
– Include more theoretical discussions<br />
4/23/2008 Tie-Yan Liu @ Renmin University 75
Advertisement (2) <br />
• LETOR: Benchmark Dataset <strong>for</strong> Research on <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong><br />
– http://research.microsoft.com/users/LETOR/<br />
– T. Liu, J. Xu, et al. LR4IR 2007<br />
4/23/2008 Tie-Yan Liu @ Renmin University 76
tyliu@microsoft.com<br />
http://research.microsoft.com/users/tyliu/<br />
4/23/2008 Tie-Yan Liu @ Renmin University 77