08.06.2015 Views

Learning to Rank for Information Retrieval

Learning to Rank for Information Retrieval

Learning to Rank for Information Retrieval

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong><br />

<strong>for</strong> In<strong>for</strong>mation <strong>Retrieval</strong><br />

Tie-Yan Liu<br />

Lead Researcher<br />

Microsoft Research Asia<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

1


Outline<br />

• <strong>Rank</strong>ing in In<strong>for</strong>mation <strong>Retrieval</strong><br />

• <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong><br />

– Introduction <strong>to</strong> <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong><br />

– Pointwise approach<br />

– Pairwise approach<br />

– Listwise approach<br />

• Summary<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

3


Introduction <strong>to</strong> <strong>Rank</strong>ing in IR<br />

4/23/2008 Tie-Yan Liu @ Renmin University 4


Inside Search Engine<br />

Query<br />

User Interface<br />

Online Part<br />

Caching<br />

Relevance <strong>Rank</strong>ing<br />

Inverted<br />

Index<br />

Index Builder<br />

Page <strong>Rank</strong>s<br />

Link Analysis<br />

Cached<br />

Pages<br />

Web Page Parser<br />

Pages<br />

Links &<br />

Anchors<br />

Link Map<br />

Web Graph Builder<br />

Web Graph<br />

Page & Site<br />

Statistics<br />

Crawler<br />

Offline Part<br />

4/23/2008 Tie-Yan Liu @ Renmin University 5


<strong>Rank</strong>ing in In<strong>for</strong>mation <strong>Retrieval</strong><br />

Document index<br />

D <br />

d i<br />

Query<br />

q {www2008}<br />

<strong>Rank</strong>ing<br />

model<br />

d<br />

d<br />

q<br />

k<br />

q<br />

2<br />

<br />

<strong>Rank</strong>ing of documents<br />

d<br />

q<br />

1<br />

www2008.org<br />

www2008.org/submissions/index. html<br />

<br />

<br />

www.iw3c2.org/blog/w elcome/www2008<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

6


Widely-used Judgments<br />

• Binary judgment<br />

– Relevant vs. Irrelevant<br />

• Multi-level ratings<br />

– Perfect > Excellent > Good > Fair > Bad<br />

• Pairwise preferences<br />

– Document A is more relevant than document B with<br />

respect <strong>to</strong> query q<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

7


Evaluation Measures<br />

• MRR (Mean Reciprocal <strong>Rank</strong>)<br />

q i<br />

– For query , rank position of the first relevant document:<br />

– MRR: average of 1/ over all queries<br />

• WTA (Winners Take All)<br />

r i<br />

– If <strong>to</strong>p ranked document is relevant: 1; otherwise 0<br />

– average over all queries<br />

• MAP (Mean Average Precision)<br />

• NDCG (Normalized Discounted Cumulative Gain)<br />

r i<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

8


MAP<br />

• Precision at position n<br />

P@<br />

n <br />

• Average precision<br />

AP<br />

<br />

n<br />

#{relevant documents in <strong>to</strong>p n results}<br />

n<br />

P@<br />

n<br />

I{document<br />

n is<br />

#{relevant documents}<br />

AP =<br />

relevant}<br />

1<br />

<br />

<br />

1<br />

0.76<br />

• MAP: averaged over all queries in the test set<br />

1<br />

3<br />

2<br />

3<br />

<br />

3<br />

<br />

5 <br />

<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

9


NDCG<br />

• NDCG at position n:<br />

N(<br />

n)<br />

Z<br />

n<br />

n<br />

<br />

j1<br />

(2<br />

r(<br />

j)<br />

1) / log(1 j)<br />

• NDCG averaged <strong>for</strong> all queries in test set<br />

4/23/2008 Tie-Yan Liu @ Renmin University 10


More about Evaluation Measures<br />

• Query-level<br />

– Averaged over all queries.<br />

– The measure is bounded, and every query contributes<br />

equally <strong>to</strong> the averaged measure.<br />

• Position<br />

– Position discount is used here and there.<br />

• Rating<br />

– Different ratings may correspond <strong>to</strong> different gains.<br />

4/23/2008 Tie-Yan Liu @ Renmin University 11


<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> <strong>for</strong> IR<br />

Overview<br />

Pointwise Approach<br />

Pairwise Approach<br />

Listwise Approach<br />

4/23/2008 Tie-Yan Liu @ Renmin University 12


Conventional IR Models<br />

Category<br />

Example Models<br />

Boolean model<br />

Similarity-based<br />

models<br />

Vec<strong>to</strong>r space model<br />

Latent Semantic Indexing<br />

…<br />

BM25 model<br />

Probabilistic models<br />

Hyperlink-based<br />

models<br />

Language model <strong>for</strong> IR<br />

…<br />

HITS<br />

Page<strong>Rank</strong><br />

……<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

13


Problems with Conventional <strong>Rank</strong>ing<br />

• For a particular model<br />

Models<br />

– Parameter tuning is difficult, especially when there are<br />

many parameters <strong>to</strong> tune.<br />

• For comparison between two models<br />

– Given a test set, when one model is over-tuned (overfitting)<br />

while the other is not, how can we compare them?<br />

• For a collection of models<br />

– There are hundreds of models proposed in the literature. It<br />

is non-trivial how <strong>to</strong> combine them.<br />

4/23/2008 Tie-Yan Liu @ Renmin University 14


Machine <strong>Learning</strong> Can Help<br />

• Machine learning is an effective <strong>to</strong>ol<br />

– To au<strong>to</strong>matically tune parameters<br />

– To combine multiple evidences<br />

– To avoid over-fitting (regularization framework,<br />

structure risk minimization, etc.)<br />

4/23/2008 Tie-Yan Liu @ Renmin University 15


documents<br />

Framework of <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> <strong>for</strong> IR<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

queries<br />

d<br />

d<br />

q<br />

(1)<br />

(1)<br />

1<br />

(1)<br />

2<br />

(1)<br />

d n<br />

,4<br />

<br />

,2<br />

<br />

<br />

<br />

<br />

,1<br />

(1)<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

d<br />

d<br />

d<br />

Training Data<br />

q<br />

q<br />

( d1<br />

,?),( d2<br />

,?)<br />

......,(<br />

d n<br />

Test data<br />

,?)<br />

( m)<br />

( m)<br />

1<br />

( m)<br />

2<br />

<br />

( m)<br />

( m )<br />

n<br />

,5 <br />

<br />

,3 <br />

<br />

<br />

,2<br />

<br />

Labels: multiple-level<br />

ratings <strong>for</strong> example<br />

<strong>Learning</strong><br />

System<br />

<strong>Rank</strong>ing<br />

System<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

d<br />

d<br />

d<br />

i1<br />

i2<br />

in<br />

f<br />

,<br />

,<br />

,<br />

Model<br />

( q,<br />

d,<br />

w)<br />

f ( q,<br />

d<br />

f ( q,<br />

d<br />

<br />

f ( q,<br />

d<br />

i1<br />

i2<br />

in<br />

, w)<br />

<br />

<br />

, w)<br />

<br />

<br />

<br />

, w)<br />

<br />

<br />

16


Connection with Conventional<br />

Machine <strong>Learning</strong><br />

• With certain assumptions, learning <strong>to</strong> rank can be<br />

trans<strong>for</strong>med (reduced) <strong>to</strong> conventional machine<br />

learning problems<br />

– Regression (assume labels <strong>to</strong> be scores)<br />

– Classification (assume labels <strong>to</strong> have no orders)<br />

– Pairwise classification (assume document pairs <strong>to</strong> be<br />

independent of each other)<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

17


Difference from Conventional<br />

Machine <strong>Learning</strong><br />

• Unique properties of ranking in IR<br />

– Role of query<br />

• Query determines the logical structure of the ranking<br />

problem.<br />

• Loss function should be defined on ranked lists w.r.t. a<br />

query, since IR measures are computed at query level.<br />

– Relative order is important<br />

• No need <strong>to</strong> predict category, or value of f(x)<br />

– Position<br />

• Top-ranked objects are more important<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

18


Categorization of <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> Methods<br />

Pointwise<br />

Approach<br />

Pairwise Approach<br />

Listwise Approach<br />

Reduced <strong>to</strong><br />

classification<br />

or regression<br />

Discriminative<br />

model <strong>for</strong> IR (SIGIR<br />

2004)<br />

Mc<strong>Rank</strong> (NIPS<br />

2007)<br />

…<br />

<strong>Rank</strong>ing SVM (ICANN 1999)<br />

<strong>Rank</strong>Boost (JMLR 2003)<br />

LDM (SIGIR 2005)<br />

<strong>Rank</strong>Net (ICML 2005)<br />

Frank (SIGIR 2007)<br />

GB<strong>Rank</strong> (SIGIR 2007)<br />

QB<strong>Rank</strong> (NIPS 2007)<br />

MP<strong>Rank</strong> (ICML 2007)…<br />

New problem IRSVM (SIGIR 2006)<br />

MHR (SIGIR 2007)<br />

…<br />

Ada<strong>Rank</strong> (SIGIR 2007)<br />

SVM-MAP (SIGIR 2007)<br />

Soft<strong>Rank</strong> (LR4IR 2007)<br />

GP<strong>Rank</strong> (LR4IR 2007)<br />

CCA (SIGIR 2007)<br />

Lambda<strong>Rank</strong> (NIPS 2006)<br />

<strong>Rank</strong>Cosine (IP&M 2007)<br />

ListNet (ICML 2007)<br />

ListMLE (ICML 2008)…<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

19


<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> <strong>for</strong> IR<br />

Overview<br />

Pointwise Approach<br />

Pairwise Approach<br />

Listwise Approach<br />

4/23/2008 Tie-Yan Liu @ Renmin University 20


Overview of Pointwise Approach<br />

• Reduce ranking <strong>to</strong> regression or classification<br />

on single documents.<br />

• Examples:<br />

– Discriminative model <strong>for</strong> IR.<br />

– Mc<strong>Rank</strong>: learning <strong>to</strong> rank using multi-class<br />

classification and gradient boosting.<br />

4/23/2008 Tie-Yan Liu @ Renmin University 21


Discriminative Model <strong>for</strong> IR<br />

(R. Nallapati, SIGIR 2004)<br />

• Features extracted <strong>for</strong> each document<br />

• Treat relevant documents as positive examples,<br />

irrelevant documents as negative examples.<br />

• Use SVM or MaxEnt <strong>to</strong> per<strong>for</strong>m discriminative<br />

learning.<br />

4/23/2008 Tie-Yan Liu @ Renmin University 22


Mc<strong>Rank</strong><br />

(P. Li, et al. NIPS 2007)<br />

• Loss in terms of DCG is upper bounded (although<br />

lossly) by multi-class classification error<br />

• Use gradient boosting <strong>to</strong> per<strong>for</strong>m multi-class<br />

classification<br />

– K trees per boosting iteration<br />

– Tree outputs converted <strong>to</strong> probabilities using logistic<br />

function<br />

– Convert classification <strong>to</strong> ranking:<br />

4/23/2008 Tie-Yan Liu @ Renmin University 23


Discussions<br />

• Assumption<br />

– Relevance is absolute and query-independent.<br />

• However, in practice, relevance is query-dependent<br />

– Relevance labels are assigned by comparing the<br />

documents associated with the same query.<br />

– An irrelevant document <strong>for</strong> a popular query might have<br />

higher term frequency than a relevant document <strong>for</strong> a rare<br />

query.<br />

4/23/2008 Tie-Yan Liu @ Renmin University 24


<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> <strong>for</strong> IR<br />

Overview<br />

Pointwise Approach<br />

Pairwise Approach<br />

Listwise Approach<br />

4/23/2008 Tie-Yan Liu @ Renmin University 25


Overview of Pairwise Approach<br />

• No longer assume absolute relevance.<br />

• Reduce ranking <strong>to</strong> classification on document pairs<br />

(i.e. pairwise classification) w.r.t. the same query.<br />

• Examples<br />

– <strong>Rank</strong>Net, Frank, <strong>Rank</strong>Boost, <strong>Rank</strong>ing SVM, etc.<br />

q<br />

( i)<br />

( i)<br />

d <br />

1<br />

,5<br />

<br />

( i)<br />

d ,3 <br />

2<br />

<br />

<br />

( i)<br />

<br />

( i )<br />

d<br />

,2<br />

n <br />

Trans<strong>for</strong>m<br />

( i)<br />

( i)<br />

<br />

( i)<br />

( i)<br />

<br />

( i)<br />

( i)<br />

d , d , d , d<br />

<br />

( i ) ,..., d , d ( i )<br />

1<br />

2<br />

1<br />

q<br />

( i)<br />

5 > 3 5 > 2 3 > 2<br />

n<br />

2<br />

n<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

26


• Target posteriors:<br />

• Modeled posteriors:<br />

• Define<br />

<strong>Rank</strong>Net<br />

(C. Burges, et al. ICML 2005)<br />

• New loss function: cross entropy (CE loss)<br />

• Model output probabilities using logistic function:<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

27


<strong>Rank</strong>Net (2)<br />

• Use Neural Network as model, and gradient<br />

descent as algorithm, <strong>to</strong> optimize the crossentropy<br />

loss<br />

• Evaluate on single documents: output a<br />

relevance score <strong>for</strong> each document w.r.t. a<br />

new query<br />

4/23/2008 Tie-Yan Liu @ Renmin University 28


<strong>Rank</strong>Boost<br />

(Y. Freund, et al. JMLR 2003)<br />

• Given: initial distribution D 1 over<br />

• For t=1,…, T:<br />

<br />

<br />

– Train weak learning using distribution D t<br />

– Get weak ranking h t : X R<br />

– Choose t<br />

R<br />

Dt ( x0, x1)exp – Update:<br />

t( ht ( x0) ht<br />

( x1))<br />

<br />

Dt<br />

1( x0, x1)<br />

<br />

Zt<br />

where<br />

<br />

Z D ( x , x )exp ( h ( x ) h ( x ))<br />

t t 0 1 t t 0 t 1<br />

x0,<br />

x1<br />

T<br />

• Output the final ranking:<br />

<br />

H ( x) tht( x)<br />

Use AdaBoost <strong>to</strong> per<strong>for</strong>m pairwise classification<br />

t1<br />

<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

29


<strong>Rank</strong>ing SVM<br />

(R. Herbrich, et al. ICANN 1999)<br />

min<br />

z<br />

w,<br />

<br />

i<br />

<br />

i<br />

1<br />

2<br />

w,<br />

x<br />

0<br />

|| w ||<br />

(1)<br />

i<br />

<br />

2<br />

x<br />

C<br />

(2)<br />

i<br />

f ( x;<br />

wˆ)<br />

wˆ<br />

, x <br />

i<br />

1<br />

i<br />

x (1) -x (2) as instance of learning<br />

Use SVM <strong>to</strong> per<strong>for</strong>m binary<br />

classification on these instances,<br />

<strong>to</strong> learn model parameter w<br />

Use w <strong>for</strong> testing<br />

Use SVM <strong>to</strong> per<strong>for</strong>m pairwise classification<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

30


Other Work in this Category<br />

• Frank: A Fidelity Loss <strong>for</strong> In<strong>for</strong>mation <strong>Retrieval</strong>, (M. Tsai, T. Liu,<br />

et al. SIGIR 2007)<br />

• Linear discriminant model <strong>for</strong> in<strong>for</strong>mation retrieval (LDM), (J.<br />

Gao, H. Qi, et al. SIGIR 2005)<br />

• A Regression Framework <strong>for</strong> <strong>Learning</strong> <strong>Rank</strong>ing Functions using<br />

Relative Relevance Judgments (GB<strong>Rank</strong>), (Z. Zheng, H. Zha, et<br />

al. SIGIR 2007)<br />

• A General Boosting Method and its Application <strong>to</strong> <strong>Learning</strong><br />

<strong>Rank</strong>ing Functions <strong>for</strong> Web Search (QB<strong>Rank</strong>), (Z. Zheng, H.<br />

Zha, et al. NIPS 2007)<br />

• Magnitude-preserving <strong>Rank</strong>ing Algorithms (MP<strong>Rank</strong>), (C.<br />

Cortes, M. Mohri, et al. ICML 2007)<br />

4/23/2008 Tie-Yan Liu @ Renmin University 31


Discussions<br />

• Progress made as compared <strong>to</strong> pointwise<br />

approach<br />

– No longer assume absolute relevance<br />

• However,<br />

– Unique properties of ranking in IR have not been<br />

fully modeled.<br />

4/23/2008 Tie-Yan Liu @ Renmin University 32


Problems with Pairwise Approach (1)<br />

• Number of instance pairs varies according <strong>to</strong> query<br />

4/23/2008 Tie-Yan Liu @ Renmin University 33


Problems with Pairwise Approach (2)<br />

• Negative effects of making errors on <strong>to</strong>p<br />

positions<br />

d: definitely relevant, p: partially relevant, n: not relevant<br />

ranking 1: p d p n n n n one wrong pair Worse<br />

ranking 2: d p n p n n n one wrong pair Better<br />

4/23/2008 Tie-Yan Liu @ Renmin University 34


Problems with Pairwise Approach (3)<br />

• Different rank pairs correspond <strong>to</strong> different ranking<br />

criteria<br />

– Perfect:<br />

• Uniqueness,<br />

• domain knowledge<br />

• Authority<br />

– Excellent / Good<br />

– Bad<br />

• Content match<br />

• Detrimental<br />

4/23/2008 Tie-Yan Liu @ Renmin University 35


As a Result, …<br />

• Users only see ranked list of<br />

documents, but not document<br />

pairs<br />

• It is not clear how pairwise<br />

loss correlates with IR<br />

evaluation measures, which<br />

are defined on query level.<br />

Pairwise loss vs. (1-NDCG@5)<br />

TREC Dataset<br />

4/23/2008 Tie-Yan Liu @ Renmin University 36


Incremental Solution <strong>to</strong> Problems (1)(2)<br />

IRSVM: Change Loss of <strong>Rank</strong>ing SVM<br />

(Y. Cao, J. Xu, T. Liu, et al. SIGIR 2006)<br />

Query-level normalizer<br />

Implicit Position Discount<br />

Larger weight <strong>for</strong> more critical pairs<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

37


Incremental Solution <strong>to</strong> Problems (3)<br />

Multi-Hyperplane <strong>Rank</strong>er<br />

(T. Qin, T. Liu, et al. SIGIR 2007)<br />

• Suppose there are k-level ratings<br />

– m=k*(k-1)/2 different types of document pairs:<br />

• (rank 1, rank 2), (rank 1, rank 3), (rank 2, rank 3), …<br />

• Divide-and-conquer approach<br />

– Train m hyperplanes using <strong>Rank</strong>ing SVM, with each<br />

hyperplane only focusing on one type of document pairs.<br />

– Combine ranking results given by these hyperplanes <strong>to</strong><br />

compose final ranked list, by means of rank aggregation.<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

38


More Fundamental Solution<br />

Listwise Approach <strong>to</strong> <strong>Rank</strong>ing<br />

• Directly operate on ranked lists<br />

– Directly optimize IR evaluation measures<br />

• Ada<strong>Rank</strong>,<br />

• SVM-MAP,<br />

• Soft<strong>Rank</strong>,<br />

• <strong>Rank</strong>GP, …<br />

– Define listwise loss functions<br />

• <strong>Rank</strong>Cosine,<br />

• ListNet,<br />

• ListMLE, …<br />

4/23/2008 Tie-Yan Liu @ Renmin University 39


<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> <strong>for</strong> IR<br />

Overview<br />

Pointwise Approach<br />

Pairwise Approach<br />

Listwise Approach<br />

4/23/2008 Tie-Yan Liu @ Renmin University 40


Overview of Listwise Approach<br />

• Instead of reducing ranking <strong>to</strong> regression and<br />

classification, per<strong>for</strong>m learning directly on document<br />

list.<br />

– Treats ranked lists as learning instances.<br />

• Two major categories<br />

– Directly optimize IR evaluation measures<br />

– Define listwise loss functions<br />

4/23/2008 Tie-Yan Liu @ Renmin University 41


Directly Optimize IR Evaluation Measures<br />

• It is a natural idea <strong>to</strong> directly optimize what is used <strong>to</strong><br />

evaluate the ranking results.<br />

• However, it is non-trivial.<br />

– Evaluation measures such as NDCG are usually nonsmooth<br />

and non-differentiable.<br />

– It is challenging <strong>to</strong> optimize such objective functions, since<br />

most of the optimization techniques were developed <strong>to</strong><br />

work with smooth and differentiable cases.<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

42


Tackle the Challenges<br />

• Use optimization technologies designed <strong>for</strong> smooth and<br />

differentiable problems<br />

– Fit the optimization of evaluation measure in<strong>to</strong> the logic of<br />

existing optimization technique (Ada<strong>Rank</strong>)<br />

– Optimize a smooth and differentiable upper bound of<br />

evaluation measure (SVM-MAP)<br />

– Soften (approximate) the evaluation measure so as <strong>to</strong> make it<br />

smooth and differentiable, and easy <strong>to</strong> optimize (Soft<strong>Rank</strong>)<br />

• Use optimization technologies designed <strong>for</strong> non-smooth<br />

and non-differentiable problems<br />

– Genetic programming (<strong>Rank</strong>GP)<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

43


• Use Boosting <strong>to</strong><br />

per<strong>for</strong>m the<br />

optimization.<br />

• Embed IR evaluation<br />

measure in the<br />

distribution update<br />

of Boosting.<br />

Ada<strong>Rank</strong><br />

(J. Xu, et al. SIGIR 2007)<br />

4/23/2008 Tie-Yan Liu @ Renmin University 44


Theoretical Analysis<br />

• Training error will be continuously reduced during<br />

learning phase when


Practical Solution<br />

• In practice, we cannot<br />

guarantee


SVM-MAP<br />

(Y. Yue, et al. SIGIR 2007)<br />

• Use SVM framework <strong>for</strong> optimization<br />

• Incorporate IR evaluation measures in the listwise constraints.<br />

• Sum of slacks ∑ξ upper bounds 1-E(y,y’).<br />

min<br />

1<br />

2<br />

w<br />

2<br />

<br />

C<br />

N<br />

<br />

y'<br />

y : F(<br />

q,<br />

d,<br />

y)<br />

F(<br />

q,<br />

d,<br />

y')<br />

(1 E(<br />

y',<br />

y))<br />

<br />

i<br />

<br />

i<br />

4/23/2008 Tie-Yan Liu @ Renmin University 47


Challenges in SVM-MAP<br />

• For Average Precision, the true labeling is a ranking where<br />

the relevant documents are all ranked in the front, e.g.,<br />

• An incorrect labeling would be any other ranking, e.g.,<br />

• Exponential number of rankings, thus an exponential<br />

number of constraints!<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

48


Structural SVM Training<br />

• STEP 1: Per<strong>for</strong>m optimization on only current working set of<br />

constraints.<br />

• STEP 2: Use the model learned in STEP 1 <strong>to</strong> find the most violated<br />

constraint from the exponential set of constraints.<br />

– This step can also be very fast since MAP is invariant on the order of<br />

documents within a relevance class<br />

• STEP 3: If the constraint returned in STEP 2 is more violated than<br />

the most violated constraint in the working set, add it <strong>to</strong> the<br />

working set.<br />

STEP 1-3 is guaranteed <strong>to</strong> loop <strong>for</strong> at most a polynomial number of<br />

iterations. [Tsochantaridis et al. 2005]<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

49


Soft<strong>Rank</strong><br />

(M. Taylor, et al. LR4IR 2007)<br />

• Add additive noise <strong>to</strong> the ranking result, and thus<br />

make NDCG “soft” enough <strong>for</strong> optimization.<br />

Define the probability of a document being ranked at a particular position<br />

4/23/2008 Tie-Yan Liu @ Renmin University 50


From Scores <strong>to</strong> <strong>Rank</strong> Distribution<br />

• Considering the score <strong>for</strong> each document s j as<br />

random variable<br />

• Probability of doc i is ranked be<strong>for</strong>e doc j<br />

• Expected rank<br />

• <strong>Rank</strong> Distribution p j (r)<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

51


Soft NDCG<br />

• From hard <strong>to</strong> soft<br />

• Neural networks <strong>for</strong> optimization<br />

4/23/2008 Tie-Yan Liu @ Renmin University 52


<strong>Rank</strong>GP<br />

(J. Yeh, et al. LR4IR 2007)<br />

• Define ranking function as a tree<br />

• Use genetic programming <strong>to</strong> learn the optimal tree<br />

• GP is widely used <strong>to</strong> optimize non-smoothing nondifferentiable<br />

objective.<br />

• Crossover, mutation, reproduction, <strong>to</strong>urnament selection.<br />

• Fitness function = IR evaluation measure.<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

53


Other Work in this Category<br />

• <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> with Nonsmooth Cost Functions ,<br />

(C.J.C. Burges, R. Ragno, et al. NIPS 2006 )<br />

• A combined component approach <strong>for</strong> finding<br />

collection-adapted ranking functions based on<br />

genetic programming (CCA), (H. Almeida, M.<br />

Goncalves, et al. SIGIR 2007).<br />

4/23/2008 Tie-Yan Liu @ Renmin University 54


Discussions<br />

• Progress has been made as compared <strong>to</strong> pointwise and<br />

pairwise approaches<br />

• Still many unsolved problems<br />

– Is the upper bound in SVM-MAP tight enough?<br />

– Is SoftNDCG highly correlated with NDCG?<br />

– Is there any theoretical justification on the correlation<br />

between the implicit loss function used in Lambda<strong>Rank</strong><br />

and NDCG?<br />

– Multiple measures are not necessarily consistent with each<br />

other: which <strong>to</strong> optimize?<br />

4/23/2008 Tie-Yan Liu @ Renmin University 55


Define Listwise Loss Function<br />

• Indirectly optimizing IR evaluation measures<br />

– Consider properties of ranking <strong>for</strong> IR.<br />

• Examples<br />

– <strong>Rank</strong>Cosine: use cosine similarity between score vec<strong>to</strong>rs as<br />

loss function<br />

– ListNet: use KL divergence between permutation<br />

probability distributions as loss function<br />

– ListMLE: use negative log likelihood of the ground truth<br />

permutation as loss function<br />

• Theoretical analysis on listwise loss functions<br />

4/23/2008 Tie-Yan Liu @ Renmin University 56


<strong>Rank</strong>Cosine<br />

(T. Qin, T. Liu, et al. IP&M 2007)<br />

• Loss function define as<br />

• Properties<br />

– Insensitive <strong>to</strong> number of documents / document<br />

pairs per query<br />

– Emphasize correct ranking on by setting high<br />

scores <strong>for</strong> <strong>to</strong>p in ground truth.<br />

– Bounded:<br />

4/23/2008 Tie-Yan Liu @ Renmin University 57


Optimization<br />

• Additive model as ranking function<br />

• Loss over queries<br />

• Gradient descent <strong>for</strong> optimization<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

58


More Investigation on Loss Functions<br />

• A Simple Example:<br />

– function f: f(A)=3, f(B)=0, f(C)=1 ACB<br />

– function h: h(A)=4, h(B)=6, h(C)=3 BAC<br />

– ground truth g: g(A)=6, g(B)=4, g(C)=3 ABC<br />

• Question: which function is closer <strong>to</strong> ground truth?<br />

– Based on pointwise similarity: sim(f,g) < sim(g,h).<br />

– Based on pairwise similarity: sim(f,g) = sim(g,h)<br />

– Based on cosine similarity between score vec<strong>to</strong>rs?<br />

f: g: h:<br />

sim(f,g)=0.85<br />

Closer!<br />

sim(g,h)=0.93<br />

• However, according <strong>to</strong> position discount, f should be closer <strong>to</strong> g!<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

59


Permutation Probability Distribution<br />

• Question:<br />

– How <strong>to</strong> represent a ranked list?<br />

• Solution<br />

– <strong>Rank</strong>ed list ↔ Permutation probability distribution<br />

– More in<strong>for</strong>mative representation <strong>for</strong> ranked list:<br />

permutation and ranked list has 1-1 correspondence.<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

60


Luce Model: Defining Permutation Probability<br />

• Probability of permutation π is defined as<br />

Ps<br />

( )<br />

• Example:<br />

P f<br />

<br />

ABC<br />

<br />

<br />

<br />

<br />

n<br />

( j)<br />

n<br />

1 (<br />

s<br />

k<br />

j ( k )<br />

<br />

j<br />

<br />

(<br />

s<br />

<br />

f (A) <br />

f (A) f (B) f (A) <br />

)<br />

)<br />

<br />

<br />

<br />

f (B) <br />

f (B) f (C) <br />

<br />

<br />

<br />

<br />

<br />

f<br />

f<br />

(C)<br />

(C)<br />

<br />

<br />

P(A ranked No.1)<br />

P(B ranked No.2 | A ranked No.1)<br />

= P(B ranked No.1)/(1- P(A ranked No.1))<br />

P(C ranked No.3 | A ranked No.1, B ranked No.2)<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

61


Properties of Luce Model<br />

• Nice properties<br />

– Continuous, differentiable, and convex w.r.t. score s.<br />

– Suppose g(A) = 6, g(B)=4, g(C)=3, P(ABC) is largest and<br />

P(CBA) is smallest; <strong>for</strong> the ranking based on f: ABC, swap<br />

any two, probability will decrease, e.g., P(ABC) > P(ACB)<br />

– Translation invariant, when<br />

( s)<br />

exp( s)<br />

– Scale invariant, when<br />

( s)<br />

<br />

as<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

62


Distance between <strong>Rank</strong>ed Lists<br />

φ = exp<br />

Using KL-divergence<br />

<strong>to</strong> measure difference<br />

between distributions<br />

dis(f,g) = 0.46<br />

dis(g,h) = 2.56<br />

4/23/2008<br />

Tie-Yan Liu @ Renmin University<br />

63


ListNet Algorithm<br />

(Z. Cao, T. Qin, T. Liu, et al. ICML 2007)<br />

• Loss function = KL-divergence between two<br />

permutation probability distributions (φ = exp)<br />

L(<br />

w)<br />

<br />

<br />

<br />

t<br />

<br />

• Model = Neural Network<br />

• Algorithm = Gradient Descent<br />

<br />

<br />

<br />

<br />

n<br />

exp( s<br />

y(<br />

j )<br />

t<br />

)<br />

<br />

log<br />

<br />

<br />

<br />

t<br />

<br />

exp( w<br />

X<br />

n<br />

n<br />

qQ<br />

G( j ,..., jk<br />

) 1 exp( s<br />

<br />

ut<br />

y(<br />

j<br />

w X<br />

u ))<br />

1 exp(<br />

ut<br />

y(<br />

ju<br />

)<br />

n<br />

y(<br />

j )<br />

1 )<br />

t<br />

)<br />

<br />

<br />

<br />

<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

64


Experimental Results<br />

More correlated!<br />

Pairwise (<strong>Rank</strong>Net)<br />

Listwise (ListNet)<br />

Training Per<strong>for</strong>mance on TREC Dataset<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

65


Limitations of ListNet<br />

• Too complex in practice.<br />

• Per<strong>for</strong>mance of ListNet depends on the<br />

mapping function from ground truth label <strong>to</strong><br />

scores<br />

– Ground truth are permutations (ranked list) or<br />

group of permutations.<br />

– Ground truth labels are not naturally scores.<br />

4/23/2008 Tie-Yan Liu @ Renmin University 66


Solution<br />

• Never assume ground truth <strong>to</strong> be scores.<br />

• Given the ranking model, compute the<br />

likelihood of a ground truth permutation, and<br />

maximize it <strong>to</strong> learn the model parameter.<br />

4/23/2008 Tie-Yan Liu @ Renmin University 67


Likelihood Loss<br />

• Top-k Likelihood<br />

• Regularized Likelihood Loss ( )<br />

4/23/2008 Tie-Yan Liu @ Renmin University 68


ListMLE Algorithm<br />

(F. Xia, T. Liu, et al. ICML 2008)<br />

• Loss function = likelihood loss<br />

• Model = Neural Network<br />

• Algorithm = S<strong>to</strong>chastic Gradient Descent<br />

TD2004 Dataset<br />

4/23/2008 Tie-Yan Liu @ Renmin University 69


Summary<br />

• <strong>Rank</strong>ing is a new application, and also a new<br />

machine learning problem.<br />

• <strong>Learning</strong> <strong>to</strong> rank: from pointwise, pairwise <strong>to</strong> listwise<br />

approach.<br />

• Many open questions.<br />

4/23/2008 Tie-Yan Liu @ Renmin University 70


poinwise pairwise listwise<br />

* Papers published by our group<br />

<strong>Rank</strong>ingSVM<br />

<strong>Rank</strong>Boost<br />

ListNet ListMLE<br />

<strong>Rank</strong>Cosine<br />

Ada<strong>Rank</strong><br />

SVM-MAP<br />

Soft<strong>Rank</strong><br />

<strong>Rank</strong>GP<br />

CCA<br />

Lambda<strong>Rank</strong><br />

SRA<br />

F<strong>Rank</strong><br />

<strong>Rank</strong>Net MHR<br />

MP<strong>Rank</strong><br />

QB<strong>Rank</strong><br />

LDM IRSVM<br />

GB<strong>Rank</strong><br />

Relational<br />

<strong>Rank</strong>ing<br />

<strong>Rank</strong><br />

Refinement<br />

Discriminative<br />

IR model<br />

MC<strong>Rank</strong><br />

2000 2001 2002 2003 2004 2005 2006 2007<br />

4/23/2008 Tie-Yan Liu @ Renmin University 71


References<br />

• R. Herbrich, T. Graepel, et al. Support Vec<strong>to</strong>r <strong>Learning</strong> <strong>for</strong> Ordinal Regression, ICANN1999.<br />

• T. Joachims, Optimizing Search Engines Using Clickthrough Data, KDD 2002.<br />

• Y. Freund, R. Iyer, et al. An Efficient Boosting Algorithm <strong>for</strong> Combining Preferences, JMLR 2003.<br />

• R. Nallapati, Discriminative model <strong>for</strong> in<strong>for</strong>mation retrieval, SIGIR 2004.<br />

• J. Gao, H. Qi, et al. Linear discriminant model <strong>for</strong> in<strong>for</strong>mation retrieval, SIGIR 2005.<br />

• C.J.C. Burges, T. Shaked, et al. <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> using Gradient Descent, ICML 2005.<br />

• I. Tsochantaridis, T. Joachims, et al. Large Margin Methods <strong>for</strong> Structured and Interdependent<br />

Output Variables, JMLR, 2005.<br />

• F. Radlinski, T. Joachims, Query Chains: <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> from Implicit Feedback, KDD 2005.<br />

• I. Matveeva, C.J.C. Burges, et al. High accuracy retrieval with multiple nested ranker, SIGIR 2006.<br />

• C.J.C. Burges, R. Ragno, et al. <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> with Nonsmooth Cost Functions , NIPS 2006<br />

• Y. Cao, J. Xu, et al. Adapting <strong>Rank</strong>ing SVM <strong>to</strong> In<strong>for</strong>mation <strong>Retrieval</strong>, SIGIR 2006.<br />

• Z. Cao, T. Qin, et al. <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong>: From Pairwise <strong>to</strong> Listwise Approach, ICML 2007.<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

72


References (2)<br />

• T. Qin, T.-Y. Liu, et al, Multiple hyperplane <strong>Rank</strong>er, SIGIR 2007.<br />

• T.-Y. Liu, J. Xu, et al. LETOR: Benchmark dataset <strong>for</strong> research on learning <strong>to</strong> rank <strong>for</strong> in<strong>for</strong>mation<br />

retrieval, LR4IR 2007.<br />

• M.-F. Tsai, T.-Y. Liu, et al. F<strong>Rank</strong>: A <strong>Rank</strong>ing Method with Fidelity Loss, SIGIR 2007.<br />

• Z. Zheng, H. Zha, et al. A General Boosting Method and its Application <strong>to</strong> <strong>Learning</strong> <strong>Rank</strong>ing<br />

Functions <strong>for</strong> Web Search, NIPS 2007.<br />

• Z. Zheng, H. Zha, et al, A Regression Framework <strong>for</strong> <strong>Learning</strong> <strong>Rank</strong>ing Functions Using Relative<br />

Relevance Judgment, SIGIR 2007.<br />

• M. Taylor, J. Guiver, et al. Soft<strong>Rank</strong>: Optimising Non-Smooth <strong>Rank</strong> Metrics, LR4IR 2007.<br />

• J.-Y. Yeh, J.-Y. Lin, et al. <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> <strong>for</strong> In<strong>for</strong>mation <strong>Retrieval</strong> using Generic Programming,<br />

LR4IR 2007.<br />

• T. Qin, T.-Y. Liu, et al, Query-level Loss Function <strong>for</strong> In<strong>for</strong>mation <strong>Retrieval</strong>, In<strong>for</strong>mation Processing<br />

and Management, 2007.<br />

• X. Geng, T.-Y. Liu, et al, Feature Selection <strong>for</strong> <strong>Rank</strong>ing, SIGIR 2007.<br />

• Y. Yue, T. Finley, et al. A Support Vec<strong>to</strong>r Method <strong>for</strong> Optimizing Average Precision, SIGIR 2007.<br />

4/23/2008 Tie-Yan Liu @ Renmin University<br />

73


References (3)<br />

• Y.-T. Liu, T.-Y. Liu, et al, Supervised <strong>Rank</strong> Aggregation, WWW 2007.<br />

• J. Xu and H. Li, A Boosting Algorithm <strong>for</strong> In<strong>for</strong>mation <strong>Retrieval</strong>, SIGIR 2007.<br />

• F. Radlinski, T. Joachims. Active Exploration <strong>for</strong> <strong>Learning</strong> <strong>Rank</strong>ings from Clickthrough Data, KDD 2007.<br />

• P. Li, C. Burges, et al. Mc<strong>Rank</strong>: <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> Using Classification and Gradient Boosting, NIPS 2007.<br />

• Z. Zheng, H. Zha, et al. A Regression Framework <strong>for</strong> <strong>Learning</strong> <strong>Rank</strong>ing Functions using Relative Relevance<br />

Judgments, SIGIR 2007.<br />

• Z. Zheng, H. Zha, et al. A General Boosting Method and its Application <strong>to</strong> <strong>Learning</strong> <strong>Rank</strong>ing Functions <strong>for</strong><br />

Web Search, NIPS 2007.<br />

• C. Cortes, M. Mohri, et al. Magnitude-preserving <strong>Rank</strong>ing Algorithms, ICML 2007.<br />

• H. Almeida, M. Goncalves, et al. A combined component approach <strong>for</strong> finding collection-adapted ranking<br />

functions based on genetic programming, SIGIR 2007.<br />

• T. Qin, T.-Y. Liu, et al, <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> Relational Objects and Its Application <strong>to</strong> Web Search, WWW 2008.<br />

• R. Jin, H. Valizadegan, et al. <strong>Rank</strong>ing Refinement and Its Application <strong>to</strong> In<strong>for</strong>mation <strong>Retrieval</strong>, WWW<br />

2008.<br />

• F. Xia. T.-Y. Liu, et al. Listwise Approach <strong>to</strong> <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> – Theory and Algorithm, ICML 2008.<br />

• Y. Lan, T.-Y. Liu, et al. On Generalization Ability of <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> Algorithms, ICML 2008.<br />

4/23/2008 Tie-Yan Liu @ Renmin University 74


Advertisement <br />

• SIGIR 2008 workshop on learning <strong>to</strong> rank<br />

– http://research.microsoft.com/users/LR4IR-2008/<br />

– Submission due: May 16.<br />

• SIGIR 2008 tu<strong>to</strong>rial on learning <strong>to</strong> rank<br />

– Include more new algorithms<br />

– Include more theoretical discussions<br />

4/23/2008 Tie-Yan Liu @ Renmin University 75


Advertisement (2) <br />

• LETOR: Benchmark Dataset <strong>for</strong> Research on <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong><br />

– http://research.microsoft.com/users/LETOR/<br />

– T. Liu, J. Xu, et al. LR4IR 2007<br />

4/23/2008 Tie-Yan Liu @ Renmin University 76


tyliu@microsoft.com<br />

http://research.microsoft.com/users/tyliu/<br />

4/23/2008 Tie-Yan Liu @ Renmin University 77

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!