Introduction
Continuous approximation
Structured output learning
Direct Optimization for Web Search Ranking Olivier Chapelle
SIGIR Workshop: Learning to Rank for Information Retrieval July 23rd 2009
Perspectives
Introduction
Continuous approximation
Outline
1
Introduction
2
Continuous approximation
3
Structured output learning
4
Perspectives
Structured output learning
Perspectives
Introduction
Continuous approximation
Outline
1
Introduction
2
Continuous approximation
3
Structured output learning
4
Perspectives
Structured output learning
Perspectives
Introduction
Continuous approximation
Structured output learning
Perspectives
Web search ranking
Ranking via a relevance function Given a query q and a document d, estimate the relevance of d to q. Web search results are sorted by relevance. Traditional relevance functions (BM25) are hand designed. Recently: several machine learning approaches to learn the relevance function. Learning a relevance function is practical, but there are other possibilities: learn a ranking or learn a preference function.
Introduction
Continuous approximation
Structured output learning
Machine learning for ranking
Training data 1
Binary relevance label (traditional IR)
2
Multiple levels of relevance (Excellent, Good, Bad,...)
3
Pairwise comparisons Possibility of converting 1 and 2 into 3. Need of human editors for 1 and 2. But possibility of using 3 with large amount of click data −→ Skip-above pairs in [Joachims ’02] Rest of this talk: 2.
Perspectives
Introduction
Continuous approximation
Structured output learning
Perspectives
Information retrieval metrics Binary relevance labels: average precision, reciprocal rank, winner takes all, AUC (i.e. fraction of misranked pairs),... Multiple level of relevance: Discounted Cumulative Gain at rank p: DCGp =
X
Dp (j)G (sr (j) )
ranks
=
X
Dp (r −1 (i))G (si ),
Rank j 1 2 3
D(j) 1 1/ log2 (3) 1/ log2 (4) ...
G (sr (j) ) 3 7 0
documents
where si is the relevance score for doc i from 0 (Bad) to 4 (Perfect). r is the ranking function: r (j) = i means document i is at position j. Dp is the discount function truncated at rank p, D(j) = 1/ log2 (j) if j ≤ p, 0 otherwise. G is the gain function, G (s) = 2s − 1.
Introduction
Continuous approximation
Structured output learning
Perspectives
Features Given a query and a document, construct a feature vector xi with 3 types of features: Query only : Type of query, query length,... Document only : Pagerank, length, spam,... Query & document : match score,... Set of q = 1, . . . , Q queries. Set of n triplets (query,document,score), (xi , si ), xi ∈ Rd , si ∈ {0, 1, 2, 3, 4}. Uq is the set of indices associated with query q-th query.
Introduction
Continuous approximation
Structured output learning
Approaches to ranking
Pointwise classification [Li et al. ’07], regression −→ works surprisingly well. Pairwise RankSVM perceptron [Crammer et al. ’03] neural nets: [Burges et al. ’05], LambdaRank, boosting: RankBoost, GBRank. Listwise Non metric specific: ListNet, ListMLE Metric specific: AdaRank; structured learning: SVMMAP, [Chapelle et al. ’07]; gradient descent: SoftRank, [Chapelle et al’ 09].
Perspectives
Introduction
Continuous approximation
Structured output learning
Two approaches for a direct optimization the DCG: 1
Gradient descent on a smooth approximation of the DCG
2
Large margin structured output learning where the loss function is the DCG.
Orthogonal issue: choice of the architecture. −→ for simplicity, linear functions. At the end, we will present non-linear extensions.
Perspectives
Introduction
Continuous approximation
Outline
1
Introduction
2
Continuous approximation
3
Structured output learning
4
Perspectives
Structured output learning
Perspectives
Introduction
Continuous approximation
Structured output learning
Main difficulty for a direct optimization (by gradient descent for instance): the DCG is not continuous and constant almost everywhere. −→ Continuous approximation of it.
DCG1 =
X i
≈
X i
I (i = arg max w> xj )G (si ) j
exp(w> xi /σ) P G (si ). > j exp(w xj /σ)
−→ ”Soft-argmax”; softness controlled by σ.
Perspectives
Introduction
Continuous approximation
Structured output learning
Perspectives
Generalization for DCGp : A(w, σ) :=
p X j=1
P D(j)
G (si )hij iP i
hij
with hij a ”smooth” version of the indicator function: ”Is xi at the j-th position in the ranking?”, ! ||w · xi − w · xr (j) ||2 . hij = exp − 2σ 2 σ controls amount of smoothing: when σ → 0, A(w, σ) → DCGp . A(w, σ) is continuous but non-differentiable. But it is differentiable almost everywhere −→ no problem for gradient descent. Approach generalizable to other IR metrics such as MAP.
Introduction
Continuous approximation
Structured output learning
Perspectives
Optimization by gradient descent and annealing: 1 - Initialize: w = w0 and large σ. 2 - Starting from w, minimize by (conjugate) gradient descent λ||w − w0 ||2 + A(w, σ). 3 - Divide σ by 2 and go back to 2 (or stop). w0 is an initial solution such as the one given by pairwise ranking. −11.84 σ=0.125 σ=1 σ=8 σ=64
−11.86
Objective function
−11.88 −11.9 −11.92 −11.94 −11.96 −11.98 −12 −12.02 −5
−4
−3
−2
−1
0 t
1
2
3
4
5
Introduction
Continuous approximation
Structured output learning
Perspectives
Evaluation on web search data Dataset Several hundred features. âˆź50k (query,urls) pairs from an international market. âˆź1500 queries randomly split in training / test (80% / 20%). 5 levels of relevance.
Introduction
Continuous approximation
Structured output learning
Perspectives
Evaluation on web search data Dataset Several hundred features. ∼50k (query,urls) pairs from an international market. ∼1500 queries randomly split in training / test (80% / 20%). 5 levels of relevance. 8.5 9.5
8.4 8.3 8.2 DCG5
DCG5
9
8.5
8
8
λ = 101
λ = 102
7.9
λ = 102
3
λ = 10
7.5 2
10
0
10 Smoothing factor σ
−2
10
8.1
λ = 101
4
λ = 10
7.8
λ = 105
7.7
λ = 103 λ = 104 λ = 105 2
10
0
−2
10 10 Smoothing factor σ
DCG can be improved by almost 10% on the training set (left), but not more than 1% on the test set (right).
Introduction
Continuous approximation
Structured output learning
Perspectives
Evaluation on Letor 3.0 Ohsumed dataset: NDCG 0.6
SmoothNDCG RankSVM Regression AdaRank−NDCG ListNet
0.58 SVMMAP
0.56
RankBoost
0.54 NDCG
ListNet FRank AdaRank−NDCG
0.52 0.5
AdaRank−MAP
0.48
Regression RankSVM
0.46
SmoothNDCG 0.42
0.425
0.43
0.435
0.44 0.445 NDCG@10
0.45
0.455
0.46
0.44 1
2
3
4
5 6 Position
7
8
9
10
0.53
0.54
0.55
All datasets: NDCG / MAP SVMMAP
SVMMAP
RankBoost
RankBoost
ListNet
ListNet
FRank
FRank
AdaRank−NDCG
AdaRank−NDCG
AdaRank−MAP
AdaRank−MAP
Regression
Regression
RankSVM
RankSVM
SmoothNDCG
SmoothNDCG
0.57
0.58
0.59
0.6 NDCG@10
0.61
0.62
0.63
0.47
0.48
0.49
0.5
0.51 MAP
0.52
Introduction
Continuous approximation
Outline
1
Introduction
2
Continuous approximation
3
Structured output learning Formulation Experiments & Extensions
4
Perspectives
Structured output learning
Perspectives
Introduction
Continuous approximation
Structured output learning
Perspectives
Structured output learning Notations xq is a set of documents associated with query q; xqi the i-th document. yq is a ranking (i.e. a permutation): yqi is the rank of the i-th document. (obtained by sorting the scores sqi ).
Learning for structured outputs (Tsochantaridis et al. ’04) Learn a mapping x → y Joint feature map: Ψ(x, y ). Prediction rule: yˆ = arg max w> Ψ(x, y ). y
Introduction
Continuous approximation
We take: Ψ(xq , yq ) =
P
i
Structured output learning
Perspectives
xqi A(yqi ).
A : N → R is a user defined non increasing function. Ranking is given by the order of w> xqi because P w> Ψ(x, y ) = i w> xqi A(yqi ). w> xqi A(y )
2.5 3.7 −0.5 × + × + × = 15.2 → max A(2) = 2 A(1) = 3 A(3) = 1
Constraints for correct predictions on the training set: ∀q, ∀y 6= yq , w> Ψ(xq , yq ) − w> Ψ(xq , y ) > 0.
Introduction
Continuous approximation
Structured output learning
Perspectives
SVM-like optimization problem: Q
min w,ξq
X λ ||w||2 + ξq , 2 q=1
under constraints: ∀q, ∀y 6= yq , w> Ψ(xq , yq ) − w> Ψ(xq , y ) ≥ ∆(y , yq ) − ξq , where ∆(y , yq ) is the query loss, e.g. the difference between the DCGs with ranking y and yq . At the optimum solution, ξq ≥ ∆(yˆq , yq ) with yˆq = arg max w> Ψ(xq , y ).
Introduction
Continuous approximation
Structured output learning
Perspectives
Optimization
ξq = max ∆(y , yq ) + w> Ψ(xq , y ) − w> Ψ(xq , yq ). y
Need to find the argmax: X y˜ = arg max A(yi )w> xqi − G (sqi )D(yi ). y
i
−→ Can be solved efficiently through a linear assignment problem.
Introduction
Continuous approximation
Structured output learning
Perspectives
Cutting plane Strategy used in SVMstruct . Iterate between: 1
Solving the problem on a subset of constraints.
2
Find and add (the most) violated constraints.
Unconstrained optimization X 1 max ∆(y , yq ) + w> Ψ(xq , y ) − w> Ψ(xq , yq ). min ||w||2 + y 2 q Convex, but not differentiable Subgradient descent Bundle method
Introduction
Continuous approximation
Structured output learning
Perspectives
Cutting plane Strategy used in SVMstruct . Iterate between: 1
Solving the problem on a subset of constraints.
2
Find and add (the most) violated constraints.
Unconstrained optimization X 1 max ∆(y , yq ) + w> Ψ(xq , y ) − w> Ψ(xq , yq ). min ||w||2 + y 2 q Convex, but not differentiable Subgradient descent Bundle method
Introduction
Continuous approximation
Structured output learning
Perspectives
Experiments
Normalized DCG for different truncation levels
0.53 0.52
SVMStruct Regression RankSVM
NDCGk
0.51
λ chosen on a validation set
0.5
A(r ) = max(r + 1 − k, 0).
0.49
Training time ∼20 minutes.
0.48 0.47 0.46
2
4
6 k
8
10
About 2% improvement (p-value = 0.03 vs regression, 0.07 vs RankSVM).
Introduction
Continuous approximation
Structured output learning
Ohsumed dataset (Letor distribution): 3 levels of relevance. 25 features 106 queries split in training / validation / test. Optimal solution is w = 0 even for small values of λ. Reason: there are a lot of constraints and not a lot of variables −→ the function looks like x → |x|. w> Ψ(x, y )
w> Ψ(x, y˜ )
w> Ψ(x, yˆ )
Perfect
Bad
Good
Large loss (because Perfect < Bad), but we would like a small (because Good is at the top).
Perspectives
Introduction
Continuous approximation
Structured output learning
Ohsumed dataset (Letor distribution): 3 levels of relevance. 25 features 106 queries split in training / validation / test. Optimal solution is w = 0 even for small values of λ. Reason: there are a lot of constraints and not a lot of variables −→ the function looks like x → |x|. w> Ψ(x, y )
w> Ψ(x, y˜ )
w> Ψ(x, yˆ )
Perfect
Bad
Good
Large loss (because Perfect < Bad), but we would like a small (because Good is at the top).
Perspectives
Introduction
Continuous approximation
Structured output learning
max w> Ψ(xi , y ) − w> Ψ(xi , yi ) + ∆(y , yi ) y
Perspectives
Introduction
Continuous approximation
min yˆ , ∆(ˆ y ,yi )=0
Structured output learning
max w> Ψ(xi , y ) − w> Ψ(xi , yˆ ) + ∆(y , yˆ ) y
Perspectives
Introduction
Continuous approximation
Structured output learning
min max w> Ψ(xi , y ) − w> Ψ(xi , yˆ ) + ∆(y , yˆ ) + ∆(ˆ y , yi ) yˆ
y
Perspectives
Introduction
Continuous approximation
Structured output learning
min max w> Ψ(xi , y ) − w> Ψ(xi , yˆ ) + ∆(y , yˆ ) + ∆(ˆ y , yi ) yˆ
y
= min max w> Ψ(xi , y ) − w> Ψ(xi , yˆ ) + ∆(y , yi ) yˆ
y
= max (w> Ψ(xi , y ) + ∆(y , yi )) − max w> Ψ(xi , yˆ ). y
yˆ
Perspectives
Introduction
Continuous approximation
Structured output learning
min max w> Ψ(xi , y ) − w> Ψ(xi , yˆ ) + ∆(y , yˆ ) + ∆(ˆ y , yi ) yˆ
y
= min max w> Ψ(xi , y ) − w> Ψ(xi , yˆ ) + ∆(y , yi ) yˆ
y
= max (w> Ψ(xi , y ) + ∆(y , yi )) − max w> Ψ(xi , yˆ ). y
yˆ
1
Smaller than the original loss: take yˆ = y .
2
Still an upper bound on the loss: take y = yˆ .
3
Non-convex.
−→ This upper bound can be used for any structured output learning problem. Details available in Tighter bounds for structured estimation [Do et al. ’09] and Optimization of ranking measures [Le et al.].
Perspectives
Introduction
Continuous approximation
Structured output learning
Perspectives
0.52 SVMStruct RankBoost RankSVM
0.51
Ohsumed dataset
0.5
NDCGk
0.49
w0 found by regression: serves a starting point and in the regularizer ||w â&#x2C6;&#x2019; w0 ||2 .
0.48 0.47 0.46 0.45 0.44 0.43
Optimization for DCG10 . 2
4
6 k
8
10
Introduction
Continuous approximation
Structured output learning
Perspectives
Non-linear extensions 1 2
The ”obvious” kernel trick. Gradient boosted decision trees. Frideman’s functional gradient boosting framework. ( N ) n X X min R(f ) = `(yj , f (xj )), F = α i hi . f ∈F
j=1
i=1
Typically hi is a tree and N is infinite. Cannot do direction optimization on F. f ←0 repeat ∂`(f (x ),y ) gj ← − ∂f (xj j ) j . Functional gradient Pn 2 ˆı ← arg mini,λ j=1 (gj − λhi (xj )) . Steepest component ρˆ ← arg min R(f + ρhˆı ). Line search ˆ f ← f + η ρˆh. ”Shrinkage” when η < 1. until Max iterations reached.
Introduction
Continuous approximation
Structured output learning
Perspectives
Non-linear extensions 1 2
The ”obvious” kernel trick. Gradient boosted decision trees. Frideman’s functional gradient boosting framework. ( N ) n X X min R(f ) = `(yj , f (xj )), F = α i hi . f ∈F
j=1
i=1
Typically hi is a tree and N is infinite. Cannot do direction optimization on F. f ←0 repeat ∂`(f (x ),y ) gj ← − ∂f (xj j ) j . Functional gradient Pn 2 ˆı ← arg mini,λ j=1 (gj − λhi (xj )) . Steepest component ρˆ ← arg min R(f + ρhˆı ). Line search ˆ f ← f + η ρˆh. ”Shrinkage” when η < 1. until Max iterations reached.
Introduction
Continuous approximation
Structured output learning
Perspectives
Non-linear extensions 1 2
The ”obvious” kernel trick. Gradient boosted decision trees. Frideman’s functional gradient boosting framework. ( N ) n X X min R(f ) = `(yj , f (xj )), F = α i hi . f ∈F
j=1
i=1
Typically hi is a tree and N is infinite. Cannot do direction optimization on F. f ←0 repeat ∂`(f (x ),y ) gj ← − ∂f (xj j ) j . Functional gradient Pn 2 ˆı ← arg mini,λ j=1 (gj − λhi (xj )) . Steepest component ρˆ ← arg min R(f + ρhˆı ). Line search ˆ f ← f + η ρˆh. ”Shrinkage” when η < 1. until Max iterations reached.
Introduction
Continuous approximation
Structured output learning
Perspectives
The objective function can be much more general: R(f (x1 ), . . . , f (xn )) gj = − ∂f∂R (xj ) . For ranking via structured output learning: R(f ) =
X q
max ∆(y , yq ) + y
Q X
! f (xqi )(A(yi ) − A(yqi )) .
i=1
Preliminary results are disappointing: with gradient boosted decision trees, no difference between regression and structured output learning. −→ Could be because the loss function matters only when the class of functions is restricted (underfitting).
Introduction
Continuous approximation
Structured output learning
Perspectives
The objective function can be much more general: R(f (x1 ), . . . , f (xn )) gj = − ∂f∂R (xj ) . For ranking via structured output learning: R(f ) =
X q
max ∆(y , yq ) + y
Q X
! f (xqi )(A(yi ) − A(yqi )) .
i=1
Preliminary results are disappointing: with gradient boosted decision trees, no difference between regression and structured output learning. −→ Could be because the loss function matters only when the class of functions is restricted (underfitting).
Introduction
Continuous approximation
Outline
1
Introduction
2
Continuous approximation
3
Structured output learning
4
Perspectives Objective function Future directions
Structured output learning
Perspectives
Introduction
Continuous approximation
Structured output learning
Perspectives
Choice of the objective function
General consensus on relative performance of learning to rank methods: Pointwise < Pairwise < Listwise True but ... the differences are very small. On real web search data using non-linear functions: Pairwise is ∼ 0.5 − 1% better than pointwise; Listwise is ∼ 0 − 0.5% better than pairwise. Letor datasets are interesting to test some ideas, but validation in a real setting is necessary.
Introduction
Continuous approximation
Structured output learning
Perspectives
Choice of the objective function
General consensus on relative performance of learning to rank methods: Pointwise < Pairwise < Listwise True but ... the differences are very small. On real web search data using non-linear functions: Pairwise is ∼ 0.5 − 1% better than pointwise; Listwise is ∼ 0 − 0.5% better than pairwise. Letor datasets are interesting to test some ideas, but validation in a real setting is necessary.
Introduction
Continuous approximation
Structured output learning
Public web search datasets Internet Mathematics 2009 Dataset released by the russian search engine Yandex for a competition. Available at: http://company.yandex.ru/grant/2009/en/datasets 9,124 queries / 97,290 judgements (training) 245 features 5 levels of relevance 132 submissions Yahoo! also plans to organize a similar competition and release datasets. Stay tuned!
Perspectives
Introduction
Continuous approximation
Structured output learning
Perspectives
To improve a ranking system, work in priority on: 1
Feature development
2
Choice of the function class
3
Choice of the objective function to optimize
But 1 and 2 are orthogonal issues to learning to rank. What are the other interesting problematics beyond the choice of the objective function?
Introduction
Continuous approximation
Structured output learning
Perspectives
Sample selection bias Training and offline test sets typically come from polling top results of other ranking functions. But online test documents come from a â&#x20AC;?largerâ&#x20AC;? distribution (all the documents from a simple ranking function). Problem: the learning algorithm does not learn to demote very bad pages (low BM25, spam,...) because they rarely appear in the training set. Solution: reweight the training set such that it resembles the online test distribution.
Introduction
Continuous approximation
Structured output learning
Perspectives
Diversity Output a set of relevant documents which is also diverse. Need to go beyond learning a relevance function. Structured output learning can be a principled framework for this purpose. But in any case, extra computational load at test time. Problem: no cheap metric for diversity. Diversity on content is more important than diversity on topic: a user can always reformulate an ambiguous query.
Introduction
Continuous approximation
Structured output learning
Perspectives
Transfer / multi-task learning How to leverage the data from one (big) market to another (small) one?
Introduction
Continuous approximation
Structured output learning
Perspectives
Cascade learning Ideally: rank all existing web pages. In practice: rank only a small subset of them using machine learning. Instead: build a â&#x20AC;?cascadeâ&#x20AC;? of rankers f1 , . . . , fT cascade of T rankers. All documents applied to f1 . Discard bottom documents after each round. Features and functions of increasing complexity. Each ranker is learned.
Introduction
Continuous approximation
Structured output learning
Perspectives
Low-level learning Two different ML philosophies: 1
Design a limited number of high-level features and put an ML algorithm on top of them.
2
Let the ML algorithm directly work on a large number of low-level features.
We have done 1, but 2 has been successful in various domains such as computer vision. Two ideas: Learn BM25 by introducing several parameters per word (such as the k in the saturation function). P Define the score match as i,j wij qi dj and learn the wij . See earlier talk Learning to rank with low rank.
Introduction
Continuous approximation
Structured output learning
Perspectives
Summary
Optimizing ranking measures is difficult, but feasible. Two types of approaches: convex upper bound or non-convex approximation. Only small improvements on real settings (large number of examples, large number of features, non-linear architecture). â&#x2C6;&#x2019;â&#x2020;&#x2019; Choice of the objective function has a small influence on the overall performance. Research on learning to rank should focus on new problems.