Presentation is loading. Please wait.

Presentation is loading. Please wait.

NTU & MSRA Ming-Feng Tsai

Similar presentations


Presentation on theme: "NTU & MSRA Ming-Feng Tsai"— Presentation transcript:

1 NTU & MSRA Ming-Feng Tsai
Fidelity Rank NTU & MSRA Ming-Feng Tsai

2 Ranking Ranking vs. Classification
Training sample is not independent, identical distributed Criterion of training is not compatible to one of IR Applications of Ranking Web Search Movie Suggestion Collaborative Filtering We would like to use machine learning approach to learn the underlying learning function

3 Machine Learning Approaches
Step 1: Annotate examples of the required data Step 2: Induce a function automatically from these examples A “Typical” Machine Learning Problem: Classification Prediction problem: will it rain tomorrow? Training data:

4 ML on Classification Machine Learning: induce a function
from examples (x1,y1) … (xn,yn) Classification Problem: Huge amount of work on classification problems

5 Machine Learning Approaches
Classification Structured Ranking Generative (a) (c) (??) Discriminative (b) (d) (e) Two learning approaches: Generative: attempt to model a distribution over examples Discriminative: attempt to learn a function directly Examples: (a) Naïve bayes, mixture of gaussians (b) Supported vector machines (c) Probabilistic grammars, graphical models (HMM and CRF) (d) Michael Collins (2005) used SVM for parsing (e) This talk (??)

6 A One-dimensional Classification Problem

7 Starbucks Density vs. Voting Behavior

8 Starbucks Density vs. Voting Behavior

9 A “Generative” Approach
Attempt to model P ( x | y = +1) and P ( x | y = -1) For example, as normal distributions: Class +1 has mean = 1.618, variance = 1.35 (after log transform) Class -1 has mean =3.64, variance = 1.29 (after log transform)

10 A “Discriminative” Approach
Choose a threshold with the minimum number of errors In the example: Starbucks density >= => Kerry Starbucks density < => Bush Classifiers 43 / 51 (84%) of states correctly

11 Generative vs. Discriminative
Generative models: Assume some distribution P(x,y) generating examples, try to model P(x,y) Statistical theory: bayesian or asymptotic results for parameter estimation Discriminative models: Assume some distribution P(x,y) generating examples, find a function with low error rate on training examples Statistical theory: non-parametric results relating error on training sample to probability of error on future examples

12 Four components of Discriminative ML Approaches
Learning Approach Classification Structured Ranking Generative (a) (c) (??) Discriminative (b) (d) (e) Feature Require domain knowledge, such as NLP and IR experts Most important for overall performance Model F(x) Loss Function The loss function should be compatible with the evaluation Optimization Approach Neural Network, SVM, Boosting, BLasso …

13 The Contrast Between Classification and Ranking
Model Classification: F(x) Ranking F(x) Pair-wise: F(xi, xj) Global: F(x1, x2, …, xn) Loss Function Relative to evaluation metric It should consider global loss of the order Learning Algorithm Most ranking algorithms still use the same approaches to minimize the loss function.

14 Learning to Rank Many ML approaches have been applied to ranking
RankSVM T. Joachims, SIGKDD, 2002 (SVM Light) Ordinal Regression RankBoost Freund Y., Iyer, Journal of Machine Learning Research, 2003 Regression error on weighted distributions RankNet C.J.C. Burges, ICML, 2005 (MSN Search) Probabilistic Ranking Frameowrk MSN search engine

15 Motivation RankNet Pro Con Motivation Improve efficiency
Probabilistic Ranking Framework This can take the multi-relevance information for training. Con Training is inefficient Query-level loss is neglected Motivation Improve efficiency Improve loss function

16 Probabilistic Ranking Model
Model posterior by Pij The map from outputs to probability are modeled using a logistic function Properties Multi-label information can be taken into account Consistency requirements P(A>B)=0.5, and P(B>C)=0.5, then P(A>C)=0.5 Confidence, or lack of confidence, builds as expected P(A>B)=0.6, and P(B>C)=0.6, then P(A>C)>0.6

17 Fidelity Fidelity A measure of distance between two probability distribution A more reasonable loss function inspired from quantum computation New properties F(p, q) = F(q, p) Loss is between 0 and 1 get the minimum loss value 0 the loss convergence

18 Fidelity Loss Function
Pair-wise differentiable loss function Pair-level loss is considered e.g. the loss of (5, 4, 3, 2, 1) and (4, 3, 2, 1, 0) is zero Query-level loss is also considered, since the loss of a pair is between 0 and 1 Total loss =

19 Fidelity Loss Function (cont.)
Fidelity Loss vs. Cross Entropy Loss Fidelity Loss Cross Entropy Loss (RankNet)

20 Learning Algorithm FRank
A learning algorithm combining additive model with the pair-wise differentiable loss function

21 Experiment Data description Consist of 19,600 queries
Randomly choose 20 unlabeled docs as the poorly relevant docs Feature: includes query-dependent and query-independent features, which are preprocessed with global normalization Detail of data set

22 Experimental Results Training Process of FRank
The fidelity loss decreases with the increasing number of weak learners. On the other hand, the value of increases. The fidelity loss function is consistent with NDCG.

23 Experimental Results (cont.)
Performance on Validation Set To select the best parameter setting for each ranking algorithm

24 Experimental Results (cont.)
Performance Comparison on Testing Set Complex model performs better than simple model Our proposed method outperforms the other methods We performed t-test for FRank and RankNet_TwoLayer; the p-values are for for for

25 Experimental Results (cont.)
Experiment on Different Training Data To investigate how the number of training queries will affect the performance, we separately trained these ranking algorithms on 1,000, 2,000, 4,000, 8,000 and 12,000 queries. In this experiment, RankSVM can output reasonable models when training with 1,000, 2,000, and 4,000. Details of Different Training Data

26 Experimental Results (cont.)
Experimental Results on Different Training Data When the number of training queries is small, the linear model performs better than the non-linear model. Since FRank introduces query-level normalization in the loss function, it still outperforms the other methods. RankSVM would be a good candidate for learning to rank The methods based on the probabilistic ranking framework performs well.

27 Conclusion F(x) Ranking Model Loss Function
Pair-wise Loss function Fij This loss function seems more consistent with the evaluation of IR Learning Algorithm We combine this pair-wise differentiable loss function with the boosting optimization approach.

28 Future work (??) Generative model for Ranking Ranking Model
Learning Approach Classification Structured Ranking Generative (a) (c) (??) Discriminative (b) (d) (e) (??) Generative model for Ranking Ranking Model Pair-wise ranking function: F(xi, xj) Global ranking function: F(x1, x2, …, xn) Loss Function Global loss function: L(x1, x2, …, xn) More consistent with NDCG Optimization Technique


Download ppt "NTU & MSRA Ming-Feng Tsai"

Similar presentations


Ads by Google