1 A fast algorithm for learning large scale preference relations Vikas C. Raykar and Ramani Duraiswami University of Maryland College Park Balaji Krishnapuram Siemens medical solutions USA AISTATS 2007
2 Learning Many learning tasks can be viewed as function estimation.
3 Learning from examples Learning algorithm Training Not all supervised learning procedures fit in the standard classification/regression framework. In this talk we are mainly concerned with ranking/ordering.
4 Ranking / Ordering For some applications ordering is more important Example 1: Information retrieval Sort in the order of relevance
5 Ranking / Ordering For some applications ordering is more important Example 2: Recommender systems Sort in the order of preference
6 Ranking / Ordering For some applications ordering is more important Example 3: Medical decision making Decide over different treatment options
7 Ranking formulation Algorithm Fast algorithm Results Plan of the talk
8 Preference relations Given a we can order/rank a set of instances. Goal - Learn a preference relation Training data – Set of pairwise preferences
9 Ranking function Why not use classifier/ordinal regressor as the ranking function? Goal - Learn a preference relation New Goal - Learn a ranking function Provides a numerical score Not unique
10 Why is ranking different? Learning algorithm Training Pairwise preference Relations Pairwise disagreements
11 Training data..more formally From these two we can get a set of pairwise preference realtions
12 Loss function.. Generalized Wilcoxon-Mann-Whitney (WMW) statistic Minimize fraction of pairwise disagreements Maximize fraction of pairwise agreements Total # of pairwise agreements Total # of pairwise preference relations
13 Consider a two class problem
14 Function class..Linear ranking function Different algorithms use different function class RankNet – neural network RankSVM – RKHS RankBoost – boosted decision stumps
15 Ranking formulation –Training data – Pairwise preference relations –Ideal Loss function – WMW statistic –Function class – linear ranking functions Algorithm Fast algorithm Results Plan of the talk
16 The Likelihood Discrete optimization problem Log-likelihood Assumption : Every pair is drawn independently Sigmoid [Burges et.al.] Choose w to maximize
17 The MAP estimator
18 Another interpretation O-1 indicator function Log-sigmoid What we want to maximize What we actually maximize Log-sigmoid is a lower bound for the indicator function
19 Lower bounding the WMW Log-likelihood <= WMW
20 Gradient based learning Use nonlinear conjugate-gradient algorithm. Requires only gradient evaluations. No function evaluations. No second derivatives. Gradient is given by
21 RankNet Learning algorithm Training Pairwise preference relations Cross entropy neural net Backpropagation
22 RankSVM Learning algorithm Training Pairwise preference relations Pairwise disagreements RKHS SVM
23 RankBoost Learning algorithm Training Pairwise preference relations Pairwise disagreements Decision stumps Boosting
24 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm Results Plan of the talk
25 Key idea Use approximate gradient. Extremely fast in linear time. Converges to the same solution. Requires a few more iterations.
26 Core computational primitive Weighted summation of erfc functions
27 Notion of approximation
28 Example
29 1. Beauliu’s series expansion Retain only the first few terms contributing to the desired accuracy. Derive bounds for this to choose the number of terms
30 2. Error bounds
31 3. Use truncated series
32 3. Regrouping Does not depend on y. Can be computed in O(pN) Once A and B are precomputed Can be computed in O(pM) Reduced from O(MN) to O(p(M+N))
33 3. Other tricks Rapid saturation of the erfc function. Space subdivision Choosing the parameters to achieve the error bound See the technical report
34 Numerical experiments
35 Precision vs Speedup
36 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm –Use fast approximate gradient –Fast summation of erfc functions Results Plan of the talk
37 Datasets 12 public benchmark datasets Five-fold cross-validation experiments CG tolerance 1e-3 Accuracy for the gradient computation 1e-6
38 Direct vs Fast -WMW statistic Dataset DirectFast *0.979 WMW is similar for both the exact and the fast approximate version.
39 Dataset DirectFast secs.2 secs secs.19 secs secs.4 secs. 4*47 secs. Direct vs Fast – Time taken
40 Effect of gradient approximation
41 Comparison with other methods RankNet - Neural network RankSVM - SVM RankBoost - Boosting
42 Comparison with other methods WMW is almost similar for all the methods. Proposed method faster than all the other methods. Next best time is shown by RankBoost. Only proposed method can handle large datasets.
43 Sample result Dataset 8 N=950 d=10 S=5 Time taken (secs) WMW RankNCG direct RankNCG fast RankNet linear RankNet two layer RankSVM linear RankSVM quadratic RankBoost60.958
44 Sample result Dataset 11 N=4177 d=9 S=3 Time taken (secs) WMW RankNCG direct RankNCG fast RankNet linear RankNet two layer RankSVM linear RankSVM quadratic RankBoost
45 Application to collaborative filtering Predict movie ratings for a user based on the ratings provided by other users. MovieLens dataset ( 1 million ratings (1-5) 3592 movies 6040 users Feature vector for each movie – rating provided by d other users
46 Collaborative filtering results
47 Collaborative filtering results
48 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm –Use fast approximate gradient –Fast summation of erfc functions Results –Similar accuracy as other methods –But much much faster Plan/Conclusion of the talk
49 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm –Use fast approximate gradient –Fast summation of erfc functions Results –Similar accuracy as other methods –But much much faster Future work Other applications neural network Probit regression Code coming soon
50 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm –Use fast approximate gradient –Fast summation of erfc functions Results –Similar accuracy as other methods –But much much faster Future work Other applications neural network Probit regression Nonlinear Kernelized Variation.
51 Thank You ! | Questions ?