1 A fast algorithm for learning large scale preference relations Vikas C. Raykar and Ramani Duraiswami University of Maryland College Park Balaji Krishnapuram.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Neural networks Introduction Fitting neural networks
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
INTRODUCTION TO Machine Learning 3rd Edition
Intelligent Environments1 Computer Science and Engineering University of Texas at Arlington.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.
COLLABORATIVE FILTERING Mustafa Cavdar Neslihan Bulut.
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
Computer vision: models, learning and inference
Visual Recognition Tutorial
Optimizing Estimated Loss Reduction for Active Sampling in Rank Learning Presented by Pinar Donmez joint work with Jaime G. Carbonell Language Technologies.
Support Vector Machines (and Kernel Methods in general)
Sparse vs. Ensemble Approaches to Supervised Learning
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Speaker Adaptation for Vowel Classification
For internal use only / Copyright © Siemens AG All rights reserved. Multiple-instance learning improves CAD detection of masses in digital mammography.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
An Introduction to Support Vector Machines Martin Law.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
On ranking in survival analysis: Bounds on the concordance index
Biointelligence Laboratory, Seoul National University
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Online Learning for Collaborative Filtering
A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting Huang, C. L. & Tsai, C. Y. Expert Systems with Applications 2008.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
An Introduction to Support Vector Machines (M. Law)
Today Ensemble Methods. Recap of the course. Classifier Fusion
Transductive Regression Piloted by Inter-Manifold Relations.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
NTU & MSRA Ming-Feng Tsai
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
RiskTeam/ Zürich, 6 July 1998 Andreas S. Weigend, Data Mining Group, Information Systems Department, Stern School of Business, NYU 2: 1 Nonlinear Models.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Deep Feedforward Networks
One-layer neural networks Approximation problems
Boosting and Additive Trees (2)
Trees, bagging, boosting, and stacking
Overview of Machine Learning
Neural networks (1) Traditional multi-layer perceptrons
Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007
Linear Discrimination
Learning to Rank with Ties
University of Wisconsin - Madison
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Presentation transcript:

1 A fast algorithm for learning large scale preference relations Vikas C. Raykar and Ramani Duraiswami University of Maryland College Park Balaji Krishnapuram Siemens medical solutions USA AISTATS 2007

2 Learning Many learning tasks can be viewed as function estimation.

3 Learning from examples Learning algorithm Training Not all supervised learning procedures fit in the standard classification/regression framework. In this talk we are mainly concerned with ranking/ordering.

4 Ranking / Ordering For some applications ordering is more important Example 1: Information retrieval Sort in the order of relevance

5 Ranking / Ordering For some applications ordering is more important Example 2: Recommender systems Sort in the order of preference

6 Ranking / Ordering For some applications ordering is more important Example 3: Medical decision making Decide over different treatment options

7 Ranking formulation Algorithm Fast algorithm Results Plan of the talk

8 Preference relations Given a we can order/rank a set of instances. Goal - Learn a preference relation Training data – Set of pairwise preferences

9 Ranking function Why not use classifier/ordinal regressor as the ranking function? Goal - Learn a preference relation New Goal - Learn a ranking function Provides a numerical score Not unique

10 Why is ranking different? Learning algorithm Training Pairwise preference Relations Pairwise disagreements

11 Training data..more formally From these two we can get a set of pairwise preference realtions

12 Loss function.. Generalized Wilcoxon-Mann-Whitney (WMW) statistic Minimize fraction of pairwise disagreements Maximize fraction of pairwise agreements Total # of pairwise agreements Total # of pairwise preference relations

13 Consider a two class problem

14 Function class..Linear ranking function Different algorithms use different function class RankNet – neural network RankSVM – RKHS RankBoost – boosted decision stumps

15 Ranking formulation –Training data – Pairwise preference relations –Ideal Loss function – WMW statistic –Function class – linear ranking functions Algorithm Fast algorithm Results Plan of the talk

16 The Likelihood Discrete optimization problem Log-likelihood Assumption : Every pair is drawn independently Sigmoid [Burges et.al.] Choose w to maximize

17 The MAP estimator

18 Another interpretation O-1 indicator function Log-sigmoid What we want to maximize What we actually maximize Log-sigmoid is a lower bound for the indicator function

19 Lower bounding the WMW Log-likelihood <= WMW

20 Gradient based learning Use nonlinear conjugate-gradient algorithm. Requires only gradient evaluations. No function evaluations. No second derivatives. Gradient is given by

21 RankNet Learning algorithm Training Pairwise preference relations Cross entropy neural net Backpropagation

22 RankSVM Learning algorithm Training Pairwise preference relations Pairwise disagreements RKHS SVM

23 RankBoost Learning algorithm Training Pairwise preference relations Pairwise disagreements Decision stumps Boosting

24 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm Results Plan of the talk

25 Key idea Use approximate gradient. Extremely fast in linear time. Converges to the same solution. Requires a few more iterations.

26 Core computational primitive Weighted summation of erfc functions

27 Notion of approximation

28 Example

29 1. Beauliu’s series expansion Retain only the first few terms contributing to the desired accuracy. Derive bounds for this to choose the number of terms

30 2. Error bounds

31 3. Use truncated series

32 3. Regrouping Does not depend on y. Can be computed in O(pN) Once A and B are precomputed Can be computed in O(pM) Reduced from O(MN) to O(p(M+N))

33 3. Other tricks Rapid saturation of the erfc function. Space subdivision Choosing the parameters to achieve the error bound See the technical report

34 Numerical experiments

35 Precision vs Speedup

36 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm –Use fast approximate gradient –Fast summation of erfc functions Results Plan of the talk

37 Datasets 12 public benchmark datasets Five-fold cross-validation experiments CG tolerance 1e-3 Accuracy for the gradient computation 1e-6

38 Direct vs Fast -WMW statistic Dataset DirectFast *0.979 WMW is similar for both the exact and the fast approximate version.

39 Dataset DirectFast secs.2 secs secs.19 secs secs.4 secs. 4*47 secs. Direct vs Fast – Time taken

40 Effect of gradient approximation

41 Comparison with other methods RankNet - Neural network RankSVM - SVM RankBoost - Boosting

42 Comparison with other methods WMW is almost similar for all the methods. Proposed method faster than all the other methods. Next best time is shown by RankBoost. Only proposed method can handle large datasets.

43 Sample result Dataset 8 N=950 d=10 S=5 Time taken (secs) WMW RankNCG direct RankNCG fast RankNet linear RankNet two layer RankSVM linear RankSVM quadratic RankBoost60.958

44 Sample result Dataset 11 N=4177 d=9 S=3 Time taken (secs) WMW RankNCG direct RankNCG fast RankNet linear RankNet two layer RankSVM linear RankSVM quadratic RankBoost

45 Application to collaborative filtering Predict movie ratings for a user based on the ratings provided by other users. MovieLens dataset ( 1 million ratings (1-5) 3592 movies 6040 users Feature vector for each movie – rating provided by d other users

46 Collaborative filtering results

47 Collaborative filtering results

48 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm –Use fast approximate gradient –Fast summation of erfc functions Results –Similar accuracy as other methods –But much much faster Plan/Conclusion of the talk

49 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm –Use fast approximate gradient –Fast summation of erfc functions Results –Similar accuracy as other methods –But much much faster Future work Other applications neural network Probit regression Code coming soon

50 Ranking formulation –Training data – Pairwise preference relations –Loss function – WMW statistic –Function class – linear ranking functions Algorithm –Maximize a lower bound on WMW –Use conjugate-gradient –Quadratic complexity Fast algorithm –Use fast approximate gradient –Fast summation of erfc functions Results –Similar accuracy as other methods –But much much faster Future work Other applications neural network Probit regression Nonlinear Kernelized Variation.

51 Thank You ! | Questions ?