Jianfu Chen, David S. Warren Stony Brook University

Slides:



Advertisements
Similar presentations
On the application of GP for software engineering predictive modeling: A systematic review Expert systems with Applications, Vol. 38 no. 9, 2011 Wasif.
Advertisements

Lazy Paired Hyper-Parameter Tuning
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Logistics Network Configuration
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CMPUT 466/551 Principal Source: CMU
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Optimizing F-Measure with Support Vector Machines David R. Musicant Vipin Kumar Aysel Ozgur FLAIRS 2003 Tuesday, May 13, 2003 Carleton College.
Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.
Support Vector Machines
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Oregon State University – Intelligent Systems Group 8/22/2003ICML Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
SimGuard Demonstrations. 2 Incident Management Solutions. “Incident” – An incident is any mishap which distracts or negatively affects pre-scheduled activity.
An Introduction to Support Vector Machines Martin Law.
Model Assessment and Selection Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Energy-Aware Scheduling with Quality of Surveillance Guarantee in Wireless Sensor Networks Jaehoon Jeong, Sarah Sharafkandi and David H.C. Du Dept. of.
Final Exam Review. The following is a list of items that you should review in preparation for the exam. Note that not every item in the following slides.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Experimental Evaluation of Learning Algorithms Part 1.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Online Learning for Collaborative Filtering
An Introduction to Support Vector Machines (M. Law)
LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.
1. 2 Traditional Income Statement LO1: Prepare a contribution margin income statement.
1 A fast algorithm for learning large scale preference relations Vikas C. Raykar and Ramani Duraiswami University of Maryland College Park Balaji Krishnapuram.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
Hierarchical Classification
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Support Vector Machines in Marketing Georgi Nalbantov MICC, Maastricht University.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines.
U of Minnesota DIWANS'061 Energy-Aware Scheduling with Quality of Surveillance Guarantee in Wireless Sensor Networks Jaehoon Jeong, Sarah Sharafkandi and.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Introduction to Machine Learning, its potential usage in network area,
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Decision Trees in Analytical Model Development
Large Margin classifiers
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Rule Induction for Classification Using
Bounding the error of misclassification
Basic machine learning background with Python scikit-learn
Asymmetric Gradient Boosting with Application to Spam Filtering
Pawan Lingras and Cory Butz
An Inteligent System to Diabetes Prediction
Overview of Machine Learning
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
What is The Optimal Number of Features
Physics-guided machine learning for milling stability:
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Jianfu Chen, David S. Warren Stony Brook University Cost-Sensitive Learning for Large-Scale Hierarchical Classification of Commercial Products Jianfu Chen, David S. Warren Stony Brook University

Classification is a fundamental problem in information management. Email content Product description UNSPSC Product and material transport vehicles (16) Passenger motor vehicles (15) Safety and rescue vehicles (17) Limousines (06) Automobiles or cars (03) Buses (02) Food Beverage and Tobacco Products (50) Vehicles and their Accessories and Components (25) Office Equipment and Accessories and Supplies (44) Marine transport (11) Motor vehicles (10) Aerospace systems (20) Segment Family Class Commodity Spam Ham Introduce classification problem: given a product description find the commodity node it corresponds to.

How should we design a classifier for a given real world task? Abstraction of context, but may matter in a real-life application to a real-world problem. So improve by looking at details on particular instance to improve its usefulness for the particular problem.

Try Off-the-shelf Classifiers Method 1. No Design Training Set f(x) Test Set Try Off-the-shelf Classifiers SVM Logistic-regression Decision Tree Neural Network ... Classification is a well-studied problem. There is no need of reinventing the wheel. We can just try a variety of standard classifiers to see which one works best. So far, we are very happy, we have a simple solution. Simplicity is beauty. What else do we need to do? We all forget to ask some simple questions: What’s the use of the classifier? Why do we care about a particular classification task at the first place? How do we measure the performance of a classifier, according to our interests? Most standard classifiers do answer what we care about. They simply assume we care about the error rate, and tries to minimize error rate, or equivalently, maximize accuracy. However, most standard classifiers are designed for minimizing error rate or maximizing accuracy. For many real world tasks, this is true. However, for some other real world tasks, minimizing error rate might not be exactly what we want to achieve in practice. Implicit Assumption: We are trying to minimize error rate, or equivalently, maximize accuracy

Method 2. Optimize what we really care about What’s the use of the classifier? How do we evaluate the performance of a classifier according to our interests? Quantify what we really care about Optimize what we care about Cover very quickly here, and move this extended methodology to conclusion. Tightly couples performance evaluation and learning

Hierarchical classification of commercial products Textual product description UNSPSC Product and material transport vehicles (16) Passenger motor vehicles (15) Safety and rescue vehicles (17) Limousines (06) Automobiles or cars (03) Buses (02) Food Beverage and Tobacco Products (50) Vehicles and their Accessories and Components (25) Office Equipment and Accessories and Supplies (44) Marine transport (11) Motor vehicles (10) Aerospace systems (20) Segment Family Class Commodity

Product taxonomy helps customers to find desired products quickly. Facilitates exploring similar products Helps product recommendation Facilitates corporate spend analysis Toys&Games Looking for gift ideas for a kid? taxonomy organization complements keyword search. dolls puzzles building toys ...

We assume misclassification of products leads to revenue loss. Textual product description of a mouse Product ... ... ... ... Desktop computer and accessories ... mouse keyboard pet realize an expected annual revenue lose part of the potential revenue

What do we really care about? ? maximize profit A vendor’s business goal is to maximize revenue, or equivalently, minimize revenue loss

Observation 1: the misclassification cost of a product depends on its potential revenue.

Desktop computer and accessories Observation 2: the misclassification cost of a product depends on how far apart the true class and the predicted class in the taxonomy. Textual product description of a mouse Product ... ... ... Pet instead of car? ... Desktop computer and accessories ... mouse keyboard pet

The proposed performance evaluation metric: average revenue loss revenue loss of product x 𝑅 𝑒𝑚 = 1 𝑚 𝑥,𝑦, 𝑦 ′ ∈𝐷 𝑣 𝑥 ⋅ 𝐿 𝑦 𝑦 ′ example weight 𝑣 𝑥 is the potential annual revenue of product x error function 𝐿 𝑦 𝑦 ′ is the loss ratio the percentage of the potential revenue a vendor will lose due to misclassification from class y to class y’. a non-decreasing monotonic function of hierarchical distance between y and y’, f(d(y, y’)) d(y,y’) 1 2 3 4 𝐿 𝑦 𝑦 ′ 0.2 0.4 0.6 0.8

Learning – minimizing average revenue loss 𝑅 𝑒𝑚 = 1 𝑚 𝑥,𝑦, 𝑦 ′ ∈𝐷 𝑣 𝑥 ⋅ 𝐿 𝑦 𝑦 ′ Minimize convex upper bound

Multi-class SVM with margin re-scaling 𝜃 𝑦 𝑖 𝑇 𝑥 𝑖 𝜃 𝑦 𝑖 𝑇 𝑥 𝑖 − 𝜃 𝑦 ′ 𝑇 𝑥 𝑖 ≥𝐿 𝑥 𝑖 , 𝑦 𝑖 , 𝑦 ′ =𝑣 𝑥 𝑖 ⋅ 𝐿 𝑦 𝑖 𝑦 ′ 𝜃 𝑦 ′ 𝑇 𝑥 𝑖 min 𝜃,𝜉 1 2 𝜃 2 + 𝐶 𝑚 𝑖=1 𝑚 𝜉 𝑖 ∀𝑖, ∀ 𝑦 ′ : 𝜃 𝑦 𝑖 𝑇 𝑥 𝑖 − 𝜃 𝑦 ′ 𝑇 𝑥 𝑖 ≥𝐿 𝑥 𝑖 , 𝑦 𝑖 , 𝑦 ′ 𝑠.𝑡. − 𝜉 𝑖 𝜉 𝑖 ≥0

Multi-class SVM with margin re-scaling Convex upper bound of 1 𝑚 𝑖=1 𝑚 𝐿( 𝑥 𝑖 , 𝑦 𝑖 , 𝑦 ′ ) min 𝜃,𝜉 1 2 𝜃 2 + 𝐶 𝑚 𝑖=1 𝑚 𝜉 𝑖 ∀𝑖, ∀ 𝑦 ′ : 𝜃 𝑦 𝑖 𝑇 𝑥 𝑖 − 𝜃 𝑦 ′ 𝑇 𝑥 𝑖 ≥𝐿 𝑥 𝑖 , 𝑦 𝑖 , 𝑦 ′ 𝑠.𝑡. − 𝜉 𝑖 𝜉 𝑖 ≥0 plug in any loss function 0-1 [ 𝑦 𝑖 ≠ 𝑦 ′ ] error rate (standard multi-class SVM) VALUE 𝑣 𝑥 𝑖 [ 𝑦 𝑖 ≠ 𝑦 ′ ] product revenue TREE 𝐷( 𝑦 𝑖 , 𝑦 ′ ) hierarchical distance REVLOSS 𝑣 𝑥 𝑖 𝐿 𝑦 𝑖 𝑦 ′ revenue loss

Dataset UNSPSC (United Nations Standard Product and Service Code) dataset Product revenues are simulated revenue = price * sales data source multiple online market places oriented for DoD and Federal government customers GSA Advantage DoD EMALL taxonomy structure 4-level balanced tree UNSPSC taxonomy #examples 1.4M #leaf classes 1073

Average revenue loss (in K$) of different algorithms Experimental results Average revenue loss (in K$) of different algorithms

What’s wrong? min 𝜃,𝜉 1 2 𝜃 2 + 𝐶 𝑚 𝑖=1 𝑚 𝜉 𝑖 𝑠.𝑡. ∀𝑖, ∀ 𝑦 ′ ≠ 𝑦 𝑖 : 𝜃 𝑦 𝑖 𝑇 𝑥 𝑖 − 𝜃 𝑦 ′ 𝑇 𝑥 𝑖 ≥𝐿 𝑥 𝑖 , 𝑦 𝑖 , 𝑦 ′ − 𝜉 𝑖 𝜉 𝑖 ≥0 𝑣 𝑥 𝑖 ⋅ 𝐿 𝑦 𝑖 𝑦 ′ Revenue loss ranges from a few K to several M

Loss normalization Linearly scale loss function to a fixed range [1, 𝑀 𝑚𝑎𝑥 ], say [1, 10] 𝐿 𝑥,𝑦, 𝑦 ′ 𝑠 =1+ 𝐿 𝑥,𝑦, 𝑦 ′ − 𝐿 𝑚𝑖𝑛 𝐿 𝑚𝑎𝑥 − 𝐿 𝑚𝑖𝑛 ⋅( 𝑀 𝑚𝑎𝑥 −1) The objective now upper bounds both 0-1 loss and the average normalized loss.

Average revenue loss (in K$) of different algorithms Final results 7.88% reduction in average revenue loss! Average revenue loss (in K$) of different algorithms

Conclusion empirical risk, average misclassification cost: 𝑅 𝑒𝑚 = 1 𝑚 𝑥,𝑦, 𝑦 ′ ∈𝐷 𝐿(𝑥,𝑦, 𝑦 ′ ) = 1 𝑚 𝑥,𝑦, 𝑦 ′ ∈𝐷 𝑤 𝑥 ⋅Δ(𝑦, 𝑦 ′ ) What do we really care about for this task? Minimize error rate? Minimize revenue loss? Performance evaluation metric regularized empirical risk minimization A general method: multi-class SVM with margin re-scaling and loss normalization How do we approximate the performance evaluation metric to make it tractable? Model + Tractable loss function Optimization Find the best parameters

Thank you! Questions?