Jianfu Chen, David S. Warren Stony Brook University

Name: Jianfu Chen, David S. Warren Stony Brook University
Uploaded: 2017-12-21T21:56:15+00:00
Duration: PTM12S49
Channel: Ralf Cummings
Description: Jianfu Chen, David S. Warren Stony Brook University

Jianfu Chen, David S. Warren Stony Brook University
Cost-Sensitive Learning for Large-Scale Hierarchical Classification of Commercial Products Jianfu Chen, David S. Warren Stony Brook University

Classification is a fundamental problem in information management.
content Product description UNSPSC Product and material transport vehicles (16) Passenger motor vehicles (15) Safety and rescue vehicles (17) Limousines (06) Automobiles or cars (03) Buses (02) Food Beverage and Tobacco Products (50) Vehicles and their Accessories and Components (25) Office Equipment and Accessories and Supplies (44) Marine transport (11) Motor vehicles (10) Aerospace systems (20) Segment Family Class Commodity Spam Ham Introduce classification problem: given a product description find the commodity node it corresponds to.

How should we design a classifier for a given real world task?
Abstraction of context, but may matter in a real-life application to a real-world problem. So improve by looking at details on particular instance to improve its usefulness for the particular problem.

Try Off-the-shelf Classifiers
Method 1. No Design Training Set f(x) Test Set Try Off-the-shelf Classifiers SVM Logistic-regression Decision Tree Neural Network ... Classification is a well-studied problem. There is no need of reinventing the wheel. We can just try a variety of standard classifiers to see which one works best. So far, we are very happy, we have a simple solution. Simplicity is beauty. What else do we need to do? We all forget to ask some simple questions: What’s the use of the classifier? Why do we care about a particular classification task at the first place? How do we measure the performance of a classifier, according to our interests? Most standard classifiers do answer what we care about. They simply assume we care about the error rate, and tries to minimize error rate, or equivalently, maximize accuracy. However, most standard classifiers are designed for minimizing error rate or maximizing accuracy. For many real world tasks, this is true. However, for some other real world tasks, minimizing error rate might not be exactly what we want to achieve in practice. Implicit Assumption: We are trying to minimize error rate, or equivalently, maximize accuracy

Method 2. Optimize what we really care about
What’s the use of the classifier? How do we evaluate the performance of a classifier according to our interests? Quantify what we really care about Optimize what we care about Cover very quickly here, and move this extended methodology to conclusion. Tightly couples performance evaluation and learning

Hierarchical classification of commercial products
Textual product description UNSPSC Product and material transport vehicles (16) Passenger motor vehicles (15) Safety and rescue vehicles (17) Limousines (06) Automobiles or cars (03) Buses (02) Food Beverage and Tobacco Products (50) Vehicles and their Accessories and Components (25) Office Equipment and Accessories and Supplies (44) Marine transport (11) Motor vehicles (10) Aerospace systems (20) Segment Family Class Commodity

Product taxonomy helps customers to find desired products quickly.
Facilitates exploring similar products Helps product recommendation Facilitates corporate spend analysis Toys&Games Looking for gift ideas for a kid? taxonomy organization complements keyword search. dolls puzzles building toys ...

We assume misclassification of products leads to revenue loss.
Textual product description of a mouse Product ... ... ... ... Desktop computer and accessories ... mouse keyboard pet realize an expected annual revenue lose part of the potential revenue

What do we really care about?
? maximize profit A vendor’s business goal is to maximize revenue, or equivalently, minimize revenue loss

Observation 1: the misclassification cost of a product depends on its potential revenue.

Desktop computer and accessories
Observation 2: the misclassification cost of a product depends on how far apart the true class and the predicted class in the taxonomy. Textual product description of a mouse Product ... ... ... Pet instead of car? ... Desktop computer and accessories ... mouse keyboard pet

The proposed performance evaluation metric: average revenue loss
revenue loss of product x 𝑅 𝑒𝑚 = 1 𝑚 𝑥,𝑦, 𝑦 ′ ∈𝐷 𝑣 𝑥 ⋅ 𝐿 𝑦 𝑦 ′ example weight 𝑣 𝑥 is the potential annual revenue of product x error function 𝐿 𝑦 𝑦 ′ is the loss ratio the percentage of the potential revenue a vendor will lose due to misclassification from class y to class y’. a non-decreasing monotonic function of hierarchical distance between y and y’, f(d(y, y’)) d(y,y’) 1 2 3 4 𝐿 𝑦 𝑦 ′ 0.2 0.4 0.6 0.8

Learning – minimizing average revenue loss
𝑅 𝑒𝑚 = 1 𝑚 𝑥,𝑦, 𝑦 ′ ∈𝐷 𝑣 𝑥 ⋅ 𝐿 𝑦 𝑦 ′ Minimize convex upper bound

Multi-class SVM with margin re-scaling
𝜃 𝑦 𝑖 𝑇 𝑥 𝑖 𝜃 𝑦 𝑖 𝑇 𝑥 𝑖 − 𝜃 𝑦 ′ 𝑇 𝑥 𝑖 ≥𝐿 𝑥 𝑖 , 𝑦 𝑖 , 𝑦 ′ =𝑣 𝑥 𝑖 ⋅ 𝐿 𝑦 𝑖 𝑦 ′ 𝜃 𝑦 ′ 𝑇 𝑥 𝑖 min 𝜃,𝜉 𝜃 𝐶 𝑚 𝑖=1 𝑚 𝜉 𝑖 ∀𝑖, ∀ 𝑦 ′ : 𝜃 𝑦 𝑖 𝑇 𝑥 𝑖 − 𝜃 𝑦 ′ 𝑇 𝑥 𝑖 ≥𝐿 𝑥 𝑖 , 𝑦 𝑖 , 𝑦 ′ 𝑠.𝑡. − 𝜉 𝑖 𝜉 𝑖 ≥0

Multi-class SVM with margin re-scaling
Convex upper bound of 1 𝑚 𝑖=1 𝑚 𝐿( 𝑥 𝑖 , 𝑦 𝑖 , 𝑦 ′ ) min 𝜃,𝜉 𝜃 𝐶 𝑚 𝑖=1 𝑚 𝜉 𝑖 ∀𝑖, ∀ 𝑦 ′ : 𝜃 𝑦 𝑖 𝑇 𝑥 𝑖 − 𝜃 𝑦 ′ 𝑇 𝑥 𝑖 ≥𝐿 𝑥 𝑖 , 𝑦 𝑖 , 𝑦 ′ 𝑠.𝑡. − 𝜉 𝑖 𝜉 𝑖 ≥0 plug in any loss function 0-1 [ 𝑦 𝑖 ≠ 𝑦 ′ ] error rate (standard multi-class SVM) VALUE 𝑣 𝑥 𝑖 [ 𝑦 𝑖 ≠ 𝑦 ′ ] product revenue TREE 𝐷( 𝑦 𝑖 , 𝑦 ′ ) hierarchical distance REVLOSS 𝑣 𝑥 𝑖 𝐿 𝑦 𝑖 𝑦 ′ revenue loss

Dataset UNSPSC (United Nations Standard Product and Service Code) dataset Product revenues are simulated revenue = price * sales data source multiple online market places oriented for DoD and Federal government customers GSA Advantage DoD EMALL taxonomy structure 4-level balanced tree UNSPSC taxonomy #examples 1.4M #leaf classes 1073

Average revenue loss (in K$) of different algorithms
Experimental results Average revenue loss (in K$) of different algorithms

What’s wrong? min 𝜃,𝜉 1 2 𝜃 2 + 𝐶 𝑚 𝑖=1 𝑚 𝜉 𝑖 𝑠.𝑡. ∀𝑖, ∀ 𝑦 ′ ≠ 𝑦 𝑖 :
𝜃 𝑦 𝑖 𝑇 𝑥 𝑖 − 𝜃 𝑦 ′ 𝑇 𝑥 𝑖 ≥𝐿 𝑥 𝑖 , 𝑦 𝑖 , 𝑦 ′ − 𝜉 𝑖 𝜉 𝑖 ≥0 𝑣 𝑥 𝑖 ⋅ 𝐿 𝑦 𝑖 𝑦 ′ Revenue loss ranges from a few K to several M

Loss normalization Linearly scale loss function to a fixed range [1, 𝑀 𝑚𝑎𝑥 ], say [1, 10] 𝐿 𝑥,𝑦, 𝑦 ′ 𝑠 =1+ 𝐿 𝑥,𝑦, 𝑦 ′ − 𝐿 𝑚𝑖𝑛 𝐿 𝑚𝑎𝑥 − 𝐿 𝑚𝑖𝑛 ⋅( 𝑀 𝑚𝑎𝑥 −1) The objective now upper bounds both 0-1 loss and the average normalized loss.

Average revenue loss (in K$) of different algorithms
Final results 7.88% reduction in average revenue loss! Average revenue loss (in K$) of different algorithms

Conclusion empirical risk, average misclassification cost:
𝑅 𝑒𝑚 = 1 𝑚 𝑥,𝑦, 𝑦 ′ ∈𝐷 𝐿(𝑥,𝑦, 𝑦 ′ ) = 1 𝑚 𝑥,𝑦, 𝑦 ′ ∈𝐷 𝑤 𝑥 ⋅Δ(𝑦, 𝑦 ′ ) What do we really care about for this task? Minimize error rate? Minimize revenue loss? Performance evaluation metric regularized empirical risk minimization A general method: multi-class SVM with margin re-scaling and loss normalization How do we approximate the performance evaluation metric to make it tractable? Model + Tractable loss function Optimization Find the best parameters

Thank you! Questions?

Jianfu Chen, David S. Warren Stony Brook University

Similar presentations

Presentation on theme: "Jianfu Chen, David S. Warren Stony Brook University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jianfu Chen, David S. Warren Stony Brook University

Similar presentations

Presentation on theme: "Jianfu Chen, David S. Warren Stony Brook University"— Presentation transcript:

Similar presentations

About project

Feedback