Presentation on theme: "Jianfu Chen, David S. Warren Stony Brook University"— Presentation transcript:
1Jianfu Chen, David S. Warren Stony Brook University Cost-Sensitive Learning for Large-Scale Hierarchical Classification of Commercial ProductsJianfu Chen, David S. WarrenStony Brook University
2Classification is a fundamental problem in information management. contentProduct descriptionUNSPSCProduct and material transport vehicles (16)Passenger motor vehicles (15)Safety and rescue vehicles (17)Limousines (06)Automobiles or cars (03)Buses (02)Food Beverage and Tobacco Products (50)Vehicles and their Accessories and Components (25)Office Equipment and Accessories and Supplies (44)Marine transport (11)Motor vehicles (10)Aerospace systems (20)SegmentFamilyClassCommoditySpamHamIntroduce classification problem: given a product description find the commodity node it corresponds to.
3How should we design a classifier for a given real world task? Abstraction of context, but may matter in a real-life application to a real-world problem.So improve by looking at details on particular instance to improve its usefulness for the particular problem.
4Try Off-the-shelf Classifiers Method 1. No DesignTraining Setf(x)Test SetTry Off-the-shelf ClassifiersSVMLogistic-regressionDecision TreeNeural Network...Classification is a well-studied problem. There is no need of reinventing the wheel. We can just try a variety of standard classifiers to see which one works best.So far, we are very happy, we have a simple solution. Simplicity is beauty. What else do we need to do?We all forget to ask some simple questions: What’s the use of the classifier? Why do we care about a particular classification task at the first place? How do we measure the performance of a classifier, according to our interests?Most standard classifiers do answer what we care about. They simply assume we care about the error rate, and tries to minimize error rate, or equivalently, maximize accuracy.However, most standard classifiers are designed for minimizing error rate or maximizing accuracy. For many real world tasks, this is true. However, for some other real world tasks, minimizing error rate might not be exactly what we want to achieve in practice.Implicit Assumption: We are trying to minimize error rate, or equivalently, maximize accuracy
5Method 2. Optimize what we really care about What’s the use of the classifier?How do we evaluate the performance of a classifier according to our interests?Quantify what we really care aboutOptimize what we care aboutCover very quickly here, and move this extended methodology to conclusion.Tightly couples performance evaluation and learning
6Hierarchical classification of commercial products Textual product descriptionUNSPSCProduct and material transport vehicles (16)Passenger motor vehicles (15)Safety and rescue vehicles (17)Limousines (06)Automobiles or cars (03)Buses (02)Food Beverage and Tobacco Products (50)Vehicles and their Accessories and Components (25)Office Equipment and Accessories and Supplies (44)Marine transport (11)Motor vehicles (10)Aerospace systems (20)SegmentFamilyClassCommodity
7Product taxonomy helps customers to find desired products quickly. Facilitates exploring similar productsHelps product recommendationFacilitates corporate spend analysisToys&GamesLooking forgift ideas for a kid?taxonomy organization complements keyword search.dollspuzzlesbuilding toys...
8We assume misclassification of products leads to revenue loss. Textual product description of a mouseProduct............Desktop computer and accessories...mousekeyboardpetrealize an expected annual revenuelose part of the potential revenue
9What do we really care about? ? maximize profitA vendor’s business goal is to maximize revenue, or equivalently, minimize revenue loss
10Observation 1: the misclassification cost of a product depends on its potential revenue.
11Desktop computer and accessories Observation 2: the misclassification cost of a product depends on how far apart the true class and the predicted class in the taxonomy.Textual product description of a mouseProduct.........Pet instead of car?...Desktop computer and accessories...mousekeyboardpet
12The proposed performance evaluation metric: average revenue loss revenue loss of product x𝑅 𝑒𝑚 = 1 𝑚 𝑥,𝑦, 𝑦 ′ ∈𝐷 𝑣 𝑥 ⋅ 𝐿 𝑦 𝑦 ′example weight 𝑣 𝑥 is the potential annual revenue of product xerror function 𝐿 𝑦 𝑦 ′ is the loss ratiothe percentage of the potential revenue a vendor will lose due to misclassification from class y to class y’.a non-decreasing monotonic function of hierarchical distance between y and y’, f(d(y, y’))d(y,y’)1234𝐿 𝑦 𝑦 ′0.20.40.60.8
16DatasetUNSPSC (United Nations Standard Product and Service Code) datasetProduct revenues are simulatedrevenue = price * salesdata sourcemultiple online market places oriented for DoD and Federal government customersGSA AdvantageDoD EMALLtaxonomy structure4-level balanced treeUNSPSC taxonomy#examples1.4M#leaf classes1073
17Average revenue loss (in K$) of different algorithms Experimental resultsAverage revenue loss (in K$) of different algorithms
18What’s wrong? min 𝜃,𝜉 1 2 𝜃 2 + 𝐶 𝑚 𝑖=1 𝑚 𝜉 𝑖 𝑠.𝑡. ∀𝑖, ∀ 𝑦 ′ ≠ 𝑦 𝑖 : 𝜃 𝑦 𝑖 𝑇 𝑥 𝑖 − 𝜃 𝑦 ′ 𝑇 𝑥 𝑖 ≥𝐿 𝑥 𝑖 , 𝑦 𝑖 , 𝑦 ′ − 𝜉 𝑖𝜉 𝑖 ≥0𝑣 𝑥 𝑖 ⋅ 𝐿 𝑦 𝑖 𝑦 ′Revenue loss ranges from a few K to several M
19Loss normalizationLinearly scale loss function to a fixed range [1, 𝑀 𝑚𝑎𝑥 ], say [1, 10]𝐿 𝑥,𝑦, 𝑦 ′ 𝑠 =1+ 𝐿 𝑥,𝑦, 𝑦 ′ − 𝐿 𝑚𝑖𝑛 𝐿 𝑚𝑎𝑥 − 𝐿 𝑚𝑖𝑛 ⋅( 𝑀 𝑚𝑎𝑥 −1)The objective now upper bounds both 0-1 loss and the average normalized loss.
20Average revenue loss (in K$) of different algorithms Final results7.88% reduction in average revenue loss!Average revenue loss (in K$) of different algorithms
21Conclusion empirical risk, average misclassification cost: 𝑅 𝑒𝑚 = 1 𝑚 𝑥,𝑦, 𝑦 ′ ∈𝐷 𝐿(𝑥,𝑦, 𝑦 ′ )= 1 𝑚 𝑥,𝑦, 𝑦 ′ ∈𝐷 𝑤 𝑥 ⋅Δ(𝑦, 𝑦 ′ )What do we really care about for this task?Minimize error rate?Minimize revenue loss?Performance evaluation metricregularized empirical risk minimizationA general method: multi-class SVM with margin re-scaling and loss normalizationHow do we approximate the performance evaluation metric to make it tractable?Model +Tractable loss functionOptimizationFind the best parameters