Large Scale Multi-Label Classification via MetaLabeler Lei Tang Arizona State University Suju Rajan and Vijay K. Narayanan Yahoo! Data Mining.

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos.

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.

Prediction Modeling for Personalization & Recommender Systems Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Farag Saad i-KNOW 2014 Graz- Austria,

Imbalanced data David Kauchak CS 451 – Fall 2013.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1

Evaluating Search Engine

Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.

Scalable Text Mining with Sparse Generative Models

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.

1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Chien-Shing Chen Author: Tie-Yan.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,

Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

Chapter 9 – Classification and Regression Trees

Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)

Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.

Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,

Online Learning for Collaborative Filtering

This material is approved for public release. Distribution is limited by the Software Engineering Institute to attendees. Sponsored by the U.S. Department.

Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization --- Lei Tang, Jianping Zhang and Huan Liu.

CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.

NTU & MSRA Ming-Feng Tsai

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:

Classification Results for Folder Classification on Enron Dataset.

Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.

Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Large-Scale Content-Based Audio Retrieval from Text Queries

Erasmus University Rotterdam

Supervised Learning Seminar Social Media Mining University UC3M

Learning to Rank Shubhra kanti karmaker (Santu)

Detecting Online Commercial Intention (OCI)

Overview of Machine Learning

SVM Based Learning System for F-term Patent Classification

Feature Selection for Ranking

Information Retrieval

Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007

Presentation transcript:

Large Scale Multi-Label Classification via MetaLabeler Lei Tang Arizona State University Suju Rajan and Vijay K. Narayanan Yahoo! Data Mining & Research

Large Scale Multi-Label Classification Huge number of instances and categories Common for online contents Query Categorization Web Page Classification Social Bookmark/Tag Recommendation Video Annotation/Organization

Challenges Most existing multi-label methods do not scale Multi-Class: thousands of categories Multi-Label: each instance has >1 labels Large Scale: huge number of instances and categories Our query categorization problem: 1.5M queries, 7K categories Yahoo! Directory 792K docs, 246K categories in Liu et al. 05 Most existing multi-label methods do not scale structural SVM, mixture model, collective inference, maximum-entropy model, etc. The simplest One-vs-Rest SVM is still widely used

One-vs-Rest SVM x1 + x2 x3 - x4 x1 - x2 + x3 x4 x1 + x2 - x3 x4 x1 - C1, C3 x2 C1, C2, C4 x3 C2 x4 C2, C4 x1 + x2 x3 - x4 x1 - x2 + x3 x4 x1 + x2 - x3 x4 x1 - x2 + x3 x4 C1 C2 C3 C4 SVM1 SVM2 SVM3 SVM4 Predict C3 C4 C1 C2

One-vs-Rest SVM Pros: Cons: Simple, Fast, Scalable Each label trained independently, easy to parallel Cons: Highly skewed class distribution (few +, many -) Biased prediction scores Output reasonable good ranking (Rifkin and Klauta 04) e.g. 4 categories C1, C2, C3, C4 True Labels for x1: C1, C3 Prediction Scores: {s1, s3} > {s2, s4} Predict the number of labels?

MetaLabeler Algorithm Obtain a ranking of class membership for each instance Any genetic ranking algorithm can be applied Use One-vs-Rest SVM Build a Meta Model to predict the number of top classes Construct Meta Label Construct Meta Feature Build Meta Model

How to handle predictions like 2.5 labels? Meta Model – Training Clothing Q1 = affordable cocktail dress Labels: Formal wear Women Clothing Leather clothing Women Clothing Formal wear Children Clothing How to handle predictions like 2.5 labels? Fashion Meta data Query: #labels Q2 = cotton children jeans Labels: Children clothing Q1: 2 Q2: 1 Q3: 3 Regression Meta-Model One-vs-Rest SVM Q3 = leather fashion in 1990s Labels: Fashion Women Clothing Leather Clothing

Meta Feature Construction Content-Based Use raw data Raw data contains all the info Score-Based Use prediction scores Bias with scores might be learned Rank-Based Use sorted prediction scores C1 C2 C3 C4 Meta Feature 0.9 -0.2 0.7 - 0.6 C1 C2 C3 C4 0.9 -0.2 0.7 -0.6 Meta Feature 0.9 0.7 -0.2 -0.6

MetaLabeler Prediction Given one instance: Obtain the rankings for all labels; Use the meta model to predict the number of labels Pick the top-ranking labels MetaLabeler Easy to implement Use existing SVM package/software directly Can be combined with a hierarchical structure easily Simply build a Meta Model at each internal node

Baseline Methods Existing thresholding methods (Yang 2001) Rank-based Cut (Rcut) output fixed number of top-ranking labels for each instance Proportion-based Cut For each label, choose a portion of test instances as positive Not applicable for online prediction Score-based Cut (Scut, aka. threshold tuning) For each label, determine a threshold based on cross-validation Tends to overfit and is not very stable MetaLabeler: A local RCut method Customize the number of labels for each instance

Publicly Available Benchmark Data Yahoo! Web Page Classification 11 data sets: each constructed from a top-level category 2nd level topics are the categories 16-32k instances, 6-15k features, 14-23 categories 1.2 -1.6 labels per instance, maximum 17 labels Each label has at least 100 instances RCV1: A large scale text corpus 101 categories, 3.2 labels per instance For evaluation purpose, use 3000 for training, 3000 for testing Highly skewed distribution (some labels have only 3-4 instances)

MetaLabeler of Different Meta Features Which type of meta feature is more predictive? Content-based MetaLabeler outperforms other meta features Yahoo! Performance is averaged over 11 data sets. Please refer to the paper for details. RCV1 is averaged over 5-fold cross-validation.

Performance Comparison MetaLabeler tends to outperform other methods

Bias with MetaLabeler The distribution of number of labels is imbalanced Most instances have small number of labels; Small portion of data instances have many more labels Imbalanced Distribution leads to bias in MetaLabeler Prefer to predict lesser labels Only predict many labels with strong confidence Society Data is chosen as it has the largest number of categories (23 categories)

Scalability Study Each curve shows the total computation time. This is a little bit different from the figure on the paper, in which only additional time for MetaLabeler and Threshold Tuning are shown. Threshold tuning requires cross-validation, otherwise overfit MetaLabeler simply adds some meta labels and learn One-vs-Rest SVMs

Scalability Study (cond.) Threshold tuning: linearly increasing with number of categories in the data E.g. 6000 categories -> 6000 thresholds to be tuned MetaLabeler: upper bounded by the maximum number of labels with one instance E.g. 6000 categories but one instance has at most 15 labels Just need to learn additional 15 binary SVMs Meta Model is “independent” of number of categories

Application to Large Scale Query Categorization Query categorization problem: 1.5 million unique queries: 1M for training, 0.5M for testing 120k features A 8-level taxonomy of 6433 categories Multiple labels e.g. 0% interest credit card no transfer fee Financial Services/Credit, Loans and Debt/Credit/Credit Card/ Balance Transfer Financial Services/Credit, Loans and Debt/Credit/Credit Card/ Low Interest Card Financial Services/Credit, Loans and Debt/Credit/Credit Card/ Low-No-fee Card 1.23 labels on average At most 26 labels

Flat Model Flat Model: do not leverage the hierarchical structure Threshold tuning on training data alone takes 40 hours to finish while MetaLabeler costs 2 hours. Here, threshold is tuned on training data only, no cross validation as some categories have few instances. A similar pattern is observed for Macro-F1.

Hierarchical Model - Training Root Step 1: Generate Training Data . . . . . Step 2: Roll up labels . . . . . . . . . . . . Step 3: Create “Other” Category N . . . . . . . . . . . . . . . Step 4: Train One vs. Rest SVM Other . . . . . . . . . . . . . . . . . . . . . . . . . . Training Data New Training Data

Hierarchical Model - Prediction Root Query q Predict using SVMs trained at root level m1 m4 m2 m3 . . . . . Query q c1 m2 . . . . . . . . . c2 Stop !!! Query q . . . . . m3 . . . . . . . . . c3 Other Stop !!! . . . . . . . . . . . . . . . . . . . . . . . . . . Stop if reaching a leaf node or “other” category

Hierarchical Model + MetaLabeler Precision decrease by 1-2%, but recall is improved by 10% at deeper levels.

Features in MetaLabeler Related Categories Overstock.com Mass Merchants/…/discount department stores Apparel & Jewelry Electronics & Appliances Home & Garden Books-Movies-Music-Tickets Blizard Toys & Hobbies/…/Video Game Computing/…/Computer Game Software Entertainment & Social Event/…/Fast Food Restaurant Reference/News/Weather Information Threading Books-Movies-Music-Tickets/…/Computing Books Computing/…/Programming Health and Beauty/…/Unwanted Hair Toys and Hobbies/…/Sewing

Conclusions & Future Work MetaLabeler is promising for large-scale multi-label classification Core idea: learn a meta model to predict the number of labels Simple, efficient and scalable Use existing SVM software directly Easy for practical deployment Future work How to optimize MetaLabeler for desired performance ? E.g. > 95% precision Application to social networking related tasks

Questions?

References Liu, T., Yang, Y., Wan, H., Zeng, H., Chen, Z., and Ma, W. 2005. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explor. Newsl. 7, 1 (Jun. 2005), 36-43. Rifkin, R. and Klautau, A. 2004. In Defense of One-Vs-All Classification. J. Mach. Learn. Res. 5 (Dec. 2004), 101-141. Yang, Y. 2001. A study of thresholding strategies for text categorization. In Proceedings of the 24th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (New Orleans, Louisiana, United States). SIGIR '01. ACM, New York, NY, 137-145.

Hierarchical vs. Flat Model Build a one-vs-rest SVM for all the labels No taxonomy information during training. Hierarchical model has about 5% higher recall fat deeper levels.