Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal

Text Categorization Numerous Applications Search Engines/Portals Customer Service Email Routing …. Domains: Topics Genres Languages $$$ Making

How do people deal with a large number of classes?  Use fast multiclass algorithms (Naïve Bayes) Builds one model per class  Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems  What happens with a 1000 class problem?  Can we do better?

ECOC to the Rescue!  An n-class problem can be solved by solving log 2 n binary problems  More efficient than one-per-class  Does it actually perform better?

What is ECOC?  Solve multiclass problems by decomposing them into multiple binary problems ( Dietterich & Bakiri 1995 )  Use a learner to learn the binary problems

Training ECOC 0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 1 ABCDABCD f 1 f 2 f 3 f 4 f 5 X 1 1 1 1 0 Testing ECOC

ECOC - Picture 0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 1 ABCDABCD A D C B f 1 f 2 f 3 f 4 f 5

ECOC - Picture 0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 1 ABCDABCD A D C B f 1 f 2 f 3 f 4 f 5 X 1 1 1 1 0

Classification Performance EfficiencyEfficiency NB ECOC Preliminary Results This Proposal Preliminary Results: ECOC reduces the error of the Naïve Bayes Classifier by 66% with NO increase in computational cost (as used in Berger 99)

Proposed Solutions  Design codewords that minimize cost and maximize “performance”  Investigate the assignment of codewords to classes  Learn the decoding function  Incorporate unlabeled data into ECOC

Use unlabeled data with a large number of classes  How? Use EM Mixed Results  Think Again! Use Co-Training Disastrous Results  Think one more time

Use Unlabeled data  Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories  ECOC works great with a large number of classes but there is no framework for using unlabeled data

Use Unlabeled Data  ECOC decomposes multiclass problems into binary problems  Co-Training works great with binary problems  ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training  Preliminary Results: Not so great! (very sensitive to initial labeled documents)

What Next?  Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration  Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training

Work Plan  Collect Datasets  Codeword Assignment - 2 weeks  Learning Decoding – 1-2 weeks  Using Unlabeled Data - 2 weeks  Design Codes - 2 weeks  Project Write-up – 1 week

Summary  Use ECOC for efficient text classification with a large number of categories  Reduce code length without sacrificing performance  Fix code length and Increase Performance  Generalize to domain-independent classification tasks involving a large number of categories

Testing ECOC  To test a new instance Apply each of the n classifiers to the new instance Combine the predictions to obtain a binary string(codeword) for the new point Classify to the class with the nearest codeword (usually hamming distance is used as the distance measure)

The Decoding Step  Standard: Map to the nearest codeword according to hamming distance  Can we do better?

The Real Question?  Tradeoff between “learnability” of binary problems and the error-correcting power of the code

Codeword assignment  Standard Procedure: Assign codewords to classes randomly  Can we do better?

Goal of Current Research  Improve classification performance without increasing cost Design short codes that perform well Develop algorithms that increase performance without affecting code length

Previous Results  Performance increases with length of code  Gives the same percentage increase in performance over NB regardless of training set size  BCH Codes > Random Codes > Hand- constructed Codes

Others have shown that ECOC  Works great with arbitrary long codes  Longer codes = More Error-Correcting Power = Better Performance  Longer codes = More Computational Cost

ECOC to the Rescue!  An n-class problem can be solved by solving log 2 n problems  More efficient than one-per-class  Does it actually perform better?

Previous Results Industry Sector Data Set Naïve Bayes Shrinkage 1 ME 2 ME/ w Prior 3 ECOC 63-bit 66.1%76%79%81.1%88.5% ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in computational cost 1.(McCallum et al. 1998) 2,3. (Nigam et al. 1999) (Ghani 2000)

Design codewords  Maximize Performance (Accuracy, Precision, Recall, F1?)  Minimize length of codes  Search in the space of codewords through gradient descent G=Error + Code_Length

Codeword Assignment  Generate the confusion matrix and use that to assign the most confusable classes the codewords that are farthest apart  Pros Focusing on confusable classes more can help  Cons Individual binary problems can be very hard

The Decoding Step  Weight the individual classifiers according to their training accuracies and do weighted majority decoding.  Pose the decoding as a separate learning problem and use regression/Neural Network 10011001 11011101

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Similar presentations

Presentation on theme: "Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Similar presentations

Presentation on theme: "Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal."— Presentation transcript:

Similar presentations

About project

Feedback