Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal

Text Categorization Numerous Applications Search Engines/Portals Customer Service …. Domains: Topics Genres Languages $$$ Making

How do people deal with a large number of classes?  Use fast multiclass algorithms (Naïve Bayes) Builds one model per class  Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems  What happens with a 1000 class problem?  Can we do better?

ECOC to the Rescue!  An n-class problem can be solved by solving log 2 n problems  More efficient than one-per-class  Does it actually perform better?

What is ECOC?  Solve multiclass problems by decomposing them into multiple binary problems  Use a learner to learn the binary problems

Training ECOC 0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 1 ABCDABCD f 1 f 2 f 3 f 4 f 5 X 0 0 1 1 1 Testing ECOC

ECOC - Picture 0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 1 ABCDABCD A D C B f 1 f 2 f 3 f 4 f 5

ECOC - Picture 0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 1 ABCDABCD A D C B f 1 f 2 f 3 f 4 f 5 X 1 1 1 1 0

Classification Performance EfficiencyEfficiency NB ECOC Preliminary Results This Proposal ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in computational cost

Proposed Solutions  Design codewords that minimize cost and maximize “performance”  Investigate the assignment of codewords to classes  Learn the decoding function  Incorporate unlabeled data into ECOC

Use unlabeled data  Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories  ECOC works great with a large number of classes but there is no framework for usaing unlabeled data

Use Unlabeled Data  ECOC decomposes multiclass problems into binary problems  Co-Training works great with binary problems  ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training (and variants of Co-Training such as Co-EM)

Summary

Testing ECOC  To test a new instance Apply each of the n classifiers to the new instance Combine the predictions to obtain a binary string(codeword) for the new point Classify to the class with the nearest codeword (usually hamming distance is used as the distance measure)

The Decoding Step  Standard: Map to the nearest codeword according to hamming distance  Can we do better?

The Real Question?  Tradeoff between “learnability” of binary problems and the error-correcting power of the code

Codeword assignment  Standard Procedure: Assign codewords to classes randomly  Can we do better?

Goal of Current Research  Improve classification performance without increasing cost Design short codes that perform well Develop algorithms that increase performance without affecting code length

Previous Results  Performance increases with length of code  Gives the same percentage increase in performance over NB regardless of training set size  BCH Codes > Random Codes > Hand- constructed Codes

Others have shown that ECOC  Works great with arbitrary long codes  Longer codes = More Error-Correcting Power = Better Performance  Longer codes = More Computational Cost

ECOC to the Rescue!  An n-class problem can be solved by solving log 2 n problems  More efficient than one-per-class  Does it actually perform better?

Previous Results Industry Sector Data Set Naïve Bayes Shrinkage 1 ME 2 ME/ w Prior 3 ECOC 63-bit 66.1%76%79%81.1%88.5% ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in computational cost 1.(McCallum et al. 1998) 2,3. (Nigam et al. 1999) (Ghani 2000)

Design codewords  Maximize Performance (Accuracy, Precision, Recall, F1?)  Minimize length of codes  Search in the space of codewords through gradient descent G=Error + Code_Length

Codeword Assignment  Generate the confusion matrix and use that to assign the most confusable classes the codewords that are farthest apart  Pros Focusing on confusable classes more can help  Cons Individual binary problems can be very hard

The Decoding Step  Weight the individual classifiers according to their training accuracies and do weighted majority decoding.  Pose the decoding as a separate learning problem and use regression/Neural Network 10011001 11011101

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Similar presentations

Presentation on theme: "Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Similar presentations

Presentation on theme: "Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal."— Presentation transcript:

Similar presentations

About project

Feedback