ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.

ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect of Smoothing on Naive Bayes for text classification (with Tong Zhang) Hypertext Categorization using link and extracted information (with Sean Slattery & Yiming Yang) Some Recent work

Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery Carnegie Mellon University This presentation can be accessed at http://www.cs.cmu.edu/~rayid/talks/

Outline Introduction to ECOC Intuition & Motivation Some Questions? Experimental Results Semi-Theoretical Model Types of Codes Drawbacks Conclusions

Introduction Decompose a multiclass classification problem into multiple binary problems One-Per-Class Approach (moderately expensive) All-Pairs (very expensive) Distributed Output Code (efficient but what about performance?) Error-Correcting Output Codes (?)

Is it a good idea? Larger margin for error since errors can now be “corrected” One-per-class is a code with minimum hamming distance (HD) = 2 Distributed codes have low HD The individual binary problems can be harder than before Useless unless number of classes > 5

Training ECOC Given m distinct classes 1. Create an m x n binary matrix M. 2. Each class is assigned ONE row of M. 3. Each column of the matrix divides the classes into TWO groups. 4. Train the Base classifiers to learn the n binary problems. 0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 1 ABCDABCD f 1 f 2 f 3 f 4 f 5

Training ECOC Given m distinct classes Create an m x n binary matrix M. Each class is assigned ONE row of M. Each column of the matrix divides the classes into TWO groups. Train the Base classifiers to learn the n binary problems.

Testing ECOC To test a new instance Apply each of the n classifiers to the new instance Combine the predictions to obtain a binary string(codeword) for the new point Classify to the class with the nearest codeword (usually hamming distance is used as the distance measure)

ECOC - Picture 0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 1 ABCDABCD A D C B f 1 f 2 f 3 f 4 f 5

ECOC - Picture 0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 1 ABCDABCD A D C B f 1 f 2 f 3 f 4 f 5 X 1 1 1 1 0

Single classifier – learns a complex boundary once Ensemble – learns a complex boundary multiple times ECOC – learns a “simple” boudary multiple times

Questions? How well does it work? How long should the code be? Do we need a lot of training data? What kind of codes can we use? Are there intelligent ways of creating the code?

Previous Work Combine with Boosting – ADABOOST.OC (Schapire, 1997), (Guruswami & Sahai, 1999) Local Learners (Ricci & Aha, 1997) Text Classification (Berger, 1999)

Experimental Setup Generate the code BCH Codes Choose a Base Learner Naive Bayes Classifier as used in text classification tasks (McCallum & Nigam 1998) Naive Bayes Classifier

Dataset Industry Sector Dataset Consists of company web pages classified into 105 economic sectors Standard stoplist No Stemming Skip all MIME headers and HTML tags Experimental approach similar to McCallum et al. (1998) for comparison purposes.

Results Classification Accuracies on five random 50-50 train-test splits of the Industry Sector dataset with a vocabulary size of 10000. ECOC is 88% accurate!

Results Industry Sector Data Set Naïve Bayes Shrinkage 1 ME 2 ME/ w Prior 3 ECOC 63-bit 66.1%76%79%81.1%88.5% ECOC reduces the error of the Naïve Bayes Classifier by 66% 1.(McCallum et al. 1998) 2,3. (Nigam et al. 1999)

The Longer the Better! Table 2: Average Classification Accuracy on 5 random 50-50 train-test splits of the Industry Sector dataset with a vocabulary size of 10000 words selected using Information Gain. Longer codes mean larger codeword separation The minimum hamming distance of a code C is the smallest distance between any pair of distance codewords in C If minimum hamming distance is h, then the code can correct  (h-1)/2 errors

Size Matters?

Size does NOT matter!

Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly

Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly # of BitsH min E max P ave Accuracy 1552.85.59 1552.89.80 1552.91.84 31115.85.67 31115.89.91 31115.91.94 633115.89.99

Talk.misc.religion Comp.sys.ibm.hardware Comp.os.windows Alt.atheism Comp.os.windows Comp.sys.ibm.hardware Talk.misc.religion Alt.atheism Comp.os.windows Talk.misc.religion Comp.sys.ibm.hardware Alt.atheism Talk.misc.religion Comp.sys.ibm.hardware Comp.os.windows Talk.misc.religion Alt.atheism Comp.sys.ibm.hardware Comp.os.windows Alt.atheism Comp.sys.ibm.hardware Talk.misc.religion 99% 73% 68% 81%86% 87%

Types of Codes Data-Independent Data-Dependent Algebraic Random Hand-Constructed Adaptive

What is a Good Code? Row Separation Column Separation (Independence of errors for each binary classifier) Efficiency (for long codes)

Choosing Codes RandomAlgebraic Row SepOn Average For long codes Guaranteed Col SepOn Average For long codes Can be Guaranteed EfficiencyNoYes

Experimental Results CodeMin Row HD Max Row HD Min Col HD Max Col HD Error Rate 15-Bit BCH 515496420.6% 19-Bit Hybrid 518156922.3% 15-bit Random 2 (1.5) 13426024.1%

Interesting Questions? NB does not give good probabilitiy estimates- using ECOC results in better estimates? Assignment of codewords to classes? Can Decoding be posed as a supervised learning task?

Drawbacks Can be computationally expensive Random Codes throw away the real- world nature of the data by picking random partitions to create artificial binary problems

Current Work Combine ECOC with Co-Training to use unlabeled data Automatically construct optimal / adaptive codes

Conclusion Performs well on text classification tasks Can be used when training data is sparse Algebraic codes perform better than random codes for a given code length Hand-constructed codes may not be the answer

Background Co-training seems to be the way to go when there is (and maybe even when there isn’t) a feature split in the data Reported results on co-training only deal with very small (toy) problems – mostly binary classification tasks (Blum & Mitchell 98, Nigam & Ghani 2000)

Co-Training Challenge Task: Apply cotraining to a 65 class dataset containing 130,000 training examples Result: Cotraining fails!

Solution? ECOC seems to work well when there are a large number of classes ECOC decomposes a multiclass problems into several binary problems Cotraining works well with binary problems Combine ECOC and Cotrain

Algorithm Learn each bit for ECOC using a cotrained classifier

Dataset (Job Descriptions) 65 classes 32000 examples Two feature sets Title Description

Results 10% Train, 50% unlabeled, 40% test NB 40.3% ECOC 48.9% EM 30.83% CoTraining ECOC-EM ECOC-Cotrain ECOC-CoEM

ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.

Similar presentations

Presentation on theme: "ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.

Similar presentations

Presentation on theme: "ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect."— Presentation transcript:

Similar presentations

About project

Feedback