Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Advisor: Tom Mitchell.

Similar presentations


Presentation on theme: "Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Advisor: Tom Mitchell."— Presentation transcript:

1

2 Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Advisor: Tom Mitchell

3 Text Categorization Numerous Applications Search Engines/Portals Customer Service Email Routing …. Domains: Topics Genres Languages

4 Problems  Practical applications such as web portals deal with a large number of categories and current systems are not specifically designed to handle that  A lot of labeled examples are needed for training the system

5 How do people deal with a large number of classes?  Use algorithms that scale up linearly Naïve Bayes – fast, handles multiclass problems, builds one probability distribution per class SVM – can only handle binary problems so an n-class problem is solved by solving n binary problems  Can we do better than scale up linearly with the number of classes?

6 ECOC to the Rescue!  An n-class problem can be solved by solving fewer than n binary problems  More efficient than one-per-class  Is it better in classification performance?

7 ECOC to the rescue!  Motivated by error-correcting codes – add redundancy to correct errors  Solve multiclass problems by decomposing them into multiple binary problems ( Dietterich & Bakiri 1995 )  Can be more efficient than one-per-class approach

8 Related Work using ECOC  Focus is on accuracy, mostly at the cost of computational efficiency Applied to text classification with NB (Berger 1999) Applied to several non-text problems using Dtrees, NN, kNN (Dietterich & Bakiri 1995, Aha 1997) Combined with Boosting (Schapire 1999, Guruswami & Sahai 1999)

9 Training ECOC 0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD f 1 f 2 f 3 f 4 X 1 1 Testing ECOC

10 ECOC - Picture 0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD A D C B f 1 f 2 f 3 f 4

11 ECOC - Picture 0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD A D C B f 1 f 2 f 3 f 4

12 ECOC - Picture 0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD A D C B f 1 f 2 f 3 f 4 X 1 1

13 0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD f 1 f 2 f 3 f 4 X 1 1 ECOC can recover from the errors of the individual classifiers 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1  Longer codes mean larger codeword separation  The minimum hamming distance of a code C is the smallest distance between any pair of codewords in C  If minimum hamming distance is h, then the code can correct at most  (h-1)/2  errors

14 ECOC Longer Codes Better Error-Correcting Properties Better Classification Performance Higher Computational Cost

15 Classification Performance EfficiencyEfficiency NB ECOC Goal (as used in Berger 99)

16  Efficient (short) codes  Maximize Accuracy  Reduce Labeled Data

17 Roadmap  Efficiency: shorter and better codes  Accuracy: Maximize classification performance  Data efficiency : use unlabeled data Classification Performance Efficiency NB ECOC Goal

18 What is a Good Code?  Row Separation  Column Separation (correlation of errors for each binary classifier)  Efficiency

19 Choosing the codewords  Use Coding Theory for Good Error-Correcting Codes? [Dietterich & Bakiri 1995] Guaranteed properties for a fixed-length code  Random codes [Berger 1999, James 1999] Asymptotically good (the longer the better) Computational Cost is very high  Construct codes that are data-dependent

20 Text Classification with Naïve Bayes  “Bag of Words” document representation  Estimate parameters of generative model:  Naïve Bayes classification:

21 Industry Sector Dataset Consists of company web pages classified into 105 economic sectors [McCallum et al. 1998, Ghani 2000]

22 Results Industry Sector Data Set Naïve Bayes Shrinkage 1 MaxEnt 2 MaxEnt/ w Prior 3 ECOC 63-bit 66.1%76%79%81.1%88.5% ECOC reduces the error of the Naïve Bayes Classifier by 66% with HALF the computational cost 1.(McCallum et al. 1998) 2,3. (Nigam et al. 1999)

23 Experimental Setup  Generate the code BCH Codes  Choose a Base Learner Naive Bayes Classifier

24 Datasets  Industry Sector 10,000 Corporate web pages (105 classes)  Hoovers-28 and Hoovers-255 4,285 websites  Jobs-15 and Jobs-65 100,000 job titles and descriptions (from WhizBang! Labs)

25 Results

26 Comparison with Naïve Bayes

27

28 A Closer Look…

29 ECOC for better Precision Industry Sector Dataset

30 ECOC for better Precision Hoovers-28 dataset

31 Summary – Short Codes  Use algebraic coding theory to obtain short codes  Highly efficient and accurate at the same time  High precision classification

32  Efficient (short) codes  Maximize Accuracy  Reduce Labeled Data

33 Roadmap  Efficiency: shorter and better codes  Accuracy: Maximize classification performance  Data efficiency : use unlabeled data

34 Maximize classification performance for fixed code lengths  Investigate the assignment of codewords to classes  Learn the decoding function

35 Codeword assignment  Standard Procedure: Assign codewords to classes randomly  Can we do better?

36 Codeword Assignment  Generate the confusion matrix and use that to assign the most confusable classes the codewords that are farthest apart  Pros Since most errors are between the two most confusable classes, letting the learner make more mistakes between them should help  Cons The New individual binary problems can be very hard

37 Codeword Assignment Example Computer.Software Metal Industry Retail.Technology Jewelery Industry 0 0 0 0 1 0 0 1 1 1 Computer.Software Metal Industry Retail.Technology Jewelery Industry Metal Industry Retail.Technology Computer.Software Jewelery Industry Computer.Software Metal Industry Retail.Technology Jewelery Industry 0 1 0 0 0 0 1 0 1 1

38 Results Standard ECOCECOC with intelligent codeword assignment 75.4%56.3% Results on 10 classes chosen from the industry sector dataset Why? Some of the bits are really hard to learn

39 The Decoding Step  Standard: Use hamming distance to map to the class with the nearest codeword  Weight the individual classifiers according to their training accuracies and do weighted majority decoding.  Can we do better?

40 The Decoding Step  Pose the decoding as a separate learning problem and use Regression/Neural Network Predicted Codeword Hamming Distance Actual Codeword Neural Network

41 Results Standard ECOCECOC with learning decoding 75.4%76.1% Results on 10 classes chosen from the industry sector dataset

42 What If ?  We combine Assigning codewords to classes by confusability Learn the decoding Codeword Assignment Learn Binary Classifiers Learn Decoding function

43 Results 75.4%56.3% 76.1%83.3% Codeword Assignment Random Intelligent Hamming Distance Neural Network Decoding Results on 10 classes chosen from the industry sector dataset

44 Picture of why combo works 1 1 1 1 0 0 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 0 Since each column is a binary classification task, a learner with less than 50% accuracy can be made more than 50% accurate by just inverting the classification Input Output

45 Summary –Maximize Accuracy  Assigning codewords to classes intelligently and then learning the decoding function performs well.

46  Efficient (short) codes  Maximize Accuracy  Reduce Labeled Data

47 Roadmap  Efficiency: shorter and better codes  Accuracy: Maximize classification performance  Data efficiency : use unlabeled data

48 Supervised Learning with Labeled Data Labeled data is required in large quantities and can be very expensive to collect.

49 Why use Unlabeled data?  Very Cheap in the case of text Web Pages Newsgroups Email Messages  May not be equally useful as labeled data but is available in enormous quantities

50 Goal  Make learning more efficient by reducing the amount of labeled data required for text classification with a large number of categories

51 Related research with unlabeled data  Using EM in a generative model (Nigam et al. 1999)  Transductive SVMs (Joachims 1999)  Co-Training type algorithms (Blum & Mitchell 1998, Collins & Singer 1999, Nigam & Ghani 2000)

52 ECOC very accurate and efficient for text categorization with a large number of classes Co-Training useful for combining labeled and unlabeled data with a small number of classes Plan

53 The Feature Split Setting Junior Vermiculturist Looking to get involved in Worm farming and composting with worms. Look no more! My Advisor

54 The Co-training algorithm  Loop (while unlabeled documents remain): Build classifiers A and B Use Naïve Bayes Classify unlabeled documents with A & B Use Naïve Bayes Add most confident A predictions and most confident B predictions as labeled training examples [Blum & Mitchell 1998]

55 The Co-training Algorithm Naïve Bayes on B Naïve Bayes on A Learn from labeled data Estimate labels Estimate labels Select most confident Select most confident [Blum & Mitchell, 1998] Add to labeled data

56 One Intuition behind co- training  A and B are redundant  A features independent of B features  Co-training like learning with random classification noise Most confident A prediction gives random B Small misclassification error for A

57 ECOC + Co-Training  ECOC decomposes multiclass problems into binary problems  Co-Training works great with binary problems  ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training

58 SPORTS SCIENCE ARTS HEALTH POLITICS LAW

59 Datasets  Jobs-65 (from WhizBang) Job Postings (Two feature sets – Title, Description) 65 categories Baseline 11%  Hoovers-255 Collection of 4285 corporate websites Each company is classified into one of 255 categories Baseline 2%

60 Results DatasetNaïve Bayes (No Unlabeled Data) ECOC (No Unlabeled Data) EMCo- Trainin g ECOC + Co- Trainin g 10% Labeled 100% Labeled 10% Labeled 100% Labeled 10% Labeled Jobs-6550.168.259.371.258.254.164.5 Hoovers- 255 15.232.024.836.59.110.227.6 DatasetNaïve Bayes (No Unlabeled Data) ECOC (No Unlabeled Data) EMCo- Trainin ECOC + Co- Trainin g 10% Labeled 100% Labeled 10% Labeled 100% Labeled 10% Labeled Jobs-6550.168.259.371.258.254.164.5 Hoovers- 255 15.232.020.830.59.110.227.6

61 Results Dataset Naïve BayesECOCEMCo- Training ECOC + Co-Training 10% Labeled 100 % Labeled 10% Labeled 100 % Labeled 10% Labeled Jobs-65 50.168.259.371.258.254.164.5 Hoovers -255 15.232.024.836.59.110.227.6 Dataset Naïve BayesECOCEMCo- Training ECOC + Co-Training 10% Labeled 100 % Labeled 10% Labeled 100 % Labeled 10% Labeled Jobs-65 50.168.259.371.258.254.164.5 Hoovers -255 15.232.024.836.59.110.227.6

62 Results Hoovers-255 dataset

63 Summary – Using Unlabeled Data  Combination of ECOC and Co-Training performs extremely well for learning with labeled and unlabeled data  Results in High-Precision classifications  Randomly partitioning the vocabulary works well for Co-Training

64 Conclusions  Using short codes with good error-correcting properties results in high accuracy and precision  Combination of assigning codewords to classes according to their confusability and learning the decoding is promising  Using unlabeled data for a large number of categories by combining ECOC with Co-Training leads to good results

65 Next Steps  Use Active Learning techniques to pick initial labeled examples  Develop techniques to partition a standard feature set into two redundantly sufficient feature sets  Create codes specifically for unlabeled data

66 What happens with sparse data?

67 ECOC+CoTrain - Results Algorithm300L+ 0U Per Class 50L + 250U Per Class 5L + 295U Per Class Naïve BayesUses No Unlabeled Data 766740.3 ECOC 15bit76.568.549.2 EMUses Unlabeled Data - 105Class Problem 68.251.4 Co-Train 67.650.1 ECoTrain (ECOC + Co- Training) Uses Unlabeled Data 72.056.1

68 What Next?  Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration  Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training

69 Potential Drawbacks  Random Codes throw away the real-world nature of the data by picking random partitions to create artificial binary problems

70 Summary  Use ECOC for efficient text classification with a large number of categories  Increase Accuracy & Efficiency  Use Unlabeled data by combining ECOC and Co-Training  Generalize to domain-independent classification tasks involving a large number of categories

71 Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly

72 Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly # of BitsH min E max P ave Accuracy 1552.85.59 1552.89.80 1552.91.84 31115.85.67 31115.89.91 31115.91.94 633115.89.99

73 Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly # of BitsH min E max P ave Accuracy 1552.85.59 1552.89.80 1552.91.84 31115.85.67 31115.89.91 31115.91.94 633115.89.99

74 The Longer the Better! Table 2: Average Classification Accuracy on 5 random 50-50 train-test splits of the Industry Sector dataset with a vocabulary size of 10000 words selected using Information Gain.  Longer codes mean larger codeword separation  The minimum hamming distance of a code C is the smallest distance between any pair of distance codewords in C  If minimum hamming distance is h, then the code can correct  (h-1)/2  errors

75 Data-Independent Data-Dependent Algebraic Random Hand-Constructed Adaptive Types of Codes

76 What is a Good Code?  Row Separation  Column Separation (Independence of errors for each binary classifier)  Efficiency (for long codes)

77 Choosing Codes RandomAlgebraic Row SepOn Average For long codes Guaranteed Col SepOn Average For long codes Can be Guaranteed EfficiencyNoYes

78 Experimental Results CodeMin Row HD Max Row HD Min Col HD Max Col HD Error Rate 15-Bit BCH 515496420.6% 19-Bit Hybrid 518156922.3% 15-bit Random 2 (1.5)13426024.1%

79 Design codewords  Maximize Performance (Accuracy, Precision, Recall, F1?)  Minimize length of codes  Search in the space of codewords through gradient descent G=Error + Code_Length

80 Classification Performance EfficiencyEfficiency Naïve Bayes ECOC GOAL (as used in Berger 99)

81 A 1 0 0 0 0 0 0 0 0 0 B 0 1 0 0 0 0 0 0 0 0 C 0 0 1 0 0 0 0 0 0 0 D 0 0 0 1 0 0 0 0 0 0 E 0 0 0 0 1 0 0 0 0 0 F 0 0 0 0 0 1 0 0 0 0 G 0 0 0 0 0 0 1 0 0 0 H 0 0 0 0 0 0 0 1 0 0 I 0 0 0 0 0 0 0 0 1 0 J 0 0 0 0 0 0 0 0 0 1


Download ppt "Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Advisor: Tom Mitchell."

Similar presentations


Ads by Google