Presentation is loading. Please wait.

Presentation is loading. Please wait.

Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.

Similar presentations


Presentation on theme: "Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University."— Presentation transcript:

1 Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University

2 Text Active Learning Many applications Scenario: ask for labels of a few documents While learning: –Learner carefully selects unlabeled document –Trainer provides label –Learner rebuilds classifier

3 Query-By-Committee (QBC) Label documents with high classification variance Iterate: –Create a committee of classifiers –Measure committee disagreement about the class of unlabeled documents –Select a document for labeling Theoretical results promising [Freund et al. 97] [Seung et al. 92]

4 Text Framework “Bag of Words” document representation Naïve Bayes classification: For each class, estimate P(word|class)

5 Outline: Our approach Create committee by sampling from distribution over classifiers Measure committee disagreement with KL- divergence of the committee members Select documents from a large pool using both disagreement and density-weighting Add EM to use documents not selected for labeling

6 Creating Committees Each class a distribution of word frequencies For each member, construct each class by: –Drawing from the Dirichlet distribution defined by the labeled data labeled data Classifier distribution MAP classifier Member 1 Member 2 Member 3 Committee

7 Measuring Committee Disagreement Kullback-Leibler Divergence to the mean –compares differences in how members “vote” for classes –Considers entire class distribution of each member –Considers “confidence” of the top-ranked class

8 Selecting Documents Stream-based sampling: –Disagreement => Probability of selection –Implicit (but crude) instance distribution information Pool-based sampling: –Select highest disagreement of all documents –Lose distribution information

9 Disagreement

10 Density-weighted pool-based sampling A balance of disagreement and distributional information Select documents by: Calculate Density by: –(Geometric) Average Distance to all documents

11 Disagreement

12 Density

13 Datasets and Protocol Reuters-21578 and subset of Newsgroups One initial labeled document per class 200 iterations of active learning mac ibm graphics windows X computers acqcorntrade...

14 QBC on Reuters acq, P(+) = 0.25trade, P(+) = 0.038corn, P(+) = 0.018

15 Selection comparison on News5

16 EM after Active Learning After active learning only a few documents have been labeled Use EM to predict the labels of the remaining unlabeled documents Use all documents to build a new classification model, which is often more accurate.

17 QBC and EM on News5

18 Related Work Active learning with text: –[Dagan & Engelson 95]: QBC Part of speech tagging –[Lewis & Gale 94]: Pool-based non-QBC –[Liere & Tadepalli 97 & 98]: QBC Winnow & Perceptrons EM with text: –[Nigam et al. 98]: EM with unlabeled data

19 Conclusions & Future Work Small P(+) => better active learning Leverage unlabeled pool by: –pool-based sampling –density-weighting –Expectation-Maximization Different active learning approaches a la [Cohn et al. 96] Interleaved EM & active learning

20 Document classification: the Potential 3 x 10 8 unlabeled web pages Classification important for the Web –Knowledge extraction –User interest modeling

21 Document classification: the Status Good techniques exist –Many parameters to estimate –Data very sparse –Lots of training examples needed

22 Document classification: the Challenge Labeling data is expensive –requires human interaction –domains may constrain labeling effort Use Active Learning! –Pick carefully which documents to label –Make a quicker knee

23 Disagreement Example

24 Reuters-21578 Skewed priors => better active learning? Reuters: Binary classification & skewed priors Better active learning results with more infrequent classes

25 comp.* Newsgroups dataset 5 categories, 1000 documents each 20% held out for testing One initial labeled document per class 200 iterations of active learning 10 runs per curve mac ibm graphics windows X computers

26 Text Classification Many applications Good techniques exist Require lots of data Labeling expensive Use active learning Corn prices rose today while corn futures dropped in surprising trading activity. Corn...

27 For each unlabeled document: Pick two consistent hypotheses If they disagree about the label, request it Old QBC stuff


Download ppt "Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University."

Similar presentations


Ads by Google