Active, Semi-Supervised Learning for Textual Information Access Anastasia Krithara¹, Cyril Goutte², Massih-Reza Amini³, Jean-Michel Renders¹ Massih-Reza.

Active, Semi-Supervised Learning for Textual Information Access Anastasia Krithara¹, Cyril Goutte², Massih-Reza Amini³, Jean-Michel Renders¹ Massih-Reza Amini³, Jean-Michel Renders¹ 1 07/07/06 International Workshop on Intelligent Information Access- Helsinki 2006 ¹ Xerox Research Centre Europe, 6 chemin de Maupertuis, F-38240 Meylan, FRANCE ² National Research Council Canada, Institute for Information Technology, Interactive Language Technologies Group, 101 St-Jean-Bosco Street, Gatineau, QC K1A 0R6, Quebec, CANADA ³ Department of Computer Science, University of Paris VI, 8 rue de Capitaine Scott, 75015 Paris, FRANCE

International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 2 Unlabeled examples Introduction (1) Annotation process labeled examples Problem: The annotation process is often costly and time-consuming Supervised learning: Training of the model classifier

International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 3 Introduction (2) Solutions: Both solve the same problem but from different perspectives Semi-Supervised Learning Active Learning

Outline The problem/ solutions Our method: Active, Semi-Supervised PLSA Experiments Conclusions/ Future work International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 4

Semi-Supervised Learning (SSL) International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 5 Given a small set of labeled data L a large set of unlabeled data U Train a model M on Unlabeled data can give us some valuable information about P(x)

Active Learning International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 6 Given a small set of labeled data L a large set of unlabeled data U Repeat Train a model(s) M on L Use M to test U Select the most useful example from U Ask the human expert to label it Add the labeled example in L Until M reaches a certain performance level or a certain number of queries

Combination of SSL and Active Learning International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 7 Given a small set of labeled data L a large set of unlabeled data U Repeat Train a model(s) M on ( Semi-Supervised Learning) Use M to test U Select the most useful example from U Ask the human expert to label it Add the labeled example in L Until M reaches a certain performance level or a certain number of queries

Active, Semi-Supervised PLSA (1) International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 8 We represent our document collection as a term-by-document matrix (in other words, as co-occurrences ):

Active, Semi-Supervised PLSA (2) Problem: Synonyms: different words have the same meaning Polysems: words with multiple meanings Disconnection between topics and words Solution: PLSA (Probabilistic Latent Semantic Analysis) aims to discover something about the meaning behind the words. In other words, about the topics of the document. International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 9

Active, Semi-Supervised PLSA (3) We model our data by a mixture model, under the assumption that d and w are independent: International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 10 wheregives the profile of a component and gives the topics which are in a document (c=1…K is the index over K latent components)

Active, Semi-Supervised PLSA (4) International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 11 When the ratio of labeled-unlabeled document is very low some components contain only unlabeled examples In this case arbitrary probabilities will be assign to components, which will lead to arbitrary decision during the classification Solution: Introduction of an additional “fake label” variable z=L0: All labeled examples keep their label All unlabeled examples get the new “label” z After training the model, we distribute the probability obtained for the “fake” z onto the “real” labels:

Active, Semi-Supervised PLSA (5) International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 12 Taking into account the label z, our model becomes: where c=1…K is the index over K latent components. We then use a variant of EM-algorithm to train our multinomial mixture model. The (log)likelihood of the data is: where z(d) the (unique) label of document d and n(w,d) # of occurrences of word w to document d

Active, Semi-Supervised PLSA (6) International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 13 E-step: M-step: The EM algorithm

Active, Semi-Supervised PLSA (7) International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 14 On the top of SSL Active Learning i.e. the example with the highest entropy. For the binary case the example with probability closest to 0.5

Experimental Setting (1) International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 15 Corpus: 3 binary problems from the 20-newsgroups dataset rec.sport.baseball (994) vs. rec.sport.hockey (999) comp.sys.ibm.pc.hardware (982) vs. comp.sys.mac.hardware (961) talk.religion.misc (628) vs. alt.atheism (799) (They represent easy, moderate and hard problems respectively) Corpus split 80% for training set 2 labeled examples (one of each category) The rest unlabeled examples 20% for test set (for unbiased estimation of the accuracy)

Experimental Setting (2) International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 16 Comparison of the following methods: Semi-Supervised PLSA +Active Learning Semi-Supervised PLSA + Random Query SVM + Active Learning (choice of the examples closest to the margin)

Experimental Results International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 17 Comparison of the active semi-supervised PLSA (top) with semi-supervised PLSA querying random examples (middle) and SVM querying the examples closest to the margin (bottom) Baseball vs. Hockey

Experimental Results International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 18 PC vs. Mac

Experimental Results International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 19 Religion vs. Atheism

Conclusions International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 20 Proposed a method which combines SSL and active learning using PLSA The combination outperforms the Semi-Supervised PLSA alone The hardest the problem is, the more the active learning helps

Future Work International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 21 More experiments with different datasets Use different Active Learning methods Take into account different costs Apply our method on multiclass problems …

Thank you International Workshop on Intelligent Information Access- Helsinki 2006 07/07/06 22 Questions?

Active, Semi-Supervised Learning for Textual Information Access Anastasia Krithara¹, Cyril Goutte², Massih-Reza Amini³, Jean-Michel Renders¹ Massih-Reza.

Similar presentations

Presentation on theme: "Active, Semi-Supervised Learning for Textual Information Access Anastasia Krithara¹, Cyril Goutte², Massih-Reza Amini³, Jean-Michel Renders¹ Massih-Reza."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Active, Semi-Supervised Learning for Textual Information Access Anastasia Krithara¹, Cyril Goutte², Massih-Reza Amini³, Jean-Michel Renders¹ Massih-Reza.

Similar presentations

Presentation on theme: "Active, Semi-Supervised Learning for Textual Information Access Anastasia Krithara¹, Cyril Goutte², Massih-Reza Amini³, Jean-Michel Renders¹ Massih-Reza."— Presentation transcript:

Similar presentations

About project

Feedback