Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

Similar presentations


Presentation on theme: "Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06."— Presentation transcript:

1 Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06

2 Outline Overview of Semi-supervised learning (SSL) Self-training (a.k.a. Bootstrapping)

3 Additional Reference Xiaojin Zhu (2006): Semi-supervised learning literature survey. Olivier Chapelle et al. (2005): Semi- supervised Learning. The MIT Press.

4 Overview of SSL

5 What is SSL? Labeled data: –Ex: POS tagging: tagged sentences –Creating labeled data is difficult, expensive, and/or time-consuming. Unlabeled data: –Ex: POS tagging: untagged sentences. –Obtaining unlabeled data is easier. Goal: use both labeled and unlabeled data to improve the performance

6 Learning –Supervised (labeled data only) –Semi-supervised (both labeled and unlabeled data) –Unsupervised (unlabeled data only) Problems: –Classification –Regression –Clustering –…–…  Focus on semi-supervised classification problem

7 A brief history of SSL The idea of self-training appeared in the 1960s. SSL took off in the 1970s. The interest for SSL increased in the 1990s, mostly due to applications in NLP.

8 Does SSL work? Yes, under certain conditions. –Problem itself: the knowledge on p(x) carry information that is useful in the inference of p(y | x). –Algorithm: the modeling assumption fits well with the problem structure. SSL will be most useful when there are far more unlabeled data than labeled data. SSL could degrade the performance when mistakes reinforce themselves.

9 Illustration (Zhu, 2006)

10 Illustration (cont)

11 Assumptions Smoothness (continuity) assumption: if two points x 1 and x 2 in a high-density region are close, then so should be the corresponding outputs y 1 and y 2. Cluster assumption: If points are in the same cluster, they are likely to be of the same class. Low density separation: the decision boundary should lie in a low density region. ….

12 SSL algorithms Self-training Co-training Generative models: –Ex: EM with generative mixture models Low Density Separations: –Ex: Transductive SVM Graph-based models

13 Which SSL method should we use? It depends. Semi-supervised methods make strong model assumptions. Choose the ones whose assumptions fit the problem structure.

14 Semi-supervised and active learning They address the same issue: labeled data are hard to get. Semi-supervised: choose the unlabeled data to be added to the labeled data. Active learning: choose the unlabeled data to be annotated.

15 Self-training

16 Basics of self-training Probably the earliest SSL idea. Also called self-teaching or bootstrapping. Appeared in the 1960s and 1970s. First well-known NLP paper: (Yarowsky, 1995)

17 Self-training algorithm Let L be the set of labeled data, U be the set of unlabeled data. Repeat –Train a classifier h with training data L –Classify data in U with h –Find a subset U’ of U with the most confident scores. –L + U’  L –U – U’  U

18 Case study: (Yarowsky, 1995)

19 Setting Task: WSD –Ex: plant: living / factory “Unsupervised”: just need a few seed collocations for each sense. Learner: Decision list

20 Assumption #1: One sense per collocation Nearby words provide strong and consistent clues to the sense of a target word. The effect varies depending on the type of collocation –It is strongest for immediately adjacent collocations. Assumption #1  Use collocations in the decision rules.

21 Assumption #2: One sense per discourse The sense of a target word is highly consistent within any given document. The assumption holds most of the time (99.8% in their experiment) Assumption #2  filter and augment the addition of unlabeled data.

22 Step 1: identify all examples of the given word: “plant” Our sample set S

23 Step 2: Create initial labeled data using a small number of seed collocations Sense A: “life” Sense B: “manufacturing” Our L (0) U (0) = S – L (0) : residual data set.

24 Initial “labeled” data

25 Step 3a: Train a BL classifier

26 Step 3b: Apply the classifier to the entire set Add to L the members in U which are tagged with prob above a threshold. or

27 Step 3c: filter and augment this addition with assumption #2

28 Repeat step 3 until converge

29 The final DL classifier

30 The original algorithm Keep initial labeling unchanged.

31 Options for obtaining the seeds Use words in dictionary definitions Use a single defining collocate for each sense: –Ex: “bird” and “machine” for the word “crane” Label salient corpus collocates: –Collect frequent collocates –Manually check the collocates  Getting the seeds is not hard.

32 Performance Baseline: 63.9% Supervised learning: 96.1% Unsupervised learning: 90.6% - 96.5%

33 Why does it work? (Steven Abney, 2004): “Understanding the Yarowsky Algorithm”

34 Summary of self-training The algorithm is straightforward and intuitive. It produces outstanding results. Added unlabeled data pollute the original labeled data:

35 Additional slides

36 Papers on self-training Yarowsky (1995): WSD Riloff et al. (2003): identify subjective nouns Maeireizo et al. (2004): classify dialogues as “emotional” or “non-emotional”.


Download ppt "Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06."

Similar presentations


Ads by Google