Presentation is loading. Please wait.

Presentation is loading. Please wait.

Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:

Similar presentations


Presentation on theme: "Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:"— Presentation transcript:

1 Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies: http://metaoptimize.com/projects/wordreprs/

2 2 Sup model Sup data Supervised training

3 3 Sup model Sup data Supervised training Semi-sup training?

4 4 Sup model Sup data Supervised training Semi-sup training?

5 5 Sup model Sup data Supervised training Semi-sup training? More feats

6 6 Sup model Sup data More feats Sup model Sup data More feats sup task 1 sup task 2

7 7 Semi-sup model Unsup data Sup data Joint semi-sup

8 8 Semi-sup model Unsup model Unsup data Sup data unsup pretraining semi-sup fine-tuning

9 9 Unsup model Unsup data unsup training unsup feats

10 10 Semi-sup model Unsup data unsup training Sup training Sup data unsup feats

11 11 Unsup data unsup training unsup feats sup task 1sup task 2 sup task 3

12 12 What unsupervised features are most useful in NLP?

13 13 Natural language processing Words, words, words

14 14 How do we handle words? Not very well

15 15 “One-hot” word representation |V| = |vocabulary|, e.g. 50K for PTB2 word -1, word 0, word +1 Pr dist over labels (3*|V|) x m 3*|V| m

16 16 One-hot word representation 85% of vocab words occur as only 10% of corpus tokens Bad estimate of Pr(label|rare word) word 0 |V| x m |V| m

17 17 Approach

18 18 Approach Manual feature engineering

19 19 Approach Manual feature engineering

20 20 Approach Induce word reprs over large corpus, unsupervised Use word reprs as word features for supervised task

21 21 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs

22 22 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs

23 23 Distributional representations F W (size of vocab) C e.g. F w,v = Pr(v follows word w) or F w,v = Pr(v occurs in same doc as w)

24 24 Distributional representations F W (size of vocab) C d g(g( ) = f g(F) = f, e.g. g = LSI/LSA, LDA, PCA, ICA, rand trans C >> d

25 25 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs

26 26 Class-based word repr |C| classes, hard clustering word 0 (|V|+|C|) x m |V|+|C| m

27 27 Class-based word repr Hard vs. soft clustering Hierarchical vs. flat clustering

28 28 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs –Brown (hard, hierarchical) clustering –HMM (soft, flat) clustering Distributed word reprs

29 29 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs –Brown (hard, hierarchical) clustering –HMM (soft, flat) clustering Distributed word reprs

30 30 Brown clustering Hard, hierarchical class-based LM Brown et al. (1992) Greedy technique for maximizing bigram mutual information Merge words by contextual similarity

31 31 Brown clustering (image from Terry Koo) cluster(chairman) = `0010’ 2-prefix(cluster(chairman)) = `00’

32 32 Brown clustering Hard, hierarchical class-based LM 1000 classes Use prefixes = 4, 6, 10, 20

33 33 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs –Brown (hard, hierarchical) clustering –HMM (soft, flat) clustering Distributed word reprs

34 34 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs

35 35 Distributed word repr k- (low) dimensional, dense representation “word embedding” matrix E of size |V| x k word 0 k x m k m

36 36 Sequence labeling w/ embeddings word -1, word 0, word +1 (3*k) x m |V| x k, tied weights m “word embedding” matrix E of size |V| x k

37 37 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs –Collobert + Weston (2008) –HLBL embeddings (Mnih + Hinton, 2007)

38 38 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs –Collobert + Weston (2008) –HLBL embeddings (Mnih + Hinton, 2007)

39 39 Collobert + Weston 2008 w1 w2 w3 w4 w5 50*5 1 100 w1 w2 w3 w4 w5 score > μ + score

40 40 50-dim embeddings: Collobert + Weston (2008) t-SNE vis by van der Maaten + Hinton (2008)

41 41 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs –Collobert + Weston (2008) –HLBL embeddings (Mnih + Hinton, 2007)

42 42 Log bilinear Language Model (LBL) w1 w2 w3 w4 w5 Linear prediction of w5 }

43 43 HLBL HLBL = hierarchical (fast) training of LBL Mnih + Hinton (2009)

44 44 Approach Induce word reprs over large corpus, unsupervised –Brown: 3 days –HLBL: 1 week, 100 epochs –C&W: 4 weeks, 50 epochs Use word reprs as word features for supervised task

45 45 Unsupervised corpus RCV1 newswire 40M tokens (vocab = all 270K types)

46 46 Supervised Tasks Chunking (CoNLL, 2000) –CRF (Sha + Pereira, 2003) Named entity recognition (NER) –Averaged perceptron (linear classifier) –Based upon Ratinov + Roth (2009)

47 47 Unsupervised word reprs as features Word = “the” Embedding = [-0.2, …, 1.6] Brown cluster = 1010001100 (cluster 4-prefix = 1010, cluster 6-prefix = 101000, …)

48 48 Unsupervised word reprs as features Orig X = {pos-2=“DT”: 1, word-2=“the”: 1,...} X w/ Brown = {pos-2=“DT”: 1, word-2=“the”: 1, class-2-pre4=“1010”: 1, class-2-pre6=“101000”: 1} X w/ emb = {pos-2=“DT”: 1, word-2=“the”: 1, word-2-dim00: -0.2, …, word-2-dim49: 1.6,...}

49 49 Embeddings: Normalization E = σ * E / stddev(E)

50 50 Embeddings: Normalization (Chunking)

51 51 Embeddings: Normalization (NER)

52 52 Repr capacity (Chunking)

53 53 Repr capacity (NER)

54 54 Test results (Chunking)

55 55 Test results (NER)

56 56 MUC7 (OOD) results (NER)

57 57 Test results (NER)

58 Test results Chunking: C&W = Brown NER: C&W < Brown Why? 58

59 59 Word freq vs word error (Chunking)

60 60 Word freq vs word error (NER)

61 61 Summary Both Brown + word emb can increase acc of near-SOTA system Combining can improve accuracy further On rare words, Brown > word emb Scale parameter σ = 0.1 Goodies: http://metaoptimize.com/projects/wordreprs/ Word features! Code!

62 62 Difficulties with word embeddings No stopping criterion during unsup training More active features (slower sup training) Hyperparameters –Learning rate for model –(optional) Learning rate for embeddings –Normalization constant vs. Brown clusters, few hyperparams

63 63 HMM approach Soft, flat class-based repr Multinomial distribution over hidden states = word representation 80 hidden states Huang and Yates (2009) No results with HMM approach yet


Download ppt "Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:"

Similar presentations


Ads by Google