Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:

Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies: http://metaoptimize.com/projects/wordreprs/

2 Sup model Sup data Supervised training

3 Sup model Sup data Supervised training Semi-sup training?

4 Sup model Sup data Supervised training Semi-sup training?

5 Sup model Sup data Supervised training Semi-sup training? More feats

6 Sup model Sup data More feats Sup model Sup data More feats sup task 1 sup task 2

7 Semi-sup model Unsup data Sup data Joint semi-sup

8 Semi-sup model Unsup model Unsup data Sup data unsup pretraining semi-sup fine-tuning

9 Unsup model Unsup data unsup training unsup feats

10 Semi-sup model Unsup data unsup training Sup training Sup data unsup feats

11 Unsup data unsup training unsup feats sup task 1sup task 2 sup task 3

12 What unsupervised features are most useful in NLP?

13 Natural language processing Words, words, words

14 How do we handle words? Not very well

15 “One-hot” word representation |V| = |vocabulary|, e.g. 50K for PTB2 word -1, word 0, word +1 Pr dist over labels (3*|V|) x m 3*|V| m

16 One-hot word representation 85% of vocab words occur as only 10% of corpus tokens Bad estimate of Pr(label|rare word) word 0 |V| x m |V| m

17 Approach

18 Approach Manual feature engineering

19 Approach Manual feature engineering

20 Approach Induce word reprs over large corpus, unsupervised Use word reprs as word features for supervised task

21 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs

23 Distributional representations F W (size of vocab) C e.g. F w,v = Pr(v follows word w) or F w,v = Pr(v occurs in same doc as w)

24 Distributional representations F W (size of vocab) C d g(g( ) = f g(F) = f, e.g. g = LSI/LSA, LDA, PCA, ICA, rand trans C >> d

26 Class-based word repr |C| classes, hard clustering word 0 (|V|+|C|) x m |V|+|C| m

27 Class-based word repr Hard vs. soft clustering Hierarchical vs. flat clustering

28 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs –Brown (hard, hierarchical) clustering –HMM (soft, flat) clustering Distributed word reprs

30 Brown clustering Hard, hierarchical class-based LM Brown et al. (1992) Greedy technique for maximizing bigram mutual information Merge words by contextual similarity

31 Brown clustering (image from Terry Koo) cluster(chairman) = `0010’ 2-prefix(cluster(chairman)) = `00’

32 Brown clustering Hard, hierarchical class-based LM 1000 classes Use prefixes = 4, 6, 10, 20

35 Distributed word repr k- (low) dimensional, dense representation “word embedding” matrix E of size |V| x k word 0 k x m k m

36 Sequence labeling w/ embeddings word -1, word 0, word +1 (3*k) x m |V| x k, tied weights m “word embedding” matrix E of size |V| x k

37 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs –Collobert + Weston (2008) –HLBL embeddings (Mnih + Hinton, 2007)

39 Collobert + Weston 2008 w1 w2 w3 w4 w5 50*5 1 100 w1 w2 w3 w4 w5 score > μ + score

40 50-dim embeddings: Collobert + Weston (2008) t-SNE vis by van der Maaten + Hinton (2008)

42 Log bilinear Language Model (LBL) w1 w2 w3 w4 w5 Linear prediction of w5 }

43 HLBL HLBL = hierarchical (fast) training of LBL Mnih + Hinton (2009)

44 Approach Induce word reprs over large corpus, unsupervised –Brown: 3 days –HLBL: 1 week, 100 epochs –C&W: 4 weeks, 50 epochs Use word reprs as word features for supervised task

45 Unsupervised corpus RCV1 newswire 40M tokens (vocab = all 270K types)

46 Supervised Tasks Chunking (CoNLL, 2000) –CRF (Sha + Pereira, 2003) Named entity recognition (NER) –Averaged perceptron (linear classifier) –Based upon Ratinov + Roth (2009)

47 Unsupervised word reprs as features Word = “the” Embedding = [-0.2, …, 1.6] Brown cluster = 1010001100 (cluster 4-prefix = 1010, cluster 6-prefix = 101000, …)

48 Unsupervised word reprs as features Orig X = {pos-2=“DT”: 1, word-2=“the”: 1,...} X w/ Brown = {pos-2=“DT”: 1, word-2=“the”: 1, class-2-pre4=“1010”: 1, class-2-pre6=“101000”: 1} X w/ emb = {pos-2=“DT”: 1, word-2=“the”: 1, word-2-dim00: -0.2, …, word-2-dim49: 1.6,...}

49 Embeddings: Normalization E = σ * E / stddev(E)

50 Embeddings: Normalization (Chunking)

51 Embeddings: Normalization (NER)

52 Repr capacity (Chunking)

53 Repr capacity (NER)

54 Test results (Chunking)

55 Test results (NER)

56 MUC7 (OOD) results (NER)

57 Test results (NER)

Test results Chunking: C&W = Brown NER: C&W < Brown Why? 58

59 Word freq vs word error (Chunking)

60 Word freq vs word error (NER)

61 Summary Both Brown + word emb can increase acc of near-SOTA system Combining can improve accuracy further On rare words, Brown > word emb Scale parameter σ = 0.1 Goodies: http://metaoptimize.com/projects/wordreprs/ Word features! Code!

62 Difficulties with word embeddings No stopping criterion during unsup training More active features (slower sup training) Hyperparameters –Learning rate for model –(optional) Learning rate for embeddings –Normalization constant vs. Brown clusters, few hyperparams

63 HMM approach Soft, flat class-based repr Multinomial distribution over hidden states = word representation 80 hidden states Huang and Yates (2009) No results with HMM approach yet

Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:

Similar presentations

Presentation on theme: "Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:

Similar presentations

Presentation on theme: "Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:"— Presentation transcript:

Similar presentations

About project

Feedback