Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:

Slides:



Advertisements
Similar presentations
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Advertisements

Greedy Layer-Wise Training of Deep Networks
Albert Gatt Corpora and Statistical Methods Lecture 13.
Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University.
Deep Learning in NLP Word representation and how to use it for Parsing
Vector space word representations
Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.
Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Distributional clustering of English words Authors: Fernando Pereira, Naftali Tishby, Lillian Lee Presenter: Marian Olteanu.
Scalable Text Mining with Sparse Generative Models
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.
Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.
How much do word embeddings encode about syntax? Jacob Andreas and Dan Klein UC Berkeley.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Playing with features for learning and prediction Jongmin Kim Seoul National University.
This week: overview on pattern recognition (related to machine learning)
Deep Learning for Speech and Language Yoshua Bengio, U. Montreal NIPS’2009 Workshop on Deep Learning for Speech Recognition and Related Applications December.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Graphical models for part of speech tagging
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Curriculum Learning Yoshua Bengio, U. Montreal Jérôme Louradour, A2iA
Text Classification, Active/Interactive learning.
Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.
The Necessity of Combining Adaptation Methods Cognitive Computation Group, University of Illinois Experimental Results Title Ming-Wei Chang, Michael Connor.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
A shallow introduction to Deep Learning
Applying Neural Networks Michael J. Watts
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Machine Learning Overview Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Automated Interpretation of EEGs: Integrating Temporal and Spectral Modeling Christian Ward, Dr. Iyad Obeid and Dr. Joseph Picone Neural Engineering Data.
Introduction to Deep Learning
June 25-29, 2006ICML2006, Pittsburgh, USA Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction Masashi Sugiyama Tokyo Institute of.
Advisor: Hsin-Hsi Chen Reporter: Chi-Hsin Yu Date: From Word Representations:... ACL2010, From Frequency... JAIR 2010 Representing Word... Psychological.
Unsupervised Visual Representation Learning by Context Prediction
Second Language Learning From News Websites Word Sense Disambiguation using Word Embeddings.
Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.
A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning Ronan Collobert Jason Weston Presented by Jie Peng.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Vector Semantics Dense Vectors.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Yann LeCun Other Methods and Applications of Deep Learning Yann Le Cun The Courant Institute of Mathematical Sciences New York University
Distributed Representations for Natural Language Processing
SAIL The Stanford Artificial Intelligence Laboratory
Deep Learning Amin Sobhani.
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Energy models and Deep Belief Networks
Tokenizer and Sentence Splitter CSCI-GA.2591
On Dataless Hierarchical Text Classification
Relation Extraction CSCI-GA.2591
Bidirectional CRF for NER
Neural networks (3) Regularization Autoencoder
Deep learning and applications to Natural language processing
Department of Electrical and Computer Engineering
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Very Deep Convolutional Networks for Large-Scale Image Recognition
Supervised vs. unsupervised Learning
Resource Recommendation for AAN
Text Categorization Berlin Chen 2003 Reference:
Neural networks (3) Regularization Autoencoder
Word embeddings (continued)
CSE 291G : Deep Learning for Sequences
Presentation transcript:

Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:

2 Sup model Sup data Supervised training

3 Sup model Sup data Supervised training Semi-sup training?

4 Sup model Sup data Supervised training Semi-sup training?

5 Sup model Sup data Supervised training Semi-sup training? More feats

6 Sup model Sup data More feats Sup model Sup data More feats sup task 1 sup task 2

7 Semi-sup model Unsup data Sup data Joint semi-sup

8 Semi-sup model Unsup model Unsup data Sup data unsup pretraining semi-sup fine-tuning

9 Unsup model Unsup data unsup training unsup feats

10 Semi-sup model Unsup data unsup training Sup training Sup data unsup feats

11 Unsup data unsup training unsup feats sup task 1sup task 2 sup task 3

12 What unsupervised features are most useful in NLP?

13 Natural language processing Words, words, words

14 How do we handle words? Not very well

15 “One-hot” word representation |V| = |vocabulary|, e.g. 50K for PTB2 word -1, word 0, word +1 Pr dist over labels (3*|V|) x m 3*|V| m

16 One-hot word representation 85% of vocab words occur as only 10% of corpus tokens Bad estimate of Pr(label|rare word) word 0 |V| x m |V| m

17 Approach

18 Approach Manual feature engineering

19 Approach Manual feature engineering

20 Approach Induce word reprs over large corpus, unsupervised Use word reprs as word features for supervised task

21 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs

22 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs

23 Distributional representations F W (size of vocab) C e.g. F w,v = Pr(v follows word w) or F w,v = Pr(v occurs in same doc as w)

24 Distributional representations F W (size of vocab) C d g(g( ) = f g(F) = f, e.g. g = LSI/LSA, LDA, PCA, ICA, rand trans C >> d

25 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs

26 Class-based word repr |C| classes, hard clustering word 0 (|V|+|C|) x m |V|+|C| m

27 Class-based word repr Hard vs. soft clustering Hierarchical vs. flat clustering

28 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs –Brown (hard, hierarchical) clustering –HMM (soft, flat) clustering Distributed word reprs

29 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs –Brown (hard, hierarchical) clustering –HMM (soft, flat) clustering Distributed word reprs

30 Brown clustering Hard, hierarchical class-based LM Brown et al. (1992) Greedy technique for maximizing bigram mutual information Merge words by contextual similarity

31 Brown clustering (image from Terry Koo) cluster(chairman) = `0010’ 2-prefix(cluster(chairman)) = `00’

32 Brown clustering Hard, hierarchical class-based LM 1000 classes Use prefixes = 4, 6, 10, 20

33 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs –Brown (hard, hierarchical) clustering –HMM (soft, flat) clustering Distributed word reprs

34 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs

35 Distributed word repr k- (low) dimensional, dense representation “word embedding” matrix E of size |V| x k word 0 k x m k m

36 Sequence labeling w/ embeddings word -1, word 0, word +1 (3*k) x m |V| x k, tied weights m “word embedding” matrix E of size |V| x k

37 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs –Collobert + Weston (2008) –HLBL embeddings (Mnih + Hinton, 2007)

38 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs –Collobert + Weston (2008) –HLBL embeddings (Mnih + Hinton, 2007)

39 Collobert + Weston 2008 w1 w2 w3 w4 w5 50* w1 w2 w3 w4 w5 score > μ + score

40 50-dim embeddings: Collobert + Weston (2008) t-SNE vis by van der Maaten + Hinton (2008)

41 Less sparse word reprs? Distributional word reprs Class-based (clustering) word reprs Distributed word reprs –Collobert + Weston (2008) –HLBL embeddings (Mnih + Hinton, 2007)

42 Log bilinear Language Model (LBL) w1 w2 w3 w4 w5 Linear prediction of w5 }

43 HLBL HLBL = hierarchical (fast) training of LBL Mnih + Hinton (2009)

44 Approach Induce word reprs over large corpus, unsupervised –Brown: 3 days –HLBL: 1 week, 100 epochs –C&W: 4 weeks, 50 epochs Use word reprs as word features for supervised task

45 Unsupervised corpus RCV1 newswire 40M tokens (vocab = all 270K types)

46 Supervised Tasks Chunking (CoNLL, 2000) –CRF (Sha + Pereira, 2003) Named entity recognition (NER) –Averaged perceptron (linear classifier) –Based upon Ratinov + Roth (2009)

47 Unsupervised word reprs as features Word = “the” Embedding = [-0.2, …, 1.6] Brown cluster = (cluster 4-prefix = 1010, cluster 6-prefix = , …)

48 Unsupervised word reprs as features Orig X = {pos-2=“DT”: 1, word-2=“the”: 1,...} X w/ Brown = {pos-2=“DT”: 1, word-2=“the”: 1, class-2-pre4=“1010”: 1, class-2-pre6=“101000”: 1} X w/ emb = {pos-2=“DT”: 1, word-2=“the”: 1, word-2-dim00: -0.2, …, word-2-dim49: 1.6,...}

49 Embeddings: Normalization E = σ * E / stddev(E)

50 Embeddings: Normalization (Chunking)

51 Embeddings: Normalization (NER)

52 Repr capacity (Chunking)

53 Repr capacity (NER)

54 Test results (Chunking)

55 Test results (NER)

56 MUC7 (OOD) results (NER)

57 Test results (NER)

Test results Chunking: C&W = Brown NER: C&W < Brown Why? 58

59 Word freq vs word error (Chunking)

60 Word freq vs word error (NER)

61 Summary Both Brown + word emb can increase acc of near-SOTA system Combining can improve accuracy further On rare words, Brown > word emb Scale parameter σ = 0.1 Goodies: Word features! Code!

62 Difficulties with word embeddings No stopping criterion during unsup training More active features (slower sup training) Hyperparameters –Learning rate for model –(optional) Learning rate for embeddings –Normalization constant vs. Brown clusters, few hyperparams

63 HMM approach Soft, flat class-based repr Multinomial distribution over hidden states = word representation 80 hidden states Huang and Yates (2009) No results with HMM approach yet