Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.
Combining Word-Alignment Symmetrizations in Dependency Tree Projection David Mareček Charles University in Prague Institute of.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance July 27 EMNLP 2011 Shay B. Cohen Dipanjan Das Noah A. Smith Carnegie Mellon University.
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
Recognizing Literary Tropes in Plot Synopses Sam Carton April 12, 2013.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
1 Quasi-Synchronous Grammars  Based on key observations in MT: translated sentences often have some isomorphic syntactic structure, but not usually in.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Natural Language Processing Expectation Maximization.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Monday seminar, ÚFAL April 2, 2012,
1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa
Fine-Grained Location Extraction from Tweets with Temporal Awareness Date:2015/03/19 Author:Chenliang Li, Aixin Sun Source:SIGIR '14 Advisor:Jia-ling Koh.
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
12/06/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Parsing Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
Estimating N-gram Probabilities Language Modeling.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Supertagging CMSC Natural Language Processing January 31, 2006.
Exploiting Reducibility in Unsupervised Dependency Parsing David Mareček and Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Web Intelligence and Intelligent Agent Technology 2008.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Language Identification and Part-of-Speech Tagging
CSC 594 Topics in AI – Natural Language Processing
David Mareček and Zdeněk Žabokrtský
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
N-Gram Model Formulas Word sequences Chain rule of probability
Presentation transcript:

Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September 26, 2012

Outline Unsupervised dependency parsing  What is it?  What is it good for? My work  Reducibility feature  Dependency model  Gibbs sampling algorithm for dependency trees  Results

Supervised  Parser is learned on a manually annotated treebank Unsupervised  No treebanks, no language specific linguistic rules  Only corpora without manual tree annotations Semi-supervised  Something in the middle Dependency parsing My grandmother plays computer games. PRP$ NN VBZ NN NNS.

Unsupervised dependency parsing Induction of linguistic structure directly from text corpus  based on language independent linguistic assumptions about dependencies  sometimes called “grammar induction” We can use it for any language and domain  We do not need any new manually annotated treebanks  Independent on linguistic theory We can tune it with respect to the final application  E.g. in Machine translation:  We do not know what stucture is the best for a particular language pair  It can be different from the structures used in treebanks. It’s a challenge...  Children do not use treebanks when learning their mother tongue.  Could machines do it as well?

REDUCIBILITY

Reducibility Definition: A word (or a sequence of words) in a sentence is reducible if it can be removed from the sentence without violating its correctness. Some conference participants missed the last bus yesterday. Some participants missed the last bus yesterday. Some conference participants the last bus yesterday. REDUCIBLENOT REDUCIBLE

Hypothesis If a word (or sequence of words) is reducible in a particular sentence, it is a leaf (or a subtree) in its dependency structure. Someconference participants missed thelast busyesterday

 It mostly holds across languages  Problems occur mainly with function words PREPOSITIONAL PHRASES: They are at the conference. DETERMINERS: I am in the pub. AUXILIARY VERBS: I have been sitting there. Let’s try to recognize reducible words automatically... Hypothesis If a word (or sequence of words) is reducible in a particular sentence, it is a leaf (or a subtree) in its dependency structure.

Recognition of reducible words We remove the word from the sentence. But how can we automatically recognize whether the rest of the sentence is correct or not?  Hardly... (we don’t have any grammar yet) If we have a large corpus, we can search for the needed sentence.  it is in the corpus -> it is (possibly) grammatical  it is not in the corpus -> we do not know We will find only a few words reducible...  very low recall

Other possibilities? Could we take a smaller context than the whole sentence?  Does not work at all for free word-order languages. Why don’t use part-of-speech tags instead of words?  DT NN VBS IN DT NN.  DT NN VBS DT NN. ... but the preposition IN should not be reducible Solution:  We use a very sparse reducible words in the corpus for estimating “reducibility scores” for PoS tags (or PoS tag sequence)

Computing reducibility scores For each possible PoS unigram, bigram and trigram:  Find all its occurrences in the corpus  For each such occurence, remove the respective words and search for the rest of the sentence in the corpus.  If it occurs at least once elsewhere in the corpus, the occurence is proclaimed as reducible. Reducibility of PoS n-gram = relative number of reducible occurences PRP VBD PRP IN DT NN. I saw her. She was sitting on the balcony and wearing a blue dress. I saw her in the theater. PRP VBD VBG IN DT NN CC VBG DT JJ NN. PRP VBD PRP. R’( “IN DT NN” ) = 1 2

Computing reducibility scores r(g)... number of reducible occurences c(g)... number of all the occurences For each possible PoS unigram, bigram and trigram:  Find all its occurrences in the corpus  For each such occurence, remove the respective words and search for the rest of the sentence in the corpus.  If it occurs at least once elsewhere in the corpus, the occurence is proclaimed as reducible. Reducibility of PoS n-gram = relative number of reducible occurences

Examples of reducibility scores Reducibility scores of the English PoS tags  induced from the English Wikipedia corpus

Examples of reducibility scores Reducibility scores of Czech PoS tags  1 st and 2 nd position of PDT tag

DEPENDENCY TREE MODEL

Dependency tree model Consists of four submodels  edge model  fertility model  distance model  reducibility model Simplification  we use only PoS tags, we don’t use word forms (except for computing reducibility scores)

Edge model P(dependent tag | edge direction, parent tag)  “Rich get richer” principle on dependency edges

Fertility model P(number left and right children | parent tag)  “Rich get richer” principle

Distance model Longer edges are less probable.

Reducibility model Probability of a subtree is proportinal to its reducibility score.

Probability of treebank The probability of the whole treebank, which we want to maximize  Multiplication over all models and words in the corpus

GIBBS SAMPLING OF DEPENDENCY TREES

Gibbs sampling Initialization  A random projective dependency tree is generated for each sentence Sampling  A small changes in dependency structures are being done in many iterations across the treebank  Small changes are chosen randomly with respect to the probability distribution of the resulting treebanks Decoding  Final trees are built according to the last 100 samples

Gibbs sampling – bracketing notation Each projective dependency tree can be expressed by a unique bracketing.  Each bracket pair belongs to one node and delimits its descendants from the rest of the sentence.  Each bracketed segment contains just one word that is not embedded deeper; this node is the head of the segment. rootNNIN VB NNDT JJRB (((DT) NN) VB (RB) (IN ((DT) (JJ) NN)))

Gibbs sampling – small change Choose one non-root node and remove its bracket Add another bracket which does not violate the projectivity ( ((DT) NN) VB (RB) IN ((DT) (JJ) NN))( ) (IN ((DT) (JJ) NN)) ((RB) IN ((DT) (JJ) NN)) ((RB) IN) (((DT) NN) VB (RB)) (((DT) NN) VB) (VB (RB)) (VB) (IN)

Gibbs sampling - decoding After 200 iterations  We run MST algorithm  Edge weights = occurrences of individual edges in the treebank during the last 100 sampling iterations  The output trees may be possibly non-projective

EXPERIMENTS AND EVALUATION

Data Inference and evaluation  CoNLL 2006/2007 test data  HamleDT treebanks (30 languages) Estimating reducibility scores  Wikipedia corpus (W2C)  85 mil. tokens for English... 3 mil. tokens for Japanese

Experiments Different languages Different combinations and variants of models Supervised / unsupervised PoS tags  POS, CPOS, number of classes Including / excluding punctuation  from training / from evaluation Different decoding methods Different evaluation metrics  DAS, UAS, NED

Results Reducibility model is very useful Reducibility modelEnglishGermanCzech  For some languages, I achieved better results when using unsupervised PoS tags instead of supervised ones Many mistakes are in punctuation...

Results

Conclusions I have introduced reducibility feature, which is useful in unsupervised dependency parsing. Reducibility scores for individual PoS tag n-grams are computed on a large corpus, the inference itself is done on a smaller data. I have proposed an algorithm for sampling projective dependency trees. Better results for 15 out of 20 treebanks  compared to the 2011 state-of-the-art Future work:  Employ lexicalized models  Improve reducibility – another dealing with function words  Parallel unsupervised parsing for machine translation

Thank you !

ANSWERS

Answers to A. Soegaard’s questions The aim of the parsing may be:  To be able to parse any language using all the data resources available (McDonald, Petrov,...)  To induce a grammar without using any manually annotated data (Spitkovsky, Blunsom,...) For a completely unsupervised solution  I should use unsupervised PoS tagging as well  I would not know what are verbs, what are nouns,... Hyperparameter tuning and evaluation  In a future work, it should be extrinsic (on a final application, e.g. MT)  In my thesis, the only possibility was to evaluate against existed treebanks

Answers to A. Soegaard’s questions [2] Decoding:  I’ve chosen the maximum-spanning-tree decoding.  The results using annealing were not very different  Non-projective (Chu-Liu-Edmonds algorithm)  I have not tested projective (Eisner’s) algorithm. Comparing results with other works  Many papers report the results on sentences not longer than 10 words. Turkish 2006 data are missing  I did not have this data available.

Answers to F. Jurčíček’s questions (2) Chinese restaurant process  Treebank generation ~ Chinese restaurant (4), (7) What is the history?  When generating a treebank, a new dependency edge is generated based on previously generated edges  When sampling a new treebank, a new edge(s) is sampled based on all other edges in the treebank (exchangeability) (5) Are the distance and reducibility models really unsupervised?  unsupervised – we do not need any labeled data  language independent – they works for all the languages  Are the properties of distance and reducibility assumptions or we observed them form a data?  The repeatability of edges could be observed from data as well.

Answers to F. Jurčíček’s questions [2] (7) Probability of a dependency relation  The proposed sampling algorithm can change more than one edge together (to preserve the treeness)  Probability of the rest of the treebank is equal for all the candidates. (7) Dependencies in the same tree are not i.i.d.  That’s true. I am aware of it.  Independency is negligible on a very high number of sentences. (8) Small changes  Described by removing a bracket and adding another bracket.  This causes that more than one edge may be changed in one sample.