Exploiting Reducibility in Unsupervised Dependency Parsing David Mareček and Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.

Combining Word-Alignment Symmetrizations in Dependency Tree Projection David Mareček Charles University in Prague Institute of.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,

Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.

Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance July 27 EMNLP 2011 Shay B. Cohen Dipanjan Das Noah A. Smith Carnegie Mellon University.

Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,

Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.

Recognizing Literary Tropes in Plot Synopses Sam Carton April 12, 2013.

PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.

1 Quasi-Synchronous Grammars  Based on key observations in MT: translated sentences often have some isomorphic syntactic structure, but not usually in.

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.

The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.

Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.

Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.

Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Monday seminar, ÚFAL April 2, 2012,

1 Persian Part Of Speech Tagging Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.

Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.

1 Statistical Parsing Chapter 14 October 2012 Lecture #9.

Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová

1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa

Fine-Grained Location Extraction from Tweets with Temporal Awareness Date:2015/03/19 Author:Chenliang Li, Aixin Sun Source:SIGIR '14 Advisor:Jia-ling Koh.

12/06/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Parsing Dr. Jan Hajič CS Dept., Johns Hopkins Univ.

Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,

Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.

What you have learned and how you can use it : Grammars and Lexicons Parts I-III.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.

Supertagging CMSC Natural Language Processing January 31, 2006.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.

Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)

Overview of Statistical NLP IR Group Meeting March 7, 2006.

A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.

Web Intelligence and Intelligent Agent Technology 2008.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.

Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,

Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.

Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.

Graph-based Dependency Parsing with Bidirectional LSTM Wenhui Wang and Baobao Chang Institute of Computational Linguistics, Peking University.

CSC 594 Topics in AI – Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing

David Mareček and Zdeněk Žabokrtský

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

N-Gram Model Formulas Word sequences Chain rule of probability

Natural Language Processing

Part of Speech Tagging with Neural Architecture Search

Presentation transcript:

Exploiting Reducibility in Unsupervised Dependency Parsing David Mareček and Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague EMNLP conference July 12, 2012, Jeju Island, Korea

Outline Unsupervised Dependency Parsing  Motivations Reducibility  What is reducibility?  Computing reducibility scores Employing reducibility in unsupervised dependency parsing  Dependency model  Inference: Gibbs sampling of projective dependency trees  Results

Motivations for Unsupervised dependeny parsing Parsing without using aby treebank or any language-specific rules For under-resourced languages or domains?  Every new treebank is expensive and time-consuming  However, semi-supervised methods are probably more useful than completely unsupervised methods Universality across languages  Linguistic theory independent parser  Treebanks differ in a ways of capturing various linguistic phenomena Unsupervised parser might find more suitable structures than we have in treebanks  It might work better in final applications (MT, QA,... )  GIZA++ is also unsupervised and it is universal and widely used  Dependency parsing is similar to the word alignment task

REDUCIBILITY

Reducibility Definition: A word (or a sequence of words) is reducible if we can remove it from the sentence without violating the grammaticality of the rest of the sentence. Some conference participants missed the last bus yesterday. Some participants missed the last bus yesterday. Some conference participants the last bus yesterday. REDUCIBLENOT REDUCIBLE

Hypothesis If a word (or sequence of words) is reducible in a particular sentence, it is a leaf (or a subtree) in the dependency structure. Someconference participants missed thelast busyesterday

 It mostly holds across languages  Problems occur mainly with function words PREPOSITIONAL PHRASES: They are at the conference. DETERMINERS: I am in the pub. AUXILIARY VERBS: I have been sitting there. Let’s try to recognize reducibile words automatically... Hypothesis If a word (or sequence of words) is reducible in a particular sentence, it is a leaf (or a subtree) in the dependency structure.

Recognition of reducible words We remove the word from the sentence. But how can we automatically recognize whether the rest of the sentence is grammatical or not?  Hardly... (we don’t have any grammar yet) If we have a large corpus, we can search for the needed sentence.  it is in the corpus -> it is (possibly) grammatical  it is not in the corpus -> we do not know We will find only a few words reducible...  very low recall

Other possibilities? Could we take a smaller context than the whole sentence?  Does not work at all for free word-order languages. Why don’t use part-of-speech tags instead of words?  DT NN VBS IN DT NN.  DT NN VBS DT NN. ... but the preposition IN should not be reducible Solution:  We use a very sparse reducible words in the corpus for estimating “reducibility scores” for PoS tags (or PoS tag sequence)

Computing reducibility scores For each possible PoS unigram, bigram and trigram:  Find all its occurrences in the corpus  For each such occurence, remove the respective words and search for the rest of the sentence in the corpus.  If it occurs at least once elsewhere in the corpus, the occurence is proclaimed as reducible. Reducibility of PoS n-gram = relative number of reducible occurences PRP VBD PRP IN DT NN. I saw her. She was sitting on the balcony and wearing a blue dress. I saw her in the theater. PRP VBD VBG IN DT NN CC VBG DT JJ NN. PRP VBD PRP. R(“IN DT NN”) = 1 2

Computing reducibility scores r(g)... number of reducible occurences c(g)... number of all the occurences For each possible PoS unigram, bigram and trigram:  Find all its occurrences in the corpus  For each such occurence, remove the respective words and search for the rest of the sentence in the corpus.  If it occurs at least once elsewhere in the corpus, the occurence is proclaimed as reducible. Reducibility of PoS n-gram = relative number of reducible occurences

Examples of reducibility scores Reducibility scores of the English PoS tags  induced from the English Wikipedia corpus

DEPENDENCY TREE MODEL

Dependency tree model Consists of four submodels  edge model  fertility model  distance model  reducibility model Simplification  we use only PoS tags, we don’t use word forms  we induce projective trees only

Edge model P(dependent tag | edge direction, parent tag)  “Rich get richer” principle on dependency edges

Fertility model P(number left and right children | parent tag)  “Rich get richer” principle

Distance model Longer edges are less probable.

Reducibility model Probability of a subtree is proportinal to its reducibility score.

Probability of treebank The probability of the whole treebank, which we want to maximize  Multiplication over all models and words in the corpus

Gibbs sampling – bracketing notation Each projective dependency tree can be expressed by a unique bracketing.  Each bracket pair belongs to one node and delimits its descendants from the rest of the sentence.  Each bracketed segment contains just one word that is not embedded deeper; this node is the head of the segment. rootNNIN VB NNDT JJRB (((DT) NN) VB (RB) (IN ((DT) (JJ) NN)))

Gibbs sampling – small change Choose one non-root node and remove its bracket Add another bracket which does not violate the projectivity ( ((DT) NN) VB (RB) IN ((DT) (JJ) NN))( ) (IN ((DT) (JJ) NN)) ((RB) IN ((DT) (JJ) NN)) ((RB) IN) (((DT) NN) VB (RB)) (((DT) NN) VB) (VB (RB)) (VB) (IN)

Gibbs sampling - decoding After 200 iterations  We run MST algorithm  Edge weights = occurrences of individual edges in the treebank during the last 100 sampling iterations  The output trees may be possibly non-projective

Evaluation CoNLL 2006/2007 test data  all the sentences (all lengths)  punctuation was removed before the evaluation  directed attachment score Wikipedia corpus for estimating reducibility scores  85 mil. tokens for English ...  3 mil. tokens for Japanese Impact of the reducibility model ReducibilityEnglishGermanCzech 

Results Directed attachment scores on CoNLL 2006/2007 test data  Spitkovsky 2012 vs. Mareček 2012 CoNLLSpi 2012Mar 2012 Arabic Arabic Basque Bulgarian Catalan Chinese Chinese Czech Czech Danish Dutch English CoNLLSpi 2012Mar 2012 German Greek Hungarian Italian Japanese Portuguese Slovenian Spanish Swedish Turkish Turkish Average:

Conclusions I have introduced reducibility feature, which is useful in unsupervised dependency parsing The reducibility scores for individual PoS tag n-grams are computed on a large corpus and then used in the induction algorithm on a smaller corpus State-of-the-art?  It might have been in January 2012 Future work:  Employ lexicalized models  Improve reducibility – another dealing with function words

Thank you for your attention.