Presentation is loading. Please wait.

Presentation is loading. Please wait.

Treebanks are Not Naturally Occurring Data Choices in Treebank Design and What They Mean for Natural Language Processing Owen Rambow Columbia University.

Similar presentations


Presentation on theme: "Treebanks are Not Naturally Occurring Data Choices in Treebank Design and What They Mean for Natural Language Processing Owen Rambow Columbia University."— Presentation transcript:

1 Treebanks are Not Naturally Occurring Data Choices in Treebank Design and What They Mean for Natural Language Processing Owen Rambow Columbia University – CCLS rambow@ccls.columbia.edu

2 Goal of this Talk In natural language processing, we all use syntactic representations (treebanks) In theoretical syntax, they use syntactic representations Problems in both communities: – Little reflection on what these syntactic representations mean – Representation confused with phenomenon which is the object of scientific investigation: natural language syntax Goal: increase meta-scientific understanding of syntactic representations for NLP

3 Overview What is a Syntactic Representation? Dependency and Phrase Structure – Definitions – Syntactic and representational constituency and dependency – Intermediate projections What does this Mean for NLP? Takehome lessons

4 Acknowledgements Rajesh Bhatt and Fei Xia – Bhatt, Rambow and Xia: Linguistic Phenomena, Analyses, and Representations: Understanding Conversion between Treebanks, IJCNLP 2011 Hindi-Urdu Treebank Project – Dipti Misra Sharma (IIIT Hyderabad) – Martha Palmer (Colorado) – NSF Grant All errors and infelicitous choices in this presentation are my own

5 What is a Syntactic Representation? 1.Syntactic phenomena, e.g.: – Subject of a verb – Relative clause – Small clause Linguists tend to agree on what phenomena exist 2.Mathematical representation type, e.g.: – Phrase structure tree – Dependency tree – Or something more complicated: LFG, TAG, … 3.Formal syntactic description: a.Mapping from phenomena to representations (in particular type) b.Chosen representation for a specific phenomenon also called analysis c.Phenomena extracted in representation are the interpretation d.Formal description is a syntactic theory if it makes predictions

6 Example: Small Clauses Hindi – अातिफ ने सीमा को बेवकूफ समझा – Atif ne Seema ko bewakuuf samjhaa – Atif Erg Seema Acc stupid consider.Pfv – ‘Atif considered Seema stupid.’ English – Atif considered Seema stupid – Atif considered her stupid

7 What is the Phenomenon? Syntactically and semantically, consider takes a clausal complement – Atif considered [ clause that she is stupid] – Atif considered [ clause her stupid] But two problems: – No verb – her is semantically subject of stupid but has accusative case, which is unusual (subjects are usually nominative) So: – Atif considered [ small clause her stupid]

8 What is the Representation Type? For this example, we will show dependency trees and phrase structure trees

9 Analysis 1a for Small Clauses: No Accusative Case Marking Structure represents her as subject but not accusative case marking of her considers Atifstupid Subj Obj her Subj

10 Analysis 1b for Small Clauses: Exceptional Case Marking Structure represents her as subject and accusative case marking through node label considers Atifstupid Subj Obj-ECM her Subj

11 Analysis 1a for Small Clauses: No Accusative Case Marking Structure represents her as subject but not accusative case marking of her considers Atifstupid Subj Obj her Subj S NP Atif VP considersS her VP AdjP stupid

12 Analysis 1b for Small Clauses: Exceptional Case Marking Structure represents her as subject but not accusative case marking of her considers Atifstupid Subj Obj-ECM her Subj S NP Atif VP considersSC her VP AdjP stupid Close to analysis adopted in Chomsky (1981)

13 Note on DS and PS These analyses are intuitively very similar Formal notion: “consistency” (Fei Xia, see Bhatt, Rambow & Fei 2011) – Intution: very simple and general algorithm can transform consistent DS to PS and vice versaI

14 Analysis 2a for Small Clauses: General Monoclausal Analysis Structure represents accusative case marking of her (as object of matrix verb) but not her as semantic subject considers Atifstupid Subj Obj2 her Obj

15 Analysis 2b for Small Clauses: Syntactic Monoclausal Analysis Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label considers Atifstupid Subj ObjPred her Obj

16 Analysis 2b for Small Clauses: Syntactic Monoclausal Analysis Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label considers Atifstupid k1 k2s her k2 Neo-Paninian analysis from IIIT Hyderabad, Used for DS in Hindi-Urdu Treebank

17 Analysis 2b for Small Clauses: Syntactic Monoclausal Analysis Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label समझा अातिफ नेबेवकूफ k1 k2s सीमा को k2 Neo-Paninian analysis from IIIT Hyderabad, Used for DS in Hindi-Urdu Treebank

18 Analysis 2a for Small Clauses: General Monoclausal Analysis Structure represents accusative case marking of her (as object of matrix verb) but not her as semantic subject considers Atifstupid Subj Obj2 her Obj NP Atif VP considersher AdjP stupid S

19 Analysis 2b for Small Clauses: Syntactic Monoclausal Analysis Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label considers Atifstupid k1 k2s her k2 NP Atif VP considersher AdjP stupid S SC

20 Analysis 3 for Small Clauses: Raising to Object Structure represents accusative case marking of her and her as semantic subject but requires empty category considers Atifstupid Subj Obj-Pred her 1 Obj e1e1 Subj

21 Analysis 3 for Small Clauses: Raising to Object Structure represents accusative case marking of her and her as semantic subject but requires empty category considers Atifstupid Subj Obj-Pred her 1 Obj e1e1 Subj S NP Atif VP considersS VP AdjP stupid her 1 e1e1 Analysis used for PS in Hindi-Urdu Treebank

22 Comparison of Representations Less Information Same information considers Atifstupid Subj Obj her Subj considers Atifstupid Subj Obj2 her Obj considers Atifstupid Subj Obj-Pred her 1 Obj e1e1 Subj considers Atifstupid Subj ObjPred her Obj considers Atifstupid Subj Obj-ECM her Subj Tree 1a Tree 2a Tree 1b Tree 2b Tree 3

23 Initial Summary: Syntactic Phenomena, Representation Types, Analyses Syntactic phenomena are the empirical data of syntax as part of the science of language – Can be very similar across languages There can be several possible analyses – Some have less information – But there can be different analyses that represent the same information differently The analyses can be similar in DS and PS Lots of choices in treebank design!

24 Overview What is a Syntactic Representation? Dependency and Phrase Structure – Definitions – Syntactic and representational constituency and dependency – Intermediate projections What does this Mean for NLP? Takehome lessons

25 Representation Types: Dependency and Phrase Structure Dependency Tree (DS): – One label alphabet, words (= words in a sentence) – All nodes labeled with words or empty strings Phrase Structure Tree (PS): – Two disjoint label alphabets, terminals (= words in sentence) and nonterminals – All and only interior nodes are labeled with nonterminals – Leaves are labeled with terminals or empty strings

26 Non-Differences Between DS and PS WRONG: “DS can’t have empty categories” considers Atifstupid Subj Obj-Pred her 1 Obj e1e1 Subj

27 Non-Differences Between DS and PS WRONG: “DS must be unordered, PS must be ordered” that the dancer a flower the violinist gives

28 Non-Differences Between DS and PS WRONG: “PS can’t be non-projective/have crossing arcs”

29 Overview What is a Syntactic Representation? Dependency and Phrase Structure – Definitions – Syntactic and representational constituency and dependency – Intermediate projections What does this Mean for NLP? Takehome lessons

30 The Double Life of Dependency and of Constituency There are two meanings to “dependency”: – A type of tree (no nonterminal labels) – A type of syntactic phenomenon, namely a relation between words (e.g., “subjecthood”) There are two meanings to “constituency” – A type of tree (terminal labels on leaf nodes; = PS tree) – A type of syntactic phenomenon, namely a grouping of words into phrases Mathematical objects = tools for linguists to represent data Empirical facts = data for linguists Don’t confuse data with its representation!

31 Constituency in PS Representation but not in Syntax VP premodifiers in PTB can be either sister to VP or in VP, there is no difference in meaning – (S (NP-SBJ Sandy (ADVP-TMP often) (VP throws (NP curves))) – (S (NP-SBJ Sandy (VP (ADVP-TMP often) throws (NP curves))) “The variation is in general free” (PTB Guidelines p. 137)

32 Constituency in Syntax but not in PS Representation Flat NP in PTB – Syntax of Adj-N-N sequences: (green (baby bassinet)) ((small business) administration) – Representation in Penn Treebank: (green baby bassinet) (small business administration) Conscious decision not to represent NP-internal pre-nominal sturcture

33 Dependency in DS Representation but not in Syntax --- links in CATiB dependency treebank of Arabic – Signal that connected words are not analyzed by annotators for dependency Pof link in DS part of Hindi-Urdu Treebank – Used for multiword expressions which form a unit syntactically (and thus have no structure) but may occur apart in sentence

34 Dependency in Syntax but not in DS Representation Small clauses in DS for Hindi-Urdu Treebank: dependency between embedded predicate and its semantic subject not marked as dependency link in tree considers Atifstupid k1 k2s her k2

35 Aren’t DS and PS Representations Complementary? NO! Syntactic dependency can be encoded in DS, and typically is Usual convention: attachment in projection shows type of dependency

36 Aren’t DS and PS Representations Complementary? NO! Syntactic constituency is represented in DS Usual convention: each node is the word, and the head of the phrase containing it and all descendents

37 Overview What is a Syntactic Representation? Dependency and Phrase Structure – Definitions – Syntactic and representational constituency and dependency – Intermediate projections What does this Mean for NLP? Takehome lessons

38 But What About Intermediate Projections? If PS: – only has preterminals (eg V) and maximal projections (eg S); – and no attchment happens at preterminal; then we have trivial correspondence to DS (consistency in the formal sense of Xia) Then PS and DS must have the exact same expressive power

39 No Intermediate Projections: DS and PS Representationally Equivalent S NAdvNV Rajeshhappilyeats laddus eats Rajeshhappilyladdus What about the POS tags in the DS?

40 No Intermediate Projections: DS and PS Representationally Equivalent S NAdvNV Rajeshhappilyeats laddus eats V Rajesh N happily Adv laddus N What about grammatical functions?

41 No Intermediate Projections: DS and PS Representationally Equivalent S N Subj Adv Adj N Obj V Rajeshhappilyeats laddus eats V Rajesh N happily Adv laddus N SubjObjAdj

42 Intermediate Projections If PS only has preterminals (eg V) and maximal projections (eg S), we have trivial correspondence to DS No direct equivalent of intermediate projections (eg VP) in DS

43 Intermediate Projections: DS and PS Different! S NP AdvNPV Rajesh happilyeats laddus eats V Rajesh N happily Adv laddus N SubjObjAdj VP AdvP N Note: X– X’—XP = maximal projection EXCEPT: V—VP—S: VP = intermediate projection

44 Intermediate Projections If PS only has preterminals (eg V) and maximal projections (eg S), we have trivial correspondence to DS No direct equivalent of intermediate projections (eg VP) in DS – we have a problem? What is important: how intermediate projections are used in the formal syntactic description A node is not information about the syntax, only a node and its interpretation

45 Why Have Intermediate Projections? Evidence for IntProj as a constituent – VP fronting in English Use of a VP to represent semantic facts – Interpretation of adverbs in English Use of IntProj to represent syntactic facts – Subject and object of verbs in English – Genitive versus adjectival modification in Arabic – Also: Arguments of verbs in Hindi Verb second in German

46 Intermediate Projection as Constituent: VP Fronting in English VP can be moved as a constituent – Rajesh said he would [ VP eat laddus] and [ VP eat laddus] Rajesh did

47 VP Fronting in English S NP did Rajesh VP Aux N NPV eat laddus VP Sbar

48 What can Dependency Do? ε1ε1 laddus Rajeshdid Aux Obj Subj eat 1 Top did laddus Rajesh Obj Subj eat Top did Rajesh Subj laddus Obj eat Comp eat didladdus ObjAux Rajesh Subj If auxiliary did is head (basically, we have a VP!) If eat is head Issue: two possible analyses of auxiliary-verb structures in DS

49 Possible claim: – VP adverbs characterize manner of event John quickly/happily played chess – S adverbs characterize speaker attitude towards event: Happily/fortunately (enough), John lost against me at chess Problem: interpretation independent of word order – Quickly/happily John played chess – John happily/oddly (enough) lost against me at chess Conflict between using VP to identify subject and using VP to describe semantics of adverbs – Would need traces to show scope In fact, Penn Treebank for English does not use VPs for scope Intermediate Projections and Semantic Scope

50 Use of VP to Indicate Subject considers Atifstupid Subj Obj-Pred her 1 Obj e1e1 Subj S NP Atif VP considersS VP AdjP stupid her 1 e1e1

51 Projections: Noun Modification in Arabic Treebank Idafa (genitive) construction: – (NP (N بيت /byt/house) (NP (N المصري /AlmSry/the-Egyptian))) the house of the Egyptian Modification construction: – (NP (N الوزير /Alwzyr/the-minister) (Adj المصري /AlmSry/the-Egyptian)) the Egyptian minister Syntactic difference signaled by projection to X or XP Can’t do in DS – need labels

52 Conclusion from Intermediate Projections Intermediate projections (eg VP, N’) are specific to PS They can be used to encode syntactic or semantic information But: DS can express the same information! Conclusion: the existence of intermediate projections in PS is not a reason to say that PS is more expressive than DS

53 Overview What is a Syntactic Representation? Dependency and Phrase Structure – Definitions – Syntactic and representational constituency and dependency – Intermediate projections What does this Mean for NLP? Takehome lessons

54 What Does This Mean for NLP? Treebanks are not naturally occurring data The guidelines are painstakingly produced by linguists and represent a formal description of the language Annotators understand a sentence, determine what syntactic phenomena exist, and use the guidelines to choose an analysis for the sentence (a structure) Users of the treebank can use the guidelines to interpret the structures and get back the syntactic phenomena present These phenomena, and not their representation in the treebank, can be used for NLP in whatever representation chosen by the researcher! There is already lots of linguistics in our resources, we just need to make use of that linguistic information!

55 Conclusion 4 takehome lessons

56 Lesson 1: There are All Sorts of Trees Fallacy: DS trees are unordered, have no empty categories,… and PS trees are ordered and … Truth: DS and PS trees are graphs that can have all sorts of properties

57 Lesson 2: Trees Need an Interpretation Fallacy: Trees have intrinsic syntactic meaning Truth: Trees only gain syntactic meaning through interpretation; in treebanks: guidelines

58 Lesson 3: Conversion between Formats Depends on Interpretation Fallacy: It is easier/harder to convert DS to PS than vice versa Truth: conversion means preserving interpretation; thus it depends on the two representations between which conversion happens

59 Lesson 4: There are two Meanings to “Constituency” and “Dependency” Fallacy: DS trees only show dependency and PS trees only show constituency Truth: there are distinct mathematical (data structure) notions and linguistic notions of “dependency” and “phrase structure”; each tree type can show each

60 Thank you!


Download ppt "Treebanks are Not Naturally Occurring Data Choices in Treebank Design and What They Mean for Natural Language Processing Owen Rambow Columbia University."

Similar presentations


Ads by Google