Presentation is loading. Please wait.

Presentation is loading. Please wait.

Drexel – 4/22/13 1/39 Treebank Analysis Using Derivation Trees Seth Kulick

Similar presentations


Presentation on theme: "Drexel – 4/22/13 1/39 Treebank Analysis Using Derivation Trees Seth Kulick"— Presentation transcript:

1 Drexel – 4/22/13 1/39 Treebank Analysis Using Derivation Trees Seth Kulick skulick@seas.upenn.edu

2 Drexel – 4/22/13 2/39 Overview  Background Treebank – a collection of sentences annotated (by people). The role of treebanks in Natural Language Processing (NLP).  The Problem A utomatically discovering internal inconsistency in treebanks.  Our approach System output  Adaptation for parser evaluation/inter-annotator agreement  Future work

3 Drexel – 4/22/13 3/39 Treebank Example stood VP PP at NP NML The NP -SBJ wholesaleprice index S 90.1  Each sentence has a tree with syntactic information  The annotation is done (ideally) in conformity with a set of annotation guidelines Treebank annotation is hard and error-prone

4 Drexel – 4/22/13 4/39 The role of treebanks in NLP  Penn Treebank (1994) – 1M words from the Wall Street Journal – “statistical revolution in NLP”  Training and evaluation material for parsers  Parsers – machine learning problem input: new sentence, output: tree for that sentence used for downstream processing  Many approaches to parsing Break down full trees into smaller parts that can be used to assign structure to a new sentence Rely on consistency of annotation.  What does it mean for a treebank to be inconsistent?

5 Drexel – 4/22/13 5/39 Annotation inconsistency: Example 1  Two instances of “The wholesale price index” in a treebank.  Suppose a parser encountered “The wholesale price index” in new data. How would it know what to do? NML The NP wholesaleprice index The NP wholesalepriceindex

6 Drexel – 4/22/13 6/39 Annotation inconsistency: Example 2  Two instances of “foreign trade exports” in a treebank NP NML foreign exports trade NP foreignexportstrade

7 Drexel – 4/22/13 7/39 Annotation inconsistency: Example 3  Two instances of “than average” in a treebank PP than NP average PP than ADJP average

8 Drexel – 4/22/13 8/39 The Problem – How to detect inconsistencies  A million words of such trees – what to compare, how to find inconsistencies? VP NML The NP -SBJ wholesaleprice index S S The NP -SBJ wholesaleprice index VP

9 Drexel – 4/22/13 9/39 The Problem – How to detect inconsistencies  After approx. 20 years of making treebanks, this is still an open question.  Usual approach - Hand-crafted searches for structural problems (e.g. two SBJs for a verb, unexpected unary branches, etc.)  But we want a way to automatically discover inconsistencies. Dickinson & Meurers (TLT 2003…) Kulick et al. (ACL 2011, LREC 2012, NAACL 2013)  Also for treebanks meant for linguistic research, not as training material for parsers.

10 Drexel – 4/22/13 10/39 The Importance of the Problem  Lots of treebanks Penn Treebank 1M words (DARPA) Ontonotes 1.3M words - (DARPA) Arabic Treebank 600M – (DARPA) Penn Parsed Corpora of Historical English 1.8M (NSF)  Treebank construction is expensive for annotation Need faster and better treebank construction NSF Linguistics/Robust Intelligence (Kroch & Kulick)  Treebank analysis overlaps with key concerns How to increase annotation speed How to determine features for parsers.

11 Drexel – 4/22/13 11/39 Our approach  Similarities to parsing work – break down full trees to compare parts of them Adapt some ideas from parsing research Tree Adoining Grammar (Joshi et al…)  Basic idea: Decompose each tree into smaller chunks of structure “Derivation tree” relates these chunks together Compare strings based on their derivation tree fragments  Two advantages Group inconsistencies by structural properties Abstract away from interfering surface properties

12 Drexel – 4/22/13 12/39 Treebank Decomposition stood VP PP at NP NML The NP -SBJ wholesaleprice index S 90.1

13 Drexel – 4/22/13 13/39 Treebank Decomposition stood VP PP at NP NML The NP -SBJ wholesaleprice index S 90.1 Decomposition done by heuristics controlling the descent through the tree. The separate chunks (elementary trees) become the nodes of the “derivation tree”.

14 Drexel – 4/22/13 14/39 Derivation Tree Each node is an elementary tree.

15 Drexel – 4/22/13 15/39 Derivation Tree “ derivation tree fragment for “the wholesale price index”

16 Drexel – 4/22/13 16/39 Treebank Example was VP NP S 180.9 The NP -SBJ wholesaleprice index

17 Drexel – 4/22/13 17/39 Treebank Example was VP NP S 180.9 The NP -SBJ wholesaleprice index

18 Drexel – 4/22/13 18/39 Derivation Tree another derivation tree fragment for “the wholesale price index”

19 Drexel – 4/22/13 19/39 The Basic Idea  “nucleus” – a sequence of words being examined for consistency of annotation  Collect all the instances of that nucleus Each instance has an associated derivation tree fragment Have a set of all derivation tree fragments for the nucleus  If more than one derivation tree fragment, flag the nucleus as inconsistent  Simplifying the problem

20 Drexel – 4/22/13 20/39 Derivation Tree Fragments  Two different derivation tree fragments for “the wholespace price index” – flag as inconsistent

21 Drexel – 4/22/13 21/39 Organization by Annotation Structures  Great advantage of this approach For any set of instances for a nucleus, we have the derivation tree fragments for each instance Group nuclei by the set of derivation tree fragments used for all the instances that the nucleus has.

22 Drexel – 4/22/13 22/39 Another Annotation Inconsistency  Two instances of “the economic growth rate”  Inconsistent, just like “the wholesale price index” NML the NP economicgrowth rate the NP economicgrowthrate

23 Drexel – 4/22/13 23/39 Derivation Tree Fragments  Aside from the words, exactly the same as the derivation tree fragments for “the wholesale price index”

24 Drexel – 4/22/13 24/39 Organization by Annotation Structures  Why this is good: e.g.. “a short interest position”, “the economic growth rate”, “the real estate industry” all annotated inconsistently in the same way We care more about different types of consistencies/inconsistencies than individual cases of words. Very helpful for identifying the sorts of errors annotators might be making, or problematic areas for a parser.

25 Drexel – 4/22/13 25/39 System Overview  Compute derivation tree for each sentence  Identify strings to compare - the “nuclei” For now, use all strings that are constituents somewhere in the corpus. (this can’t be right in the long run)  Get the derivation tree fragments for each nucleus Include nuclei with more than one derivation tree fragment. Sort by set of derivation tree fragments used for a nucleus.  Software – getting in shape for release Derivation tree and elementary tree extraction – Java Everything else – MySQL and Python. (trees in MySQL) Output - static HTML for now.  Used for current treebanking

26 Drexel – 4/22/13 26/39 Why not just compare subtrees?  Two instances of “The wholesale price index” in a treebank.  Why bother decomposing it?  Adjunction/coordination  Partial constituents/Treebank simplifications NML The NP wholesaleprice index The NP wholesalepriceindex

27 Drexel – 4/22/13 27/39 Adjunction  “of the class” modifying “the teacher”  Suppose we want to look at all instances of “the teacher of the class” NP theteacher PP of NP theclass Adjunction structure for modification

28 Drexel – 4/22/13 28/39 Adjunction  “that I took” modifying “the class”  No subtree with just “the teacher of the class” NP theteacher PP of NP clas s NP SBAR that I took the

29 Drexel – 4/22/13 29/39 Adjunction in Derivation Tree Fragments Clunch 11/8/1029  nucleus “the teacher of the class”  a1-a5 and b1-b5 are compared  No interference from b6 (and below).

30 Drexel – 4/22/13 30/39 Arabic Treebank example nucleus is “summit Sharm el-Sheikh” summit NP SharmNP el-Sheikh summit NP Sharmel-Sheikh Sharm summit NP el-Sheikh NP Egypt Annotation is differentAnnotation is same

31 Drexel – 4/22/13 31/39 Arabic Treebank example nucleus is “summit Sharm el-Sheikh”

32 Drexel – 4/22/13 32/39 Further Abstraction from Treebank  Two instances of nucleus “one place” Not a constituent in one of them. not a constituen t here

33 Drexel – 4/22/13 33/39 Further Abstraction from Treebank  Change #b2 to not have the QP  More abstract representation also used for parsing

34 Drexel – 4/22/13 34/39 (Partial) Evaluation  Arabic Treebank 598K words (Maamouri et al, 2008-9) First 10 annotation types include 266 nuclei, all correctly identified recall of inconsistencies – hard to measure # Nuclei# Reported# non-duplicate#Annotation Types 54,4969,9844,2721,911  Ontonotes English newswire 525K words  evaluation so far – successful, not perfect # Nuclei# Reported# non-duplicate#Annotation Types 30,4973,6093,0121,186

35 Drexel – 4/22/13 35/39 System Output  Similar words grouped together

36 Drexel – 4/22/13 36/39 Parser Analysis/Inter-Annotator Agreement  Method of finding nuclei is in general a problem Reliance on identical strings of words – “later”, “l8r” Must occur at least once as a constituent  But in one situation it is completely natural Comparing two sets of annotations over the same sentences Kulick et al, NAACL 2013  Parser evaluation – Output of parser compared to treebank gold  Inter-annotator Agreement Evaluation Two annotators working on the same sentences

37 Drexel – 4/22/13 37/39 Example of Inter-annotator (dis)agreement  What to compare? Original starting point of work  Two annotators on same sentence

38 Drexel – 4/22/13 38/39 IAA evaluation  Evaluation on 4,270 word pre-release subset of Google English Web Treebank (Bies et al, 2012) Inconsistency Type# Found# Accurate Function Tags only53 POS Tags Only1813 Structural129122  Evaluation on 82,701 word pre-release supplement Modern British English ( Kroch & Santorini, in prep)  1,532 inconsistency types with 2,194 nuclei  First has 88 nuclei, second has 37  First 20 (375 nuclei) all true instances of inconsistent annotation

39 Drexel – 4/22/13 39/39 Future work  Clustering based on words and derivation tree fragments. Problems of multiple spellings - webtext, historical corpora  Order output based on some notion of what’s more likely to be an error.  Dependency work.  Parsing work based on derivation trees.


Download ppt "Drexel – 4/22/13 1/39 Treebank Analysis Using Derivation Trees Seth Kulick"

Similar presentations


Ads by Google