Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thoughts on Treebanks Christopher Manning Stanford University.

Similar presentations


Presentation on theme: "Thoughts on Treebanks Christopher Manning Stanford University."— Presentation transcript:

1 Thoughts on Treebanks Christopher Manning Stanford University

2 Q1: What do you really care about when you're building a parser? Completeness of information There’s not much point in having a treebank if really you’re having to end up doing unsupervised learning You want to be giving human value add Classic bad example: Noun compound structure in the Penn English Treebank Consistency of information If things are annotated inconsistently, you lose both in training (if it is widespread) and in evaluation Bad example Long ago constructions: as long ago as …; not so long ago Mutual information Categories should be as mutually informative as possible

3 Q3: What info (e.g., function tags, empty categories, coindexation) is useful, what is not? Information on function is definitely useful Should move to always having typed dependencies. Clearest example in Penn English Treebank: temporal NPs Empty categories don ’ t necessarily give much value in the dumbed-down world of Penn English Treebank parsing work Though it should be tried again/more But definitely useful if you want to know this stuff! Subcategorization/argument structure determination Natural Language Understanding!! Cf. Johnson, Levy and Manning, etc. work on long distance dependencies I ’ m sceptical that there is a categorical argument adjunct distinction to be make Leave it to the real numbers This means that subcategorization frames can only be statistical Cf. Manning (2003) I ’ ve got some more slides on this from another talk if you want …

4 Q3: What info (e.g., function tags, empty categories, coindexation) is useful, what is not? Do you prefer a more refined tagset for parsing? Yes. I mightn ’ t use it, but I often do The transform-detransform framework: RawInput  TransformedInput  Parser  TransformedOutput  DesiredOutput I think everyone does this to some extent Some like Johnson, Klein and Manning have exploited it very explicitly: NN-TMP, IN^T, NP-Poss, VP-VBG, NP-v, Everyone else should think about it more It ’ s easy to throw away too precise information, or to move information around deterministically (tag to phrase or vice versa), if it ’ s represented completely and consistently!

5 Q4: How does grammar writing interact with treebanking? In practice, they often haven’t interacted much I’m a great believer that they should Having a grammar is a huge guide to how things should be parsed and to check parsing consistency It also allows opportunities for analysis updating, etc. Cf. the Redwoods Treebank, and subsequent efforts The inability to automatically update treebanks is a growing problem Current English treebanking isn’t having much impact because of annotation differences with original PTB Feedback from users has only rarely been harvested

6 Q5: What methodological lessons can be drawn for treebanking? Good guidelines (loosely, a grammar!) Good, trained people Annotator buy-in Ann Bies said all this … I strongly agree! I think there has been a real underexploitation of technology for treebank validation Doing vertical searches/checks almost always turns up inconsistencies Either these or a grammar should give vertical review

7 Q6: What are advantages and disadvantages of pre-processing the data to be treebanked with an automatic parser? The economics are clear You reduce annotation costs The costs are clear The parser places a large bias on the trees produced Humans are lazy/reluctant to correct mistakes Clear e.g.: I think it is fair to say that many POS errors in the Penn English Treebank can be traced to the POS tagger E.g., sentence initial capitalized Separately, Frankly, Currently, Hopefully analyzed as NNP Doesn’t look like a human being’s mistakes to me. The answer: More use of technology to validate and check humans

8 Q7: What are the advantages of a phrase-structure and/or a dependency treebank for parsing? The current split in the literature between “ phrase-structure ” and “ dependency ” parsing is largely bogus (in my opinion) The Collins/Bikel parser operates largely in the manner of a dependency parser The Stanford parser contains a strict (untyped) dependency parser Phrase structure parsers have the advantage of phrase structure labels A dependency parser is just a phrase structure parser where you cannot refer to phrasal types or conditional on phrasal span This extra info is useful; it ’ s silly not to use it Labeling phrasal heads=dependencies is useful. Silly not to do it Automatic “ head rules ” should have had their day by now!! Scoring based on dependencies is much better than Parseval !!! Labeling dependency types is useful Especially, this will be the case in free-er word order languages


Download ppt "Thoughts on Treebanks Christopher Manning Stanford University."

Similar presentations


Ads by Google