Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus,

Similar presentations


Presentation on theme: "Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus,"— Presentation transcript:

1 Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman, Fernando Pereira University of Pennsylvania March 11, 2002 TIDES SITE VISIT

2 Outline Overview –Objectives, resource development, applications Supervised Training of Individual Components –parsers –semantic taggers Training with labeled and unlabeled data –co-training –active learning (annotation tools)

3 Objectives Resources ($200K) –Chinese TreeBank II –Parallel Korean/English TreeBanks –PropBank –Multilingual Annotation Tool – (Tom Morton, Nianwen Xue, Jeremy Lacivita) NYU, MITRE, LDC

4 Objectives (cont) PennTools ($300K) –Morphological Analyzers (at LDC) –Major decrease in parser development time and parser running time (Dan Bikel, Carlos Prolo, Anoop Sarkar) –Automatic Predicate Argument Tagging (Dan Gildea) –Word Sense Disambiguation, English & Chinese (Hoa Dang)

5 Chinese TreeBank II Fu-dong Chiou, Nianwen Xue Cost of CTB I, 100K words : $270K Additional 40K, (20k, 20K) –speedup given automatic parses? doubled –compare HK, Sinorama, People’s Daily 2002 - 360K words, $100K –Chiang’s parser doubles annotation speed –96K words bracketed as of March 8, 2002 –110K Xinhua news, 200K other newswire, 50K DLI corpus –release of original 100K + 150K planned for June

6 English Translation: CTB I TIDES Beijing E-C Translation LTD 12 week estimate, actual 15 weeks, Nov 100K words, around $10K (.06 per char) 3 rd pass for error correction –taking longer than expected –40K/100K done

7 Chinese PropBank - DOD Proposal stage, 2 yrs, 275K a year Year One (Just got funded) –Develop lexicon guidelines, 2600 verbs –Tag 100K CTB Year Two –Extend guidelines, up to 5 or 6000 verbs –Tag additional 400K CTB II Spinoff – Chinese lexicon

8 Richer CTB Annotations TIDES ($25K) Coreference Tagging (Susan Converse) –Draft guidelines –100K words tagged Sense tagging (Hoa Dang)

9 Korean/English Parallel TreeBank Chunghye Han, Narae Han, Allen Lee (CoGenTex/Penn/Systran: ARL MT Project) Defense Language Institute data –50K word corpus of military messages –Same corpus available in Chinese Guidelines for postagging, bracketing http://www.cis.upenn.edu/~xtag/koreantag/index.html Companion Transfer Lexicon, 4000 entries READY TO RELEASE

10 English PropBank Paul Kingsbury, Scott Cotton 1M words of Treebank New semantic augmentations –Predicate-argument relations for verbs, –label arguments (arg0, arg1, arg2) –First subtask, 300K word financial subcorpus Spin-off: English lexical resource –3500+ verbs

11 English PropBank – Current Status Frames files – 787 verb lemmas (includes phrasal variants - 932) – 363/ VerbNet semi-automatic expansions (subtask/PB) First subtask: 300K financial subcorpus 22,595K unique predicates annotated out of 29K, (80%) –6K+ remaining (7 weeks, 2000@week, first pass) 1040 verb lemmas out of 1700+ (59%) –700 remaining (3.5 months, 200@month) PropBank, (including some of Brown?) – 34,437 predicates annotated out of 118K, (29%) – 1040 verb lemmas out of 3500, (29%)

12 Summary of Resources Chinese Treebank II $100K 146K/400K words (100K@4 mo) Dec, 02 English translation of Chinese Treebank I (2001 $10K) $4K 3 rd pass, 40K/100K July, 02 Richer Chinese Treebank (coref/ wsd) $25K 1 st pass pronouns 100K/100K 30/100 verbs Dec’ 02 Korean/English parallel Treebank (2001 $100K) $50K 50K English/ 50K Korean March’ 02 English PropBank (2001 $250K) $250K 240K/300K finan. 300K/1M WSJ June’ 02 Dec ‘ 02 Completion Project2002 FundsStatus Date

13 Objectives (cont) Applications: ($200K) + ($150K) Relation Extraction and MT –Initial experiments with MUC 7 –Korean/English MT system wrap-up –Plans for investigating statistical MT approaches

14 Information Extraction with TAG (Libin Shen, Anoop Sarkar, and Jinying Chen) Template Relation (TR) task of the 7th Message Understanding Conference F-Measure of 78% on sentence-level relation which is comparable to the best results in MUC-7 Convert IE into a discriminative problem –Syntactic Analysis with Supertagger [Joshi 1994] and Lightweight Dependency Analyzer [Srinivas 1997] –Machine Learning with Boosting algorithm [Schapire 2000]

15 Korean/English ARL MT System: New Parser Evaluation Treebank trained – Anoop Sarkar off-the-shelf parser Treebank parser 1 st pass W/ improved collocations & markers Corrected DSyntS OK32%35%51%82% Fixable64%65%45%16% Bad4%5%4%2% Dependency Evaluation: 75.7% on test, 97.58% training

16 Statistical Approaches to MT (Dan Gildea, Yuan Ding, Owen Rambow) Tree-based alignment: –use one or both sets of trees from parallel treebanks to constrain alignments, –compare with unstructured alignments (IBM models). Word-sense disambiguation: –apply maximum entropy model of word –sense disambiguation to translation selection. Monolingual corpora: –translation selection based on dependency statistics from monolingual corpora. Statistical generation: –PropBank as underlying representation for statistical generation (JHU summer workshop).


Download ppt "Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus,"

Similar presentations


Ads by Google