Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus,

Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman, Fernando Pereira University of Pennsylvania March 11, 2002 TIDES SITE VISIT

Outline Overview –Objectives, resource development, applications Supervised Training of Individual Components –parsers –semantic taggers Training with labeled and unlabeled data –co-training –active learning (annotation tools)

Objectives Resources ($200K) –Chinese TreeBank II –Parallel Korean/English TreeBanks –PropBank –Multilingual Annotation Tool – (Tom Morton, Nianwen Xue, Jeremy Lacivita) NYU, MITRE, LDC

Objectives (cont) PennTools ($300K) –Morphological Analyzers (at LDC) –Major decrease in parser development time and parser running time (Dan Bikel, Carlos Prolo, Anoop Sarkar) –Automatic Predicate Argument Tagging (Dan Gildea) –Word Sense Disambiguation, English & Chinese (Hoa Dang)

Chinese TreeBank II Fu-dong Chiou, Nianwen Xue Cost of CTB I, 100K words : $270K Additional 40K, (20k, 20K) –speedup given automatic parses? doubled –compare HK, Sinorama, People’s Daily 2002 - 360K words, $100K –Chiang’s parser doubles annotation speed –96K words bracketed as of March 8, 2002 –110K Xinhua news, 200K other newswire, 50K DLI corpus –release of original 100K + 150K planned for June

English Translation: CTB I TIDES Beijing E-C Translation LTD 12 week estimate, actual 15 weeks, Nov 100K words, around $10K (.06 per char) 3 rd pass for error correction –taking longer than expected –40K/100K done

Chinese PropBank - DOD Proposal stage, 2 yrs, 275K a year Year One (Just got funded) –Develop lexicon guidelines, 2600 verbs –Tag 100K CTB Year Two –Extend guidelines, up to 5 or 6000 verbs –Tag additional 400K CTB II Spinoff – Chinese lexicon

Richer CTB Annotations TIDES ($25K) Coreference Tagging (Susan Converse) –Draft guidelines –100K words tagged Sense tagging (Hoa Dang)

Korean/English Parallel TreeBank Chunghye Han, Narae Han, Allen Lee (CoGenTex/Penn/Systran: ARL MT Project) Defense Language Institute data –50K word corpus of military messages –Same corpus available in Chinese Guidelines for postagging, bracketing http://www.cis.upenn.edu/~xtag/koreantag/index.html Companion Transfer Lexicon, 4000 entries READY TO RELEASE

English PropBank Paul Kingsbury, Scott Cotton 1M words of Treebank New semantic augmentations –Predicate-argument relations for verbs, –label arguments (arg0, arg1, arg2) –First subtask, 300K word financial subcorpus Spin-off: English lexical resource –3500+ verbs

English PropBank – Current Status Frames files – 787 verb lemmas (includes phrasal variants - 932) – 363/ VerbNet semi-automatic expansions (subtask/PB) First subtask: 300K financial subcorpus 22,595K unique predicates annotated out of 29K, (80%) –6K+ remaining (7 weeks, 2000@week, first pass) 1040 verb lemmas out of 1700+ (59%) –700 remaining (3.5 months, 200@month) PropBank, (including some of Brown?) – 34,437 predicates annotated out of 118K, (29%) – 1040 verb lemmas out of 3500, (29%)

Summary of Resources Chinese Treebank II $100K 146K/400K words (100K@4 mo) Dec, 02 English translation of Chinese Treebank I (2001 $10K) $4K 3 rd pass, 40K/100K July, 02 Richer Chinese Treebank (coref/ wsd) $25K 1 st pass pronouns 100K/100K 30/100 verbs Dec’ 02 Korean/English parallel Treebank (2001 $100K) $50K 50K English/ 50K Korean March’ 02 English PropBank (2001 $250K) $250K 240K/300K finan. 300K/1M WSJ June’ 02 Dec ‘ 02 Completion Project2002 FundsStatus Date

Objectives (cont) Applications: ($200K) + ($150K) Relation Extraction and MT –Initial experiments with MUC 7 –Korean/English MT system wrap-up –Plans for investigating statistical MT approaches

Information Extraction with TAG (Libin Shen, Anoop Sarkar, and Jinying Chen) Template Relation (TR) task of the 7th Message Understanding Conference F-Measure of 78% on sentence-level relation which is comparable to the best results in MUC-7 Convert IE into a discriminative problem –Syntactic Analysis with Supertagger [Joshi 1994] and Lightweight Dependency Analyzer [Srinivas 1997] –Machine Learning with Boosting algorithm [Schapire 2000]

Korean/English ARL MT System: New Parser Evaluation Treebank trained – Anoop Sarkar off-the-shelf parser Treebank parser 1 st pass W/ improved collocations & markers Corrected DSyntS OK32%35%51%82% Fixable64%65%45%16% Bad4%5%4%2% Dependency Evaluation: 75.7% on test, 97.58% training

Statistical Approaches to MT (Dan Gildea, Yuan Ding, Owen Rambow) Tree-based alignment: –use one or both sets of trees from parallel treebanks to constrain alignments, –compare with unstructured alignments (IBM models). Word-sense disambiguation: –apply maximum entropy model of word –sense disambiguation to translation selection. Monolingual corpora: –translation selection based on dependency statistics from monolingual corpora. Statistical generation: –PropBank as underlying representation for statistical generation (JHU summer workshop).

Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus,

Similar presentations

Presentation on theme: "Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus,

Similar presentations

Presentation on theme: "Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus,"— Presentation transcript:

Similar presentations

About project

Feedback