Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multilinugual PennTools that capture parses and predicate-argument structures, for use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark.

Similar presentations


Presentation on theme: "Multilinugual PennTools that capture parses and predicate-argument structures, for use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark."— Presentation transcript:

1 Multilinugual PennTools that capture parses and predicate-argument structures, for use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman, Fernando Pereira University of Pennsylvania March 25, 2003 TIDES SITE VISIT

2 Translation Issues: Chinese to English - Word order - Dropped arguments - Lexical ambiguities - Structure vs morphology CH:tazai wen-jian shangqian-zi EN:he signed the document

3 Abstracting away from surface structure sign qian-zi NP1 [case:nom] NP2 [case:acc] NP1 NP2 [prep:zai] CH:tazai wen-jian shangqian-zi EN:he signed the document

4 Common Thread Predicate-argument structure –Basic constituents of the sentence and how they are related to each other Constituents – he, the document Relations –Sign

5 Penn approach Annotation + machine learning = IP tools George Washington signed the Constitution. PERSONCOMMUNICATION PROPER NOUN VERB DET PROPER NOUN [ [NP1 ] [ [ ] NP2 ]] Arg0-Agent RELArg1-Theme

6 Predicate-argument structure George Washington signed the Constitution. sign Agent: George W. Theme: Constitution NP1[case:nom] NP2[case:acc]

7 Outline for Today Introduction Overview –Objectives, Chinese TreeBank, Agenda PennTools: Training of Individual Components –noun phrase chunkers, parsers, word sense taggers, semantic argument taggers –Training with labeled and unlabeled data Active learning (annotation tools) Unsupervised learning Combining labeled and unlabeled data Information Extraction Machine Translation

8 Objectives – Resources: TreeBanks Fu-dong Chiou, Tsan Kauang Lee, Chingyi Chia, Meiyu Chang Prior releases –Chinese TreeBanks 1.0 and 2.0 (100K and revisions) –Korean/English Parallel TreeBanks Recent releases –Chinese TreeBank 3.0 (250K) –Chinese TreeBank 2.0 and English translation as parallel corpora Future releases –Chinese TreeBank 4.0 (400K, Dec, ‘03), 5.0 (500K, ‘04) –CTB English Translation Treebank 1.0

9 Sighan’03, Sapporo, Japan  Second SIGHAN Workshop on Chinese Language Processing ACL’03, Sapporo, Japan AND THE  First International Chinese Word Segmentation Bakeoff, Four sources for training and test corpora: The Academia Sinica (Taiwan) Treebank Taiwan Big Five encoding The Beijing University Institute of Computational Linguistics Corpus GB encoding The Penn Chinese Treebank GB encoding Hong Kong City University corpus HK Big Five encoding

10 Summary of Chinese TreeBanks ResourceGenreData, CostCompletion Date Chinese Treebank 1.0 Xinhua Newswire 100K June, ‘00 Chinese Treebank 2.0 Xinhua Newswire100K, $270KDec, ‘00 Proposed Chinese TreeBank Release Xinhua Newswire 250K, $100K Feb, 03 Chinese TreeBank 3.0 (+CTB 2.0) Xinhua Newswire150K, $70March, ’03* Chinese TreeBank 4.0Sinorama (Taiwanese Magazine) 100K, $80K**July, ‘03 * Delay caused by poor quality of English Translation. ** Increased cost due to difficulty w/ automatic parsing of new genre.

11 Parallel TreeBanks Lessons learned –good quality translation is slow, expensive and hard to come by –switching genres (Xinhua to Sinorama) can really slow down treebanking –Start with good quality parallel corpora, similar genre if possible – AFP

12 Parallel TreeBanks To Do –Finish double pass of Sinorama (100K + additional 50K, Oct, ‘03) –AFP – 100K words, Summer, ‘04 –English treebanking, first 100K, and then?

13 Richer CTB Annotations Coreference Tagging (Susan Converse) –Guidelines presented at Sighan’02, Coling-02,Taiwan –100K words tagged, double annotated, adjudication is ongoing, additional tagging –Two preliminary tools for recovering dropped arguments under development Hobbs algorithm modified for Chinese MaxEnt system

14 Summary of Resources ResourceGenreData, CostCompletion Date Chinese Treebank 4.0 Sinorama (Taiwanese Magazine) 150K Oct, 03 Chinese Treebank 5.0 AFP100K2004 CTB English Translation TreeBank Translation of Xinhua Newswire 100K, $70K Aug, 03 Chinese/English Parallel TreeBank Chinese/English Sinorama Chinese/English AFP 150K 100K ?? English PropBankFinancial subcorpus, WSJ Penn TreeBank II, WSJ 300K 1M, $625K June ‘02 Dec ‘03 Chinese PropBankXinhua Newswire250K, $500K Summer, ‘04

15 Resource Development Chinese PropBank – Nianwen Xue English PropBank – Olga Babko-Malaya

16 Objectives (cont) PennTools ($200K) – faster training of multlingual components with less annotation –Noun phrase chunking with SuperTags (Libin Shen) –Parsing in Multiple Languages (Dan Bikel) –(Unsupervised) Coarse-grained Word Sense Disambiguation (Jinying Chen) –Automatic Predicate Argument Tagging, (using labeled and unlabeled data) (Szuting Yi)

17 Objectives, (cont.) Applications: Putting it all together Semantic Relations for Passage Retrieval (Tom Morton) –Information Extraction – ACE Participated in ’02 English Entity and Relation evaluation Future directions of ACE (Seth Kulick and Edward Loper) Recent Improvements in English Named Entity Tagging (Ryan McDonald) Preliminary work on Chinese (Yuan Ding, John Blitzer) –Machine Translation Flexible Tree-to-string Alignment (Dan Gildea) Johns Hopkins Summer Workshop plans


Download ppt "Multilinugual PennTools that capture parses and predicate-argument structures, for use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark."

Similar presentations


Ads by Google