Semantic Annotation for Interlingual Representation of Mulilingual Texts Teruko Mitamura (CMU), Keith Miller (MITRE), Bonnie Dorr (Maryland), David Farwell.

Slides:

Advertisements

Similar presentations

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.

Advertisements

CODE/ CODE SWITCHING.

Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.

CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,

Omega Ontology: Supporting Annotation Eduard Hovy with Andrew Philpot, Jerry Hobbs, Michael Fleischman, and Patrick Pantel USC/ISI.

June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.

Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.

The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.

Statistical NLP: Lecture 3

Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.

Steven Schoonover.  What is VerbNet?  Levin Classification  In-depth look at VerbNet  Evolution of VerbNet  What is FrameNet?  Applications.

 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Semantics.

 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Feature Structures and Unification.

NLP and Speech 2004 Feature Structures Feature Structures and Unification.

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.

1/27 Semantics Going beyond syntax. 2/27 Semantics Relationship between surface form and meaning What is meaning? Lexical semantics Syntax and semantics.

Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.

Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

David Farwell, Stephen Helmreich Computing Research Laboratory/New Mexico State University Lori Levin, Teruko Mitamura Language Technologies Institute/Carnegie.

LCS and Approximate Interlingua at UMD Semantic Annotation Planning Meeting April 14, 2004 Bonnie J. Dorr University of Maryland.

OntoNotes project Treebank Syntax Training Data Decoders Propositions Verb Senses and verbal ontology links Noun Senses and targeted nominalizations Coreference.

April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.

The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.

ELN – Natural Language Processing Giuseppe Attardi

AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

PropBank, VerbNet & SemLink Edward Loper. PropBank 1M words of WSJ annotated with predicate- argument structures for verbs. –The location & type of each.

Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.

For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.

1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

The Impact of Grammar Enhancement on Semantic Resources Induction Luca Dini Giampaolo Mazzini

Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.

Feb 23, Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon.

Reading. How do you think we read? -memorizing words on the page -extracting just the meanings of the words -playing a mental movie in our heads of what.

Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.

LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa.

Parallel Syntactic Annotation of Multiple Languages Owen Rambow, Bonnie Dorr, David Farwell, Rebecca Green, Nizar Habash, Stephen Helmreich, Eduard Hovy,

ACL Birds of a Feather Corpus Annotation with Interlingual Content Interlingual Annotation of Multilingual Text Corpora Bonnie Dorr, David Farwell, Rebecca.

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.

Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.

MT with an Interlingua Lori Levin April 13, 2009.

Linguistic Essentials

What you have learned and how you can use it : Grammars and Lexicons Parts I-III.

Combining Lexical Resources: Mapping Between PropBank and VerbNet Edward Loper,Szu-ting Yi, Martha Palmer September 2006.

Interlingua Annotation Owen Rambow Advaith Siddharthan Kathleen McKeown

For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.

For Friday Finish chapter 24 No written homework.

For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.

nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,

Supertagging CMSC Natural Language Processing January 31, 2006.

Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.

FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.

Knowledge Structure Vijay Meena ( ) Gaurav Meena ( )

SALSA-WS 09/05 Approximating Textual Entailment with LFG and FrameNet Frames Aljoscha Burchardt, Anette Frank Computational Linguistics Department Saarland.

For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.

LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.

Automatic Writing Evaluation

Approaches to Machine Translation

Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur

Statistical NLP: Lecture 3

INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.

Donna M. Gates Carnegie Mellon University

Approaches to Machine Translation

Linguistic Essentials

Semantics Going beyond syntax.

Information Retrieval

Presentation transcript:

Semantic Annotation for Interlingual Representation of Mulilingual Texts Teruko Mitamura (CMU), Keith Miller (MITRE), Bonnie Dorr (Maryland), David Farwell (NMSU), Nizar Habash (Columbia), Stephen Helmreich (NMSU), Eduard Hovy (ISI), Lori Levin (CMU), Owen Rambow (Columbia), Flo Reeder (MITRE), Advaith Siddharthan (Columbia) LREC 2004 Workshop: “Beyond Named Entity Recognition: Semantic labelling for NLP tasks”

LREC 2004 Workshop

IAMTC (Interlingua Annotation of Multilingual Corpora) Project Collaboration: –New Mexico State University –University of Maryland –Columbia University –MITRE –Carnegie Mellon University –ISI, University of Southern California

LREC 2004 Workshop Goals of IAMTC Interlingua design –Three levels of depth Annotation methodology –manuals, tools, evaluations Annotated multi-parallel texts –Foreign language original and multiple English translations –Foreign languages: Arabic, French, Hindi, Japanese, Korean, Spanish

LREC 2004 Workshop Getting at Meaning (Two translations of Korean original text) Starting on January 1 of next year, SK Telecom subscribers can switch to less expensive LG Telecom or KTF. … The Subscribers cannot switch again to another provider for the first 3 months, but they can cancel the switch in 14 days if they are not satisfied with services like voice quality. Starting January 1st of next year customers of SK Telecom can change their service company to LG Telecom or KTF … Once a service company swap has been made, customers are not allowed to change companies again within the first three months, although they can cancel the change anytime within 14 days if problems such as poor call quality are experienced.

LREC 2004 Workshop Color Key Black: same meaning and same expression Green: small syntactic difference Blue: Lexical difference Red: Not contained in the other text Purple: Larger difference. –Need to use some inference to know that the meaning is the same

LREC 2004 Workshop Getting at meaning (Two translations of a Japanese original text) This year, too, in addition to the birth of Mitsubishi Chemical, which has already been announced, other rather large-scale mergers may continue, and be recorded as a "year of mergers." This year, which has already seen the announcement of the birth of Mitsubishi Chemical Corporation as well as the continuous numbers of big mergers, may too be recorded as the "year of the merger“ for all we know. More lexical similarity. More differences in dependency relations.

LREC 2004 Workshop Toward a ‘Theory of Annotation’ Recently, sharp increase in number of annotated resources being built: –Penn Treebank, Propbank, many others… For annotation, need –Theory behind phenomena being annotated (for) –Annotation termsets (even WordNet, FrameNet, verbnet, HowNet…) –Standard (?) annotation corpus (same old Treebank?) –Annotation tools—they make an immense difference –Carefully considered annotation procedure (interleaving per text vs. per sentence, etc.) –Reconciliation and consistency checking procedures –Evaluation measures, appropriately defined

LREC 2004 Workshop Corpus and Data Initial Corpus –10+ texts in each language –2+ translations each into English Interlingua designed for MT –Multiple English translations of same source show translation divergences. Some phenomena: Lexical level: word changes Syntactic level: phrasing, thematization, nominalization Semantic level: additional/different content Discourse level: multi-clause structure, anaphor Pragmatic level: Speech Acts, implicatures, style, interpersonal Causes of divergence –Genuine ambiguity/vagueness of source meaning –Translator error/reinterpretation

LREC 2004 Workshop IL Development: Staged, deepening IL0: simple dependency tree gives structure IL1: semantic annotations for Nouns, Verbs, Adjs, Advs, and Theta Roles –Not yet ‘semantic’—”buy”≠“sell’, many remaining simplifications –Concept ‘senses’ from ISI’s Omega ontology –Theta Roles from Dorr’s LCS work –Elaborate annotation manuals –Tiamat annotation interface –Post-annotation reconciliation process and interface –Evaluation scores: annotator agreement IL2: that comes next…

LREC 2004 Workshop Details of IL0 Deep syntactic dependency representation: –Removes auxiliary verbs, determiners, and some function words –Normalizes passives, clefts, etc. –Includes syntactic roles (Subj, Obj) Construction: –Dependency parsed using Connexor (English) –Tapanainen and Jarvinen, 1997 –Hand-corrected Extensive manual and instructions on IAMTC Wiki website

LREC 2004 Workshop Example of IL0 TrEd, Pajas, 1998 Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center”

LREC 2004 Workshop Example of IL0 Sheikh Mohammed, who is also the Defens Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center” announced V Root Mohamed PN Subj Sheikh PN Mod Defense_Minister PN Mod who Pron Subj also Adv Mod of P Mod UAE PN Obj at P Mod ceremony N Obj inauguration N Mod

LREC 2004 Workshop Details of IL1 Intermediate semantic representation: –Annotations performed manually by each person alone Associate open-class lexical items with Omega Ontology items Replace syntactic relations by one of approx. 20 semantic (theta) roles (from Dorr), e.g., AGENT, THEME, GOAL, INSTR… –No treatment of prepositions, quantification, negation, time, modality, idioms, proper names, NP-internal structure… Nodes may receive more than one concept –Average: about 1.2 Manual under development; annotation tool built

LREC 2004 Workshop Example of IL1 Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center”

LREC 2004 Workshop Example of IL1: internal representation The study led them to ask the Czech government to recapitalize CSA at this level. [3, lead, V, lead, Root, LEAD<GET, GUIDE] [2, study, N, study, AGENT, SURVEY<WORK, REPORT] [4, they, N, they, THEME, ---, ---] [6, ask, V, ask, PROPOSITION, ---, ---] [9, government, N, government, GOAL, AUTHORITIES, GOVERNMENTAL-ORGANIZATION] [8, Czech, Adj, Czech, MOD, CZECH~CZECHOSLOVAKIA, ---] [11, recapitalize, V, recapitalize, PROP, CAPITALIZE<SUPPLY, INVEST] [12, csa, N, csa, THEME, AIRLINE<LINE, ---] [16, at, P, value_at, GOAL, ---, ---] [15, level, N, level, ---, DEGREE, MEASURE] [14, this, Det, this, ---, ---, ---] Semantic Roles Concepts from the Omega Ontology

LREC 2004 Workshop Details of IL2 – In development Start capturing meaning: –Handle proper names: one of around 5 classes ( PERSON, LOCATION, TIME, ORGANIZATION… ) –Conversives (buy vs. sell) at the FrameNet level –Non-literal language usage (open the door to customers vs. start doing business) –Extended paraphrases involving syntax, lexicon, grammatical features –Possible incorporation of other ‘standardized’ notations for temporal and spatial expressions Still excluded: –Quantification and negation –Discourse structure –Pragmatics

LREC 2004 Workshop Omega ontology Single set of all semantic terms, taxonomized and interconnected ( Merger of existing ontologies and other resources: –Manually built top structure from ISI –WordNet (110,000 nodes) from Princeton –Mikrokosmos (6000 nodes) from NMSU –Penman Upper model (300 nodes) from ISI –1-million+ instances (people, locations) from ISI –TAP domain relations from Stanford… Undergoing constant reconciliation and pruning Used in several past projects (metadata formation for database integration; MT; QA; summarization)

LREC 2004 Workshop Dependency parser and Omega ontology Omega (ISI): 110,000 concepts (WordNet, Mikrokosmos, etc.), 1.1 mill instances URL: Dependency parser (Prague)

LREC 2004 Workshop Tiamat: annotation interface For each new sentence: Candidate concepts Step 1: find Omega concepts for objects and events Step 2: select event frame (theta roles)

LREC 2004 Workshop Evaluation webpage

LREC 2004 Workshop Evaluation Three approaches to evaluation: –Inter-annotator agreement — completed –Sentence generation from extracted annotation structure — to be completed –Comparison of interlingual structures (graph comparisons) — not planned Inter-annotator agreement: Is the IL sufficiently defined to permit consistent annotation? –Impacts ontology, theta-roles: coverage and precision

LREC 2004 Workshop Annotation Issues 1.Post-annotation consistency checking –Novice annotators may make inconsistent annotations within the same text. –Intra-annotator consistency checking procedure e.g. If two nodes in different sentences are co- indexed, then annotators must ensure that the two nodes carry the same meaning in the context of the two different sentences 2.Post-annotation reconciliation

LREC 2004 Workshop 2. Post-annotation reconciliation Question: How much can annotators be brought into agreement? Procedure: –Annotator sees all annotations, votes Yes/Maybe/No on each –Annotators then discuss all differences (telephone conf) –Annotators then vote again, independently –We collapse all Yes and Maybe votes, compare them with No to identify all serious disagreement Result: –Annotators derive common methodology –Small errors and oversights removed during discussion –Inter-annotator agreement improved –Serious problems of interpretation or error identified

LREC 2004 Workshop Annotation across Translations Question: How different are the translations? Procedure: –Annotator sees annotations across both translations, identifies differences of form and meaning –Annotator selects ‘true’ meaning(s) Results (work still in progress): –Impacts ontology richness/conciseness –Improvement in Interlingua representation ‘depth’ –Useful for IL2 design development Observations: –This is very hard work –Methodology unclear: what is seen first, how to show alternatives, what to do with results…

LREC 2004 Workshop Principal problems to date Proper nouns –Proposed solution: automatically tag with one of 6 types (Person, Location, Org, DateTime, etc.) Noun compounds –Alternatives: tag head only; parse and tag whole structure Omega is too rich –Hard to distinguish from the others –Granularity of concept selection Light verbs –Proposed solution: rephrase to remove light verb if possible (“take a shower”  “shower”, but “take a shower”  ?) Vagueness and ambiguity –Annotate all plausible senses (“propose” as Urge and Suggest) Idioms and metaphors –Proposed solution: ?

LREC 2004 Workshop Discussion and conclusion Results are encouraging –But more work must be done to solidify them Outcomes—how have we done? –IL design —partly, and IL2 in the works –Annotation methodology, manuals, tools, evals — yes –Annotated parallel texts — approx. 150 done Six texts, two translations, annotators Next steps –Foreign language annotation standards and tools –Development of IL2 –Addressing coverage gaps (1/3 of open class words marked as having no concept) –Generation of surface structure from deep structure Is it possible?

LREC 2004 Workshop Contact information URLs and Wiki pages: –Project website: