Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantic Annotation for Interlingual Representation of Mulilingual Texts Teruko Mitamura (CMU), Keith Miller (MITRE), Bonnie Dorr (Maryland), David Farwell.

Similar presentations


Presentation on theme: "Semantic Annotation for Interlingual Representation of Mulilingual Texts Teruko Mitamura (CMU), Keith Miller (MITRE), Bonnie Dorr (Maryland), David Farwell."— Presentation transcript:

1 Semantic Annotation for Interlingual Representation of Mulilingual Texts Teruko Mitamura (CMU), Keith Miller (MITRE), Bonnie Dorr (Maryland), David Farwell (NMSU), Nizar Habash (Columbia), Stephen Helmreich (NMSU), Eduard Hovy (ISI), Lori Levin (CMU), Owen Rambow (Columbia), Flo Reeder (MITRE), Advaith Siddharthan (Columbia) LREC 2004 Workshop: “Beyond Named Entity Recognition: Semantic labelling for NLP tasks”

2 LREC 2004 Workshop

3 IAMTC (Interlingua Annotation of Multilingual Corpora) Project Collaboration: –New Mexico State University –University of Maryland –Columbia University –MITRE –Carnegie Mellon University –ISI, University of Southern California

4 LREC 2004 Workshop Goals of IAMTC Interlingua design –Three levels of depth Annotation methodology –manuals, tools, evaluations Annotated multi-parallel texts –Foreign language original and multiple English translations –Foreign languages: Arabic, French, Hindi, Japanese, Korean, Spanish

5 LREC 2004 Workshop Getting at Meaning (Two translations of Korean original text) Starting on January 1 of next year, SK Telecom subscribers can switch to less expensive LG Telecom or KTF. … The Subscribers cannot switch again to another provider for the first 3 months, but they can cancel the switch in 14 days if they are not satisfied with services like voice quality. Starting January 1st of next year customers of SK Telecom can change their service company to LG Telecom or KTF … Once a service company swap has been made, customers are not allowed to change companies again within the first three months, although they can cancel the change anytime within 14 days if problems such as poor call quality are experienced.

6 LREC 2004 Workshop Color Key Black: same meaning and same expression Green: small syntactic difference Blue: Lexical difference Red: Not contained in the other text Purple: Larger difference. –Need to use some inference to know that the meaning is the same

7 LREC 2004 Workshop Getting at meaning (Two translations of a Japanese original text) This year, too, in addition to the birth of Mitsubishi Chemical, which has already been announced, other rather large-scale mergers may continue, and be recorded as a "year of mergers." This year, which has already seen the announcement of the birth of Mitsubishi Chemical Corporation as well as the continuous numbers of big mergers, may too be recorded as the "year of the merger“ for all we know. More lexical similarity. More differences in dependency relations.

8 LREC 2004 Workshop Toward a ‘Theory of Annotation’ Recently, sharp increase in number of annotated resources being built: –Penn Treebank, Propbank, many others… For annotation, need –Theory behind phenomena being annotated (for) –Annotation termsets (even WordNet, FrameNet, verbnet, HowNet…) –Standard (?) annotation corpus (same old Treebank?) –Annotation tools—they make an immense difference –Carefully considered annotation procedure (interleaving per text vs. per sentence, etc.) –Reconciliation and consistency checking procedures –Evaluation measures, appropriately defined

9 LREC 2004 Workshop Corpus and Data Initial Corpus –10+ texts in each language –2+ translations each into English Interlingua designed for MT –Multiple English translations of same source show translation divergences. Some phenomena: Lexical level: word changes Syntactic level: phrasing, thematization, nominalization Semantic level: additional/different content Discourse level: multi-clause structure, anaphor Pragmatic level: Speech Acts, implicatures, style, interpersonal Causes of divergence –Genuine ambiguity/vagueness of source meaning –Translator error/reinterpretation

10 LREC 2004 Workshop IL Development: Staged, deepening IL0: simple dependency tree gives structure IL1: semantic annotations for Nouns, Verbs, Adjs, Advs, and Theta Roles –Not yet ‘semantic’—”buy”≠“sell’, many remaining simplifications –Concept ‘senses’ from ISI’s Omega ontology –Theta Roles from Dorr’s LCS work –Elaborate annotation manuals –Tiamat annotation interface –Post-annotation reconciliation process and interface –Evaluation scores: annotator agreement IL2: that comes next…

11 LREC 2004 Workshop Details of IL0 Deep syntactic dependency representation: –Removes auxiliary verbs, determiners, and some function words –Normalizes passives, clefts, etc. –Includes syntactic roles (Subj, Obj) Construction: –Dependency parsed using Connexor (English) –Tapanainen and Jarvinen, 1997 –Hand-corrected Extensive manual and instructions on IAMTC Wiki website

12 LREC 2004 Workshop Example of IL0 TrEd, Pajas, 1998 Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center”

13 LREC 2004 Workshop Example of IL0 Sheikh Mohammed, who is also the Defens Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center” announced V Root Mohamed PN Subj Sheikh PN Mod Defense_Minister PN Mod who Pron Subj also Adv Mod of P Mod UAE PN Obj at P Mod ceremony N Obj inauguration N Mod

14 LREC 2004 Workshop Details of IL1 Intermediate semantic representation: –Annotations performed manually by each person alone Associate open-class lexical items with Omega Ontology items Replace syntactic relations by one of approx. 20 semantic (theta) roles (from Dorr), e.g., AGENT, THEME, GOAL, INSTR… –No treatment of prepositions, quantification, negation, time, modality, idioms, proper names, NP-internal structure… Nodes may receive more than one concept –Average: about 1.2 Manual under development; annotation tool built

15 LREC 2004 Workshop Example of IL1 Sheikh Mohammed, who is also the Defense Minister of the United Arab Emirates, announced at the inauguration ceremony that “we want to make Dubai a new trading center”

16 LREC 2004 Workshop Example of IL1: internal representation The study led them to ask the Czech government to recapitalize CSA at this level. [3, lead, V, lead, Root, LEAD<GET, GUIDE] [2, study, N, study, AGENT, SURVEY<WORK, REPORT] [4, they, N, they, THEME, ---, ---] [6, ask, V, ask, PROPOSITION, ---, ---] [9, government, N, government, GOAL, AUTHORITIES, GOVERNMENTAL-ORGANIZATION] [8, Czech, Adj, Czech, MOD, CZECH~CZECHOSLOVAKIA, ---] [11, recapitalize, V, recapitalize, PROP, CAPITALIZE<SUPPLY, INVEST] [12, csa, N, csa, THEME, AIRLINE<LINE, ---] [16, at, P, value_at, GOAL, ---, ---] [15, level, N, level, ---, DEGREE, MEASURE] [14, this, Det, this, ---, ---, ---] Semantic Roles Concepts from the Omega Ontology

17 LREC 2004 Workshop Details of IL2 – In development Start capturing meaning: –Handle proper names: one of around 5 classes ( PERSON, LOCATION, TIME, ORGANIZATION… ) –Conversives (buy vs. sell) at the FrameNet level –Non-literal language usage (open the door to customers vs. start doing business) –Extended paraphrases involving syntax, lexicon, grammatical features –Possible incorporation of other ‘standardized’ notations for temporal and spatial expressions Still excluded: –Quantification and negation –Discourse structure –Pragmatics

18 LREC 2004 Workshop Omega ontology Single set of all semantic terms, taxonomized and interconnected (http://omega.isi.edu) Merger of existing ontologies and other resources: –Manually built top structure from ISI –WordNet (110,000 nodes) from Princeton –Mikrokosmos (6000 nodes) from NMSU –Penman Upper model (300 nodes) from ISI –1-million+ instances (people, locations) from ISI –TAP domain relations from Stanford… Undergoing constant reconciliation and pruning Used in several past projects (metadata formation for database integration; MT; QA; summarization)

19 LREC 2004 Workshop Dependency parser and Omega ontology Omega (ISI): 110,000 concepts (WordNet, Mikrokosmos, etc.), 1.1 mill instances URL: http://omega.isi.edu Dependency parser (Prague)

20 LREC 2004 Workshop Tiamat: annotation interface For each new sentence: Candidate concepts Step 1: find Omega concepts for objects and events Step 2: select event frame (theta roles)

21 LREC 2004 Workshop Evaluation webpage

22 LREC 2004 Workshop Evaluation Three approaches to evaluation: –Inter-annotator agreement — completed –Sentence generation from extracted annotation structure — to be completed –Comparison of interlingual structures (graph comparisons) — not planned Inter-annotator agreement: Is the IL sufficiently defined to permit consistent annotation? –Impacts ontology, theta-roles: coverage and precision

23 LREC 2004 Workshop Annotation Issues 1.Post-annotation consistency checking –Novice annotators may make inconsistent annotations within the same text. –Intra-annotator consistency checking procedure e.g. If two nodes in different sentences are co- indexed, then annotators must ensure that the two nodes carry the same meaning in the context of the two different sentences 2.Post-annotation reconciliation

24 LREC 2004 Workshop 2. Post-annotation reconciliation Question: How much can annotators be brought into agreement? Procedure: –Annotator sees all annotations, votes Yes/Maybe/No on each –Annotators then discuss all differences (telephone conf) –Annotators then vote again, independently –We collapse all Yes and Maybe votes, compare them with No to identify all serious disagreement Result: –Annotators derive common methodology –Small errors and oversights removed during discussion –Inter-annotator agreement improved –Serious problems of interpretation or error identified

25 LREC 2004 Workshop Annotation across Translations Question: How different are the translations? Procedure: –Annotator sees annotations across both translations, identifies differences of form and meaning –Annotator selects ‘true’ meaning(s) Results (work still in progress): –Impacts ontology richness/conciseness –Improvement in Interlingua representation ‘depth’ –Useful for IL2 design development Observations: –This is very hard work –Methodology unclear: what is seen first, how to show alternatives, what to do with results…

26 LREC 2004 Workshop Principal problems to date Proper nouns –Proposed solution: automatically tag with one of 6 types (Person, Location, Org, DateTime, etc.) Noun compounds –Alternatives: tag head only; parse and tag whole structure Omega is too rich –Hard to distinguish from the others –Granularity of concept selection Light verbs –Proposed solution: rephrase to remove light verb if possible (“take a shower”  “shower”, but “take a shower”  ?) Vagueness and ambiguity –Annotate all plausible senses (“propose” as Urge and Suggest) Idioms and metaphors –Proposed solution: ?

27 LREC 2004 Workshop Discussion and conclusion Results are encouraging –But more work must be done to solidify them Outcomes—how have we done? –IL design —partly, and IL2 in the works –Annotation methodology, manuals, tools, evals — yes –Annotated parallel texts — approx. 150 done Six texts, two translations, 10-12 annotators Next steps –Foreign language annotation standards and tools –Development of IL2 –Addressing coverage gaps (1/3 of open class words marked as having no concept) –Generation of surface structure from deep structure Is it possible?

28 LREC 2004 Workshop Contact information URLs and Wiki pages: –Project website: http://aitc.aitcnet.org/nsf/iamtc/


Download ppt "Semantic Annotation for Interlingual Representation of Mulilingual Texts Teruko Mitamura (CMU), Keith Miller (MITRE), Bonnie Dorr (Maryland), David Farwell."

Similar presentations


Ads by Google