Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 1 Semi-automatic Annotation of the Romanian TimeBank 1.2 Corina Forăscu,

Similar presentations


Presentation on theme: "Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 1 Semi-automatic Annotation of the Romanian TimeBank 1.2 Corina Forăscu,"— Presentation transcript:

1 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 1 Semi-automatic Annotation of the Romanian TimeBank 1.2 Corina Forăscu, Radu Ion, Dan Tufiş Faculty of Computer Science, Al.I. Cuza University of Iasi, Romania & Research Institute for Artificial Intelligence of the Romanian Academy corinfor@info.uaic.ro, {radu, tufis}@racai.ro corinfor@info.uaic.rotufis}@racai.ro corinfor@info.uaic.rotufis}@racai.ro

2 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 2 Outline 1. Fundamentals 2. TimeML & TimeBank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions

3 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 3 Fundamentals Temporal information in Natural Language: 1. Time-denoting expressions – references to a calendar or clock system expressed by NPs, PPs, or AdvPs expressed by NPs, PPs, or AdvPs the 23 rd of May, 1998; Monday; tomorrow; the second semester the 23 rd of May, 1998; Monday; tomorrow; the second semester 2. Event-denoting expressions - reference to an event  expressed by 1. sentences – more precisely their syntactic head, the main verb: John listens to the music. John listens to the music. 2. noun phrases: Israel will ask the USA to delay a military strike against Iraq. Israel will ask the USA to delay a military strike against Iraq.

4 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 4 Motivation (1) NLP applications to benefit: lexicon induction, linguistic investigation, using very large annotated corpora; question answering (questions like when, how often or how long); information extraction or information retrieval; machine translation (translated and normalized temporal references; mappings between different behavior of tenses from language to language); discourse processing: temporal structure of discourse and summarization.

5 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 5 Acum îşi dădea seama că tocmai din cauza acestui incident se hotărâse el brusc să vină acasă şi să-şi înceapă jurnalul taman astăzi. Now he realised that exactly because of this inicident he decided suddenly to come home and to begin his jurnal exactly today. Motivation (2)

6 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 6 Acum îşi dădea seama că tocmai din cauza acestui incident se hotărâse el brusc să vină acasă şi să-şi înceapă jurnalul taman astăzi. Acum îsi dădea seama ca tocmai din cauza acestui incident se hotarâse el brusc sa vină acasa si sa -si înceapă jurnalul taman astăzi. Motivation (3)

7 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 7 State of the Art 1947Reichenbach: The tenses of verbs 1998MUC 7 2000TIMEX 2004ACE – TERN: TIMEX2 v.1.1.TARSQI: TimeML v.1.2. 2005ACE – TERN: TIMEX2 v.1.2.ACL 2005: TARSQI system ACL-COLING WS: ARTE Annotating and Reasoning about Time and Events 2006Time Symposium ACL: Temporal and Spatial Information Processing2001STAG (Setzer)TIDES 2001: TIMEX2 v.1.0.2 LREC 2002 Annotation Standards for Temporal Information in Natural Language 2002DAML-TimeTERQAS: TimeML v.1.0.

8 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 8 TERQAS 2002 + TimeML v.1.0 metadata standard for: marking events, marking events, their temporal anchoring and their temporal anchoring and links in news articles links in news articles + TimeBank corpus v.1.0. + guidelines for temporal annotation

9 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 9 Outline 1. Fundamentals 2. TimeML & TimeBank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions

10 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 10 TimeML v.1.2 A metadata standard developed especially for news articles, for marking Events: EVENT, MAKEINSTANCE Events: EVENT, MAKEINSTANCE temporal anchoring of events: TIMEX3, SIGNAL temporal anchoring of events: TIMEX3, SIGNAL links between events and/or timexes: TLINK, ALINK, SLINK links between events and/or timexes: TLINK, ALINK, SLINK

11 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 11 Events (1) situations that happen or occur, states or circumstances in which something obtains or holds true situations that happen or occur, states or circumstances in which something obtains or holds true tensed verbs, adjectives, nominalizations tensed verbs, adjectives, nominalizations The oat-bran craze e190 has cost e189 the world's largest cereal maker market share. 7 classes of EVENTs: OCCURRENCE, PERCEPTION, REPORTING, ASPECTUAL, STATE, I_STATE, I_ACTION

12 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 12 Events (2) The oat-bran craze e190 has cost e189 the world's largest cereal maker market share. Analysts say e28 much of Kellogg's erosion e204 has been in such core brands as Corn Flakes,...

13 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 13 Instances Based on the event annotation: how many different instances or realizations has a given event – at least one Based on the event annotation: how many different instances or realizations has a given event – at least one Carries the tense and aspect of the verb- denoted event Carries the tense and aspect of the verb- denoted event John learns e1 twice on Monday.

14 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 14 Temporal expressions: TIMEX3 (1) Explicit & implicit temporal expressions: Times: 11 o’clock; midnight Times: 11 o’clock; midnight Dates: Dates: Fully Specified (May 23, 2006; winter, 2005), Fully Specified (May 23, 2006; winter, 2005), Underspecified (Monday; next week; last month; two years ago) Underspecified (Monday; next week; last month; two years ago) Durations: two months; three hours Durations: two months; three hours Sets: every week; every Tuesday Sets: every week; every Tuesday

15 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 15 Temporal expressions: TIMEX3 (2) 10/30/89 10/30/89 the next two years or so the next two years or so soon soon

16 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 16 Temporal signals: SIGNAL Function words that indicate how temporal objects are to be related to each other: temporal prepositions, conjunctions and/or modifiers: on, in, at, from, to, before, after, during; before, after, while, when temporal prepositions, conjunctions and/or modifiers: on, in, at, from, to, before, after, during; before, after, while, when negative expressions negative expressions modal verbs modal verbs prepositions signaling modality (“to”) prepositions signaling modality (“to”) special characters denoting ranges in temporal expressions: “-” and “/” special characters denoting ranges in temporal expressions: “-” and “/”

17 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 17 Dependencies: LINK s Temporal Relations: TLINK Temporal Relations: TLINK Anchors to Time Anchors to Time Orders between Time and Events Orders between Time and Events Aspectual Relations: ALINK Aspectual Relations: ALINK Phases of an event Phases of an event Subordinating Relations: SLINK Subordinating Relations: SLINK Events that syntactically subordinate other events Events that syntactically subordinate other events

18 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 18 Temporal relations: TLINK (1) temporal relation between two temporal elements (event-event, event-timex); temporal relation between two temporal elements (event-event, event-timex); EVENT s – through their INSTANCE s EVENT s – through their INSTANCE s 13 relTypes – as Allen’s: 13 relTypes – as Allen’s: Simultaneous Simultaneous Identical Identical One before (/after) the other One before (/after) the other One immediately before (+after) the other One immediately before (+after) the other One including / being included in the other One including / being included in the other One holding during the duration of the other One holding during the duration of the other One being the beginning (/ending) of the other One being the beginning (/ending) of the other One being begun (/ended) by the other One being begun (/ended) by the other

19 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 19 Temporal relations: TLINK (2) The oat-bran craze e190/ei1994 has cost e189/ei1995 the world's largest cereal maker market share. The company's president quit e3 /ei1996 suddenly. crazecost 10/30/89 ei1994 ei1995t192 quit ei1996

20 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 20 Temporal relations: TLINK (3) crazecost 10/30/89 ei1994 ei1995t192 quit ei1996

21 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 21 Aspectual relations: ALINK relationship between an aspectual event and its argument event: relationship between an aspectual event and its argument event: Initiation: John started ei5 to read ei6. Initiation: John started ei5 to read ei6. Culmination : John finished ei5 assembling ei6 the table. Culmination : John finished ei5 assembling ei6 the table. Termination: John stopped talking. Termination: John stopped talking. Continuation : John kept talking. Continuation : John kept talking.

22 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 22 Subordination relations: SLINK for contexts introducing relations between two events of type: for contexts introducing relations between two events of type: Modal: John should have bought some wine. Modal: John should have bought some wine. Factive: John forgot that he was in Boston yesterday. Factive: John forgot that he was in Boston yesterday. Counterfactive: John prevented the divorce. Counterfactive: John prevented the divorce. Evidential: John said he bought some wine. Evidential: John said he bought some wine. Negative evidential: John denied he bought only beer. Negative evidential: John denied he bought only beer. Conditional: If John leaves today, Mary will cry. Conditional: If John leaves today, Mary will cry.

23 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 23 TimeBank 1.2 183 English news report documents TimeML annotated, distributed through LDC 4715 sentences with 10586 unique lexical units, from a total of 61042 lexical units Non-TimeML Markup in Time Bank 1.1: structure information: header named entity recognition:,, sentence boundary information:

24 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 24 TimeBank 1.2 events 7935 instances 7940 timexes 1414 signals 688 alinks 265 slinks 2932 tlinks 6418 TOTAL27592

25 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 25 Outline 1. Fundamentals 2. TimeML & TimeBank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions

26 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 26 Translation 2 “trained translators”; one final correction Translation desiderata: 1-1 sentence aligned Preserving POS Verb tense – mapped onto Romanian Format of the dates, moments of day and numbers conforms to the norms of written Romanian 4715 sentences (translation units), 65375 lexical tokens, including punctuation marks, representing 12640 lexical types

27 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 27 Preprocessing the corpus Tokenisation – MtSeg, with idiomatic expressions, clitic splitting POS-tagging – TnT adapted & improved to determine the POS of unknown words Lemmatisation – probabilistic, based on a lexicon Chunking – REs over POS tags to determine non-recursive NPs, APs, AdvPs, PPs

28 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 28 Alignment : 4 stages, evaluated over the data in the Shared Task on Word Alignment, Romanian- English track organized at ACL2005 YAWA : 4 stages, evaluated over the data in the Shared Task on Word Alignment, Romanian- English track organized at ACL2005 Current: P = 88.80%, R = 74.83%, F = 81.22% 91714 alignments, manually checked, out of which 25346 are NULL-alignments

29 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 29 Alignment 1. Content words alignment: based on the translation lexicons P = 94.08%, R = 34.99%, F = 51.00%. 2. Inside-Chunks alignment: simple empirical rules to align the words within the corresponding chunks; P = 89.90%, R = 53.90%, F = 67.40% 3. Alignment in contiguous sequences of unaligned words: using the POS-affinities of the unaligned words and their relative positions 4. Correction phase: the wrong links introduced mainly in stage 3 are now removed.

30 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 30 Alignment

31 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 31 Alignment The parallel corpus = 183 files in XCES format Pe_de_altă_parte, se dovedeşte a fi altă săptămână financiară foarte proastă … On_the_other_hand, it 's turning out to be another very bad financial week …

32 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 32 Annotation import Based on the Romanian-English lexical alignment

33 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 33 Annotation import For every pair of sentences Sro and Sen from the TimeBank parallel corpus with the Ten English equivalent sentence: 1. construct a list E of pairs of English text fragments with sequences of English indexes from Sen and Ten. E = {,,,,, }.

34 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 34 Annotation import 2. add to every element of E the XML context in which that text fragment appeared in the original English TimeBank. E’ = {,, …} 3. construct the list RW of Romanian words along with the transferred XML contexts using E’ and the lexical alignment between Sro and Sen. If a word in Sro is not aligned, the top context for it, namely s, is considered. RW = {,, …}.

35 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 35 Annotation import 4. construct the final list R of Romanian text fragments from RW by conflating adjacent elements of RW that appear in the same XML context. Output the list in XML format.

36 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 36 Annotation import Offline markup ( MAKEINSTANCE, ALINK, TLINK and SLINK tags) : the transfer kept only those XML tags from the English version whose IDs belong to XML structures that have been transferred to Romanian

37 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 37 Annotation import TimeML tags  % transfered events770397.07 instances770697.05 timexes135695.89 signals66897.09 alinks24993.96 slinks283196.55 tlinks612295.38 TOTAL2663596.53

38 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 38 Conclusions & future work improve & evaluate the annotation transfer adequacy of temporal theories to Romanian (semi) automatically mark-up of the temporal information in Romanian texts (news + literature)

39 Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 workshop @ RANLP 39 Thank you! (Temporal) Questions???


Download ppt "Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 1 Semi-automatic Annotation of the Romanian TimeBank 1.2 Corina Forăscu,"

Similar presentations


Ads by Google