Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of Linguistics University of Colorado, Boulder.

Similar presentations


Presentation on theme: "Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of Linguistics University of Colorado, Boulder."— Presentation transcript:

1 Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of Linguistics University of Colorado, Boulder

2 Why is semantic information important? Imagine an automatic question answering system Who created the first effective polio vaccine? Two possible choices: – Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine – The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh

3 Question Answering Who created the first effective polio vaccine? – Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine – The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh

4 Question Answering Who created the first effective polio vaccine? – [Becton Dickinson] created the [first disposable syringe] for use with the mass administration of the first effective polio vaccine – [The first effective polio vaccine] was created in 1952 by [Jonas Salk] at the University of Pittsburgh

5 Question Answering Who created the first effective polio vaccine? – [Becton Dickinson agent ] created the [first disposable syringe theme ] for use with the mass administration of the first effective polio vaccine – [The first effective polio vaccine theme ] was created in 1952 by [Jonas Salk agent ] at the University of Pittsburgh

6 Question Answering We need semantic information to prefer the right answer The theme of create should be the first effective polio vaccine The theme in the first sentence was the first disposable syringe We can filter out the wrong answer

7 We need semantic information To find out about events and their participants To capture semantic information across syntactic variation

8 Semantic information Semantic information about verbs and participants expressed through semantic roles Agent, Experiencer, Theme, Result etc. However, difficult to have a standard set of thematic roles

9 Proposition Bank Proposition Bank (PropBank) provides a way to carry out general purpose Semantic role labelling A PropBank is a large annotated corpus of predicate-argument information A set of semantic roles is defined for each verb A syntactically parsed corpus is then tagged with verb-specific semantic role information

10 Outline English PropBank Background Annotation Frame files & Tagset Hindi PropBank development Adapting Frame files Light verbs Mapping from dependency labels

11 Proposition Bank The first (English) PropBank was created on a 1 million syntactically parsed Wall Street Journal corpus PropBank annotation has also been done on different genres e.g. web text, biomedical text Arabic, Chinese & Hindi PropBanks have been created

12 English PropBank English PropBank envisioned as the next level of Penn Treebank (Kingsbury & Palmer, 2003) Added a layer of predicate-argument information to the Penn Treebank Broad in its coverage- covering every instance of a verb and its semantic arguments in the corpus Amenable to collecting representative statistics

13 English PropBank Annotation Two steps are involved in annotation – Choose a sense ID for the predicate – Annotate the arguments of that predicate with semantic roles This requires two components: frame files and PropBank tagset

14 PropBank Frame files PropBank defines semantic roles on a verb-by- verb basis This is defined in a verb lexicon consisting of frame files Each predicate will have a set of roles associated with a distinct usage A polysemous predicate can have several rolesets within its frame file

15 An example John rings the bell ring.01Make sound of bell Arg0Causer of ringing Arg1Thing rung Arg2Ring for

16 An example John rings the bell Tall aspen trees ring the lake ring.01Make sound of bell Arg0Causer of ringing Arg1Thing rung Arg2Ring for ring.02To surround Arg1Surrounding entity Arg2Surrounded entity

17 An example [John] rings [the bell] [Tall aspen trees] ring [the lake] ring.01Make sound of bell Arg0Causer of ringing Arg1Thing rung Arg2Ring for ring.02To surround Arg1Surrounding entity Arg2Surrounded entity Ring.01 Ring.02

18 An example [John ARG0 ] rings [the bell ARG1 ] [Tall aspen trees ARG1 ] ring [the lake ARG2 ] ring.01Make sound of bell Arg0Causer of ringing Arg1Thing rung Arg2Ring for ring.02To surround Arg1Surrounding entity Arg2Surrounded entity Ring.01 Ring.02

19 Frame files The Penn Treebank had about 3185 unique lemmas (Palmer, Gildea, Kingsbury, 2005) Most frequently occurring verb: say Small number of verbs had several framesets e.g. go, come, take, make Most others had only one frameset per file

20 PropBank annotation pane in Jubilee

21 English PropBank Tagset Numbered arguments Arg0, Arg1, and so on until Arg4 Modifiers with function tags e.g. ArgM-LOC (location), ArgM-TMP (time), ArgM-PRP (purpose) Modifiers give additional information about when, where or how the event occurred

22 PropBank tagset Numbered Argument Description Arg0Agent, causer, experiencer Arg1Theme, patient Arg2Instrument, benefactive, attribute Arg3starting point, benefactive, attribute Arg4ending point Correspond to the valency requirements of the verb Or, those that occur with high frequency with that verb

23 PropBank tagset 15 modifier labels for English PropBank [He Arg0 ] studied [economic growth Arg1 ] [in India ArgM-LOC ] ModifierDescription ArgM-LOCLocation ArgM-TMPTime ArgM-GOLGoal ArgM-MNRManner ArgM-CAUCause ArgM-ADVAdverbial

24 PropBank tagset Verb specific and more generalized Arg0 and Arg1 correspond to Dowtys Proto Roles Leverage the commonalities among semantic roles Agents, causers, experiencers – Arg0 Undergoers, patients, themes- Arg1

25 PropBank tagset While annotating Arg0 and Arg1: – Unaccusative verbs take Arg1 as their subject argument [The window Arg1 ] broke – Unergatives will take Arg0 [John Arg0 ] sang Distinction is also made between internally caused events (blush: Arg0) & externally caused events (redden: Arg1)

26 PropBank tagset How might these map to the more familiar thematic roles? Yi, Loper & Palmer (2007) describe such a mapping to VerbNet roles

27 More frequent Arg0 and Arg1 (85%) are learnt more easily by automatic systems Arg2 is less frequent, maps to more than one thematic role Arg3-5 are even more infrequent

28 Using PropBank As a computational resource – Train semantic role labellers (Pradhan et al, 2005) – Question answering systems (with FrameNet) – Project semantic roles onto a parallel corpus in another language (Pado & Lapata, 2005) For linguists, to study various phenomena related to predicate-argument structure

29 Outline English PropBank Background Annotation Frame files & Tagset Hindi PropBank development Adapting Frame files Light verbs Mapping from dependency labels

30 Developing PropBank for Hindi-Urdu Hindi-Urdu PropBank is part of a project to develop a Multi-layered and multi- representational treebank for Hindi-Urdu – Hindi Dependency Treebank – Hindi PropBank – Hindi Phrase Structure Treebank Ongoing project at CU-Boulder

31 Hindi-Urdu PropBank Corpus of 400,000 words for Hindi Smaller corpus of 150,000 words for Urdu Hindi corpus consists of newswire text from Amar Ujala So far.. – 220 verb frames – ~100K words annotated

32 Developing Hindi PropBank Making a PropBank resource for a new language – Linguistic differences Capturing relevant language-specific phenomena – Annotation practices Maintain similar annotation practices – Consistency across PropBanks

33 Developing Hindi PropBank PropBank annotation for English, Chinese & Arabic was done on top of phrase structure trees Hindi PropBank is annotated on dependency trees

34 Dependency tree Represent relations that hold between constituents (chunks) Karaka labels show the relations between head verb and its dependents gave k1 k2 money woman dat Raam erg k4

35 Hindi PropBank There are three components to the annotation Hindi Frame file creation Insertion of empty categories Semantic role labelling

36 Hindi PropBank There are three components to the annotation Hindi Frame file creation Insertion of empty categories Semantic role labelling Both frame creation and labelling require new strategies for Hindi

37 Hindi PropBank Hindi frame files were adapted to include – Morphological causatives – Unaccusative verbs – Experiencers Additionally, changes had to be made to analyze the large number (nearly 40%) of light verbs

38 Causatives Verbs that are related via morphological derivation are grouped together as individual predicates in the same frame file. E.g. Cornerstones multi-lemma mode is used for the verbs (KA eat, (KilA feed and (KilvA cause to feed)

39 Causatives raam ne Arg0 khaana Arg1 khaayaa Ram erg food eat-perf Ram ate the food mohan ne Arg0 raam ko Arg2 khaana Arg1 khilaayaa Mohan erg Ram dat food eat-caus-perf Mohan made Ram eat the food sita ne ArgC mohan se ArgA raam ko Arg2 khaana Arg1 khilvaayaa Sita erg Mohan instr Ram acc food eat-ind.caus-caus-perf Sita, through Mohan made Ram eat the food Roleset id: KA.01 to eat Arg0eater Arg1the thing being eaten Roleset id: KilA.01 to feed Arg0feeder Arg2eater Arg1the thing being eaten Roleset id: KilvA.01 to cause to be fed ArgCCauser of feeding ArgAfeeder Arg2Eater Arg1the thing eaten

40 Unaccusatives PropBank needs to distinguish proto agents and proto patients Sudha danced – – Unergative, animate agentive arguments- Arg0 The door opened – Unaccusative, non animate patient arguments- Arg1 For English, these distinctions were available in VerbNet For Hindi, various diagnostic tests are applied to make this distinction

41 Yes No- Second Stage Ergative. E.g. naac, dauRa, bEtha? First stage Yes.Unaccus-ative. E.g. khulnaa, barasnaa No-Third stage Yes.Unacc-usative. Eliminate bEtha No. Take the majority vote on the tests. cognate object & ergative case tests Applicable to verbs undergoing transitivity alternation: eliminate those that take (mostly) animate subjects For non-alternating verbs and others that remain, the verb will be unaccusative if: Impersonal passives are not possible use of huaa (past participial relative) is possible the inanimate subject appears without overt genitive Unaccusativity diagnostics (5)

42 Unaccusative verb : mahak to smell good Arg1 entity that smells good impersonal passive test past participial relative test Unaccusativity This information is then captured in the frame

43 Experiencers Arg0 includes agents, causers, experiencers In Hindi, the experiencer subjects occur with dikhnaa (to glimpse), milnaa (to find), suujhnaa (to be struck with) etc. Typically marked with dative case – Mujh-ko chaand dikhaa I. Dat moon glimpse.pf (I glimpsed the moon) PropBank labels these as Arg0, with an additional descriptor field experiencer – Internally caused events

44 Complex predicates A large number of complex predicates in Hindi – Noun-verb complex predicates vichaar karna; think do (to think) dhyaan denaa; attention give (to pay attention) – Adjectives can also occur accha lagnaa: good seem (to like) – the predicating element is not the verb alone, but forms a composite with the noun/adj

45 Complex predicates There are also a number of verb-verb complex predicates Combine with the bare form of the verb to convey different meanings ro paRaa; cry lie- burst out crying Add some aspectual meaning to the verb As these occur with only a certain set of verbs, we find them automatically, based on part of speech tag

46 Complex predicates However, the noun-verb complex predicates are more pervasive They occur with a large number of nouns Frame files for the nominal predicate need to be created E.g. in a sample of 90K words, the light verb kar; do occurred 1738 times with 298 different nominals

47 Complex predicates Annotation strategy Additional resources (frame files) Cluster the large number of nominals?

48 Light verb annotation A convention for the annotation of light verbs has been adopted across Hindi, Arabic, Chinese and English PropBanks (Hwang et. al. 2010) Annotation is carried out in three passes – Manual identification of light verb – Annotation of arguments based on frame file – Deterministic merging of the light verb and true predicate In Hindi, this process may be simplified because of the dependency label pof that identifies a light verb

49 Raam ergMoney theftDo.perf Ram stole the money 1.Identify the N+V sequences that are complex predicates. 2.Annotate predicating expression with ARG-PRX. REL: kar ARG-PRX: corii 3.Annotate the arguments and modifiers of the complex predicate with a nominal predicate frame 4.Automatically merge the nominal with the light verb. REL: corii_kiye Arg0: raam Arg1: paese

50 Hindi PropBank tagset 24 labels LabelDescription ARG0Agent, causer, experiencer ARG1Patient, theme, undergoer ARG2Beneficiary ARG3Instrument 50

51 Annotating Hindi PropBank LabelDescription ARG0Agent, causer, experiencer ARG1Patient, theme, undergoer ARG2Beneficiary ARG3Instrument ARG2-ATR ARG2-LOC attributeARG2-GOL ARG2-SOU goal locationsource Function tags 51

52 Annotating Hindi PropBank LabelDescription ARG0Agent, causer, experiencer ARG1Patient, theme, undergoer ARG2Beneficiary ARG3Instrument ARG2-ATR ARG2-LOC attributeARG2-GOL ARG2-SOU goal locationsource ARGCcauser ARGAsecondary causer Function tags Causative 52

53 Annotating Hindi PropBank LabelDescription ARG0Agent, causer, experiencer ARG1Patient, theme, undergoer ARG2Beneficiary ARG3Instrument ARG2-ATR ARG2-LOC attributeARG2-GOL ARG2-SOU goal locationsource ARGCcauser ARGAsecondary causer ARGM-VLVVerb-verb construction ARGM-PRXNoun-verb construction Function tags Causative Complex predicate 53

54 Annotating Hindi PropBank LabelDescriptionLabelDescription ARGM-ADVadverbARGM-CAUcause ARGM-DIRdirectionARGM-DISdiscourse ARGM-EXTextentARGM-LOClocation ARGM-MNRmannerARGM-MNSmeans ARGM-MODmodalARGM-NEGnegation ARGM-PRPpurposeARGM-TMPtime Other modifier labels 54

55 Hindi PropBank annotation Annotation pane Frameset file display

56 Annotation practice Need to maintain good annotation practices Current practices- double blind, followed by adjudication Inter-annotator agreement measures the consistency in the annotation task English PropBank had high inter annotator agreement K=0.91

57 Hindi PropBank annotation Improve consistency & annotation speed PropBank annotation on dependency trees has some advantages Hindi Treebank uses a large set of dependency labels that have rich semantic information gave Arg0 k1 Arg1 k2 Arg2 k4 money woman dat Raam erg

58 Deriving PropBank labels from dependencies We can derive Hindi PropBank labels from Dependency labels Mapping will reduce annotation effort, improve speed The dependency tagset has labels that are in some ways fairly similar to PropBank – Verb specific labels k1- 5 – Verb modifier labels k7p, k7t, rh etc.

59 Label comparison Using linguistic intuition, we can compare HDT labels with the numbered arguments in HPB 59

60 Label comparison Similarly, linguistic intuition gives us the mapping from HDT for HPB modifiers 60

61 Label comparison These mappings are included in the PB frame files, for example, the verb A: to come Only for numbered arguments Basis for the linguistically motivated rules RolesetUsageRule A.01To come (path)k1 Arg1 k2p Arg2-GOL A.03to arrivek1 Arg0 k2p Arg2-GOL k5 Arg2-SOU 61

62 Automatic mapping of DT to PB A rule based, probabilistic system for automatic mapping We use two kinds of resources: – Annotated corpus [Treebank+ PropBank] 32,300 tokens, 2005 predicates – Frame files with mapping rules 62

63 Argument classification We use three kinds of rules to carry out automatic mapping – Deterministic rules Dependency label mapped directly onto PropBank – Empirically derived rules Using corpus statistics associated with dependency & PropBank labels – Linguistically motivated rules Derived from linguistic intuition & captured in frame files 63

64 Example of the rules FeaturesCountPropBank labels xe.01_active_k1 (give) 32Arg0: 0.93 Arg1: 0.03 Arg2: 0.03 xe.01_active_k265Arg1: 0.95 Arg2: 0.01 Arg0: 0.01 xe.01_active_k434Arg2: 0.94 Arg0: 0.02 Associate the probability of each label in combination with a particular feature tuple We use only 3: roleset ID, voice, dependency label For the verb give, we get the correct mapping to the Hindi labels

65 Evaluation 32,300 tokens of annotated data Ten-fold cross validation for evaluation 65

66 Results PRECISIONRECALLF1 SCORE Empirically derived rules Linguistically motivated rules

67 Results PRECISIONRECALLF1 SCORE Empirically derived rules Linguistically motivated rules Numbered Argument Accuracy PRECISIONRECALLF-SCORE Empirically derived rules Linguistically motivated rules

68 Evaluation Linguistically motivated rules improve the recall with a slight drop in the precision With the most frequent PropBank labels, empirically derived rules perform well More data should improve the performance for modifier arguments 68

69 Evaluation Annotation practices also affect the mapping – PB labels are coarse-grained: E.g. ArgM-ADV maps to four different dependency labels – PB are fine-grained: E.g. means and causes are distinguished (ArgM-MNS, ArgM-CAU)but are lumped together under a single label in dependency treebank 69

70 Evaluation Our goal is to find a useful set of mappings to use at a pre-annotation stage We expect that with more PropBanked data, empirically derived rules will perform better Using additional resources such as Hindi WordNet can help to make up for the lack of training data 70

71 Summary Semantic role labels English PropBank – Frame files – PropBank tagset Development of Hindi PropBank – Linguistic issues – Experiments

72 Questions?


Download ppt "Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of Linguistics University of Colorado, Boulder."

Similar presentations


Ads by Google