Presentation is loading. Please wait.

Presentation is loading. Please wait.

GENIA-GR: a Grammatical Relation Corpus for Parser Evaluation in the Biomedical Domain Yuka Tateisi 1, Yusuke Miyao 2, Kenji Sagae 2, Jun'ichi Tsujii 2,3.

Similar presentations


Presentation on theme: "GENIA-GR: a Grammatical Relation Corpus for Parser Evaluation in the Biomedical Domain Yuka Tateisi 1, Yusuke Miyao 2, Kenji Sagae 2, Jun'ichi Tsujii 2,3."— Presentation transcript:

1 GENIA-GR: a Grammatical Relation Corpus for Parser Evaluation in the Biomedical Domain Yuka Tateisi 1, Yusuke Miyao 2, Kenji Sagae 2, Jun'ichi Tsujii 2,3 1 Department of Informatics, Kogakuin University 2 Graduate School of Information Science and Technology, University of Tokyo 3 School of Computer Science, University of Manchester/National Centre for Text Mining

2 [Event] Inhibition [Agent] IL-1beta [Target] insulin secretion IE Background Interleukin 1 beta inhibits insulin secretion Parsing Sentence IL-1beta is known to inhibit insulin secretion Insulin secretion is inhibited by IL-1 beta [Verb] Inhibit [Subject] IL-1beta [Object] insulin secretion inhibit ↓ [Event]=[Verb] [Agent]=[Subject] [Target]=[Object] Rules Predicate-Argument Structure

3 Constituency vs Dependency Dependency structure is becoming more common Predicate-argument relation is easier to access from dependency than from constituency structure S VP Iloveyou subjectobject Iloveyou

4 Parser Evaluation Need to know the performance of parsers Cross-formalism evaluation ( PTB-based, TAG, HPSG, etc ) Cross-domain evaluation ( Newswire text etc. ) Previous works used Stanford Dependency (Clegg et al. 2007, Pyysalo et al. 2007) Automatic translation from treebank-based corpus May contain error in conversion process Good for comparison between treebank-based parsers but may be unreliable for evaluating parsers based on other formalisms

5 Our Approach Manual Annotation of Dependency Structure Grammatical Relations (GR : Carrol et al.1998) Gold Standard Corpus exists for Wall Street Journal section of Penn Treebank Has been used for cross-framework evaluation in newswire domain Annotation on GENIA text Other information (POS, tree, terms, event structure) available

6 GENIA-GR Base Text 50-abstract subset of GENIA MeSH term: NF kappa B Full text in PubMed Central Never been used for training with GENIATreebank Annotation Scheme Follows Briscoe’s scheme* Not annotated inside technical terms *An introduction to tag sequence grammars and the RASP system parser (Technical report, University of Cambridge)

7 Grammatical Relations Shows dependency structure Head-Dependent pairs with types of dependency I love you (ncsubj love I _) I is the subject of love (dobj love you) you is the object of love

8 Technical Terms (Named Entities) Technical Terms := Terms tagged in GENIA term corpus Multi-word terms are “declared” with IDs taken from GENIA term corpus Position in the sentence Only IDs are referred to in the dependency structure

9 Example sentence( id(96099434.1) named_entities( term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation in human immunodeficiency virus-infected monocytes. ) rasp( (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8) (dobj mediates:4 *T2*) (dobj in:8 *T4*) ))

10 Example sentence( id(96099434.1) Sentence ID named_entities( term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation in human immunodeficiency virus-infected monocytes. ) rasp( (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8) (dobj mediates:4 *T2*) (dobj in:8 *T4*) ))

11 Example sentence( id(96099434.1) named_entities( Multi-word Terms term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation in human immunodeficiency virus-infected monocytes. ) rasp( (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8) (dobj mediates:4 *T2*) (dobj in:8 *T4*) ))

12 Example sentence( id(96099434.1) named_entities( term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation in human immunodeficiency virus-infected monocytes. ) rasp( Sentence Form (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8) (dobj mediates:4 *T2*) (dobj in:8 *T4*) ))

13 Example sentence( id(96099434.1) named_entities( term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation in human immunodeficiency virus-infected monocytes. ) rasp( (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8) Dependency Structure (dobj mediates:4 *T2*) (dobj in:8 *T4*))

14 Annotation Procedure Manual correction of Rasp parser output Annotation by one annotator Procedure 1.Initial annotation 2.Error correction, identification of problems 3.Refinement of scheme, further correction 4.Annotation by other annotator (s), Inter Annotator Agreement evaluation 5.Error correction, re-refinement of scheme, etc 6.Publication

15 Annotation Results CorpusNo. of sentences No. of dependencies GENIA-GR49210029 PTB-GR (Carroll and Briscoe 2006) 56010906

16 Distribution of Dependency Types Relation TypeN GENIA F GENIA N PTB F PTB F GENIA /F PTB dobj21814.4317623.151.41 ncmod21634.435486.340.69 det9902.0111151.991.01 conj9751.985911.061.88 iobj9621.965450.972.01 ncsubj9461.9213512.410.8 passive3300.672280.411.65 xcomp2940.63800.680.88 xmod2940.61780.321.88 Other8941.8212082.160.84 TOTAL 1002910906 N x : occurrence in the corpus F x : frequency per sentence GENIA: GENIA (492 sentences) PTB : PTB-WSJ (560 sentences)

17 Typical Construction in Biomedical Abstract (?) A similar potentiation of TNF effects was observed in Jurkat T cells and HeLa cells treated with soluble Tat protein.

18 Problems found in the initial annotation process Domain-specific structures Coordination Prepositional Arguments/Modifiers Inconsistency with other GENIA corpora

19 Genes residues 296 to 302 residues 296 to 302 ncmod(ta) ncmod dobj

20 Genes residues 296 to 302 (ncmod ta residues 296) (ncmod _ 296 to) (dobj to 302)

21 Binding (participle as a prenominal modifier) New York-based firm NF kappa B binding domain ncmod New York based firm ncmod NF kappa B binding domain ncmod Current annotation

22 Binding Current annotation (below), following an example from PTB-GR (above) New York-based firm (ncmod _ *New York* based) (ncmod _ firm *New York*) NF kappa B binding domain (ncmod _ *NF kappa B* binding) (ncmod _ domain *NF kappa B*)

23 Binding (participle as a prenominal modifier) NF kappa B binding domain But perhaps it’s better to do this way dobj NF kappa B binding domain ncsubj xmod NF kappa B binding activity dobj NF kappa B binding activity ncmod(ta)

24 Binding NF kappa B binding domain (dobj binding *NF kappa B*) (xmod _ domain binding) (ncsubj binding domain _) NF kappa B binding activity (dobj binding *NF kappa B*) (xmod ta activity binding) But perhaps it’s better to do this way

25 References in the text M.Nugeyre, F.Barre-Sinoussi, and N.Israel, J.Virol.72 : 5852-5861, 1998 Where is the head? M.Nugeyre F.Barre-Sinoussi N.Israel J.Virol.72 5852-5861 1998 and conj ta

26 References in the text M.Nugeyre, F.Barre-Sinoussi, and N.Israel, J.Virol.72 : 5852-5861, 1998 (ta comma J.Virol.72 and) (ta colon J.Virol.72 5852-5861) (ta comma J.Virol.72 1998) (conj and M.Nugeyre) (conj and F.Barre-Sinoussi) (conj and N.Israel) Where is the head?

27 Math expressions 2 x NFKappaB > or = SlVmac239 approximately deltaNFkappaB What’s the dependency type? 2 x NFKappaB or > = ≈ conj SlVmac239 arg deltaNFkappaB arg xmod

28 Math expressions 2 x NFKappaB > or = SlVmac239 approximately deltaNFkappaB (arg or *2 x NFkappaB*) (conj or >) (conj or =) (arg or SlVmac239) (xmod SlVmac239 approximately _) (arg approximately deltaNFkappaB _) What’s the dependency type?

29 References and Math expressions Not frequent but problematic May occur more frequently in full text References may be treated like terms Math expressions can work as S or VP T n 0.

30 Coordination CD8(+) CD4(+) CD4(+) and CD8(+) T lymphocytes CD4(+) CD8(+)

31 CD4(+) and CD8(+) T lymphocytes CD8(+) CD4(+) CD8(+) T lymphocytes and conj ncmod

32 Coordination (conj and CD4(+)) (conj and CD8(+)) (ncmod *T lymphocytes* and _) CD8(+) CD4(+) CD4(+) and CD8(+) T lymphocytes

33 (ncmod ellip CD4(+) _) (ncmod *T lymphocytes* CD8(+) _ ) (conj and ellip) (conj and *T lymphocytes*) (ncmod *T lymphocytes* CD4(+) _) (ncmod *T lymphocytes* CD8(+) _ ) (conj and CD4(+)) (conj and CD8(+)) CD4(+) CD8(+) ?

34 CD4(+) and CD8(+) T lymphocytes CD4(+) CD8(+) ? ellip CD4(+) T lymphocytes CD8(+) and ncmod conj no mechanism to coindex

35 PP argument/modifier iobj (argument) or ncmod (modifier)? Reference to external resources GENIA event corpus PASBio (Wattarujeekrit et al 2008) A similar potentiation of TNF effects was observed in Jurkat T cells and HeLa cells treated with soluble Tat protein.

36 Inconsistencies with other GENIA corpora Dependency Inside a Token renal cell carcinoma-derived gangliosides Mostly involving hyphens Many inside technical terms Dependency on Part of a Technical Term more mature cell lines

37 Conclusions 50-abstract GENIA subset with dependency annotation is in the phase of error correction Works to be done Refinement of the scheme Evaluation (Enhancement to full text)


Download ppt "GENIA-GR: a Grammatical Relation Corpus for Parser Evaluation in the Biomedical Domain Yuka Tateisi 1, Yusuke Miyao 2, Kenji Sagae 2, Jun'ichi Tsujii 2,3."

Similar presentations


Ads by Google