Presentation is loading. Please wait.

Presentation is loading. Please wait.

6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive.

Similar presentations


Presentation on theme: "6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive."— Presentation transcript:

1 6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive Science* Mark Mandel – Linguistic Data Consortium* * University of Pennsylvania Parallel Entity and Treebank Annotation

2 6/29/052 Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA-0205448 Collaboration with Division of Oncology, Children’s Hospital of Philadelpia PubMed abstracts – mining cancer literature for associations that link variations in genes with malignancies http://bioie.ldc.upenn.edu - release 0.9 available 1157 abstracts entity annotated, 318 also treebankedhttp://bioie.ldc.upenn.edu

3 6/29/053 Outline Entity Annotation Treebank Annotation – Modifications from Penn Treebank guidelines Annotation Process and Merged Format Entity-Constituent Mapping – How successful?

4 6/29/054 Entity Annotation Gene X with genomic Variation event Y is correlated with Malignancy Z Gene – composite entity, can refer to gene or protein : Gene-generic, Gene-protein, Gene-RNA (Malignancy – under development, not included in release 0.9) Variation Event – Relation between entities representing different aspects of a variation

5 6/29/055 Entity Annotation - Variations Variation – A relation between variation component entities “a single nucleotide substitution at codon 249, predicting a serine to cysteine amino acid substitution” Var-type – substitution Var-location –codon 249 Var-state-orig –serine Var-state-altered –cysteine

6 6/29/056 A Change in Tokenization Tokenization – Many hyphenated words treated as separate tokens “New York-based” Old (Penn Treebank) tokenization: [New] [York-based] New tokenization: [New][York][-][based]

7 6/29/057 Discontinuous Entities E.g.: “K- and N-ras” Tokenization: [K][-][and][N][-][ras] Entity annotation: [K][-]… [ras] – “chain” of discontinuous tokens [N][-][ras] – Contiguous tokens Splitting up not always done, depends on coordination

8 6/29/058 Treebank Annotation Default NP right-branching structure (NP (JJ primary) (NN liver) (NN cancer)) Simplifies multi-token nominal annotation Allows recovery of implicit constituents: (NP (JJ primary) (newnode (NN liver) (NN cancer))) Entities sometimes map to such implicit constituents

9 6/29/059 Treebank Annotation Exceptions to right-branching marked by NML So: Any two or more non-final elements that form a constituent are a NML (ADJP (NML (NNP New) (NNP York)) (HYPH -) (VBN based)) (ADJP (NML (NN breast) (NN cancer)) (HYPH -) (VBN associated)) (NP (NML (NN human) (NN liver) (NN tumor)) (NN analysis)

10 6/29/0510 Treebank Annotation Placeholder *P* for distributed material in coordinated nominal structures “K- and N-ras” NP NN NP CC K and HYPH - NML-1 -NONE- *P* NN NP N HYPH - NML-1 -NONE- ras

11 6/29/0511 Treebank Annotation To the left or right “codon 12 or 13” NP NML-1 NN NP CC codon CD 12 or NML-1 -NONE- NP *P* CD 13

12 6/29/0512 First Release Goal – let users choose how to handle the integration of entity and treebank levels Standoff annotation for entity and treebank Identical tokenization Merged representation Penn Treebank style (POSTag:[from..to] terminal) Entity listing before each tree.

13 6/29/0513 Merged Output Example sentence 4 Span:331..605 ;In the present study, we screened for ;the K-ras exon 2 point mutations in a ;group of 87 gynecological neoplasms ;[373..378]:gene-rna:"K-ras" ;[379..385]:variation-location:"exon 2" ;[386..401]:variation-type: "point mutations“

14 6/29/0514 Merged Output Example […] ((VP (VBD:[356..364] screened) (PP-CLR (IN:[365..368] for) (NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations))) […]

15 6/29/0515 Merged Output Example ((VP (VBD:[356..364] screened) (PP-CLR (IN:[365..368] for) (NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations))) ;[373..378]:gene-rna:"K-ras" ;[379..385]:variation-location:"exon 2" ;[386..401]:variation-type: "point mutations"

16 6/29/0516 Entity-Constituent Mapping : Exact Match Exact Match: A node in the tree yields exactly the entity: ;[379..385]:variation-location:"exon 2" ( NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations)))

17 6/29/0517 Entity-Constituent Mapping : Missing Node Missing Node – Possible to add a node to yield exactly the entity ;[386..401]:variation-type: "point mutations" ( NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations)))

18 6/29/0518 Entity-Constituent Mapping : Missing Node Done for internal research purposes, not in release (implicit constituents) NML already in release (explicit constituents) ( NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (newnode(NN:[386..391] point) (NNS:[392..401] mutations))))

19 6/29/0519 Entity-Constituent Mapping : Crossing Crossing: Cuts across constituent boundaries, so cannot even add a node yielding the entity Typical case: entity containing text corresponding to a prepositional phrase One ER showed a G-to-T mutation in the second position of codon 12 [1280..1307]: variation-location: “second position of codon 12”

20 6/29/0520 Entity-Constituent Mapping : Crossing Crossing - Determiner in NP but not in entity. Could relax matching, or modify entity or treebank annotation. Didn’t do that. (NP (NP (DT:[1276..1279] the) (JJ:[1280..1286] second) (NN:[1287..1295] position)) (PP (IN:[1296..1298] of) (NP (NN:[1299..1304] codon) (CD:[1305..1307] 12))))) [1280..1307]: variation-location: “second position of codon 12”

21 6/29/0521 Entity-Constituent Mapping – Chain Exact Match “ codon 12 or 13” Entities: “codon 12”, “codon..13” NP NML-1 NN NP CC codon CD 12 or NML-1 -NONE- NP *P* CD 13

22 6/29/0522 Entity-Constituent Mapping – Chain Not a Exact Match “ specific codons (12, 13, and 61) Entities: “codons…12”, “codons..13”, “codons..61” (NP (JJ specific) (NNS codons) (PRN (-LRB- -LRB-) (NP (NP (CD 12)) (,,) (NP (CD 13)) (,,) (CC and) (NP (CD 61))) (-RRB- -RRB-)))

23 6/29/0523 Multiple Token Entities (Non-Chained) Entity TypeTotalExact Match Missing Node Crossing Gene-generic6411 Gene-protein34923610310 Gene-RNA156115356 Var-location4453486829 Var-state-orig5311 Var-state-altered10802 Var-type2711231426 Total124283735055(4.4%)

24 6/29/0524 Multiple Token Entities (Chained) Entity TypeTotalExact Match Not Exact Match Gene-generic000 Gene-protein642 Gene-RNA36297 Var-location12510322 Var-state-orig000 Var-state-altered000 Var-type101 Total16813632(19%)

25 6/29/0525 Conclusion Annotation of entities and treebank done together Identical tokenization for entities and trees, with standoff annotation Allows flexibility in use of integrated annotation Only 6.2% of the entities cannot be mapped to an implicit or explicit constituent node Changes in Treebank guidelines Use of Relations for potentially large entities Next: Relation annotation and integrated taggers

26 6/29/0526 References Ryan’s tagger Dan’s parser Web page again

27 6/29/0527 Entity Annotation - Variations “(S249C)” Var-type – none Var-location –249 Var-state-orig –S Var-state-altered –C Gene-{RNA,generic,protein} disambiguates gene metonymy Var-{type,location,state-orig,state-altered} are different kinds of entities

28 6/29/0528 Entities Entity TypeSingle Tokens Non- chains Chains Gene-generic10460 Gene-protein9213496 Gene-RNA198715636 Var-location95445125 Var-state-orig15150 Var-state-altered162100 Var-type2352711 --Multiple Tokens--

29 6/29/0529 Introduction Corpus for biomedical IE with several levels of annotation: Entity Syntactic Structure (Treebank) Relations (McDonald et al, ACL 2005) Ideal - entities mapped to treebank constituents Allow users to choose how to integrate the levels

30 6/29/0530 Annotation Process Tokenization  Entity  POS  Treebanking  Merged Representation Minimal requirement: identical tokenization for entity and treebank annotation Did not require an entity/constituent correspondence – but how did it work out?


Download ppt "6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive."

Similar presentations


Ads by Google