Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Similar presentations


Presentation on theme: "Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION."— Presentation transcript:

1 Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION

2 Agenda CharaParser Methodology Evaluation Applications CharaParser for Phenoscape New modules Evaluations Challenges

3 Fine-Grained Semantic Mark-up To annotate factual information from textual morphological descriptions of biodiversity in such a detailed manner that the machine readable annotation itself provides information equivalent to the original text.

4 An Example

5

6 Previous Research Syntactic parsing approach (Taylor, 1995 ; Abascal & Sanchenz, 1999; Vanel, 2004) Interactive extraction (Diederich, J., Fortuner, R. & Milton, J. 1999). Semi-supervised bootstrapping for lexicons (Ellen Riloff, 1999) Supervised regular expression rule learning (Soderland, 1999; Tang & Heidorn 2008) Ontology driven and parallel text (Woods et. al. 2004) Supervised association rule learning (Cui & Heidorn, 2007)

7

8

9

10 General-Purpose Parsers?

11 CharaParser Approach 1.Unsupervised machine learning to find anatomy and character terms from descriptions automatically No need to prepare training examples 50% - 80% terms learned 2.General-purpose syntactic parser (e.g., Stanford Parser) to parse syntactic structure of sentences No need to create special-purpose, domain-dependent parser Learned lexicon from 1 is used to adapt the Parser for biodiversity domains 3.Intuitive rules to produce annotations from parse trees.Intuitive rules

12 Unsupervised lexicon learning If it is known roots is an organ: Roots yellow to medium brown or black, thin. Petals yellow or white Petals absent; Subtending bracts absent; Abaxial hastula absent;

13 CharaParser: Term Reviewer

14 Ontology Term Organizer

15 Why used those methods? Portability to new descriptions from different taxon groups Scalability to large amount of legacy descriptive literature Biodiversity Heritage Library

16

17 Compared against a Heuristics-Based Method Parser performance evaluated on the same data sets. CharaParser: unsupervised learning + Stanford Parser Heuristics-based: unsupervised learning + regular expression rules

18 Annotation Problems Chunk errors: Leaves oblanceolate to lanceolate, largest 14–20(–40) × 3–4(–5) mm, pliant; Attachment errors: on outer cypselae, crowns of bristlelike scales ca. 0.5 mm; on inner, of dusky white or pale yellow, plumose bristles 5–6 mm. Semantics: straight posterolateral bounding ridges to subtriangular, bilobed ventral muscle field;

19 Applications at Various Development Stages Convert XML markup to SDD for identification key generation Character matrices for tree of life RDF for the Semantic Web and search Use marked-up descriptions to support search FNA Experimental Search Data source is RDF triples Allow character based search Plants that give yellow flowers at 200-400 meter elevation in April in North Carolina

20

21 To-Dos Tighter integration of ontologies in annotation process. Currently internal glossaries are used in place of ontologies to link a character state (e.g., red) to a character (color) Synonyms are not controlled Petiolate = with petiole Continue to reduce annotation errors Accommodate various syntactic styles Diagnosis paragraphs Comparison among different taxa

22 Ontology challenges: coverage (2010) Source #of char states in FNAGloss FNAGloss in Oxford in Radford Radford in PATO FNA 1356630259350244 FoC 1550676297368247 Not found in any ontology: FNA: 593 FoC: 742

23 Inter-Ontology Agreement(2010) Common Character states 4 ontologies agree 3 ontologies agree 2 ontologies agree No agreement n = 6412 (11 = shape, 1= position) 30 (PATO disagrees with others = 27) 139

24 Phenotype Curation Convert character and character state information from natural language descriptions to EQ statements

25 Curator Mental Process read description Identify key phrases (raw EQ) ontologized EQ ontologies

26 Phenotype Curation System= CharaParser + Phenex PCS System Schematic Diagram.

27 CharaParser Output Candidate EQ statements

28 Phenex: EQ reviewer and editor

29 A Zoom-in View

30 Adapted CharaParser Character Description State Descriptions CharaParser XML to Raw EQs Raw EQs to Final EQs Ontologies

31 Evaluations Internal evaluation: The development corpus (three publications on fishes and archosaurs) provided 1,200 character descriptions. 100 of them included in the internal evaluation benchmark. Raw EQ performance: 90% Final EQ performance: 50% BioCreative2012 evaluation: 50 descriptions independently selected by the organizer (>50% Qs were not in ontologies) Gold standard created by chief phenoscape curator (raw and final) Three biocurators worked in two modes (Phenex vs. Phenex+CharaParser) Raw EQ performance: CharaParser better than biocurators Final EQ perfoamnce: biocuration better than CharaParser Inter-curator agreements:

32 Public Evaluation: Results Overview:

33 CharaParser Effect

34 Inter-Curator Agreements PrecisionRecall Curator 1 vs 23949 Curator 1 vs 34756 Curator 2 vs 37771

35 Error Analyses Various fixable syntactic problems E.g., digits I-III Curation granularity CharaParser generated more candidate EQs than curators Preopercular latero-sensory canal leaves preopercle at first exit and enters a plate: yes/no Annotating relations (relational quality) contact between …

36 Ontology Access Currently use keyword-based search Class labels and exact, narrower, and related synonyms False positives acute(shape) =? acute (process) "margin" is a broad synonym of "marginal zone of embryo" in UBERON Pre-composed terms in ontology ceratobranchial 5 tooth, rib of vertebra 5, body of humerus Ambiguious term use in descriptions epibranchial 1 => epibranchial 1 element? bone? cartilage? No matching

37 Exploration of Solutions Experimented with Word sense disambiguation: crinkly not in PATO Candidate matches: [undulate->1.00000000000002] [obovate->1.00000000000001] [flat->1.00000000000001] [flattened->1] [circinate->0.884697579551583] Experimenting with Subsets Specify included classes: e.g. classes related to vertebrates Specify excluded classes: e.g. exclude certain developmental stages Ideas to try out: Bootstrapping to narrow down the search space starting from known classes evaluating candidate matches based on the distances to the known classes and other source of evidences.

38 Annotation consistency Instructions given to human curators are helpful to CharaParser Restricted relation list: http://phenoscape.org/wiki/Guide_to_Character_Annotation#Rela tions_used_for_post-compositions

39 Feed more info to EQ generation module Ontologies

40 Recent Improvements Explorer of Taxon Concepts project Making it a pure-java program/web-based application Currently requires MySQL + Perl Making it faster Optimization of the program Removing MySQL and reducing I/O Parallel computing using java threads Preliminary evaluation shows 20 times faster: 2 sec/taxon description Memory requirements increased by 3 folds

41 Acknowledgements Fine-Grained Semantic Markup Project (current and past) James Macklin: Agriculture and Agri-Food Canada Robert (Bob) Morris, Alex Dusenbery: UMass-Boston Hariharan Gopalakrishnan, Zilong Chang, Thomas Rodenhausen, Mohan Krishna Gowda, ParthaPartha Pratim Sanyal, Chunshui Yu: University of Arizona Phenoscape Project Chris Mungall: Laurence Berkeley National Lab Melissa Haendel : Oregon Health & Science University Paula Mabee, Alex Dececchi: University of South Dakota Jim Balhoff, Wasila Dahdul, Hilmar Lapp, Todd Vision: NESCent NSF ABI and EF Programs The Flora of North American Project


Download ppt "Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION."

Similar presentations


Ads by Google