Presentation is loading. Please wait.

Presentation is loading. Please wait.

Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,

Similar presentations


Presentation on theme: "Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,"— Presentation transcript:

1 Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University, Provo, UT 84602 Introduction Source Location by Semantic Indexing Contact Information Data Extraction Research Group Department of Computer Science Brigham Young University Provo, UT 84602 Cui Tao, ctao@cs.byu.edu http://www.deg.byu.edu/ Conclusions PROBLEMS: Huge evolving number of Bio-databases  e.g. molecular biology database collection 2004: total 548, 162 more than 2003 2005: total 719, 171 more than 2004 Different access capabilities Syntactic heterogeneity Semantics heterogeneity Updated at anytime by independent authorities SOLUTION: Source page understanding Table Interpretation Aligning with an ontology Source location through semantic annotation Metadata vs. instance data annotation Use of annotation in query processing Ontology evolution Adjustments to ISA and Part-Of hierarchies Addition of attributes GOALS: To help biologists cross search various resources Examples: Cross-linked information (Join queries) “Find genes which are longer than 5kbp, whose products have at least two helices, and participate in glycolysis” – GenBank, PDB, KEGG Collecting information from similar data sources (Union queries) “Find genes newly annotated after Jan. 2003 in the fly and worm genomes” – FlyBase, WormBase table tr td Status Nucleotides (coding/transcript) Protein Swissprot Amino Acids F47G6.1F47G6.1 1, 2 confirmed by cDNA(s)cDNA(s) 1773/7391 bp WP:CE26812 DTN1_CAEEL td590 aa tdGene Model F18H3.5bF18H3.5b 1, 2, 3 F18H3.5aF18H3.5a 1, 2 table tr td Gene Model Status Nucleotides (coding/transcript) Protein Amino Acids confirmed by cDNA(s)cDNA(s) 1029/3051 bp WP:CE18608 342 aa partially confirmed by cDNA(s)cDNA(s) 1221/1704 bp WP:CE28918 406 aa SAMPLE ONTOLOGY OBJECT RECOGNITION Key Concepts: sample ontology object, expected values Steps: Map the values with the sample ontology object set Map the labels with the ontology concepts Understand all pages from the same web site Ontology Evolution Source Page Understanding Key Concepts: sibling pages and sibling tables Main Idea: Compare two sibling tables: variable fields ~ values & fixed fields ~ labels Structure pattern for one pair of sibling tables  General structure pattern for all sibling tables SIBLING PAGE COMPARISON Steps: Transfer each HTML table to a DOM tree Find sibling tree pairs Compare and find matched nodes Generate a structure pattern for all sibling tables Source Organism Accession Number Protein Name Length in Amino Acid Molecular Weight in Da ProtoNet Semantic Web Semantic annotation Query META-DATA ANNOTATION DATA ANNOTATION Likely to have “imperfect” ontologies Can enrich semi-automatically Two possibilities: Value enrichment Object-set and relationship-set enrichment VALUE ENRICHMENT Source Target Source Organism Accession Number Protein Name Length in Amino Acid Molecular Weight in Da RELATIONSHIP-SET ENRICHMENT OBJECT-SET ENRICHMENT Start End Length in Amino Acid Location Gene “37,?612,?680”; “37,?610,?585”; “3,?095”: A sample ontology object (partial information) Two sample pages (partial information) Specie Protein Name Map to Update values Finished: sibling table comparison technique Working on: sample ontology object recognition ontology generation in the biological domain Implementation Status: Ontology: will not cover everything in the domain Source page understanding: structured/semi-structured Value enrichment: only value lexicons Object set and relationship set enrichment: only ISA and Part-Of hierarchies and simple attribute additions Delimitations: Old ontology Updated ontology Possible new object sets that could be added to the ontology Data Extraction Research Group


Download ppt "Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,"

Similar presentations


Ads by Google