Toward Making Online Biological Data Machine Understandable Cui Tao.

Toward Making Online Biological Data Machine Understandable Cui Tao

6/21/2015 Motivation  Huge evolving amount of Bio-databases  The molecular biology database collection  2004: total 548, 162 more than 2003  2005: total 719, 171 more than 2004  Different access capabilities  From web services-level interfaces to basic HTTP form interfaces  From simple lists, keyword queries to full-featured Boolean queries  Different query languages  Syntactic heterogeneity  Flat files with/without format definitions  Relational databases  Structured/unstructured HTML files  Semantic heterogeneity  Different identifiers  Different perspectives  Different terminologies  Different units  Sometimes the information a user needs spans multiple sources  Making online biological data machine understandable is important and challenging 2

6/21/2015 Motivation  To help biologists:  Perform background research  Gain insight into relationships and interactions among different research discoveries  Build up research strategies inspired by others’ hypotheses 3

6/21/2015 System Overview Located Sources Locate Sources Obtain Pages Understand Pages (Extract) Indexes Source URLs Semantic Web Pages Understood Pages Retrieved Pages Cache PagesEnrich Ontology Gene Extraction Ontology Seed Ontologies Execute Query 4

6/21/2015 Research Issues  Source page understanding  Attribute-value pair discovery?  Aligning with an ontology?  Source location through semantic indexing  Metadata vs. instance data indexing?  Use of indexes in query processing?  Ontology evolution  Adjustments to ISA and Part-Of hierarchies?  Addition of attributes? 5

6/21/2015 Thesis Statement  Automatically understands the structure of source pages  Automatically converts source pages into semantic web pages  Semantically indexes biological resources  Semi-automatically updates the ontology Build a proof-of-concept prototype that resolves the research issues: 6

6/21/2015 Outline  Extraction ontology  Source page understanding  Source location through semantic indexing  Ontology enrichment 7

6/21/2015 Extraction Ontology (Partial) 8

6/21/2015 Extraction Ontology Construction  Knowledge sources  Gene Ontology  Thousands of terms  All Species Toolkit  Total of 1,231,935 names  Protein databases  Thousands of protein names  Regular expressions, keywords (Molecular Function, Biological Process,Molecular FunctionBiological Process, Cellular ComponentCellular Component) 9

6/21/2015 Source Page Understanding 10

6/21/2015 10

6/21/2015 Source Page Understanding  Three steps:  Recognize attributes and values  Find attribute-value pairs  Map attribute-value pairs to target concepts  Two techniques:  Sibling page comparison  Seed ontology recognition 11

6/21/2015 Sibling Page Comparison 12

6/21/2015 Sibling Page Comparison Attribute 12

6/21/2015 Seed Ontology Recognition  What is a seed ontology?  A seed ontology contains as much information as we can collect for one object in a specified application domain with respect to the extraction ontology.  Why do we use a seed ontology? 14

6/21/2015 Seed Ontology Recognition Marker Name: ABP1 Forward Primer: CTTATGCTGCGAGTGCAGTC Reverse Primer: AGCAATGGAGAAGTTCCTACC 14

6/21/2015 nucleus; zinc ion binding; nucleic acid binding; zinc ion binding; nucleic acid binding; linear; NP_079345; 9606; Eukaryota; Metazoa; Chorata; Craniata; Vertebrata; Euteleostom i; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo; NP_079345; Homo sapiens; human; GTTTTTGTGTT………. ATAAGTGCATTAACGG CCCACATG; FLJ14299 msdspagsnprtpessgsgsgg ………tagpyyspyalygqrlasa salgyq; hypothetical protein FLJ14299; 8; eight; “8:?p\s?12”; “8:?p11.2”; “8:?p11.23”; : “37,?612,?680”; “37,?610,?585”; 15

6/21/2015 Seed Ontology Recognition 16

6/21/2015 nucleus; zinc ion binding; nucleic acid binding; zinc ion binding; nucleic acid binding; linear ; NP_079345; 9606; Eukaryota; Metazoa; Chorata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo; NP_079345; Homo sapiens; human; GTTTTTGTGTT………. ATAAGTGCATTAACG GCCCACATG; FLJ14299 msdspagsnprtpess gsgsgg………tagp yyspyalygqrlasasal gyq; hypothetical protein FLJ14299; 8; eight; “8:?p\s?12”; “8:?p11.2”; “8:?p11.23”; : “37,?612,?680”; “37,?610,?585”; 17

6/21/2015 Source Location through Semantic Indexing  Motivation:  Hundreds of available biological repositories  Time consuming to browse all of them  Leads quickly to needed sources for a query  Solution − semantic indexing:  Meta-data  Data 18

6/21/2015 Source Location through Semantic Indexing − Meta-Data Source Organism Accession Number Protein Name Length in Amino Acid Molecular Weight in Da ProtoNet 19

6/21/2015 Source Location through Semantic Indexing − Meta-Data Protein Name = “Hypothetical protein FLJ14299” Length in Amino Acid = ? Length in Amino Acid Protein Name ProtoNet 20

6/21/2015 Source Location through Semantic Indexing − Data Semantic Web Semantic indexing Query 21

6/21/2015 Ontology Enrichment  Likely to have “imperfect” ontologies  Incomplete ISA and Part-Of hierarchies  Incomplete lexicons  Incomplete with respect to concepts  Can enrich semi-automatically  Two possibilities:  Data frame enrichment  Object set and relationship set enrichment 22

6/21/2015 Ontology − Data Frame Enrichment 23

6/21/2015 Ontology --- Object Set and Relationship Set Enrichment Source Target 24

6/21/2015 Ontology − Object Set and Relationship Set Enrichment Source Organism Accession Number Protein Name Length in Amino Acid Molecular Weight in Da 25

6/21/2015 Research Plan  Build and test the system step by step  Provide experimental evidence that issues have been resolved  Source page understanding  Source location from semantic indexing  Ontology enrichment 26

6/21/2015 Research Plan – Source Page Understanding  Training set  Choose thresholds  Set up rules  Combine results from different techniques  Refine the seed ontologies  Test set  Detect attributes and values  Form attribute-value pairs  Recognize mappings between source attribute-value pairs to target concepts 27

6/21/2015 Research Plan – Others  Source location through semantic indexing  Ontology enrichment  Data frame enrichment  Concept and relationship set enrichment 28

6/21/2015 Delimitations  Extraction ontology: will not cover all the concepts, relationships, and values in the molecular biology domain  Source page understanding: only deals with structured/semi-structured source pages  Data frame enrichment: will not do automatic regular expression enrichment  Object set and relationship set enrichment: will be limited to enriching ISA and Part-Of hierarchies and simple attribute additions  Prototype system: will use an available front-end query interface; will not do further integration beyond synchronization with the target gene extraction ontology 29

6/21/2015 Contributions  Will contribute to both information extraction technology and bioinformatics  Can find appropriate sources, retrieve needed information, understand a source page, and extract useful information automatically  Can convert understood source pages into semantic web pages automatically  Can enrich ontologies semi-automatically  Can likely be extended to other domains 30

Toward Making Online Biological Data Machine Understandable Cui Tao.

Similar presentations

Presentation on theme: "Toward Making Online Biological Data Machine Understandable Cui Tao."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Toward Making Online Biological Data Machine Understandable Cui Tao.

Similar presentations

Presentation on theme: "Toward Making Online Biological Data Machine Understandable Cui Tao."— Presentation transcript:

Similar presentations

About project

Feedback