Download presentation
Presentation is loading. Please wait.
1
Toward Making Online Biological Data Machine Understandable Cui Tao
2
6/21/2015 Motivation Huge evolving amount of Bio-databases The molecular biology database collection 2004: total 548, 162 more than 2003 2005: total 719, 171 more than 2004 Different access capabilities From web services-level interfaces to basic HTTP form interfaces From simple lists, keyword queries to full-featured Boolean queries Different query languages Syntactic heterogeneity Flat files with/without format definitions Relational databases Structured/unstructured HTML files Semantic heterogeneity Different identifiers Different perspectives Different terminologies Different units Sometimes the information a user needs spans multiple sources Making online biological data machine understandable is important and challenging 2
3
6/21/2015 Motivation To help biologists: Perform background research Gain insight into relationships and interactions among different research discoveries Build up research strategies inspired by others’ hypotheses 3
4
6/21/2015 System Overview Located Sources Locate Sources Obtain Pages Understand Pages (Extract) Indexes Source URLs Semantic Web Pages Understood Pages Retrieved Pages Cache PagesEnrich Ontology Gene Extraction Ontology Seed Ontologies Execute Query 4
5
6/21/2015 Research Issues Source page understanding Attribute-value pair discovery? Aligning with an ontology? Source location through semantic indexing Metadata vs. instance data indexing? Use of indexes in query processing? Ontology evolution Adjustments to ISA and Part-Of hierarchies? Addition of attributes? 5
6
6/21/2015 Thesis Statement Automatically understands the structure of source pages Automatically converts source pages into semantic web pages Semantically indexes biological resources Semi-automatically updates the ontology Build a proof-of-concept prototype that resolves the research issues: 6
7
6/21/2015 Outline Extraction ontology Source page understanding Source location through semantic indexing Ontology enrichment 7
8
6/21/2015 Extraction Ontology (Partial) 8
9
6/21/2015 Extraction Ontology (Partial) 8
10
6/21/2015 Extraction Ontology (Partial) 8
11
6/21/2015 Extraction Ontology (Partial) 8
12
6/21/2015 Extraction Ontology (Partial) 8
13
6/21/2015 Extraction Ontology Construction Knowledge sources Gene Ontology Thousands of terms All Species Toolkit Total of 1,231,935 names Protein databases Thousands of protein names Regular expressions, keywords (Molecular Function, Biological Process,Molecular FunctionBiological Process, Cellular ComponentCellular Component) 9
14
6/21/2015 Source Page Understanding 10
15
6/21/2015 10
16
6/21/2015 10
17
6/21/2015 Source Page Understanding Three steps: Recognize attributes and values Find attribute-value pairs Map attribute-value pairs to target concepts Two techniques: Sibling page comparison Seed ontology recognition 11
18
6/21/2015 Sibling Page Comparison 12
19
6/21/2015 Sibling Page Comparison 12
20
6/21/2015 Sibling Page Comparison 12
21
6/21/2015 Sibling Page Comparison Attribute 12
22
6/21/2015 Sibling Page Comparison 12
23
6/21/2015 Sibling Page Comparison 13
24
6/21/2015 Seed Ontology Recognition What is a seed ontology? A seed ontology contains as much information as we can collect for one object in a specified application domain with respect to the extraction ontology. Why do we use a seed ontology? 14
25
6/21/2015 Seed Ontology Recognition Marker Name: ABP1 Forward Primer: CTTATGCTGCGAGTGCAGTC Reverse Primer: AGCAATGGAGAAGTTCCTACC 14
26
6/21/2015 nucleus; zinc ion binding; nucleic acid binding; zinc ion binding; nucleic acid binding; linear; NP_079345; 9606; Eukaryota; Metazoa; Chorata; Craniata; Vertebrata; Euteleostom i; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo; NP_079345; Homo sapiens; human; GTTTTTGTGTT………. ATAAGTGCATTAACGG CCCACATG; FLJ14299 msdspagsnprtpessgsgsgg ………tagpyyspyalygqrlasa salgyq; hypothetical protein FLJ14299; 8; eight; “8:?p\s?12”; “8:?p11.2”; “8:?p11.23”; : “37,?612,?680”; “37,?610,?585”; 15
27
6/21/2015 Seed Ontology Recognition 16
28
6/21/2015 nucleus; zinc ion binding; nucleic acid binding; zinc ion binding; nucleic acid binding; linear ; NP_079345; 9606; Eukaryota; Metazoa; Chorata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo; NP_079345; Homo sapiens; human; GTTTTTGTGTT………. ATAAGTGCATTAACG GCCCACATG; FLJ14299 msdspagsnprtpess gsgsgg………tagp yyspyalygqrlasasal gyq; hypothetical protein FLJ14299; 8; eight; “8:?p\s?12”; “8:?p11.2”; “8:?p11.23”; : “37,?612,?680”; “37,?610,?585”; 17
29
6/21/2015 Source Location through Semantic Indexing Motivation: Hundreds of available biological repositories Time consuming to browse all of them Leads quickly to needed sources for a query Solution − semantic indexing: Meta-data Data 18
30
6/21/2015 Source Location through Semantic Indexing − Meta-Data Source Organism Accession Number Protein Name Length in Amino Acid Molecular Weight in Da ProtoNet 19
31
6/21/2015 Source Location through Semantic Indexing − Meta-Data Protein Name = “Hypothetical protein FLJ14299” Length in Amino Acid = ? Length in Amino Acid Protein Name ProtoNet 20
32
6/21/2015 Source Location through Semantic Indexing − Data Semantic Web Semantic indexing Query 21
33
6/21/2015 Ontology Enrichment Likely to have “imperfect” ontologies Incomplete ISA and Part-Of hierarchies Incomplete lexicons Incomplete with respect to concepts Can enrich semi-automatically Two possibilities: Data frame enrichment Object set and relationship set enrichment 22
34
6/21/2015 Ontology − Data Frame Enrichment 23
35
6/21/2015 Ontology --- Object Set and Relationship Set Enrichment Source Target 24
36
6/21/2015 Ontology − Object Set and Relationship Set Enrichment Source Organism Accession Number Protein Name Length in Amino Acid Molecular Weight in Da 25
37
6/21/2015 Research Plan Build and test the system step by step Provide experimental evidence that issues have been resolved Source page understanding Source location from semantic indexing Ontology enrichment 26
38
6/21/2015 Research Plan – Source Page Understanding Training set Choose thresholds Set up rules Combine results from different techniques Refine the seed ontologies Test set Detect attributes and values Form attribute-value pairs Recognize mappings between source attribute-value pairs to target concepts 27
39
6/21/2015 Research Plan – Others Source location through semantic indexing Ontology enrichment Data frame enrichment Concept and relationship set enrichment 28
40
6/21/2015 Delimitations Extraction ontology: will not cover all the concepts, relationships, and values in the molecular biology domain Source page understanding: only deals with structured/semi-structured source pages Data frame enrichment: will not do automatic regular expression enrichment Object set and relationship set enrichment: will be limited to enriching ISA and Part-Of hierarchies and simple attribute additions Prototype system: will use an available front-end query interface; will not do further integration beyond synchronization with the target gene extraction ontology 29
41
6/21/2015 Contributions Will contribute to both information extraction technology and bioinformatics Can find appropriate sources, retrieve needed information, understand a source page, and extract useful information automatically Can convert understood source pages into semantic web pages automatically Can enrich ontologies semi-automatically Can likely be extended to other domains 30
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.