Presentation is loading. Please wait.

Presentation is loading. Please wait.

Source Page Understanding for Heterogeneous Molecular Biological Data

Similar presentations


Presentation on theme: "Source Page Understanding for Heterogeneous Molecular Biological Data"— Presentation transcript:

1 Source Page Understanding for Heterogeneous Molecular Biological Data
Cui Tao Supported by NSF 1

2 Introduction Online biological data:
Highly diverse in both granularity and variety In various formats Uses different terminologies, ID systems, units, … To automatically understand heterogeneous source pages is a challenge Extraction ontology based source page understanding 2/23/2019 2

3 Extraction Ontology (Partial)
2/23/2019 2

4 Extraction Ontology (Partial)
2/23/2019 2

5 Extraction Ontology (Partial)
2/23/2019 2

6 Extraction Ontology (Partial)
2/23/2019 2

7 Extraction Ontology (Partial)
2/23/2019 2

8 Extraction Ontology Construction
Knowledge sources Gene Ontology Thousands of terms All Species Toolkit Total of 1,231,935 names Protein databases Thousands of protein names Regular expressions, keywords (Molecular Function, Biological Process, Cellular Component) 2/23/2019 3

9 Source Page Understanding
2/23/2019 4

10 2/23/2019 4

11 2/23/2019 4

12 Source Page Understanding
Three steps: Recognize attributes and values Find attribute-value pairs Map attribute-value pairs to target concepts Two techniques: Sibling page comparison Seed ontology recognition 2/23/2019 5

13 Sibling Page Comparison
2/23/2019 6

14 Sibling Page Comparison
2/23/2019 6

15 Sibling Page Comparison
2/23/2019 6

16 Sibling Page Comparison
Attribute 2/23/2019 6

17 Sibling Page Comparison
2/23/2019 6

18 Sibling Page Comparison
2/23/2019 7

19 Seed Ontology Recognition
What is a seed ontology? Why do we use a seed ontology? 2/23/2019 8

20 2/23/2019 9 Homo sapiens; human; zinc ion binding;
nucleus; zinc ion binding; nucleic acid binding; linear; NP_079345; 9606; Eukaryota; Metazoa; Chorata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo; Homo sapiens; human; GTTTTTGTGTT……….ATAAGTGCATTAACGGCCCACATG; FLJ14299 msdspagsnprtpessgsgsgg………tagpyyspyalygqrlasasalgyq; hypothetical protein FLJ14299; 8; eight; “8:?p\s?12”; “8:?p11.2”; “8:?p11.23”; : “37,?612,?680”; “37,?610,?585”; 2/23/2019 9

21 Seed Ontology Recognition
2/23/2019 10

22 2/23/2019 11 Homo sapiens; human; nucleus; zinc ion binding;
nucleic acid binding; 9606; Eukaryota; Metazoa; Chorata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo; zinc ion binding; nucleic acid binding; NP_079345; nucleus; linear; NP_079345; FLJ14299 GTTTTTGTGTT……….ATAAGTGCATTAACGGCCCACATG; msdspagsnprtpessgsgsgg………tagpyyspyalygqrlasasalgyq; 8; eight; “8:?p\s?12”; “8:?p11.2”; “8:?p11.23”; : hypothetical protein FLJ14299; “37,?612,?680”; 2/23/2019 “37,?610,?585”; 11

23 Evaluation “Training”: Test: Determine thresholds
Set up rules for recognizing attribute-value patterns Determine rules of combining different pair-wise comparisons Refine Seed Ontologies Test: Structure recognition: Test on column/row level Measure Precision/Recall values Mapping recognition: Test on Concept level 2/23/2019 12

24 Contribution Will contribute to both information extraction technology and bioinformatics Can understand both structures and semantics of source pages in the molecular biology domain automatically 2/23/2019 13


Download ppt "Source Page Understanding for Heterogeneous Molecular Biological Data"

Similar presentations


Ads by Google