Download presentation
Presentation is loading. Please wait.
1
SOLUTION: Source page understanding – Table interpretation Table recognition Table pattern generalization Pattern adjustment Information extraction & semantic annotation Source location through semantic indexing Cross-database query processing HTML Table Interpretation by Sibling Page Comparison in the Molecular Biology Domain Cui Tao and David W. Embley Data Extraction Research Group Department of Computer Science, Brigham Young University, Provo, UT, 84602 2. Table Interpretation 1. Introduction Contact Information Data Extraction Research Group Department of Computer Science Brigham Young University Provo, UT 84602 Cui Tao, ctao@cs.byu.edu David W. Embley, embley@cs.byu.edu http://www.deg.byu.edu/ 3. Results 4. Conclusions GOALS: To help biologists cross-search various resources Examples: “Find genes which are longer than 5kbp, whose products have at least two helices, and participate in glycolysis” – GenBank, PDB, KEGG “Find genes newly annotated after Jan. 2003 in the fly and worm genomes” – FlyBase, WormBase Focus of this poster Input: an HTML table Output: a formal table notation (Wang notation) 3. Sibling Page Comparison 2.2. Sibling Page Comparison 2.1. Table Recognition label value Consider all tagged tables Unnest Filter out tables containing no data: Sibling table match percentage: max match score / tree size > the high threshold: exact match or near exact match < the low threshold: false match In between: sibling tables Non-data tables Values Labels Sibling pages & sibling tables Table-Interpretation Steps HTML table → DOM tree Tree matching → Find sibling tables Variable fields ~ values & Fixed fields ~ labels Infer pattern table tr td Status Nucleotides (coding/transcript) Protein Swissprot Amino Acids F47G6.1F47G6.1 1, 2 confirmed by cDNA(s)cDNA(s) 1773/7391 bp WP:CE26812 DTN1_CAEEL td590 aa tdGene Model F18H3.5bF18H3.5b 1, 2, 3 F18H3.5aF18H3.5a 1, 2 table tr td Gene Model Status Nucleotides (coding/transcript) Protein Amino Acids confirmed by cDNA(s)cDNA(s) 1029/3051 bp WP:CE18608 342 aa partially confirmed by cDNA(s)cDNA(s) 1221/1704 bp WP:CE28918 406 aa 2.3. Structure Pattern Generation Pre-defined structure templates Matches any pre-defined pattern template? Generates a specific structure pattern for the table location structure Dynamically adjust the structure pattern An optional label An optional value EXPERIMENTAL RESULTS: Test Set: 10 web sites; 100 sibling pages; 862 HTML tables Table Recognition: correctly eliminated all but 3 non-data tables Pattern Generation: successfully recognized 28 of 29 patterns Dynamic Adjustment: 5 location adjustments; 12 structure adjustments all correct PROBLEMS: Huge evolving number of Bio-databases Molecular biology database collection 2004: total 548, 162 more than 2003 2005: total 719, 171 more than 2004 Different access capabilities Syntactic heterogeneity Semantic heterogeneity Updated at anytime by independent authorities We can: Recognize data tables Find labels and values Infer table patterns Dynamically adjust table patterns Domain Generality: work for other domains Pattern Combinations
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.