Presentation is loading. Please wait.

Presentation is loading. Please wait.

Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.

Similar presentations


Presentation on theme: "Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University."— Presentation transcript:

1 Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University Supported by NSF

2 Table Interpretation (in context) Context: Table Understanding Table Recognition Table Interpretation Table Conceptualization Table Understanding Applications Not only “understanding” wrt community knowledge But also creation or augmentation of community knowledge Challenging Conceptual-Modeling Work

3 Table Interpretation (in context) Context: Table Understanding Table Recognition Table Interpretation with Sibling Pages: Table Conceptualization Table Understanding Applications Not only “understanding” wrt community knowledge But also creation or augmentation of community knowledge Challenging Conceptual-Modeling Work TISP

4 TISP: Table Recognition and Interpretation Recognize tables (discard non-tables) Locate table labels Locate table values Find label/value associations

5 Recognize Tables Data Table Layout Tables (discard) Nested Data Tables

6 Locate Table Labels Examples: Identification.Gene model(s).Protein Identification.Gene model(s).2

7 Locate Table Labels Examples: Identification.Gene model(s).Gene Model Identification.Gene model(s).2 1212

8 Locate Table Values Value

9 Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918 1212

10 Conceptual Table Interpretation Wang Notation [Wang96]; (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918 Table Ontology

11 Interpretation Technique: Sibling Page Comparison

12 Same

13 Interpretation Technique: Sibling Page Comparison Almost Same

14 Interpretation Technique: Sibling Page Comparison Different Same

15 Technique Details Unnest tables Match tables in sibling pages “Perfect” match (table for layout  discard ) “Reasonable” match (sibling table) Determine/Use Table-Structure Pattern Discover pattern Pattern usage Dynamic pattern adjustment

16 Table Unnesting

17 Match Based on DOM Tree

18 Simple Tree Matching Algorithm Labels Values [Yang91] Match Score Categorization: Exact/Near-Exact, Sibling-Table, False

19 Table Structure Patterns Regularity Expectations: ( {L} {V}) n ( {L}) n ( ( {V}) n ) + … Pattern combinations are also possible.

20 Pattern Usage (Location.Genetic Position) = X:12.69 +/- 0.000 cM [mapping data] (Location.Genomic Position) = X:13518823..13515773 bp

21 Dynamic Pattern Adjustment ( {L}) 5 ( ( {V}) 5 ) + ( {L}) 5 ( ( {V}) 5 ) + | ( {L}) 6 ( ( {V}) 6 ) +

22 TISP Evaluation Applications Commercial: car ads Scientific: molecular biology Geopolitical: US states and countries Data: > 2,000 tables, 275 sibling tables, 35 web sites Evaluation Initial two sibling pages Correct separation of data tables from layout tables? Correct pattern recognition? Remaining tables in site Information properly extracted? Able to detect and adjust for pattern variations?

23 Experimental Results Table recognition: correctly discarded 157 of 158 layout tables Pattern recognition: correctly found 69 of 72 structure patterns Extraction and adjustments: 5 path adjustments and 34 label adjustments  all correct

24 Discovered Difficulties Abundance of null entries Multiple tables as a single table Recognize and group Use box model [Gatterbauer07] Factored labels

25 Table Understanding Table Recognition Data table vs. table for layout Adjust (group table components, defactor labels, …) Table Interpretation Populate table ontology Additional table-ontology elements (title, footnotes, …) Table Conceptualization Capture table semantics Reverse engineer as a conceptual model Table Understanding Embed within a community ontology Alternatively, augment community knowledge

26 fleckvelter gonsity (ld/gg) hepth (gd) burlam1.2120 falder2.3230 multon2.5400 repeat: 1.recognize table 2.interpret table 3.conceptualize table 4.merge 5.adjust until ontology developed Knowledge Generation TANGO (Table Analysis for Generating Ontologies) repeatedly turns raw tables into conceptual mini-ontologies and integrates them into a growing ontology. Growing Ontology

27 Conclusions and Future Opportunities Conclusions Table Interpretation: overall F-measure of 94.5% Can successfully apply sibling-page technique Future Opportunities Table understanding Knowledge generation Challenging conceptual-modeling work www.deg.byu.edu


Download ppt "Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University."

Similar presentations


Ads by Google