Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

Similar presentations


Presentation on theme: "1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,"— Presentation transcript:

1 1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab, Rensselaer Polytechnic Institute 2 Computer Science and Engineering, University of Nebraska-Lincoln ( Supported by NSF Grants # 044114854 and 0414644, and Rensselaer Center for Open Source Software )

2 2 Goal: Construction of a narrow-domain ontology from semi-structured web data (“table understanding” )

3 3 Outline A B C D Tilings (rectangular tessellations) X-Y trees (1984) Tables Wang Categories (1996) Grammars

4 4 Outline A B C D Tilings (rectangular tessellations) X-Y trees (1984) Tables Wang Categories (1996) Grammars

5 5 Web tables Cannot precisely define human-understandable tables. Convert to smaller set of admissible tables. Why? Algorithmic ease.

6 6 Admissible Tables Have stub, headings and data cells.

7 7 Factor out layout-equivalent tables

8 8 Outline A B C D Tilings (rectangular tessellations) X-Y trees (1984) Tables Wang Categories (1996) Grammars

9 9 Rectangular Tessellations Partition of an isothetic rectangle into rectangles. Uniquely defined by junction points (location and type). Number of tessellations increases rapidly with table size.

10 10 XY Tessellations Special case of rectangular tessellations. Successive horizontal and vertical cuts. Easily represented by trees.

11 11 A tiling and its X-Y Tree (aka slicing structure, puzzle tree, tree map)

12 12 Non-slicing structures – No XY tree In fact, X-Y tilings are an infinitesimal fraction of all tilings. This helps, because tables never contain this “spiral” structure.

13 13 Fundamental Idea Use XY trees to automate table processing and understanding.

14 14 Table to XY tree – EX2XY Applicable to any XY tessellation. Input – Excel Table – Copy and paste or Import. – Edit to make admissible. Output – XY tree – as XML for portability. – as parenthesized string for grammars.

15 15 Example (http://www40.statcan.ca/l01/cst01/econ50-eng.htm)

16 16 After import into Excel

17 17 After Editing

18 18 Output - XML … Real gross domestic product, expenditure-based, by province and territory (millions of chained (2002) dollars) …

19 19 Outline A B C D Tilings (rectangular tessellations) X-Y trees (1984) Tables Wang Categories (1996) Grammars

20 20 Table Grammars Can characterize entire families of tables. Developed grammar for one family. Input - Nested parenthesized notation. Output – Accept/Reject as example of family.

21 21 Grammar For parsing column headers S := A(Rule 1) A := {B}(Rule 2) B := c [X] B | c [X](Rules 3 and 4) X := c X | A X | A | c(Rules 5, 6, 7 and 8) S is start symbol. A generates all admissible column headers. B generates category trees. c is a root category. X generates sub-categories.

22 22 Table Grammars Cannot check if table is consistent. Need further geometric alignment and lexical checks.

23 23 Outline A B C D Tilings (rectangular tessellations) X-Y trees (1984) Tables Wang Categories (1996) Grammars

24 24 Logical Structure of Tables How to interpret a table? – Describe relationship between header cells and content cells [Wang, U. Waterloo,1996]. Wang notation – Elegant description. – Dimensionality: Number of category trees. – Cartesian product maps categories to data.

25 25 Layout independent Wang Notation Different layout and same information means same Wang Notation

26 26 Wang Category Trees for either table characteristic gonsity hepth fleck burlam falder multon Any data cell can be designated by a path through each category tree. Leaves correspond to row or column headings.

27 27 Analyzing logical structure not sufficient. Need additional information from title, footnotes, captions, etc. Semantic analysis of the labels also important – need external knowledge. “Real” Table Understanding

28 28 Does Wang Notation always exist? Not always! Inconsistent tables do not have Wang Notation. Others can be edited using virtual headers.

29 29 XY tree to Wang Notation Algorithm Input – XY trees. Output – XML version of Wang Notation. Checks for table consistency.

30 30 Algorithm Locate principal regions - stub, headers and content cells. Extract Wang categories. Compute Cartesian product of category paths. Match each key to the content of a delta cell.

31 31 Conclusions Admissible layouts identified for ease of processing. Algorithms developed for  extracting XY trees from tables.  extracting Wang notation from XY trees. Family of tables identified using a grammar.

32 32 Future work Augmentations - captions, aggregates, units, etc. Expand the grammar. Automate conversion of table to admissible formats. (http://www40.statcan.ca/l01/cst01/agri111a-eng.htm)

33 33 THANK YOU

34 34 Goal: construction of a narrow-domain ontology from semi-structured web data (“table understanding” ) Currently multon is the best choice for rapitting velters. It is about 25% better than burlam or falder, which have the same girby (hepth/gonsity ratio). Check another table to see whether elmer is even better. NOT TODAY!

35 35 H-first tree can be transformed into V-first tree (and vice-versa)

36 36 EX2XY: Algorithm Two workhorses: – Vertical_cut – returns leftmost sub-rectangle of a given rectangle. – Horizontal_cut – returns topmost sub-rectangle of a given rectangle.

37 37 EX2XY: Algorithm (contd.) Used in a pair of procedures P1 and P2. P1 cuts vertically and submits first sub-rectangle to P2 for horizontal cuts. Similarly with P2.

38 38 Parenthesized notation P-notation has 1:1 correspondence with general trees. For above table, the XY tree sentence is: Sxy = {c [c c] c [c {c [c c]} c {c [c c]}]}.

39 39 A table with six Wang dimensions

40 40 Handles more complex scenarios: – Higher dimensionality. – Deeper nesting of headers. – Repetitive headers. XY2WANG: Other features

41 41 (http://www40.statcan.ca/l01/cst01/econ50-eng.htm)

42 42

43 43 Raghav’s Experiment

44 44

45 45

46 46 Average total time to process a table - 231 seconds. Average table size - 587 cells before preprocessing. Average preprocessing time - 104 seconds. 3 category tables took approximately 27 seconds more than 2 category tables.

47 47 Tables with aggregates and footnotes - more time to process. Strong correlation between processing time and table size. For future: automatically segmenting augmentations, categories and delta cells using visual cues.


Download ppt "1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,"

Similar presentations


Ads by Google