1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,

Slides:



Advertisements
Similar presentations
Chapter 3 – Web Design Tables & Page Layout
Advertisements

Schema Matching and Query Rewriting in Ontology-based Data Integration Zdeňka Linková ICS AS CR Advisor: Július Štuller.
Data Extraction from Web Tables: the Devil is in the Details George Nagy Electrical, Computer, and Systems Engineering DocLab, Rensselaer Polytechnic Institute.
1 Parsing The scanner recognizes words The parser recognizes syntactic units Parser operations: Check and verify syntax based on specified syntax rules.
Exploring Office Grauer and Barber 1 Committed to Shaping the Next Generation of IT Experts. Chapter 1 – Introduction to Excel: What is a Spreadsheet?
Notes on Contemporary Table Recognition Embley, Lopresti, and Nagy  February 2006  Slide 1 Notes on Contemporary Table Recognition David W. Embley 1,
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Graph Drawing and Information Visualization Laboratory Department of Computer Science and Engineering Bangladesh University of Engineering and Technology.
Tutorial 7: Using Advanced Functions and Conditional Formatting
From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI.
ADVISE: Advanced Digital Video Information Segmentation Engine
XHTML1 Tables and Lists. XHTML2 Objectives In this chapter, you will: Create basic tables Structure tables Format tables Create lists.
McGraw-Hill Technology Education © 2004 by the McGraw-Hill Companies, Inc. All rights reserved. Office Word 2003 Lab 3 Creating Reports and Tables.
A Probabilistic Classifier for Table Visual Analysis William Silversmith TANGO Research Project NSF Grant # and Greetings Prof. Embley!
Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.
1 CS110: Lecture 2 Spreadsheets Prepared by Fred Annexstein University of Cincinnati CC Some rights reserved Today’s Topics Basics of Excel Spreadsheets.
Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.
TANGO (RPI, June 2009) George Nagy, Mukkai Krishnamoorthy, Sharad Seth Raghav Padmanabhan, Ramana C. Jandhyala, Sean Kelley Max Muthalathu, William Silversmith.
WNT TRAINING Wang Notation Tool Developed by Piyushee Jha Acknowledgments: National Science Foundation Rensselaer Polytechnic Institute Brigham Young University.
TANGO – Table Analysis for Generating Ontologies Sean Kelley Rensselaer Polytechnic Institute 2011 Electrical Engineering.
Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.
SPSS Statistical Package for the Social Sciences is a statistical analysis and data management software package. SPSS can take data from almost any type.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
IP Addressing & Subnetting Made Easy. Part 1: Working with IP Addresses.
HTML Comprehensive Concepts and Techniques Second Edition Creating Tables in a Web Site October 23, 2012.
Computer Literacy BASICS
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Exploring Excel 2003 Revised - Grauer and Barber 1 Committed to Shaping the Next Generation of IT Experts. Chapter 1 – Introduction to Excel: What is a.
Unit J: Creating a Database Microsoft Office Illustrated Fundamentals.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
ACOT Intro/Copyright Succeeding in Business with Microsoft Excel
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
CS654: Digital Image Analysis Lecture 3: Data Structure for Image Analysis.
Creating Tables and Lists Lesson 9. Skills Matrix SKILL #MATRIX SKILL 4.2.1Create tables and lists 4.2.2Sort content 4.3.1Apply Quick Styles to tables.
XP 1 Excel Tables Purpose of tables – Process data in a group – Used to facilitate calculations – Used to enhance readability of output Types of tables.
ESSENTIAL QUESTION How do I analyze information in diverse formats and evaluate the motives behind the presentation? Homework 1.You are to use this Power.
A lesson approach © 2011 The McGraw-Hill Companies, Inc. All rights reserved. a lesson approach Microsoft® Excel 2010 © 2011 The McGraw-Hill Companies,
McGraw-Hill Career Education© 2008 by the McGraw-Hill Companies, Inc. All Rights Reserved. 2-1 Office PowerPoint 2007 Lab 2 Modifying and Refining a Presentation.
Exploring Office 2003 Vol 1 2/e - Grauer and Barber 1 Committed to Shaping the Next Generation of IT Experts. Chapter 1 – Introduction to Excel: What is.
Context-Free Grammars and Parsing
Dimitrios Skoutas Alkis Simitsis
Html Tables Basic Table Markup. How Tables are Used For Data Display Tables were originally designed to display and organize tabular data (charts, statistics,
Structured Analysis.
© 2008 The McGraw-Hill Companies, Inc. All rights reserved. WORD 2007 M I C R O S O F T ® THE PROFESSIONAL APPROACH S E R I E S Lesson 15 Advanced Tables.
McGraw-Hill Career Education© 2008 by the McGraw-Hill Companies, Inc. All Rights Reserved. Office Word 2007 Lab 3 Creating Reports and Tables.
Basic Table Elements. 2 Objectives Define table elements Describe the steps used to plan, design, and code a table Create a borderless table with text.
Theory of Computation, Feodor F. Dragan, Kent State University 1 TheoryofComputation Spring, 2015 (Feodor F. Dragan) Department of Computer Science Kent.
Microsoft Excel Spreadsheet Software
Excel Screen Slide 1 Column Row Cell Formula bar Column heading Row heading Worksheet tab.
1 Introduction to Spreadsheets Chapter 1 Lecture Outline.
Spreadsheets: Part I Creating a Worksheet in MS Excel
T U T O R I A L  2009 Pearson Education, Inc. All rights reserved Student Grades Application Introducing Two-Dimensional Arrays and RadioButton.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Lecture #1: Introduction to Algorithms and Problem Solving Dr. Hmood Al-Dossari King Saud University Department of Computer Science 6 February 2012.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Hertong Song Department of Computer Science Louisiana Tech University Cluster Reliability Modeling Using UML.
Graphs and Functions Chapter 5. Introduction  We will build on our knowledge of equations by relating them to graphs.  We will learn to interpret graphs.
Word processing is the software package that enables you to create,edit, print and save documents for future retrieval reference. creating a document.
1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.
Lesson 4: Working with Charts and Tables
Programming Languages Translator
CONTENT MANAGEMENT SYSTEM CSIR-NISCAIR, New Delhi
Web Data Extraction Based on Partial Tree Alignment
Expandable Group Identification in Spreadsheets
Block Matching for Ontologies
Automated Software Integration
Unit G: Using Complex Formulas, Functions, and Tables
Presentation transcript:

1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab, Rensselaer Polytechnic Institute 2 Computer Science and Engineering, University of Nebraska-Lincoln ( Supported by NSF Grants # and , and Rensselaer Center for Open Source Software )

2 Goal: Construction of a narrow-domain ontology from semi-structured web data (“table understanding” )

3 Outline A B C D Tilings (rectangular tessellations) X-Y trees (1984) Tables Wang Categories (1996) Grammars

4 Outline A B C D Tilings (rectangular tessellations) X-Y trees (1984) Tables Wang Categories (1996) Grammars

5 Web tables Cannot precisely define human-understandable tables. Convert to smaller set of admissible tables. Why? Algorithmic ease.

6 Admissible Tables Have stub, headings and data cells.

7 Factor out layout-equivalent tables

8 Outline A B C D Tilings (rectangular tessellations) X-Y trees (1984) Tables Wang Categories (1996) Grammars

9 Rectangular Tessellations Partition of an isothetic rectangle into rectangles. Uniquely defined by junction points (location and type). Number of tessellations increases rapidly with table size.

10 XY Tessellations Special case of rectangular tessellations. Successive horizontal and vertical cuts. Easily represented by trees.

11 A tiling and its X-Y Tree (aka slicing structure, puzzle tree, tree map)

12 Non-slicing structures – No XY tree In fact, X-Y tilings are an infinitesimal fraction of all tilings. This helps, because tables never contain this “spiral” structure.

13 Fundamental Idea Use XY trees to automate table processing and understanding.

14 Table to XY tree – EX2XY Applicable to any XY tessellation. Input – Excel Table – Copy and paste or Import. – Edit to make admissible. Output – XY tree – as XML for portability. – as parenthesized string for grammars.

15 Example (

16 After import into Excel

17 After Editing

18 Output - XML … Real gross domestic product, expenditure-based, by province and territory (millions of chained (2002) dollars) …

19 Outline A B C D Tilings (rectangular tessellations) X-Y trees (1984) Tables Wang Categories (1996) Grammars

20 Table Grammars Can characterize entire families of tables. Developed grammar for one family. Input - Nested parenthesized notation. Output – Accept/Reject as example of family.

21 Grammar For parsing column headers S := A(Rule 1) A := {B}(Rule 2) B := c [X] B | c [X](Rules 3 and 4) X := c X | A X | A | c(Rules 5, 6, 7 and 8) S is start symbol. A generates all admissible column headers. B generates category trees. c is a root category. X generates sub-categories.

22 Table Grammars Cannot check if table is consistent. Need further geometric alignment and lexical checks.

23 Outline A B C D Tilings (rectangular tessellations) X-Y trees (1984) Tables Wang Categories (1996) Grammars

24 Logical Structure of Tables How to interpret a table? – Describe relationship between header cells and content cells [Wang, U. Waterloo,1996]. Wang notation – Elegant description. – Dimensionality: Number of category trees. – Cartesian product maps categories to data.

25 Layout independent Wang Notation Different layout and same information means same Wang Notation

26 Wang Category Trees for either table characteristic gonsity hepth fleck burlam falder multon Any data cell can be designated by a path through each category tree. Leaves correspond to row or column headings.

27 Analyzing logical structure not sufficient. Need additional information from title, footnotes, captions, etc. Semantic analysis of the labels also important – need external knowledge. “Real” Table Understanding

28 Does Wang Notation always exist? Not always! Inconsistent tables do not have Wang Notation. Others can be edited using virtual headers.

29 XY tree to Wang Notation Algorithm Input – XY trees. Output – XML version of Wang Notation. Checks for table consistency.

30 Algorithm Locate principal regions - stub, headers and content cells. Extract Wang categories. Compute Cartesian product of category paths. Match each key to the content of a delta cell.

31 Conclusions Admissible layouts identified for ease of processing. Algorithms developed for  extracting XY trees from tables.  extracting Wang notation from XY trees. Family of tables identified using a grammar.

32 Future work Augmentations - captions, aggregates, units, etc. Expand the grammar. Automate conversion of table to admissible formats. (

33 THANK YOU

34 Goal: construction of a narrow-domain ontology from semi-structured web data (“table understanding” ) Currently multon is the best choice for rapitting velters. It is about 25% better than burlam or falder, which have the same girby (hepth/gonsity ratio). Check another table to see whether elmer is even better. NOT TODAY!

35 H-first tree can be transformed into V-first tree (and vice-versa)

36 EX2XY: Algorithm Two workhorses: – Vertical_cut – returns leftmost sub-rectangle of a given rectangle. – Horizontal_cut – returns topmost sub-rectangle of a given rectangle.

37 EX2XY: Algorithm (contd.) Used in a pair of procedures P1 and P2. P1 cuts vertically and submits first sub-rectangle to P2 for horizontal cuts. Similarly with P2.

38 Parenthesized notation P-notation has 1:1 correspondence with general trees. For above table, the XY tree sentence is: Sxy = {c [c c] c [c {c [c c]} c {c [c c]}]}.

39 A table with six Wang dimensions

40 Handles more complex scenarios: – Higher dimensionality. – Deeper nesting of headers. – Repetitive headers. XY2WANG: Other features

41 (

42

43 Raghav’s Experiment

44

45

46 Average total time to process a table seconds. Average table size cells before preprocessing. Average preprocessing time seconds. 3 category tables took approximately 27 seconds more than 2 category tables.

47 Tables with aggregates and footnotes - more time to process. Strong correlation between processing time and table size. For future: automatically segmenting augmentations, categories and delta cells using visual cues.