From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI.

Slides:



Advertisements
Similar presentations
1 Parsing The scanner recognizes words The parser recognizes syntactic units Parser operations: Check and verify syntax based on specified syntax rules.
Advertisements

HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2011 Hawkes Learning Systems. All rights reserved. Hawkes Learning Systems: College Algebra.
Describing Process Specifications and Structured Decisions Systems Analysis and Design, 7e Kendall & Kendall 9 © 2008 Pearson Prentice Hall.
CAP / RCAP Format Improvement. Types of Charts ›Approach Charts ›Cat II and III ›Circling ›Combined IAP ›RNAV ›Helicopter Procedures ›Visual Approach.
Semantic analysis Parsing only verifies that the program consists of tokens arranged in a syntactically-valid combination, we now move on to semantic analysis,
1. Canadian Results PISA PISA 2012 by the numbers 3.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Implementation of Graph Decomposition and Recursive Closures Graph Decomposition and Recursive Closures was published in 2003 by Professor Chen. The project.
Introduction The compilation approach uses a program called a compiler, which translates programs written in a high-level programming language into machine.
PHP Introduction.
Aki Hecht Seminar in Databases (236826) January 2009
Chapter 4 Lexical and Syntax Analysis Sections 1-4.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
ISBN Chapter 4 Lexical and Syntax Analysis The Parsing Problem Recursive-Descent Parsing.
A Probabilistic Classifier for Table Visual Analysis William Silversmith TANGO Research Project NSF Grant # and Greetings Prof. Embley!
Graphics Output Primitives Pixel Addressing and Fill Area Dr. M. Al-Mulhem Feb. 1, 2008.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
TANGO (RPI, June 2009) George Nagy, Mukkai Krishnamoorthy, Sharad Seth Raghav Padmanabhan, Ramana C. Jandhyala, Sean Kelley Max Muthalathu, William Silversmith.
WNT TRAINING Wang Notation Tool Developed by Piyushee Jha Acknowledgments: National Science Foundation Rensselaer Polytechnic Institute Brigham Young University.
Lexical and syntax analysis
British Columbia Immigration Source: Citizenship and Immigration Canada Facts and Figures Immigration Overview Annual Number of Immigrants to British.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Objects What are Objects Observations
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
International Dialogue Leadership Void: Getting the Right People on the Bus.
VoiceXML Brandon Hannasch. Outline What is VoiceXML? Basic Tags Voice Recognition Audio Files Call Flow.
Authors : P K D. 1.Flag of Canada 2.Map of Canada 3.Introduction 4.Big Cities 5.Interesting Places.
1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,
1 Working with Data Types and Operators. 2 Using Variables and Constants The values stored in computer memory are called variables The values, or data,
International Dialogue To Test or Not to Test. How to use your clicker device: When a question appears on the screen, press the appropriate number on.
CONFEDERATION of Canada.
Yukon Territory Northwest Territories British Columbia Alberta Pacific Ocean Beaufort Sea Arctic Ocean Saskatchewan Nunavut Manitoba OntarioQuebec Hudson.
3-1 Chapter 3: Describing Syntax and Semantics Introduction Terminology Formal Methods of Describing Syntax Attribute Grammars – Static Semantics Describing.
TextBook Concepts of Programming Languages, Robert W. Sebesta, (10th edition), Addison-Wesley Publishing Company CSCI18 - Concepts of Programming languages.
Canada. New Brunswick Newfoundland Northwest Ter Nunavut Ontario Prince Edward Is. Quebec Saskatchewan Yukon Alberta British Columbia Manitoba Nova.
Canada By: Kiki Lochner, Meg Davies, and Chrissy dePenaloza Government.
Canada funnyv. What is Canada? Canada is a country in North America.
Canada. War  In the Canada there`s no war 10 provinces and 3 territories  Alberta  Manitoba  New-Brunswick  Newfoundland and Labrador  Nova Scotia.
Bernd Fischer RW713: Compiler and Software Language Engineering.
Instructions Step 1: Try to identify each of Canada’s province and territory. Click on the province to discover the answer Next.
Chapter 3 Part II Describing Syntax and Semantics.
1 / 48 Formal a Language Theory and Describing Semantics Principles of Programming Languages 4.
National Network For Equitable Library Service Leslie Corbay Library Accessibility Consultant Public Library Services Branch.
Overview of Previous Lesson(s) Over View  In our compiler model, the parser obtains a string of tokens from the lexical analyzer & verifies that the.
ISBN Chapter 4 Lexical and Syntax Analysis.
Unit-3 Parsing Theory (Syntax Analyzer) PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE.
PHP Programming with MySQL Slide 3-1 CHAPTER 3 Working with Data Types and Operators.
Role of government policy in immigrant settlement and integration Ather H. Akbari Saint Mary’s University And Atlantic Research Group on Economics of Immigration,
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Overview of Previous Lesson(s) Over View 3 Model of a Compiler Front End.
CS 330 Programming Languages 09 / 25 / 2007 Instructor: Michael Eckmann.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Computer Work Day Canadian Geography Monday, February 22, 2016.
Compiler Construction Lecture Five: Parsing - Part Two CSC 2103: Compiler Construction Lecture Five: Parsing - Part Two Joyce Nakatumba-Nabende 1.
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
Last Chapter Review Source code characters combination lexemes tokens pattern Non-Formalization Description Formalization Description Regular Expression.
1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.
Programming Languages Translator
CS510 Compiler Lecture 4.
Syntax Analysis Chapter 4.
National Health Expenditure Trends, 1975 to 2017
Health Expenditures in the Provinces and Territories, 2017
Web Data Extraction Based on Partial Tree Alignment
Lexical and Syntax Analysis
Lecture 7: Introduction to Parsing (Syntax Analysis)
A NEEDS REPORT ON ACCESSIBLE TECHNOLOGY  and  A DISCUSSION ON ACCESSIBLE ASSISTIVE TECHNOLOGY: SUMMARY REPORT Provided to the Accessible Technology Program.
COMPILER CONSTRUCTION
Slide Deck 10: Federal Elections
Presentation transcript:

From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Introduction Novel aspects of our work –Focus on computer-constructed web tables –Using commercial software –Describing tables using XY trees –Extracting relationship of headers to content cells Formalizes the 200 table-experiment conducted by Raghav. These tables were imported from 10 websites into Excel and manually edited into a form that can be processed algorithmically. Average editing time – 104 sec. Average table size – 587 cells. Augmentations not considered!

Rectangular Tessellations Rectangular Tiling/Discrete Rectangular Tessellation –Partition of an isothetic rectangle into rectangles –Geometry uniquely defined by locations and types of junction points –Number N all (m) increases exponentially with table size. XY Tessellations –Special case of rectangular tessellations –Got by successive horizontal and vertical cuts –Number of XY tilings N xy (m) decrease rapidly (Klarner- Magliveras), i.e. Lim N xy (m) / N all (m) = 0 m->inf

Taxonomy of web tables All tables have a stub, row headings, column headings and data cells. Some common layouts – admissible tessellations

Taxonomy of web tables (contd.) Human-understandable tables - N T,S,xy(m), mathematically indefinable and unknown number Convert them to smaller set of admissible tables – N A,S,xy(m) Layout-equivalent tables enough for algorithmic analysis.

Taxonomy of web tables (contd.) Number of different layout-equivalent admissible candidates - N L,S,xy(m) For now, N L,S,xy(m) < N A,S,xy(m) Context-free grammars – characterize entire families of layout-equivalent tables

Logical Structure of Tables XY trees only capture physical layout To understand a table – need to analyse logical structure, i.e. relationship between header cells and content cells [Wang]. Wang notation – consists of category trees (headings) and delta cells (content). –Number of category trees – dimensionality of the table –Cartesian product of category trees lead to delta cells. –Size of table – product of number of rows and columns of delta cells

Logical Structure of Tables (contd.) Well-formed tables – Labeled table candidates for which Wang Notation exists Most tables not well-formed, but easily convertible into well-formed format using virtual headers. Analyzing logical structure not sufficient for table understanding!

Our project – front end for creating narrow- domain ontologies by combining information from web tables Our work based on following inequalities N L,S,xy(m) < N A,S,xy(m) < N T,S,xy(m) << N S,xy(m) << N xy(m) << N all(m) Examples of each class shown in next slide.

Tessellations to XY trees Horizontally and vertically ordered lists of junction points – not sufficient for reconstructing XY tree! Do not capture the adjacency topology. Need coordinates and junction types (NE- corner, T-junction, crossing etc.)

Table to XY tree – EX2XY Applicable to any tessellation for which XY tree exists. Input – Excel Table Output – XY tree (parenthesized notation) Algorithm: –CutV(R) – cuts a rectangle R vertically and returns leftmost sub- rectangle. –CutH(R) – cuts R horizontally and returns topmost sub-rectangle. –Both used in a pair of procedures P1 and P2, which call each other recursively. –P1 cuts given rectangle vertically and submits first sub-rectangle to P2 for horizontal cuts. Similarly with P2. –Main procedure calls P1 for vertical cuts, and P2 for horizontal cuts.

Example – Original HTML table

Example (contd.) – After import into Excel

Example – After Editing

A snippet of the output (both parenthetical and XML outputs) Parenthetical version of the output ( [ { ::15,2:15,2 ::16,2:16,2 Real gross domestic product, expenditure-based, by province and territory (millions of chained (2002) dollars)::17,2:30,2 } { ::15,3:15,3 ::16,3:16,3 Canada::17,3:17,3 Newfoundland and Labrador::18,3:18,3 Prince Edward Island::19,3:19,3 Nova Scotia::20,3:20,3 New Brunswick::21,3:21,3 Quebec::22,3:22,3 Ontario::23,3:23,3 Manitoba::24,3:24,3 Saskatchewan::25,3:25,3 Alberta::26,3:26,3 British Columbia::27,3:27,3 Yukon::28,3:28,3 Northwest Territories::29,3:29,3 Nunavut::30,3:30,3 } { Year::15,4:15,8 [ 2004::16,4:16,4 2005::16,5:16,5 2006::16,6:16,6 2007::16,7:16,7 2008::16,8:16,8 ]. XML version of the output. Real gross domestic product, expenditure-based, by province and territory (millions of chained (2002) dollars)

Grammar for tables The grammar uses nested parenthetical notation (P-notation). P-notation has 1:1 correspondence with general trees. For above table, the XY tree sentence is: Sxy = {c [c c] c [c {c [c c]} c {c [c c]}]} (neglecting the textual labels)

Grammar Grammar for parsing the column headers of all such layout- equivalent tessellations: –S := A(Rule 1) –A := {B}(Rule 2) –B := c [X] B | c [X](Rules 3 and 4) –X := c X | A X | A | c(Rules 5, 6, 7 and 8) where S – start symbol A – nonterminal that generates all admissible strings for column headers B – generates >=1 instances of categories in the form c[X] Each c becomes a root category and X generates its subcategory tree X generates strings of size >=1 with arbitrary occurrences of c and A. The derivation for the previous example using a LALR parser is shown on the next slide

Example demonstrates both power and limitation of grammars. A grammar can recognize broad classes. But grammars cannot check that headings are properly labels for well-formed tables If accepted by the grammar, need additional geometric alignment and lexical checks to verify Wang notation.

XY tree to Wang Notation XY2WANG converts an XY tree generated from a restricted family of admissible tables to Wang Notation. Example: Uses an indented table-of-contents format as a data structure.

XY2WANG Input – XY trees with arbitrary number of categories and arbitrary nesting. Output – XML version of Wang Notation For a table T = (C, d), –Category Notation: C = { (A,{(A1,phi),(A2,phi)}),(B,{(B1,phi),(B2,phi),(B3,phi)}) } –Delta mappings δ({A.A1,B.B1}) = d11 δ({A.A1,B.B2}) = d12 …

XY2WANG: Algorithm Algorithm: –First locate 4 principal regions – stub, row/column headers and content cells. –Extract Wang labeled domains under assumption that each spanning cell is the header of smaller cells either to its right (row headers) or bottom (column headers). –Compute Cartesian product of category paths and match each key to the content of a delta cell.

XY2WANG: Table-of-contents data structure Example of a table and its corresponding table-of-contents data structure is shown

XY2WANG also handles more complex scenarios like: –Higher Wang dimensionality –Deeper nesting of headers –Repetitive headers –Detection of not well-formed tables These are included in the following pseudocode

Conclusion Hierarchical structure of categories and flat structure of data cells is recovered from XY trees. Geometric and topological equivalence classes on tessellations and their XY trees are defined. Commonly encountered tables are examples of such classes. These tables are identified by parsing XY trees with a grammar. Assuming the header labels are consistent, Wang category notation is extracted.

Future work Account for aggregates – major component of web tables. Need to integrate other augmentations (footnotes, units, captions etc.) Expand on the grammar: current version accounts only for column headers. Automate the conversion from imported web tables to standard formats. Semantic interpretation of groups of conceptually overlapping tables based on precise representation of layout-invariant syntax.

Current Work Converting web tables to standard formats for ease of processing. –Internal conventions: A’, A’’, hybrids Learning from XY trees using tree edit distance –Learning from existing manipulations. –Ex: The user modifies table T1 to a standard format T1’. The steps are all recorded. Now use this information to predict the standard format of a new table T2.

Current work (contd.) Relation of tree-edit distance to pre-order and post-order string edit distance –Some interesting results and conjectures, but still half-boiled! –(Result) Pre- and post- order traversals enough for reconstructing a general tree. –(Conjecture) For 2 XY trees, distances between corresponding pre- and post-order strings equal, but not for general trees! –(Conjecture) For 2 XY trees, tree-edit distance equal to pre/post order distances –Are tables with same content, but different layouts, collinear (in terms of string/tree edit distance)? Developing software to calculate tree edit distances, should clear many things. (Any suggestions?)