A Probabilistic Classifier for Table Visual Analysis William Silversmith TANGO Research Project NSF Grant # 0414644 and 04414854 Greetings Prof. Embley!

Slides:



Advertisements
Similar presentations
Spreadsheet Vocabulary
Advertisements

The Web Wizards Guide to HTML Chapter Six Tables.
Standardized Scales.
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
Topic 12 – Further Topics in ANOVA
Lesson 2 — Working with Text
Elements make up the periodic table.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
CS 223B Assignment 1 Help Session Dan Maynes-Aminzade.
Using HTML Tables.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
TANGO (RPI, June 2009) George Nagy, Mukkai Krishnamoorthy, Sharad Seth Raghav Padmanabhan, Ramana C. Jandhyala, Sean Kelley Max Muthalathu, William Silversmith.
February 15, 2006 Geog 458: Map Sources and Errors
Experimental Statistics I.  We use data to answer research questions  What evidence does data provide?  How do I make sense of these numbers without.
Variables and Measurement (2.1) Variable - Characteristic that takes on varying levels among subjects –Qualitative - Levels are unordered categories (referred.
COMPREHENSIVE Excel Tutorial 2 Formatting a Workbook.
HTML Tables and Forms Creating Web Pages with HTML CIS 133 Web Programming Concepts 1.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Website Design. Designing and creating different elements involved in developing a website for e- commerce can help you identify and describe the components.
XP 1 Microsoft Office Excel Developing a Professional-Looking Worksheet.
A table is an arrangement of data (words and numbers) in rows and columns. Tables range in complexity from those with only two columns and a title to.
Cross Tabulation and Chi-Square Testing. Cross-Tabulation While a frequency distribution describes one variable at a time, a cross-tabulation describes.
CHAPTER 14 Formatting a Workbook Part 1. Learning Objectives Format text, numbers, dates, and time Format cells and ranges CMPTR Chapter 14: Formatting.
Excel Part 2 Formatting a Workbook. XP Objectives Format text, numbers, and dates Change font colors and fill colors Merge a range into a single cell.
CJ 526 Statistical Analysis in Criminal Justice
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
1 2 pt 3 pt 4 pt 5pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2pt 3 pt 4pt 5 pt 1pt 2pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4pt 5 pt 1pt Memo Unbound Reports LettersTablesExcel.
Microsoft Office Excel 2003 Tutorial 3 – Developing a Professional-Looking Worksheet.
Chapter 11: Applications of Chi-Square. Count or Frequency Data Many problems for which the data is categorized and the results shown by way of counts.
McGraw-Hill Career Education© 2008 by the McGraw-Hill Companies, Inc. All Rights Reserved. 2-1 Office PowerPoint 2007 Lab 2 Modifying and Refining a Presentation.
**NEW Unit Plan** TOPIC: Microsoft Word Word Processing: Software that uses text and formatting features to create documents. Microsoft Word: Software.
Bug Localization with Machine Learning Techniques Wujie Zheng
1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Copyright 2006 South-Western/Thomson Learning Chapter 12 Tables.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
VOCAB REVIEW. process of copying an item from the Clipboard into the document at the location of the insertion point Pasting Click for the answer Next.
Evaluating Results of Learning Blaž Zupan
Tetris Agent Optimization Using Harmony Search Algorithm
Final Review Word Window Basic Functions Editing Formatting Business Documents Q $100 Q $200 Q $300 Q $400 Q $500 Q $100 Q $200 Q $300 Q $400 Q $500 Final.
Lesson 6 Formatting Cells and Ranges. Objectives:  Insert and delete cells  Manually format cell contents  Copy cell formatting with the Format Painter.
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
1 Planted-model evaluation of algorithms for identifying differences between spreadsheets Anna Harutyunyan, Glencora Borradaile, Christopher Chambers,
Periodic Table CPS Chemistry. What You Need To Know Periodicity –Central Concepts: Repeating (periodic) patterns of physical and chemical properties occur.
Displaying & Describing Categorical Data Chapter 3.
CHAPTER 5 Introduction to Word Processing. OBJECTIVES 1.Define common terms related to word processing 2.Create, format, edit, save, and print Microsoft.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Niranjan Damera-Venkata HP Labs Design.
Desktop Publishing Lesson 2 — Working with Text. Lesson 2 – Working with Text2 Objectives  Create a blank document.  Work with text boxes.  Work with.
Keyboarding/Office Applications Semester Review. #1 When using Microsoft Word, what are the default margin setting? In other words, what are the margin.
I. ANOVA revisited & reviewed
Lecture8 Test forcomparison of proportion
Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts
Evaluating Results of Learning
Ying He Wuhan University of Technology Twitter: #AMIA2017
Computer Vision Lecture 16: Texture II
Variables and Measurement (2.1)
Formatting a Workbook Part 1
Elementary Statistics 8th Edition
Practice Activity – Part 1
Chapter 7: Transformations
Chapter 5 Preview Lesson Starter Objectives
CGN 2420 Formatting a Workbook Using Excel’s Ribbon
Presentation transcript:

A Probabilistic Classifier for Table Visual Analysis William Silversmith TANGO Research Project NSF Grant # and Greetings Prof. Embley!

Outline Motivation High level view Algorithm Results of a small experiment Discussion Appendix

Motivation (1) Tables use visual cues to present information. Known approaches (to us) exclusively use structure or pay little attention to visual information. Potential use: Segment category regions from delta cells, allowing further automation of TANGO.

Source:

Motivation (2) However,visual features are not used consistently Ideally, automatically analyze them anyway: Globally Within an internet domain (ie Canada Statistics) Within a subject domain Train a program to understand a given domain!

High Level View Find a group of tables that are accompanied by a listing of their categories and delta cells Train the program on them to recognize how visual features predict the splits Use trained program on new tables to allow TANGO to mine data

Algorithm (1) Find a training set (tables with verifying data) Select visual features to analyze Examples: Font, Indentation, Empty Cells, Background Color, Font Color, Many More!

Algorithm (2) Training the classifier: for each attribute: Form the “difference table” // See appendix Sum along the rows and columns The highest row and column indicates the horizontal and vertical split indicated by the attribute Compare this coordinate with the one indicated by the verification data Track the number of hits and misses persistently

Algorithm (3) The probability that each attribute predicts a correct answer is its weight. Weights can be tabulated for horizontal features, vertical features, and combined. –Certain features may be more sensitive along a given direction

Algorithm (4) Use classifier: for each table in the data set: Form the difference table by summing all weighted attributes together for each difference Sum the resulting distinctions along rows and columns The highest sum for the top or leftmost split for horizontal and vertical splits respectively is the predicted category/delta segmentation

Experiment (1) A small experiment was conducted by hand to test the efficacy of the algorithm. Procedure: –Select six tables from Canada Statistics –Use four to train the classifier –Assess it on the last two By chance, one of the training set tables was a concatenated table.

Experiment (2) Analyzed characteristics: –Font style (normal/italic/bold) –Indentation (left/center/right/offset) –Data type (string or reasonably recognizable number) –Adjacent empty cell (whitespace) Empty/Nonempty transitions do not count except for the whitespace measure

Experiment: Training Results (3)

Experiment: Training Results (4) Attribute Horizontal Cut Vertical Cut Both Cuts Weight Data TypeX 0.25 X1.00 X0.25 WhitespaceX1.00 X0.75 X

Experiment: Classifier Results (5) Both data set targets were properly segmented using both combined and distinct weights approaches for horizontal and vertical cuts Neither of the targets were concatenated tables

Experiment: Classifier Results (6) Uniformly random classifier: P( T1 and T2 ) = (1/49)(1/120) = 1 in 5880 trials P( T1 or T2 ) = (1/49)(119/120)+(1/5880)+(1/120)(48/49) = 1 in 35 trials Sample space = # points in the table

Discussion: Results (1) Font style: less important that previously thought Font style still seemed to indicate the presence of aggregates Some decisions made by slim margins Vertical cuts predicted by indentation

Discussion: Results (2) Whitespace is useful, but sometimes misleading A row of whitespace below the column categories confuses the classifier. Some choices of validating splits are debatable

Discussion: Future Issues (3) Current formulation ignorant of number of cuts? Potentially useful in describing large numbers of similar tables Where is the training data?

Discussion: Future Issues (4) Sources of training data are hard (or possibly expensive) to come by: –Canada Statistics provides access to internal databases for ~$5000 Luckily, Raghav produced about 200 samples by accident! Format is DoclabXML! Hooray for Raghav and TAT!

Discussion: Enhancements (5) Use a rule based learning system? Detect visual patterns? Combine other approaches?

Appendix 1: Difference Table Form a difference table by taking all the cells adjacent on the top, bottom, left, and right and checking to see if there is a difference in the attribute(s) you are looking at. A difference returns 1, no difference returns 0. Do this for all cells in the table. Edges of the table do not count as cells.

Experimental Data Sources Canada Statistics: Training Set: – – – – Data Set: – –