TANGO Table ANalysis for Generating Ontologies Yuri A. Tijerino*, David W. Embley*, Deryle W. Lonsdale* and George Nagy** * Brigham Young University **

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

What is a Database By: Cristian Dubon.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Center for Modeling & Simulation.  A Map is the most effective shorthand to show locations of objects with attributes, which can be physical or cultural.
0 General information Rate of acceptance 37% Papers from 15 Countries and 5 Geographical Areas –North America 5 –South America 2 –Europe 20 –Asia 2 –Australia.
Extracting Information from Heterogeneous Information Sources Using Ontologically Specified Target Views Joachim Biskup Universität Dortmund and David.
1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
Extracting Data Behind Web Forms Stephen W. Liddle David W. Embley Del T. Scott, Sai Ho Yau Brigham Young University Presented by: Helen Chen.
Automatically Identifying Record Patterns from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs David W. Embley.
Thesis Defense Mini-Ontology GeneratOr (MOGO) Mini-Ontology Generation from Canonicalized Tables Stephen Lynn Data Extraction Research Group Department.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
A Tool to Support Ontology Creation Based on Incremental Mini-Ontology Merging Zonghui Lian Data Extraction Research Group Supported by.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
1 A Tool to Support Ontology Creation Based on Incremental Mini-ontology Merging Zonghui Lian.
Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.
September 23, 2007NSF TANGO BYU/RPI1 TANGO Table Analysis for Generating Ontologies David W. Embley (BYU) & George Nagy (RPI) under NSF Awards
Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.
A Tool to Support Ontology Creation based on Incremental Mini- Ontology Merging Zonghui Lian Supported by.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
Semi-Automatic Generation of Mini-Ontologies from Canonicalized Relational Tables Chris Hathaway Supported by NSF.
Semi-Automatic Generation of Mini-Ontologies from Canonicalized Relational Tables Chris Hathaway.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
TANGO – Table Analysis for Generating Ontologies Sean Kelley Rensselaer Polytechnic Institute 2011 Electrical Engineering.
Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Rebecca Boger Earth and Environmental Sciences Brooklyn College.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
Thesis Proposal Mini-Ontology GeneratOr (MOGO) Mini-Ontology Generation from Canonicalized Tables Stephen Lynn Data Extraction Research Group Department.
Title: Spatial Data Mining in Geo-Business. Overview  Twisting the Perspective of Map Surfaces — describes the character of spatial distributions through.
Attribute Data in GIS Data in GIS are stored as features AND tabular info Tabular information can be associated with features OR Tabular data may NOT be.
Deryle W. Lonsdale, David W. Embley, Stephen W. Liddle, and Joseph Park BYU Data Extraction Research Group.
Machine Learning Approach for Ontology Mapping using Multiple Concept Similarity Measures IEEE/ACIS International Conference on Computer and Information.
Chapter 9 Designing Databases Modern Systems Analysis and Design Sixth Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Why Is It There? Getting Started with Geographic Information Systems Chapter 6.
1 From Tessellations to Table Interpretation R. C. Jandhyala 1, M. Krishnamoorthy 1, G. Nagy 1, R. Padmanabhan 1, S. Seth 2, W. Silversmith 1 1 DocLab,
 2008 Pearson Education, Inc. All rights reserved Introduction to XHTML.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Dimitrios Skoutas Alkis Simitsis
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
1 Spatial Data Models and Structure. 2 Part 1: Basic Geographic Concepts Real world -> Digital Environment –GIS data represent a simplified view of physical.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Query and Reasoning. Types of Queries Most GIS queries will select spatial features Query by Attribute (Select by Attribute) –Structured Query Language.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
ONTOLOGY ENGINEERING Lab #2 – September 8,
Microsoft® Access Generate forms quickly 1 Modify controls in Layout View 2 Work with form sections 3 Modify controls in Design View 4 Add calculated.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
David W. Embley Brigham Young University Provo, Utah, USA.
Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.
Of 24 lecture 11: ontology – mediation, merging & aligning.
Why Is It There? Chapter 6. Review: Dueker’s (1979) Definition “a geographic information system is a special case of information systems where the database.
IT 5433 LM3 Relational Data Model. Learning Objectives: List the 5 properties of relations List the properties of a candidate key, primary key and foreign.
SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.
Food and Agriculture Organization of the UN GILW Library and Documentation Systems Division Food, Nutrition and Agriculture Ontology Portal.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
A Tool to Support Ontology Creation based on Incremental Mini-Ontology Merging Zonghui Lian Supported by.
Presentation transcript:

TANGO Table ANalysis for Generating Ontologies Yuri A. Tijerino*, David W. Embley*, Deryle W. Lonsdale* and George Nagy** * Brigham Young University ** Rensselaer Polytechnic Institute

List of contents  Motivation  Applications  Table understanding  Concept matching  Ontology merging/growing  Example  Future direction

Motivation  Semi-automated ontological engineering through Table Analysis for Generating Ontologies (TANGO)  Keyword or link analysis search not enough to search for information in tables  Structure in tables can lead to domain knowledge which includes concepts, relationships and constraints (ontologies)  Tables on web created for human use can lead to robust domain ontologies

TANGO Applications  Extraction ontologies (generation)  Data integration  Semantic web  Multiple-source query processing  Document image analysis for documents that contain tables

Table understanding  What is a table?  Why table normalization?  What is table understanding?  What is mini-ontology generation?

Table understanding: What is a table?  “…a two-dimensional assembly of cells used to present information…” Lopresti and Nagy  Normalized tables (row-column format)  Small paper (using OCR) and/or electronic tables (marked up) intended for human use

Table understanding: What is table normalization? Table normalization means to take any table and produce a standard row-column table with all data cells containing expanded values and type information CountryGDP/PPP Per Capita Real- Growth Rate Inflation Afghanistan$21,000,000,000$800?? Albania$13,200,000,000$3,8007.3%3.0% Algeria$177,000,000,000$5,6003.8%3.0% Andorra$1,300,000,000$19,0003.8%4.3% Angola$13,300,000,000$1,3305.4%110.0% Antigua and Barbuda $674,000,000$10,0003.5%0.4% …………… Raw table Normalized table

Table understanding: What is table normalization?

??Population Growth rate Population Density Birth Rate Death Rate Migration Rate Life Expectancy Male Life Expectancy Female Infant Mortality Afghanistan25,824, %39.88 persons/km %1.70%1.46%47.82 years46.82 years14.06% Albania3,364, % persons/km %0.74%-0.29%65.92 years72.33 years4.29% Algeria31,133, %13.07 persons/km %0.55%-0.05%68.07 years70.46 years4.38% American Samoa63, % persons/km %0.40%0.39%71.23 years79.95 years1.02% Andorra65, % persons/km %0.55%1.76%80.55 years86.55 years0.41% Angola11, %8.97 persons/km %1.64%0.16%46.08 years50.82 years12.92% ………………………… Western Sahara239, %0.90 persons/km %1.66%-0.54%47.98 years50.57 years13.67% World5,995,544, %14.42 persons/km %0.90%?61.00 years65.00 years5.60% Yemen16,942, %32.09 persons/km %0.99%0.00%58.17 years61.88 years6.98% Zambia9,663, %13.05 persons/km %2.26%0.08%36.72 years37 21 years9.19% Zimbabwe11,163, %28.87 persons/km %2.04%?38.77 years38.94 years6.12%

Table understanding: Information useful for normalization  Captions – in vicinity of table (above, below etc)  Footnotes – on annotated column labels or data cells  Embedded information – in rows, columns or cells {e.g., $, %, (1,000), billions, etc}  Links to other views of the table, possibly with new information

What is table understanding?  Normalize table  Take a table as an input and produce standard records in the form of attribute-value pairs as output  Discover constraints among columns  Understand the data values CountryGDP/PPP Per Capita Real-Growth Rate Inflation Afghanistan$21,000,000,000$800?? Albania$13,200,000,000$3,8007.3%3.0% Algeria$177,000,000,000$5,6003.8%3.0% Andorra$1,300,000,000$19,0003.8%4.3% Angola$13,300,000,000$1,3305.4%110.0% Antigua and Barbuda $674,000,000$10,0003.5%0.4% …………… {has(Country, GDP/PPP),has(Country,GDP/PPP Per Capita), has(Country,Real-growth rate*), has(Country, Inflation*) Left-most, primary key Dollar amount (from data frame) Percentage (from data frame) Country names (from data frame) {,,,, }

Example: Creating a domain ontology Has associated data frames Includes procedural knowledge Distances Duration between Time zones NameGeopolitical Entity Time Location LongitudeLatitude hasnames Latitude and longitude designates location CountryCity Has GMT

Example: Table understanding to mini-ontology generation AgglomerationPopulationContinentCountry Tokyo31,139,900AsiaJapan New York-Philadelphia30,286,900The AmericasUnited States of America Mexico21,233,900The AmericasMexico Seoul19,969,100AsiaKorea (South) Sao Paulo18,847,400The AmericasBrazil Jakarta17,891,000AsiaIndonesia Osaka-Kobe-Kyoto17,621,500AsiaJapan ………… Niigata503,500AsiaJapan Raurkela503,300AsiaIndia Homjel502,200EuropeBelarus Zunyi501,900AsiaChina Santiago501,800The AmericasDominican Republic Pingdingshan501,500AsiaChina Fargona501,000AsiaUzbekistan Kirov500,200EuropeRussia Newcastle500,000Australia /Oceania Australia AgglomerationPopulation CountryContinent

Example: Concept matching to ontology Merging Merge Results AgglomerationPopulation CountryContinent Time Location LongitudeLatitude hasnames Latitude and longitude designates location CountryCity NameGeopolitical Entity Continent Location LongitudeLatitude Latitude and longitude designates location NameGeopolitical Entity Population City Agglomeration Country Has GMT Time Location LongitudeLatitude hasnames Latitude and longitude designates location CountryCity NameGeopolitical Entity Has GMT

Concept matching  We use exhaustive concept matching techniques to match concepts from different mini-ontologies, including: Lexical and Natural Language Processing Value Similarity Value Features Data Frame Comparison Constraints

Concept Matching (Lexical & NLP)  Lexical Direct comparisons (substring/superstring) WordNet (Synonyms, Word Senses, Hypernyms/Hyponyms)  Natural Language Processing Phrases in column headers Footnotes (for columns, rows, values) Explanations of symbols, rows, columns Titles and subtitles

Concept Matching (Value Similarity)  Compute overlap for string values comparing data sets  Compute overlap for numeric values comparing Gaussian Probability Distributions  Compute similarity of numeric values using regression

Concept Matching (Value Similarity) Afghanistan Albania Algeria Andorra … Yemen Zambia Zimbabwe Afghanistan Albania Algeria American Samoa … World Yemen Zambia Zimbabwe A B In B not in A In A not in B In B not in A Real-world example Total of 193 cells in A Total of 267 cells in B 77 fields in B not in A 3 fields in A not in B 190 total matches Proportion of matches with respect to A = 190/193 = 98% Proportion of matches with respect to B = 190/267 = 71%

Concept Matching (Value Similarity) 31,900,600 30,521,550 25,335,200 12,300,555 … 3,567,203 2,300,531 1,400,112 31,500,900 30,400,111 25,500,100 21,000,900 … 7,000,000 3,500,050 2,300,000 1,500,000 A B In B not in A In A not in B In B not in A Total of 170 cells in A Total of 240 cells in B 50 fields in B not in A 2 fields in A not in B 168 total matches Proportion of matches with respect to A = 168/170 = 99% Proportion of matches with respect to B = 168/240 = 70% Gaussian PDF

Concept Matching (Value Features)  We can also compute similarities from value characteristics such as: Character/numeric length, ratio Numeric values mean, variance, standard deviation

Concept Matching (Data frames)  Snippets of real-world knowledge about data (type, length, nearby keywords, patterns [as in regexps], functional, etc)  We have used data frames to Recognize data types Include recognizers for values (dates, times, longitude, latitude, countries, cities, etc) Provide conversion routines Match headers, labels, footnotes and values Compose or split columns (e.g., addresses)

Concept Matching (Constraints)  Keys in tables (as well as nonkeys)  Functional relationships  1-1, 1-*, *-1 or *-* correspondences  Subset/superset of value sets  Unknown and null values

Ontology merging/growing  Direct merge (no conflicts) Use results of matching phase to find similar concepts in ontologies (e.g., data value similarities, data frames, NLP, etc)  Conflict resolution Interactively identify evidence and counter evidence of functional relationships among mini-ontologies using constraint resolution  IDS Interaction with human knowledge engineer Issues – identify Default strategy – apply Suggestions – make

Example: Another mini-ontology generation Place LongitudeLatitude Elevation USGS Quad Area MineReservoir LakeCity/town Country State Place Name ⊎

Example: Another mini-ontology generation Place LongitudeLatitude Elevation USGS Quad Area MineReservoir LakeCity/town Country State Place Name ⊎ Location LongitudeLatitude Latitude and longitude designates location NameGeopolitical Entity Population City Agglomeration Country Merge Continent Time hasnames has GMT

Example: Concept Mapping to Ontology Merging Place Elevation USGS Quad Area MineReservoir Lake Country State ⊎ Location LongitudeLatitude Latitude and longitude designates location NameGeopolitical Entity Population Agglomeration Country Continent Time hasnames has GMT Geopolitical Entity with population City/town

Future direction  Start with multiple tables (or URLs) and generate mini-ontologies  Identify most suitable mini-ontologies to merge by calculating which tables have most overlap of concepts  Generate multiple domain ontologies  Integrate with form-based data extraction tools (smarter Web search engines)