Presentation is loading. Please wait.

Presentation is loading. Please wait.

TANGO Table ANalysis for Generating Ontologies Yuri A. Tijerino*, David W. Embley*, Deryle W. Lonsdale* and George Nagy** * Brigham Young University **

Similar presentations


Presentation on theme: "TANGO Table ANalysis for Generating Ontologies Yuri A. Tijerino*, David W. Embley*, Deryle W. Lonsdale* and George Nagy** * Brigham Young University **"— Presentation transcript:

1 TANGO Table ANalysis for Generating Ontologies Yuri A. Tijerino*, David W. Embley*, Deryle W. Lonsdale* and George Nagy** * Brigham Young University ** Rensselaer Polytechnic Institute

2 List of contents  Motivation  Applications  Table understanding  Concept matching  Ontology merging/growing  Example  Future direction

3 Motivation  Semi-automated ontological engineering through Table Analysis for Generating Ontologies (TANGO)  Keyword or link analysis search not enough to search for information in tables  Structure in tables can lead to domain knowledge which includes concepts, relationships and constraints (ontologies)  Tables on web created for human use can lead to robust domain ontologies

4 TANGO Applications  Extraction ontologies (generation)  Data integration  Semantic web  Multiple-source query processing  Document image analysis for documents that contain tables

5 Table understanding  What is a table?  Why table normalization?  What is table understanding?  What is mini-ontology generation?

6 Table understanding: What is a table?  “…a two-dimensional assembly of cells used to present information…” Lopresti and Nagy  Normalized tables (row-column format)  Small paper (using OCR) and/or electronic tables (marked up) intended for human use

7 Table understanding: What is table normalization? Table normalization means to take any table and produce a standard row-column table with all data cells containing expanded values and type information CountryGDP/PPP Per Capita Real- Growth Rate Inflation Afghanistan$21,000,000,000$800?? Albania$13,200,000,000$3,8007.3%3.0% Algeria$177,000,000,000$5,6003.8%3.0% Andorra$1,300,000,000$19,0003.8%4.3% Angola$13,300,000,000$1,3305.4%110.0% Antigua and Barbuda $674,000,000$10,0003.5%0.4% …………… Raw table Normalized table

8 Table understanding: What is table normalization?

9 ??Population Growth rate Population Density Birth Rate Death Rate Migration Rate Life Expectancy Male Life Expectancy Female Infant Mortality Afghanistan25,824,8823.95%39.88 persons/km 2 4.19%1.70%1.46%47.82 years46.82 years14.06% Albania3,364,5711.05%122.79 persons/km 2 2.07%0.74%-0.29%65.92 years72.33 years4.29% Algeria31,133,4862.10%13.07 persons/km 2 2.70%0.55%-0.05%68.07 years70.46 years4.38% American Samoa63,7862.64%320.53 persons/km 2 2.65%0.40%0.39%71.23 years79.95 years1.02% Andorra65,9392.24%146.53 persons/km 2 1.03%0.55%1.76%80.55 years86.55 years0.41% Angola11,5102.84%8.97 persons/km 2 4.31%1.64%0.16%46.08 years50.82 years12.92% ………………………… Western Sahara239,3332.34%0.90 persons/km 2 4.54%1.66%-0.54%47.98 years50.57 years13.67% World5,995,544,8361.30%14.42 persons/km 2 2.20%0.90%?61.00 years65.00 years5.60% Yemen16,942,2303.34%32.09 persons/km 2 4.33%0.99%0.00%58.17 years61.88 years6.98% Zambia9,663,5352.12%13.05 persons/km 2 4.45%2.26%0.08%36.72 years37 21 years9.19% Zimbabwe11,163,1601.02%28.87 persons/km 2 3.06%2.04%?38.77 years38.94 years6.12%

10 Table understanding: Information useful for normalization  Captions – in vicinity of table (above, below etc)  Footnotes – on annotated column labels or data cells  Embedded information – in rows, columns or cells {e.g., $, %, (1,000), billions, etc}  Links to other views of the table, possibly with new information

11 What is table understanding?  Normalize table  Take a table as an input and produce standard records in the form of attribute-value pairs as output  Discover constraints among columns  Understand the data values CountryGDP/PPP Per Capita Real-Growth Rate Inflation Afghanistan$21,000,000,000$800?? Albania$13,200,000,000$3,8007.3%3.0% Algeria$177,000,000,000$5,6003.8%3.0% Andorra$1,300,000,000$19,0003.8%4.3% Angola$13,300,000,000$1,3305.4%110.0% Antigua and Barbuda $674,000,000$10,0003.5%0.4% …………… {has(Country, GDP/PPP),has(Country,GDP/PPP Per Capita), has(Country,Real-growth rate*), has(Country, Inflation*) Left-most, primary key Dollar amount (from data frame) Percentage (from data frame) Country names (from data frame) {,,,, }

12 Example: Creating a domain ontology Has associated data frames Includes procedural knowledge Distances Duration between Time zones NameGeopolitical Entity Time Location LongitudeLatitude hasnames Latitude and longitude designates location CountryCity Has GMT

13 Example: Table understanding to mini-ontology generation AgglomerationPopulationContinentCountry Tokyo31,139,900AsiaJapan New York-Philadelphia30,286,900The AmericasUnited States of America Mexico21,233,900The AmericasMexico Seoul19,969,100AsiaKorea (South) Sao Paulo18,847,400The AmericasBrazil Jakarta17,891,000AsiaIndonesia Osaka-Kobe-Kyoto17,621,500AsiaJapan ………… Niigata503,500AsiaJapan Raurkela503,300AsiaIndia Homjel502,200EuropeBelarus Zunyi501,900AsiaChina Santiago501,800The AmericasDominican Republic Pingdingshan501,500AsiaChina Fargona501,000AsiaUzbekistan Kirov500,200EuropeRussia Newcastle500,000Australia /Oceania Australia AgglomerationPopulation CountryContinent

14 Example: Concept matching to ontology Merging Merge Results AgglomerationPopulation CountryContinent Time Location LongitudeLatitude hasnames Latitude and longitude designates location CountryCity NameGeopolitical Entity Continent Location LongitudeLatitude Latitude and longitude designates location NameGeopolitical Entity Population City Agglomeration Country Has GMT Time Location LongitudeLatitude hasnames Latitude and longitude designates location CountryCity NameGeopolitical Entity Has GMT

15 Concept matching  We use exhaustive concept matching techniques to match concepts from different mini-ontologies, including: Lexical and Natural Language Processing Value Similarity Value Features Data Frame Comparison Constraints

16 Concept Matching (Lexical & NLP)  Lexical Direct comparisons (substring/superstring) WordNet (Synonyms, Word Senses, Hypernyms/Hyponyms)  Natural Language Processing Phrases in column headers Footnotes (for columns, rows, values) Explanations of symbols, rows, columns Titles and subtitles

17 Concept Matching (Value Similarity)  Compute overlap for string values comparing data sets  Compute overlap for numeric values comparing Gaussian Probability Distributions  Compute similarity of numeric values using regression

18 Concept Matching (Value Similarity) Afghanistan Albania Algeria Andorra … Yemen Zambia Zimbabwe Afghanistan Albania Algeria American Samoa … World Yemen Zambia Zimbabwe A B In B not in A In A not in B In B not in A Real-world example Total of 193 cells in A Total of 267 cells in B 77 fields in B not in A 3 fields in A not in B 190 total matches Proportion of matches with respect to A = 190/193 = 98% Proportion of matches with respect to B = 190/267 = 71%

19 Concept Matching (Value Similarity) 31,900,600 30,521,550 25,335,200 12,300,555 … 3,567,203 2,300,531 1,400,112 31,500,900 30,400,111 25,500,100 21,000,900 … 7,000,000 3,500,050 2,300,000 1,500,000 A B In B not in A In A not in B In B not in A Total of 170 cells in A Total of 240 cells in B 50 fields in B not in A 2 fields in A not in B 168 total matches Proportion of matches with respect to A = 168/170 = 99% Proportion of matches with respect to B = 168/240 = 70% Gaussian PDF

20 Concept Matching (Value Features)  We can also compute similarities from value characteristics such as: Character/numeric length, ratio Numeric values mean, variance, standard deviation

21 Concept Matching (Data frames)  Snippets of real-world knowledge about data (type, length, nearby keywords, patterns [as in regexps], functional, etc)  We have used data frames to Recognize data types Include recognizers for values (dates, times, longitude, latitude, countries, cities, etc) Provide conversion routines Match headers, labels, footnotes and values Compose or split columns (e.g., addresses)

22 Concept Matching (Constraints)  Keys in tables (as well as nonkeys)  Functional relationships  1-1, 1-*, *-1 or *-* correspondences  Subset/superset of value sets  Unknown and null values

23 Ontology merging/growing  Direct merge (no conflicts) Use results of matching phase to find similar concepts in ontologies (e.g., data value similarities, data frames, NLP, etc)  Conflict resolution Interactively identify evidence and counter evidence of functional relationships among mini-ontologies using constraint resolution  IDS Interaction with human knowledge engineer Issues – identify Default strategy – apply Suggestions – make

24 Example: Another mini-ontology generation Place LongitudeLatitude Elevation USGS Quad Area MineReservoir LakeCity/town Country State Place Name ⊎

25 Example: Another mini-ontology generation Place LongitudeLatitude Elevation USGS Quad Area MineReservoir LakeCity/town Country State Place Name ⊎ Location LongitudeLatitude Latitude and longitude designates location NameGeopolitical Entity Population City Agglomeration Country Merge Continent Time hasnames has GMT

26 Example: Concept Mapping to Ontology Merging Place Elevation USGS Quad Area MineReservoir Lake Country State ⊎ Location LongitudeLatitude Latitude and longitude designates location NameGeopolitical Entity Population Agglomeration Country Continent Time hasnames has GMT Geopolitical Entity with population City/town

27 Future direction  Start with multiple tables (or URLs) and generate mini-ontologies  Identify most suitable mini-ontologies to merge by calculating which tables have most overlap of concepts  Generate multiple domain ontologies  Integrate with form-based data extraction tools (smarter Web search engines)


Download ppt "TANGO Table ANalysis for Generating Ontologies Yuri A. Tijerino*, David W. Embley*, Deryle W. Lonsdale* and George Nagy** * Brigham Young University **"

Similar presentations


Ads by Google