UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF.

Slides:



Advertisements
Similar presentations
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Advertisements

Ontologies for multilingual extraction Deryle W. Lonsdale David W. Embley Stephen W. Liddle Supported by the.
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
Extracting Information from Heterogeneous Information Sources Using Ontologically Specified Target Views Joachim Biskup Universität Dortmund and David.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Generic Schema Matching using Cupid
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Aki Hecht Seminar in Databases (236826) January 2009
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration David W. Embley David Jackman Li Xu.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March, 2003 Funded by National.
Data Frames Version 3 Proposal. Data Frames Version 2 Year matches [2] constant { extract "\d{2}"; context "([^\$\d]|^)\d{2}[^,\dkK]"; } 0.5, { extract.
BYU 2003BYU Data Extraction Group Combining the Best of Global-as-View and Local-as-View for Data Integration Li Xu Brigham Young University Funded by.
Direct and Indirect Matching of Schema Elements for Data Integration on the Web Li Xu Data Extraction Group Brigham Young University Sponsored by NSF.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu and Calton Pu, Georgia Tech.
Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University.
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
BYU 2003BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Schema Mapping: Experiences and Lessons Learned Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF.
DLLS Ontologically-based Searching for Jobs in Linguistics Deryle Lonsdale Funded by:
Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,
From OSM-L to JAVA Cui Tao Yihong Ding. Overview of OSM.
DASFAA 2003BYU Data Extraction Group Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley Brigham Young University Funded.
Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley.
Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources Cui Tao March, 2002 Founded by NSF.
Discovering Direct and Indirect Matches for Schema Elements Li Xu Data Extraction Group Brigham Young University Sponsored by NSF.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration Li Xu David W. Embley David Jackman.
Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,
BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
1 A Tool to Support Ontology Creation Based on Incremental Mini-ontology Merging Zonghui Lian.
Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.
fleckvelter gonsity (ld/gg) hepth (gd) burlam falder multon repeat: 1.understand table 2.generate mini-ontology 3.match with growing.
Record-Boundary Discovery in Web Documents D.W. Embley, Y. Jiang, Y.-K. Ng Data-Extraction Group* Department of Computer Science Brigham Young University.
Generating Data-Extraction Ontologies By Example Joe Zhou Data Extraction Group Brigham Young University.
Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
1 Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004.
Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
An Aspect of the NSF CDI InitiativeNSF CDI: Cyber-Enabled Discovery and Innovation.
CSE 636 Data Integration Schema Matching Cupid Fall 2006.
HKU CSIS DB Seminar: HKU CSIS DB Seminar: Finding Set-Mappings in Schema Matching Supervisor: Dr. David Cheung Speaker: Eric Lo.
FROntIER: Fact Recognizer for Ontologies with Inference and Entity Resolution Joseph Park, Computer Science Brigham Young University.
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Introduction to Geographic Information Systems Fall 2013 (INF 385T-28620) Dr. David Arctur Research Fellow, Adjunct Faculty University of Texas at Austin.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
David W. Embley Brigham Young University Provo, Utah, USA.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Extracting and Organizing Facts of Interest from OCRed Historical Documents Joseph Park, Computer Science Brigham Young University.
David W. Embley Brigham Young University Provo, Utah, USA
Data Integration for Relational Web
Automating Schema Matching for Data Integration
INFO/CSE 100, Spring 2006 Fluency in Information Technology
Presentation transcript:

UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group Information Exchange SourceTarget Information Extraction Schema Matching Leverage this … … to do this

UFMG, June 2002BYU Data Extraction Group Presentation Outline Information Extraction Schema Matching for HTML Table Direct Schema Matching Indirect Schema Matching Conclusions and Future Work

UFMG, June 2002BYU Data Extraction Group Information Extraction

UFMG, June 2002BYU Data Extraction Group Extracting Pertinent Information from Documents

UFMG, June 2002BYU Data Extraction Group A Conceptual Modeling Solution YearPrice Make Mileage Model Feature PhoneNr Extension Car has is for has 1..* * * 1..*

UFMG, June 2002BYU Data Extraction Group Car-Ads Ontology Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, … End;

UFMG, June 2002BYU Data Extraction Group Recognition and Extraction Car Year Make Model Mileage Price PhoneNr Subaru SW $1900 (363) Elandra (336) HONDA ACCORD EX 100K (336) Car Feature 0001 Auto 0001 AC 0002 Black door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stero 0002 a/c 0003 Auto 0003 jade green 0003 gold

UFMG, June 2002BYU Data Extraction Group Schema Matching for HTML Tables

UFMG, June 2002BYU Data Extraction Group Table-Schema Matching (Basic Idea) Many tables on the Web Ontology-Based Extraction: –Works well for unstructured or semistructured data –What about structured data – tables? Method: –Form attribute-value pairs –Do extraction –Infer mappings from extraction patterns

UFMG, June 2002BYU Data Extraction Group Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Different Source Table Schemas –{Run #, Yr, Make, Model, Tran, Color, Dr} –{Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} –{Vehicle, Distance, Price, Mileage} –{Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} ?

UFMG, June 2002BYU Data Extraction Group Problem: Attribute is Value

UFMG, June 2002BYU Data Extraction Group Problem: Attribute-Value is Value ??

UFMG, June 2002BYU Data Extraction Group Problem: Value is not Value

UFMG, June 2002BYU Data Extraction Group Problem: Implied Values

UFMG, June 2002BYU Data Extraction Group Problem: Missing Attributes

UFMG, June 2002BYU Data Extraction Group Problem: Compound Attributes

UFMG, June 2002BYU Data Extraction Group Problem: Merged Values

UFMG, June 2002BYU Data Extraction Group Problem: Values not of Interest

UFMG, June 2002BYU Data Extraction Group Problem: Factored Values

UFMG, June 2002BYU Data Extraction Group Problem: Split Values

UFMG, June 2002BYU Data Extraction Group Problem: Information Behind Links Single-Column Table (formated a list) Table extending over several pages

UFMG, June 2002BYU Data Extraction Group Solution Form attribute-value pairs (adjust if necessary) Do extraction Infer mappings from extraction patterns

UFMG, June 2002BYU Data Extraction Group Solution: Remove Internal Factoring Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)* Unnest:  (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*  (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table Legend ACURA

UFMG, June 2002BYU Data Extraction Group Solution: Replace Boolean Values Legend ACURA  CD Table Yes, CD Yes,  Auto  Air Cond  AM/FM Yes, AM/FM Air Cond. Auto

UFMG, June 2002BYU Data Extraction Group Solution: Form Attribute-Value Pairs Legend ACURA CD AM/FM Air Cond. Auto,,,,,,,

UFMG, June 2002BYU Data Extraction Group Solution: Adjust Attribute-Value Pairs Legend ACURA CD AM/FM Air Cond. Auto,,,,,,,

UFMG, June 2002BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto,,,,,,,

UFMG, June 2002BYU Data Extraction Group Solution: Infer Mappings Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Each row is a car.  Model  (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table  Make  (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*  (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table  Year Table Note: Mappings produce sets for attributes. Joining to form records is trivial because we have OIDs for table rows (e.g. for each Car).

UFMG, June 2002BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}  Model  (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table

UFMG, June 2002BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}  Price Table

UFMG, June 2002BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Yes,  Colour  Feature  Colour Table   Auto  Feature  Auto  Auto Table   Air Cond.  Feature  Air Cond.  Air Cond. Table   AM/FM  Feature  AM/FM  AM/FM Table   CD  Feature  CD  CD Table Yes,

UFMG, June 2002BYU Data Extraction Group Experiment Tables from 60 sites 10 “ training ” tables 50 test tables 357 mappings (from all 60 sites) –172 direct mappings (same attribute and meaning) –185 indirect mappings (29 attribute synonyms, 5 “ Yes/No ” columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)

UFMG, June 2002BYU Data Extraction Group Results 10 “training” tables –100% of the 57 mappings (no false mappings) –94.6% of the values in linked pages (5.4% false declarations) 50 test tables –94.7% of the 300 mappings (no false mappings) –On the bases of sampling 3,000 values in linked pages, we obtained 97% recall and 86% precision 16 missed mappings –4 partial (not all unions included) –6 non-U.S. car-ads (unrecognized makes and models) –2 U.S. unrecognized makes and models –3 prices (missing $ or found MSRP instead) –1 mileage (mileages less than 1,000)

UFMG, June 2002BYU Data Extraction Group Direct Schema Matching

UFMG, June 2002BYU Data Extraction Group Attribute Matching for Populated Schemas Central Idea: Exploit All Data & Metadata Matching Possibilities (Facets) –Attribute Names –Data-Value Characteristics –Expected Data Values –Data-Dictionary Information –Structural Properties

UFMG, June 2002BYU Data Extraction Group Approach Target Schema T Source Schema S Framework –Individual Facet Matching –Combining Facets –Best-First Match Iteration

UFMG, June 2002BYU Data Extraction Group Example Source Schema S Car Year has 0:1 Make has 0:1 Model has 0:1 Cost Style has 0:1 0:* Year has 0:1 Feature has 0:* Cost has 0:1 Car Mileage has Phone has 0:1 Model has 0:1 Target Schema T Make has 0:1 Miles has 0:1 Year Model Make Year Make Model Car MileageMiles

UFMG, June 2002BYU Data Extraction Group Individual Facet Matching Attribute Names Data-Value Characteristics Expected Data Values

UFMG, June 2002BYU Data Extraction Group Attribute Names Target and Source Attributes –T : A –S : B WordNet C4.5 Decision Tree: feature selection, trained on schemas in DB books –f0: same word –f1: synonym –f2: sum of distances to a common hypernym root –f3: number of different common hypernym roots –f4: sum of the number of senses of A and B

UFMG, June 2002BYU Data Extraction Group WordNet Rule The number of different common hypernym roots of A and B The sum of distances of A and B to a common hypernym The sum of the number of senses of A and B

UFMG, June 2002BYU Data Extraction Group Confidence Measures

UFMG, June 2002BYU Data Extraction Group Data-Value Characteristics C4.5 Decision Tree Features –Numeric data (Mean, variation, standard deviation, …) –Alphanumeric data (String length, numeric ratio, space ratio)

UFMG, June 2002BYU Data Extraction Group Confidence Measures

UFMG, June 2002BYU Data Extraction Group Expected Data Values Target Schema T and Source Schema S –Regular expression recognizer for attribute A in T –Data instances for attribute B in S Hit Ratio = N’/N for (A, B) match –N’ : number of B data instances recognized by the regular expressions of A –N: number of B data instances

UFMG, June 2002BYU Data Extraction Group Confidence Measures

UFMG, June 2002BYU Data Extraction Group Combined Measures Threshold:

UFMG, June 2002BYU Data Extraction Group Final Confidence Measures 0 0 0

UFMG, June 2002BYU Data Extraction Group Experimental Results This schema, plus 6 other schemas –32 matched attributes –376 unmatched attributes Matched: 100% Unmatched: 99.5% –“Feature” ---”Color” –“Feature” ---”Body Type” F1 93.8% F2 84% F3 92% F1 98.9% F2 97.9% F3 98.4% F1: WordNet F2: Value Characteristics F3: Expected Values

UFMG, June 2002BYU Data Extraction Group Indirect Schema Matching

UFMG, June 2002BYU Data Extraction Group Schema Matching Source Car Year Cost Style Year Feature Cost Phone Target Car Miles Mileage Model Make & Model Color Body Type

UFMG, June 2002BYU Data Extraction Group Mapping Generation Direct Matches as described earlier: –Attribute Names based on WordNet –Value Characteristics based on value lengths, averages, … –Expected Values based on regular-expression recognizers Indirect Matches: –Direct matches –Structure Evaluation Union Selection Decomposition Composition

UFMG, June 2002BYU Data Extraction Group Union and Selection Car Source Year Cost Style Year Feature Cost Phone Target Car Miles Mileage Model Make & Model Color Body Type

UFMG, June 2002BYU Data Extraction Group Decomposition and Composition Car Source Year Cost Style Year Feature Cost Phone Target Car Miles Mileage Model Make & Model Color Body Type

UFMG, June 2002BYU Data Extraction Group Structure PO POShipToPOBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder DeliverToInvoiceTo Items ItemItemCount ItemNumber QuantityUnitOfMeasure CityStreet Address TargetSource Example Taken From [MBR, VLDB’01]

UFMG, June 2002BYU Data Extraction Group Structure (Nonlexical Matches) PO POShipToPOBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder DeliverToInvoiceTo Items ItemCount ItemNumber QuantityUnitOfMeasure CityStreet Address DeliverTo TargetSource

UFMG, June 2002BYU Data Extraction Group Structure (Join over FD Relationship Sets, …) PO POBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder InvoiceTo Items ItemCount ItemNumber QuantityUnitOfMeasure City Street City Street POShipToDeliverTo TargetSource

UFMG, June 2002BYU Data Extraction Group Structure (Lexical Matches) PO POBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder InvoiceTo Items ItemCount ItemNumber Quantity City Street City Street City Street City Street Count LineQty QuantityUnitOfMeasure POShipToDeliverTo TargetSource

UFMG, June 2002BYU Data Extraction Group Experiments Methodology Measures –Precision –Recall –F Measure

UFMG, June 2002BYU Data Extraction Group Results Applications (Number of Schemes) Precision (%) Recall (%) F (%) CorrectFalse Positive False Negative Course Schedule (5) Faculty Member (5) Real Estate (5) Data borrowed from Univ. of Washington [DDH, SIGMOD01] Indirect Matches: 94% (precision, recall, F-measure) Rough Comparison with U of W Results (Direct Matches only) * Course Schedule – Accuracy: ~71% * Faculty Members – Accuracy, ~92% * Real Estate (2 tests) – Accuracy: ~75%

UFMG, June 2002BYU Data Extraction Group Conclusions and Future Work

UFMG, June 2002BYU Data Extraction Group Conclusions Table Mappings –Tables: 94.7% (Recall); 100% (Precision) –Linked Text: ~97% (Recall); ~86% (Precision) Direct Attribute Matching –Matched 32 of 32: 100% Recall –2 False Positives: 94% Precision Direct and Indirect Attribute Matching –Matched 494 of 513: 96% Recall –22 False Positives: 96% Precision

UFMG, June 2002BYU Data Extraction Group Current & Future Work: Improve and Extend Indirect Matching Improve Object-Set Matching (e.g. Lex/non-Lex) Add Relationship-Set Matching Computations

UFMG, June 2002BYU Data Extraction Group Current & Future Work: Tables Behind Forms Crawling the Hidden Web Filling in Forms from Global Queries

UFMG, June 2002BYU Data Extraction Group Current & Future Work: Developing Extraction Ontologies Creation from Knowledge Sources and Sample Application Pages –  K Ontology + Data Frames, Lexicons, … –RDF Ontologies User Creation by Example

UFMG, June 2002BYU Data Extraction Group Current & Future Work: and Much More … Table Understanding Microfilm Census Records Generate Ontologies by Reading Tables …