Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 1: INTRODUCTION TO DATA INTEGRATION PRINCIPLES OF DATA INTEGRATION.
CSE 636 Data Integration Data Integration Approaches.
Lecture-7/ T. Nouf Almujally
Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington.
Data Integration: A Status Report Alon Halevy University of Washington, Seattle BTW 2003.
Managing Data Resources
Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.
Principles of Dataspace Systems Alon Halevy PODS June 26, 2006.
New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies.
DataSpaces: A New Abstraction for Data Management Alon Halevy* DASFAA, 2006 Singapore *Joint work with Mike Franklin and David Maier.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
1 Database Research at the UW  Faculty: Alon Halevy and Dan Suciu. A dozen Ph.D students  Related faculty: Oren Etzioni, Pedro Domingos, Dan Weld and.
Feb. 23, 2004CS WPI1 CS 509 Design of Software Systems Lecture #5 Monday, Feb. 23, 2004.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom.
Dataspaces: Co-Existence with Heterogeneity Alon Halevy KR 2006.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.
Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Crossing the Structure Chasm Alon Halevy University of Washington FQAS 2002.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UCLA, April 15, 2004.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
09/12/2003 Peer-to-Peer Information Systems – WS 03/04 1 Piazza: Data Management Infrastructure for Semantic Web Applications Alon Y. Halevy, Zachary G.
A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.
Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Ontology Development Kenneth Baclawski Northeastern University Harvard Medical School.
Research Topics in Computing Data Modelling for Data Schema Integration 1 March 2005 David George.
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.
AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.
A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.
Introduction to Database Systems Fundamental Concepts Irvanizam Zamanhuri, M.Sc Computer Science Study Program Syiah Kuala University Website:
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
Lecture #9 Data Integration May 30 th, Agenda/Administration Project demo scheduling. Reading pointers for exam.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 16, 2015 LSD Slides courtesy AnHai Doan.
The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)
Database A database is a collection of data organized to meet users’ needs. In this section: Database Structure Database Tools Industrial Databases Concepts.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
CSE 636 Data Integration Schema Matching Cupid Fall 2006.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
End of Query Optimization Data Integration May 24, 2004.
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 14, 2007.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Data Integration: Achievements and Perspectives in the Last Ten Years AiJing.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Semantic Mappings for Data Mediation
Data Integration Approaches
Fundamentals of Information Systems, Sixth Edition Chapter 3 Database Systems, Data Centers, and Business Intelligence.
Of 24 lecture 11: ontology – mediation, merging & aligning.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Managing Data Resources File Organization and databases for business information systems.
Statistical Schema Matching across Web Query Interfaces
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington
Chapter 4 Relational Databases
A Platform for Personal Information Management and Integration
Presentation transcript:

Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004

The Structure Chasm AuthoringCreating a schemaWriting text Queryingkeywords Using someone else ’ s schema Data sharing EasyCommittees, standards But we can pose complex queries

Why is This a Problem? Databases used to be isolated and administered only by experts. Today ’ s applications call for large-scale data sharing: Big science (bio-medicine, astrophysics, … ) Government agencies Large corporations The web (over 100,000 searchable data sources) The vision: Content authoring by anyone, anywhere Powerful database-style querying Use relevant data from anywhere to answer the query The Semantic Web Fundamental problem: reconciling different models of the world.

Outline Other benefits of structure: (Semantic) Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems The algorithmic problems: Query reformulation Reconciling semantic heterogeneity What can we do with a large corpus of schemas?

Adding Structure to is often used for lightweight data management tasks: Organizing a PC meeting + dinner. Arranging a ‘ balanced ’ potluck Giving away opera tickets Announcing an event and associated reminders. Some specialized tools/services: Outlook scheduling, evite.com Can we delegate some tasks easily?

Constraints Check OK bringing Semantic Processes OriginatorRecipientsProcess Database “Start a potluck process” “Here is what everyone is bringing…” “What will you bring?” “I’ll bring a dessert” “I’ll bring an appetizer” “I’ll bring a dessert” “I’ll bring a dessert” “I’ll bring an entree” “Too many desserts. Appetizer or entrée?” STOP “I’ll bring a dessert”

Semantic [Etzioni, McDowell, (Ha)Levy] Creating the structure? We ’ ll help with template interfaces Incorporating additional knowledge? I always bring desserts I don ’ t schedule morning meetings  Another data sharing challenge. But it ’ s free: (and cross platform)

Personal Data Management HTML Mail & calendar Cites Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage Author Data is organized by application [Semex: Sigurdsson, Nemes, H.] Papers FilesPresentations

Finding Publications Person: A. Halevy Person: Dan Suciu Person: Maya Rodrig Person: Steven Gribble Person: Zachary Ives Publication: What Can Peer-to-Peer Do for Databases, and Vice Versa

Publication Bernstein Following Associations (1)

“A survey of approaches to automatic schema matching” “Corpus-based schema matching” “Database management for peer-to-peer computing: A vision” “Matching schemas by learning from others” “A survey of approaches to automatic schema matching” “Corpus-based schema matching” “Database management for peer-to-peer computing: A vision” “Matching schemas by learning from others” Publication Bernstein Following Associations (2)

Publication Bernstein Cited by Publication Citations Following Associations (3)

Cited Authors Bernstein Publication Following Associations (4)

Structure for Personal Data High-level concepts are given, but later extend and personalize concept hierarchy, share (parts) of our data with others, incorporate external data into our view. Concepts are populated automatically with instances Need Instance level reconciliation: Alon Halevy, A. Halevy, Alon Y. Levy – same guy!

Outline Other benefits of structure: (Semantic) Personal data management  A tour of recent data sharing architectures  Data integration systems  Peer-data management systems The algorithmic problems: Query reformulation Reconciling semantic heterogeneity What can we do with a large corpus of schemas?

Data Integration Goal: provide a uniform interface to a set of autonomous data sources. First step towards data sharing. Many research projects (DB & AI) Mine: Information Manifold, Tukwila, LSD Recent industry: Startups: Nimble, Enosys, Composite, MetaMatrix Products from big players: BEA, IBM

Relational DBMS Refresher Schema: the template for data. Queries: Students:Takes: Courses: SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid

Data Integration: Higher-level Abstraction Mediated Schema Q Q1Q2Q3 …… Semantic mappings

Mediated Schema OMIM Swiss- Prot HUGOGO Gene- Clinics Entrez Locus- Link GEO Entity Sequenceable Entity GenePhenotype Structured Vocabulary Experiment Protein Nucleotide Sequence Microarray Experiment Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code? Tarczy-Hornoch, Mork

Semantic Mappings BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Differences in: Names in schema Attribute grouping Coverage of databases Granularity and format of attributes

Issues for Semantic Mappings Mediated Schema Q Q’Q’ Q’Q’ Q’Q’ …… Semantic mappings  Formalism for mappings  Reformulation algorithms  How will we create them?

Beyond Data Integration Mediated schema is a bottleneck for large-scale data sharing It ’ s hard to create, maintain, and agree upon.

Peer Data Management Systems UW Stanford DBLP UBC Waterloo CiteSeer Toronto Q Q1 Q2 Q6 Q5 Q4 Q3 Mappings specified locally Map to most convenient nodes Queries answered by traversing semantic paths. Piazza: [Tatarinov, H., Ives, Suciu, Mork]

PDMS-Related Projects Hyperion (Toronto) PeerDB (Singapore) Local relational models (Trento) Edutella (Hannover, Germany) Semantic Gossiping (EPFL Zurich) Raccoon (UC Irvine) Orchestra (Ives, U. Penn)

A Few Comments about Commerce Until 5 years ago: Data integration = Data warehousing. Since then: A wave of startups: Nimble, MetaMatrix, Calixa, Composite, Enosys Big guys made announcements (IBM, BEA). [Delay] Big guys released products. Success: analysts have new buzzword – EII New addition to acronym soup (with EAI). Lessons: Performance was fine. Need management tools.

Data Integration: Before Mediated Schema Source Q Q’Q’ Q’Q’ Q’Q’ Q’Q’ Q’Q’

XML Query User Applications Lens™ FileInfoBrowser™ Software Developers Kit NIMBLE™ APIs Front-End XML Lens Builder™ Management Tools Management Tools Integration Builder Integration Builder Security Tools Data Administrator Data Administrator Data Integration: After Concordance Developer Integration Layer Nimble Integration Engine ™ CompilerExecutor Metadata Server Cache Relational Data Warehouse/ Mart Legacy Flat FileWeb Pages Common XML View

Sound Business Models Explosion of intranet and extranet information 80% of corporate information is unmanaged By X more enterprise data than 1999 The average company: maintains 49 distinct enterprise applications spends 35% of total IT budget on integration- related efforts Source: Gartner, 1999

Outline Other benefits of structure: (Semantic) Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems  The algorithmic problems:  Query reformulation  Reconciling semantic heterogeneity  What can we do with a large corpus of schemas?

Languages for Schema Mapping Mediated Schema Source Q Q’Q’ Q’Q’ Q’Q’ Q’Q’ Q’Q’ GAV LAVGLAV

Local-as-View (LAV) Book: ISBN, Title, Genre, Year R1R2R3R4R5 Author: ISBN, Name R1(x,y,n) :- Book(x, y, z, t), Author(x, n), t < 1970 R5(x,y) :- Book(x,y,”Humor”) Books before 1970Humor books

Query Reformulation Book: ISBN, Title, Genre, Year R1R2R3R4R5 Author: ISBN, Name Books before 1970Humor books Query: Find authors of humor books Plan: R1 Join R5

Query Reformulation Book: ISBN, Title, Genre, Year R1R2R3R4R5 Author: ISBN, Name ISBN, Title, NameISBN, Title Find authors of humor books before 1960 Plan: Can’t do it! (subtle reasons)

Query Reformulation Query is posed on mediated schema that contains no data. Sources are answers to queries (views). Problem: answering queries using views (Conceptually) Need to invert query expression. Traditional databases also use this: Can you reuse previously cached results?

Answering Queries Using Views NP-Complete for basic queries [LMSS, PODS 95]. Results depend on: Query language used for sources and queries, Open-world vs. Closed-world assumption Allowable access patterns to the sources A lot of beautiful theory!

Theory? A lot of beautiful theory. “There is in these words the beautiful maneuverability of the abstract, rushing in to replace the intractability of the concrete.” Milan Kundera The Book of Laughter and Forgetting

Practical Query Reformulation A lot of nice theory. But also very practical algorithms: MiniCon [Pottinger and H., 2001]: scales to thousands of sources. Every commercial DBMS implements some version of answering queries using views. See [Halevy, 2001] for survey.

Reformulation in PDMS UW Stanford DBLP UBC Waterloo CiteSeer Toronto Can ’ t follow all paths naively Pruning techniques [Tatarinov, H.] Can we pre-compute some paths?  Need to compose mappings  [Madhavan, H., VLDB-2003]

Open PDMS Research Issues UW Stanford DBLP UBC Waterloo CiteSeer Toronto Managing large networks of mappings: Consistency Trust Improving networks: finding additional mappings Indexing: Heterogeneous data across the network Caching: Where? What?

Outline Other benefits of structure: (Semantic) Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems  The algorithmic problems: Query reformulation  Reconciling semantic heterogeneity  What can we do with a large corpus of schemas?

Semantic Mappings BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Need mappings in every data sharing architecture “ Standards are great, but there are too many. ”

Why is it so Hard? Schemas never fully capture their intended meaning: Schema elements are just symbols. We need to leverage any additional information we may have. ‘ Theorem ’ : Schema matching is AI- Complete. Hence, a human will always be in the loop. Goal is to improve designer ’ s productivity. Solution must be extensible.

Matching Heuristics Multiple sources of evidences in the schemas Schema element names BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book Data types, data instances DateTime  Integer, addresses have similar formats Schema structure All books have similar attributes Use domain knowledge All these techniques consider only the two schemas. In isolation, techniques are incomplete or brittle: Need principled combination.

Using Past Experience Matching tasks are often repetitive Humans improve over time at matching. A matching system should improve too! LSD: Learns to recognize elements of mediated schema. [Doan, Domingos, H., SIGMOD-01, MLJ-03] Doan: 2003 ACM Distinguished Dissertation Award. Mediated Schema data sources Mediated Schema

listed-price $250,000 $110, address price agent-phone description Example: Matching Real-Estate Sources location Miami, FL Boston, MA... phone (305) (617) comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320, contact-phone (278) (617) extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema

Learning Source Descriptions We learn a classifier for each element of the mediated schema. Training examples are provided by the given mappings. Multi-strategy learning: Base learners: name, instance, description Combine using stacking. Accuracy of 70-90% in experiments.

Corpus-Based Schema Matching Can we use previous experience to match two new schemas? Can a corpus of schemas and matches be a general purpose resource? Information Retrieval and NLP progressed by using corpora – Can the same be done for structured data?

Corpus-Based Schema Matching Can we use previous experience to match two new schemas? CDsCategoriesArtists Items Artists Authors Books Music Information Litreture Publisher Authors Corpus of Schemas and Matches Reuse extracted knowledge to match new schemas Learn general purpose knowledge Classifier for every corpus element Data Instances Learner Name Learner Data Type Learner Description Learner Structure Learner Meta Learner multi-strategy learning

The Corpus vs. Other Matchers

Exploiting Previous Experience

Corpus Challenges What exactly should we learn? Generalizing with few training examples Balancing previous experience with other clues Size and scope of the corpus

Other Corpus Based Tools Conjecture: a corpus of schemas can be the basis for many useful tools. Auto-complete: I start creating a schema (or show sample data), and the tool suggests a completion. Formulating queries on new databases: I ask a query using my terminology, and it gets reformulated appropriately. Now we can cross the structure chasm.

Conclusion Vision: data authoring, querying and sharing by everyone, everywhere. Structure is useful in our daily tasks. Key challenge: reconciling semantic heterogeneity Corpus Of schemas schema mapping

Some References Piazza: ICDE03, WWW03, VLDB-03 The Structure Chasm: CIDR-03 Surveys on schema matching languages: Halevy, VLDB Journal 01 Lenzerini, PODS 2002 Semi-automatic schema matching: Rahm and Bernstein, VLDB Journal 01. Teaching integration to undergraduates: SIGMOD Record, September, 2003.