Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004.

Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004

The Structure Chasm AuthoringCreating a schemaWriting text Queryingkeywords Using someone else ’ s schema Data sharing EasyCommittees, standards But we can pose complex queries

Why is This a Problem? Databases used to be isolated and administered only by experts. Today ’ s applications call for large-scale data sharing: Big science (bio-medicine, astrophysics, … ) Government agencies Large corporations The web (over 100,000 searchable data sources) The vision: Content authoring by anyone, anywhere Powerful database-style querying Use relevant data from anywhere to answer the query The Semantic Web Fundamental problem: reconciling different models of the world.

Outline Other benefits of structure: (Semantic) email Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems The algorithmic problems: Query reformulation Reconciling semantic heterogeneity What can we do with a large corpus of schemas?

Adding Structure to Email Email is often used for lightweight data management tasks: Organizing a PC meeting + dinner. Arranging a ‘ balanced ’ potluck Giving away opera tickets Announcing an event and associated reminders. Some specialized tools/services: Outlook scheduling, evite.com Can we delegate some email tasks easily?

Constraints Check OK bringingemail jane@csEntree Semantic Email Processes OriginatorRecipientsProcess Database “Start a potluck process” “Here is what everyone is bringing…” “What will you bring?” john@csDessert “I’ll bring a dessert” mary@eeAppetizer “I’ll bring an appetizer” jayant@uDessert “I’ll bring a dessert” “I’ll bring a dessert” “I’ll bring an entree” “Too many desserts. Appetizer or entrée?” STOP “I’ll bring a dessert”

Semantic Email [Etzioni, McDowell, (Ha)Levy] Creating the structure? We ’ ll help with template interfaces Incorporating additional knowledge? I always bring desserts I don ’ t schedule morning meetings  Another data sharing challenge. But it ’ s free: (and cross platform) www.cs.washington.edu/research/semweb

Personal Data Management HTML Mail & calendar Cites Event Message Document Web Page Presentation Cached Softcopy Sender, Recipients Organizer, Participants Person Paper Author Homepage Author Data is organized by application [Semex: Sigurdsson, Nemes, H.] Papers FilesPresentations

Finding Publications Person: A. Halevy Person: Dan Suciu Person: Maya Rodrig Person: Steven Gribble Person: Zachary Ives Publication: What Can Peer-to-Peer Do for Databases, and Vice Versa

Publication Bernstein Following Associations (1)

“A survey of approaches to automatic schema matching” “Corpus-based schema matching” “Database management for peer-to-peer computing: A vision” “Matching schemas by learning from others” “A survey of approaches to automatic schema matching” “Corpus-based schema matching” “Database management for peer-to-peer computing: A vision” “Matching schemas by learning from others” Publication Bernstein Following Associations (2)

Publication Bernstein Cited by Publication Citations Following Associations (3)

Cited Authors Bernstein Publication Following Associations (4)

Structure for Personal Data High-level concepts are given, but later extend and personalize concept hierarchy, share (parts) of our data with others, incorporate external data into our view. Concepts are populated automatically with instances Need Instance level reconciliation: Alon Halevy, A. Halevy, Alon Y. Levy – same guy!

Outline Other benefits of structure: (Semantic) email Personal data management  A tour of recent data sharing architectures  Data integration systems  Peer-data management systems The algorithmic problems: Query reformulation Reconciling semantic heterogeneity What can we do with a large corpus of schemas?

Data Integration Goal: provide a uniform interface to a set of autonomous data sources. First step towards data sharing. Many research projects (DB & AI) Mine: Information Manifold, Tukwila, LSD Recent industry: Startups: Nimble, Enosys, Composite, MetaMatrix Products from big players: BEA, IBM

Relational DBMS Refresher Schema: the template for data. Queries: Students:Takes: Courses: SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid

Data Integration: Higher-level Abstraction Mediated Schema Q Q1Q2Q3 …… Semantic mappings

Mediated Schema OMIM Swiss- Prot HUGOGO Gene- Clinics Entrez Locus- Link GEO Entity Sequenceable Entity GenePhenotype Structured Vocabulary Experiment Protein Nucleotide Sequence Microarray Experiment Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code? www.biomediator.org Tarczy-Hornoch, Mork

Semantic Mappings BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Differences in: Names in schema Attribute grouping Coverage of databases Granularity and format of attributes

Issues for Semantic Mappings Mediated Schema Q Q’Q’ Q’Q’ Q’Q’ …… Semantic mappings  Formalism for mappings  Reformulation algorithms  How will we create them?

Beyond Data Integration Mediated schema is a bottleneck for large-scale data sharing It ’ s hard to create, maintain, and agree upon.

Peer Data Management Systems UW Stanford DBLP UBC Waterloo CiteSeer Toronto Q Q1 Q2 Q6 Q5 Q4 Q3 Mappings specified locally Map to most convenient nodes Queries answered by traversing semantic paths. Piazza: [Tatarinov, H., Ives, Suciu, Mork]

PDMS-Related Projects Hyperion (Toronto) PeerDB (Singapore) Local relational models (Trento) Edutella (Hannover, Germany) Semantic Gossiping (EPFL Zurich) Raccoon (UC Irvine) Orchestra (Ives, U. Penn)

A Few Comments about Commerce Until 5 years ago: Data integration = Data warehousing. Since then: A wave of startups: Nimble, MetaMatrix, Calixa, Composite, Enosys Big guys made announcements (IBM, BEA). [Delay] Big guys released products. Success: analysts have new buzzword – EII New addition to acronym soup (with EAI). Lessons: Performance was fine. Need management tools.

Data Integration: Before Mediated Schema Source Q Q’Q’ Q’Q’ Q’Q’ Q’Q’ Q’Q’

XML Query User Applications Lens™ FileInfoBrowser™ Software Developers Kit NIMBLE™ APIs Front-End XML Lens Builder™ Management Tools Management Tools Integration Builder Integration Builder Security Tools Data Administrator Data Administrator Data Integration: After Concordance Developer Integration Layer Nimble Integration Engine ™ CompilerExecutor Metadata Server Cache Relational Data Warehouse/ Mart Legacy Flat FileWeb Pages Common XML View

Sound Business Models Explosion of intranet and extranet information 80% of corporate information is unmanaged By 2004 30X more enterprise data than 1999 The average company: maintains 49 distinct enterprise applications spends 35% of total IT budget on integration- related efforts Source: Gartner, 1999

Outline Other benefits of structure: (Semantic) email Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems  The algorithmic problems:  Query reformulation  Reconciling semantic heterogeneity  What can we do with a large corpus of schemas?

Languages for Schema Mapping Mediated Schema Source Q Q’Q’ Q’Q’ Q’Q’ Q’Q’ Q’Q’ GAV LAVGLAV

Local-as-View (LAV) Book: ISBN, Title, Genre, Year R1R2R3R4R5 Author: ISBN, Name R1(x,y,n) :- Book(x, y, z, t), Author(x, n), t < 1970 R5(x,y) :- Book(x,y,”Humor”) Books before 1970Humor books

Query Reformulation Book: ISBN, Title, Genre, Year R1R2R3R4R5 Author: ISBN, Name Books before 1970Humor books Query: Find authors of humor books Plan: R1 Join R5

Query Reformulation Book: ISBN, Title, Genre, Year R1R2R3R4R5 Author: ISBN, Name ISBN, Title, NameISBN, Title Find authors of humor books before 1960 Plan: Can’t do it! (subtle reasons)

Query Reformulation Query is posed on mediated schema that contains no data. Sources are answers to queries (views). Problem: answering queries using views (Conceptually) Need to invert query expression. Traditional databases also use this: Can you reuse previously cached results?

Answering Queries Using Views NP-Complete for basic queries [LMSS, PODS 95]. Results depend on: Query language used for sources and queries, Open-world vs. Closed-world assumption Allowable access patterns to the sources A lot of beautiful theory!

Theory? A lot of beautiful theory. “There is in these words the beautiful maneuverability of the abstract, rushing in to replace the intractability of the concrete.” Milan Kundera The Book of Laughter and Forgetting

Practical Query Reformulation A lot of nice theory. But also very practical algorithms: MiniCon [Pottinger and H., 2001]: scales to thousands of sources. Every commercial DBMS implements some version of answering queries using views. See [Halevy, 2001] for survey.

Reformulation in PDMS UW Stanford DBLP UBC Waterloo CiteSeer Toronto Can ’ t follow all paths naively Pruning techniques [Tatarinov, H.] Can we pre-compute some paths?  Need to compose mappings  [Madhavan, H., VLDB-2003]

Open PDMS Research Issues UW Stanford DBLP UBC Waterloo CiteSeer Toronto Managing large networks of mappings: Consistency Trust Improving networks: finding additional mappings Indexing: Heterogeneous data across the network Caching: Where? What?

Outline Other benefits of structure: (Semantic) email Personal data management A tour of recent data sharing architectures Data integration systems Peer-data management systems  The algorithmic problems: Query reformulation  Reconciling semantic heterogeneity  What can we do with a large corpus of schemas?

Semantic Mappings BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Need mappings in every data sharing architecture “ Standards are great, but there are too many. ”

Why is it so Hard? Schemas never fully capture their intended meaning: Schema elements are just symbols. We need to leverage any additional information we may have. ‘ Theorem ’ : Schema matching is AI- Complete. Hence, a human will always be in the loop. Goal is to improve designer ’ s productivity. Solution must be extensible.

Matching Heuristics Multiple sources of evidences in the schemas Schema element names BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book Data types, data instances DateTime  Integer, addresses have similar formats Schema structure All books have similar attributes Use domain knowledge All these techniques consider only the two schemas. In isolation, techniques are incomplete or brittle: Need principled combination.

Using Past Experience Matching tasks are often repetitive Humans improve over time at matching. A matching system should improve too! LSD: Learns to recognize elements of mediated schema. [Doan, Domingos, H., SIGMOD-01, MLJ-03] Doan: 2003 ACM Distinguished Dissertation Award. Mediated Schema data sources Mediated Schema

listed-price $250,000 $110,000... address price agent-phone description Example: Matching Real-Estate Sources location Miami, FL Boston, MA... phone (305) 729 0831 (617) 253 1429... comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320,000... contact-phone (278) 345 7215 (617) 335 2315... extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema

Learning Source Descriptions We learn a classifier for each element of the mediated schema. Training examples are provided by the given mappings. Multi-strategy learning: Base learners: name, instance, description Combine using stacking. Accuracy of 70-90% in experiments.

Corpus-Based Schema Matching Can we use previous experience to match two new schemas? Can a corpus of schemas and matches be a general purpose resource? Information Retrieval and NLP progressed by using corpora – Can the same be done for structured data?

Corpus-Based Schema Matching Can we use previous experience to match two new schemas? CDsCategoriesArtists Items Artists Authors Books Music Information Litreture Publisher Authors Corpus of Schemas and Matches Reuse extracted knowledge to match new schemas Learn general purpose knowledge Classifier for every corpus element Data Instances Learner Name Learner Data Type Learner Description Learner Structure Learner Meta Learner multi-strategy learning

The Corpus vs. Other Matchers

Exploiting Previous Experience

Corpus Challenges What exactly should we learn? Generalizing with few training examples Balancing previous experience with other clues Size and scope of the corpus

Other Corpus Based Tools Conjecture: a corpus of schemas can be the basis for many useful tools. Auto-complete: I start creating a schema (or show sample data), and the tool suggests a completion. Formulating queries on new databases: I ask a query using my terminology, and it gets reformulated appropriately. Now we can cross the structure chasm.

Conclusion Vision: data authoring, querying and sharing by everyone, everywhere. Structure is useful in our daily tasks. Key challenge: reconciling semantic heterogeneity Corpus Of schemas schema mapping

Some References www.cs.washington.edu/homes/alon Piazza: ICDE03, WWW03, VLDB-03 The Structure Chasm: CIDR-03 Surveys on schema matching languages: Halevy, VLDB Journal 01 Lenzerini, PODS 2002 Semi-automatic schema matching: Rahm and Bernstein, VLDB Journal 01. Teaching integration to undergraduates: SIGMOD Record, September, 2003.

Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004.

Similar presentations

Presentation on theme: "Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004.

Similar presentations

Presentation on theme: "Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004."— Presentation transcript:

Similar presentations

About project

Feedback