Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos Peer Data-Management Systems: Plumbing.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
Chapter 12 Decision Support Systems
Advanced SQL Topics Edward Wu.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Chapter 1: The Database Environment
Chapter 7 System Models.
…to Ontology Repositories Mathieu dAquin Knowledge Media Institute, The Open University From…
18 Copyright © 2005, Oracle. All rights reserved. Distributing Modular Applications: Introduction to Web Services.
Presented to: By: Date: Federal Aviation Administration Registry/Repository in a SOA Environment SOA Brown Bag #5 SWIM Team March 9, 2011.
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
1 Preliminary results of the Environmental Data Exchange Network for Inland Waters (EDEN-IW) project Practical lessons. P. Haastrup.
Public B2B Exchanges and Support Services
Limitations of the relational model 1. 2 Overview application areas for which the relational model is inadequate - reasons drawbacks of relational DBMSs.
Fast optimal instruction scheduling for single-issue processors with arbitrary latencies Peter van Beek, University of Waterloo Kent Wilken, University.
Peer-to-peer and agent-based computing Peer-to-Peer Computing: Introduction.
|epcc| NeSC Workshop Open Issues in Grid Scheduling Ali Anjomshoaa EPCC, University of Edinburgh Tuesday, 21 October 2003 Overview of a Grid Scheduling.
Toward Scalable Keyword Search over Relational Data Akanksha Baid, Ian Rae, Jiexing Li, AnHai Doan, and Jeffrey Naughton University of Wisconsin VLDB 2010.
George Anadiotis, Spyros Kotoulas and Ronny Siebes VU University Amsterdam.
Database Systems: Design, Implementation, and Management
© Tarek Hegazy – 1 Basics of Asset Management Prof. Tarek Hegazy.
Introduction Lesson 1 Microsoft Office 2010 and the Internet
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Week 2 The Object-Oriented Approach to Requirements
Computer Literacy BASICS
Configuration management
Selecting an Advanced Energy Management System May 2007 Chris Greenwell – Director Energy Markets Scott Muench - Manager Technical Sales © 2007 Tridium,
Chapter 11: Models of Computation
Information Systems Today: Managing in the Digital World
Chapter Information Systems Database Management.
Chapter 6 Data Design.
Lecture plan Outline of DB design process Entity-relationship model
Executional Architecture
Who are the Experts?Simon KampaSlide 1 Who are the Experts? Simon Kampa IAM Group University of Southampton
Macromedia Dreamweaver MX 2004 – Design Professional Dreamweaver GETTING STARTED WITH.
© 2006 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialBCMSN BCMSN Module 1 Lesson 1 Network Requirements.
1 An inference engine for the semantic web Naudts Guido Student at the Open University Netherlands.
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
12 January 2009SDS batch generation, distribution and web interface 1 ExESS IT tool for SDS batch generation, distribution and web interface ExESS IT tool.
Chapter 13 The Data Warehouse
14-1 © Prentice Hall, 2004 Chapter 14: OOSAD Implementation and Operation (Adapted) Object-Oriented Systems Analysis and Design Joey F. George, Dinesh.
CFR 250/590 Introduction to GIS, Autumn 1999 Data Search & Import © Phil Hurvitz, find_data 1  Overview Web search engines NSDI GeoSpatial Data.
Profile. 1.Open an Internet web browser and type into the web browser address bar. 2.You will see a web page similar to the one on.
Introduction Peter Dolog dolog [at] cs [dot] aau [dot] dk Intelligent Web and Information Systems September 9, 2010.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
From Model-based to Model-driven Design of User Interfaces.
TIDE Presentation Florida Standards Assessments 1 FSA Regional Trainings Updated 02/09/15.
Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.
What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.
Crossing the Structure Chasm Alon Halevy University of Washington FQAS 2002.
09/12/2003 Peer-to-Peer Information Systems – WS 03/04 1 Piazza: Data Management Infrastructure for Semantic Web Applications Alon Y. Halevy, Zachary G.
Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.
AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Presented by Jiwen Sun, Lihui Zhao 24/3/2004
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Semantic Mappings for Data Mediation
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington
Learning to Map Between Schemas Ontologies
Presentation transcript:

Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos Peer Data-Management Systems: Plumbing for the Semantic Web

2 Agenda Elements of the Semantic Web Piazza: a peer data-management system –A database guy’s contribution to the semantic web The key issue: mapping between different models: –Some recent progress and current directions. The critical issue: crossing the structure chasm. The talk I’m not giving today: –A critique of the Semantic Web. Work and thoughts are in progress

3 The Semantic Web (my view) Web sites include structural annotations –You can pose meaningful queries on them. –Ontologies provide the semantic glue. –Internal implementation of web sites left open. Agents perform tasks: –Query one or more web sites –Perform updates (e.g., set schedules) –Coordinate actions –Trust each other (or not). I.e., agents operating on a gigantic heterogeneous distributed database.

4 Getting there Robust infrastructure for querying –Peer data management systems. Facilitate mapping between different structures. Need tools for: –Locating relevant structures –Easily joining the semantic web. Get data into structured form –Should we worry about the legacy web?

5 Agenda Elements of the Semantic Web (personal view) Piazza: a peer data-management system –A database guy’s contribution to the semantic web The key issue: mapping between different models: –Some recent progress and current directions. The critical issue: crossing the structure chasm.

6 Piazza: Peer Data-Management Goal: To enable users to share data across local or wide area networks in an ad-hoc, highly dynamic distributed architecture. Peers can: –Export base data –Provide views on base data –Serve as logical mediators for other peers Every peer can be both a server and a client. Peers join and leave the PDMS at will.

7 Extending the Vision to Data Sharing

8 Relationship of PDMS to… P2P overlay networks (the “S” word) Data integration systems (no central logical mediated schema) Federated databases (scale, ad-hoc nature) Distributed databases (no central administration)

9 Representing Data A spectrum of possibilities: –Relational tables, some integrity constraints –XML: can encode relational, hierarchical, OO –Xquery – emerging standard query language (SQL for XML) –RDF: “XML on drugs”. –Sees only the logic; ignores other aspects. –DAML+OIL –Full blown Knowledge representation language. They all have semantics; just different expressive powers. We keep the data simple. Mappings between data at different peers are more complex.

10 Piazza Querying Semantic mappings between peers provide glue: LH:CritBed(bed, hosp, room, PID, status)  H:CritBed(bed, hosp, room) & H:Patient(PID, bed, status) 9DC:SkilledPerson(PID, "Doctor") :- H:Doctor(SID, h, l, s, e) 9DC:SkilledPerson(PID, "EMT") :- H:EMT(SID, h, vid, s, e) Query processing phases: –Reformulate a query into queries over stored data. –Minicon algorithm (++) for answering queries using views. –Extensions in Piazza enable chaining multiple peer mappings. –Find best plan for the query and execute it: –Tukwila data integration engine – an efficient processor for network bound XML/relational data.

11 Efficiency Issues in Piazza Intelligent data placement: –We may want to place views over data at key points in the PDMS: –Save work for frequently asked queries. –Increase availability in cases of failures. –Akamai for structured data –A form of automated reformulation. –Large search space of possibilities –Surprising lower bounds on very simple cases [Chirkova et al, VLDB 2001]. Efficient propagation of updates: –Approach: publish updategrams as first-class citizens.

12 Additional Piazza Issues The catalog of data sources –What does a catalog of structured data sources look like? –How can it be browsed by humans? –How do we facilitate joining a PDMS? –How can the catalog be distributed physically? Systems issues: –Architecture of a Piazza node: what are the components? –Naming issues –Security Piazza collaborators: Etzioni,Gribble, Ives, Levy, Suciu, Mork, Rodrig, Tatarinov.

13 Agenda Elements of the Semantic Web Piazza: a peer data-management system –A database guy’s contribution to the semantic web The key issue: mapping between different models: –Some recent progress and current directions. The critical issue: crossing the structure chasm.

14 It’s All About the Mappings It’s not about understanding the data: It’s about understanding each other. Whenever you see a model for some domain, there is another one hiding around the corner. Mappings provide semantic relationships between different peers. Specifying mappings: inherently a human-assisted task. Goal: make it easy, fast, incremental. Not a new problem!

15 Example Semantic Mapping Mapping between XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mapping non 1-1 mapping

16 Desiderata from Proposed Solutions Accuracy, efficiency, ease of use. Extensible: accommodate in a principled fashion: –User feedback –Domain constraints –General heuristics “Memory”, knowledge reuse: –System should exploit knowledge from previous matching tasks [LSD]. Some underlying semantics.

17 Why Matching is Difficult Structures represent same entity differently –different names => same entity: –area & address => location –same names => different entities: –area => location or square-feet Intended semantics is typically subjective! –IBM Almaden Lab = IBM? Schema, data and rules never fully capture semantics! –not adequately documented, certainly not for machine consumption. Often hard for humans (committees are formed!)

18 Learning for Mapping We started simple: generating semantic mappings between a mediated schema and a large set of data source schemas. Key idea: generate the first mappings manually, and learn from them to generate the rest. Technique: multi-strategy learning (extensible!) L( earning ) S( ource ) D( escriptions ) [ SIGMOD 2001 ]. Recent and current work: –(simple) Ontology mapping [WWW-02] –Complex mappings [COMAP] –Semantics [Madhavan et al., AAAI-02]

19 Data Integration (a simple PDMS) Find houses with four bathrooms priced under $500,000 mediated schema homes.comrealestate.com source schema 2 homeseekers.com source schema 3source schema 1 Applications: WWW, enterprises, science projects Techniques: virtual data integration, warehousing, custom code. Query reformulation and optimization.

20 price agent-name agent-phone office-phone description Learning from the Manual Mappings listed-price contact-name contact-phone office comments Schema of realestate.com Mediated schema $250K James Smith (305) (305) Fantastic house $320K Mike Doan (617) (617) Great location listed-price contact-name contact-phone office comments realestate.com If “fantastic” & “great” occur frequently in data instances => description sold-at contact-agent extra-info $350K (206) Beautiful yard $230K (617) Close to Seattle $190K (512) Great lot homes.com If “office” occurs in the name => office-phone

21 Multi-Strategy Learning Use a set of base learners: –Name learner, Naïve Bayes, Whirl, XML learner And a set of recognizers: –County name, zip code, phone numbers. Each base learner produces a prediction weighted by confidence score. Combine base learners with a meta-learner, using stacking.

22 Base Learners Training Matching Name Learner –training: (“location”, address) (“contact name”, name) –matching: agent-name => (name,0.7),(phone,0.3) Naive Bayes Learner –training: (“Seattle, WA”,address) (“250K”,price) matching: “Kent, WA” => (address,0.8),(name,0.2) labels weighted by confidence score X (X 1,C 1 ) (X 2,C 2 )... (X m,C m ) Observed label Training examples Object Classification model (hypothesis)

23 Meta-Learner: Stacking [Wolpert 92,Ting&Witten99] Training –uses training data to learn weights –one for each (base-learner,mediated-schema element) pair –weight (Name-Learner,address) = 0.2 –weight (Naive-Bayes,address) = 0.8 Matching: combine predictions of base learners –computes weighted average of base-learner confidence scores Seattle, WA Kent, WA Bend, OR (address,0.4) (address,0.9) Name Learner Naive Bayes Meta-Learner (address, 0.4* *0.8 = 0.8) area

24 The LSD Architecture Matching PhaseTraining Phase Mediated schema Source schemas Base-Learner 1 Base-Learner k Meta-Learner Training data for base learners Hypothesis 1 Hypothesis k Weights for Base Learners Base-Learner Base-Learner k Meta-Learner Prediction Combiner Predictions for elements Predictions for instances Constraint Handler Mappings Domain constraints

25 Domain Constraints Encode user knowledge about the domain Specified by examining mediated schema Examples –at most one source-schema element can match address –if a source-schema element matches house-id then it is a key –avg-value(price) > avg-value(num-baths) Given a mapping combination –can verify if it satisfies a given constraint area: address sold-at: price contact-agent: agent-phone extra-info: address

26 Empirical Evaluation Four domains –Real Estate I & II, Course Offerings, Faculty Listings For each domain –create mediated DTD & domain constraints –choose five sources –extract & convert data listings into XML (faithful to schema!) –mediated DTDs: elements, source DTDs: Ten runs for each experiment - in each run: –manually provide 1-1 mappings for 3 sources –ask LSD to propose mappings for remaining 2 sources –accuracy = % of 1-1 mappings correctly identified

27 Matching Accuracy LSD’s accuracy: % Best single base learner: % + Meta-learner: % + Constraint handler: % + XML learner: % Average Matching Acccuracy (%)

28 Sensitivity to Amount of Available Data Average matching accuracy (%) Number of data listings per source (Real Estate I)

29 Contribution of Schema vs. Data LSD with only schema info. LSD with only data info. Complete LSD Average matching accuracy (%) More experiments in the paper [Doan et. al. 01]

30 Contribution of Each Component Average Matching Acccuracy (%) Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system

31 The Next Steps Learning is a useful component. But it needs to be combined with: –User feedback –Domain constraints –General heuristics Need a representation of mappings: –First step – see [Madhavan et al., AAAI-02] –Also defines key inference problems for such a representation, –Provides answers for the mapping language used in Piazza. –Ultimately, some first-order probabilistic representation. Need benchmarks to measure progress.

32 Agenda Elements of the Semantic Web Piazza: a peer data-management system –A database guy’s contribution to the semantic web The key issue: mapping between different models: –Some recent progress and current directions. The critical issue: crossing the structure chasm.

33 Can We Cross the Structure Chasm? There are two worlds: –U-world: the current web, keyword search, google –S-world: databases, knowledge bases, structured queries The web succeeded because it’s in the u-world. For the semantic web to succeed, we need to make it dead simple for people to: –Structure data, locate relevant data and data sets, query. However: –People have a hard time structuring their data –It’s harder to query structured data: need to know a terminology. –It’s harder to understand each other in the S-world. DB and KR people have no clue how to deal with this. More expressive power in the languages won’t help.