Presentation is loading. Please wait.

Presentation is loading. Please wait.

Principles of Dataspace Systems Alon Halevy PODS June 26, 2006.

Similar presentations


Presentation on theme: "Principles of Dataspace Systems Alon Halevy PODS June 26, 2006."— Presentation transcript:

1 Principles of Dataspace Systems Alon Halevy PODS June 26, 2006

2 Outline Example data management challenges  Denote by: “dataspaces” [Franklin, H., Maier] Dataspace Support Platforms: –“Pay-as-you-go” data management Putting meat to the bones: –Specific research problems, recent progress –Querying, dataspace evolution, reflection (Possibly) some predictions and subliminal messages.

3 Shrapnels in Baghdad Story courtesy of Phil Bernstein

4 OriginatedFrom PublishedIn ConfHomePage ExperimentOf ArticleAbout BudgetOf CourseGradeIn AddressOf Cites CoAuthor FrequentEmailer HomePage Sender EarlyVersion Recipient AttachedTo PresentationFor Personal Information Management [Semex: Dong et al.]

5 Google Base

6 The Web is Getting Semantic Forms (millions) Vertical search engines (hundreds) Annotation schemes: Flickr, ESP Game Google Coop –DB search engine coming soon! “A little semantics goes a long way” [See Madhavan talk on Wednesday afternoon]

7 “Data is the plural of anecdote” Digital libraries, enterprises, “smart homes” Corie Environmental Observation System – Talk to Maier Circle of Blue – Data about the world’s water sources The Boeing 777 – [Hanrahan @ Stanford]

8 Requirements A system that: –Is defined by boundaries (organizational, physical, logical), not explicit entry. –Provides best-effort services Little or no setup time –Leverages semantics when possible Manage dataspaces

9 Other Dataspace Characteristics All dataspaces contain >20% porn. The rest has >50% spam.

10 Dataspaces vs. Data Integration BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Data integration requires semantic mappings

11 Dataspaces vs. Data Integration Dataspaces are “pay as you go” Benefit Investment (time, cost) Dataspaces Data integration solutions Artist: Mike Franklin

12 Dataspaces vs. Data Integration Data integration systems require semantic mappings. –Dataspaces are “pay-as-you-go” Dataspaces capture a broader class of semantic relationships: –DerivedFrom, SnapshotOf, HighlyCorrelatedWith, …

13 Why Do This Now? Fact: –Data management is about people, not enterprises. Prediction: –In the next few years, DB conferences will be about data management for the masses. “CMA”: –If not, our community will become largely irrelevant. Observation: –We’re doing it anyway: e.g., DB&IR, information extraction, uncertainty, …

14 Outline Example data management challenges Denote by: “dataspaces”  Dataspace Support Platforms: –“Pay-as-you-go” data management Putting meat to the bones: –Specific research problems, recent progress –Querying, dataspace evolution, reflection (Possibly) some predictions and subliminal messages.

15 Logical Model: Participants and Relationships XML WSDL SDB sensor java RDF RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates

16 Relationships General form: (Obj 1, Rel, Obj 2, p) Obj 1, Obj 2 : instances or sources, p: degree of certainty Language for describing relationships? –[Rosati]

17 Dataspace Support Platforms (DSSP) XML WSDL SDB sensor java RDF RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates Catalog Search & query Local Store & Index Administration Discover & Enhance

18 Outline Example data management challenges Denote by: “dataspaces” Dataspace Support Platforms: –“Pay-as-you-go” data management  Putting meat to the bones: –Specific research problems, recent progress –Querying, dataspace evolution, reflection (Possibly) some predictions and subliminal messages.

19 Technical Outline Query Evolve Reflect Queries Answers Query processing

20 Dataspace Queries Keyword queries as starting point –Later may be refined to add structure –Formulated in terms of user’s “schema” Mostly of the form –Instance*: “britany spears” –P (instance) “chicago weather” “PC chair PODS” Query Evolve Reflect

21 Semantics of Answers 1.The actual answers: –P(instance), P*(instance) Query Evolve Reflect

22 Weather Seattle

23 Semantics of Answers 1.The actual answers: –P(instance), P*(instance) 2.Sources where answer can be found: –Partially specify the query to the source –Help the user clean the query Query Evolve Reflect

24 Toyota Corolla Palo alto

25 Volvo Palo alto Toyota Palo alto

26 Semantics of Answers 1.The actual answers: –P(instance), P*(instance) 2.Sources where answer can be found: –Partially specify the query to the source –Help the user clean the query 3.Supporting facts or sources: –Facts that can be used to derive P(instance) –Rest of derivation may be obvious to user Query Evolve Reflect

27 Related or Partial Answers In which country was Dan Suciu born? –Bucharest Latest edition of software X: –2004 edition Is the Space Needle higher than the Eiffel Tower? –Height of Seattle Space Needle –Height of Eiffel Tower Query Evolve Reflect 184m 324m Rank all types of answers

28 Query Processing: Data Integration Q1Q1 Q2Q2 Q3Q3 Q4Q4 Q 41 Q 42 Q 43 Weather(Chicago) Query Evolve Reflect PDMS Active XML Data exchange

29 Query Processing: DSSPs Query: Jan Van den Bussche address First name: Jan Middle name: Last name: Van den Bussche Address: ? Query Evolve Reflect

30 Query Processing: DSSPs First name: Jan Middle name: Last name: Van den Bussche Address: ? address address: City required Keyword query: J. vd Bussche ………… City? StreetAdr … … (t 1,p 1 ) city, zip (t 2,p 2 ) ………… Companies address Query Evolve Reflect (t 1,p 3 )

31 Two Principles Mappings are approximate at best –What do approximate mappings mean? –Answering queries with them? –Composition? Answering queries = Finding evidence + combining evidence Query Evolve Reflect

32 Technical Outline Query Evolve Reflect Reuse human attention

33 The Cost of Semantics Benefit Investment (time, cost) Dataspaces Data integration solutions Artist: Mike Franklin ? Semantic integration modeled by: {(Obj 1, rel, Obj 2, p),…} Query Evolve Reflect

34 Reusing Human Attention Principle:  User action = statement of semantic relationship  Leverage actions to infer other semantic relationships Examples –Providing a semantic mapping Infer other mappings –Writing a query Infer content of sources, relationships between sources –Creating a “digital workspace” Infer “relatedness” of documents/sources Infer co-reference between objects in the dataspace –Annotating, cutting & pasting, browsing among docs Query Evolve Reflect

35 Past, Present and Future Leverage past actions & existing structure: –[Dong et al., 2004, 2005], [He & Chang, 2003] Generalize from current actions –Queries, schema mappings Beg for extra attention: –ESP [von Ahn], mass collaboration [Doan+], active learning [Sarawagi et al.] Query Evolve Reflect

36 Reuse: Learning Schema Mappings [Doan et al.] Classifiers for mediated schema  Thousands of web forms mapped in little time  Transformic Inc: deep web search.  [Madhavan et al.]: infer mappings for any schemas in the domain Mediated schema Action Query Evolve Reflect ( S 1, M, S, p)

37 Technical Outline Query Evolve Reflect Unify lineage, uncertainty, and inconsistency Model them on views

38 Dataspace Reflection Answers are uncertain in dataspaces: –data, –mappings, –query answering techniques Data may be inconsistent Tracking lineage is crucial Query Evolve Reflect

39 A DSSP Needs to Process uncertain data Update uncertainty with new evidence Be proactive about reducing uncertainty Live with inconsistency Leverage lineage to reduce uncertainty ( Obj 1, Rel, Obj 2, p) Query Evolve Reflect

40 Two Principles Need a single formalism for modeling: –uncertainty, –inconsistency, and –lineage Model them on views Query Evolve Reflect

41 Israel population Query Evolve Reflect Uncertainty & Lineage Trio @ Stanford

42 Uncertainty and Inconsistency Inconsistency = uncertainty about the truth –Salary (John Doe, $120,000) –Salary (John Doe, $135,000)  Salary (John Doe, $120,000 | $135,000) Orchestra, Ives @ U. Penn. Query Evolve Reflect

43 Uncertainty Formalisms 101 Represent a set of possible worlds A-tuples - uncertainty on attribute values: –(PODS, 2006, {Chicago, Baltimore}) –Tuples can be optional Not closed under relational operators Query Evolve Reflect

44 Uncertainty Formalisms 101 (2) X-tuples – uncertainty on entire tuples: –{(PODS, 2006, Chicago), (PODC, 2006, Baltimore)} Still not closed under relational operators Query Evolve Reflect

45 Uncertainty 101: C-Tables { (PODC, 2006, Chicago, X=1), (PODS, 2006, Chicago, X <>1), (SIGMOD, 2006, Chicago, X<>1) } Closed and Complete! Understandable? –See [Das Sarma et al., ICDE 2006]: working models for uncertain data Query Evolve Reflect

46 Adding Lineage to X-tuples [ULDBs, Benjelloun et al., VLDB 06] {(PODS, 2006, Ch), (PODC, 2006, Ba)} l1l1 l2l2 ! (l 1 & l 2 ) { (…) (…) } (t) { (…) (…) } (t’) Query Evolve Reflect No effect on complexity of relational operators

47 Adding Lineage to A-tuples (PODS, {2005,2006}, {Chicago, Baltimore}) l1l1 l2l2 l3l3 l4l4 Lineage on views (projections)  Uncertainty should also be on views! (Halevy, Los Altos, CA, professor}, 0.80.6  Answering queries using uncertain views See [Dalvi & Suciu, 2005] for a great start Query Evolve Reflect

48 Putting it all Together Query Evolve Reflect Approximate mappings Evidence combination Reuse human attention Create approximate semantic relationships Foundation for reasoning about uncertainty, inconsistency and lineage

49 Conclusion and Outlook Data management moving to consumers Dataspaces: key element in this agenda –Pay as you go data management –Reuse human attention The role of theory: –Reflect, generalize and explain –People, people, people

50 Some References SIGMOD Record, December 2005: –Original dataspace vision paper Maier EDBT 2006 tak PODS 2006 proceedings: challenges Data Integration: the Teenage Years –VLDB 2006 Teaching integration to undergraduates: –SIGMOD Record, September, 2003.


Download ppt "Principles of Dataspace Systems Alon Halevy PODS June 26, 2006."

Similar presentations


Ads by Google