Presentation is loading. Please wait.

Presentation is loading. Please wait.

Structure and Value Synopses for XML Data Graphs

Similar presentations


Presentation on theme: "Structure and Value Synopses for XML Data Graphs"— Presentation transcript:

1 Structure and Value Synopses for XML Data Graphs
Neoklis Polyzotis (UW-Madison) Minos Garofalakis (Bell Labs)

2 Path Expressions //author[book/year>2000]/paper r author author
title year title title 1999 2002

3 Path Expressions //author[book/year>2000]/paper r author author
title year title title 1999 2002

4 Path Expressions //author[book/year>2000]/paper r author author
title year title title 1999 2002

5 Path Expressions //author[book/year>2000]/paper r author author
title year title title 1999 2002

6 Path Expressions Efficient evaluationPath Selectivity
//author[book/year>2000]/paper r author author book paper book paper paper year title year title title 1999 2002 Efficient evaluationPath Selectivity Need to estimate true selectivities

7 Contribution XSKETCH synopses Structure + Value synopses
Graph structured XML Data Branching PEs with value predicates Low estimation error/Low storage

8 Outline Preliminaries XSKETCH Synopses Construction
Experimental Results Conclusions

9 XML Data Model XML Document  XML Data Graph ρ0 A1 A2 PB3
A: Author, PB: Publisher B: Book, N:Name, P:Paper N4 B5 P6 P7 N8 B9 V4 V8 T: Title, E:Editor T10 T11 T12 T13 E14 V10 V11 V12 V13 V14 XML Document  XML Data Graph Graph nodes  Elements+Attributes+Values Graph edges  Nesting + Reference (ID/IDREF)

10 Path Expressions XPath expressions Result is a set Simple: A/P/T
Branching: A[B]/P/T Values: A/P/T[=v11] Result is a set ρ0 A1 A2 PB3 N4 B5 P6 P7 N8 B9 V4 V8 T10 T11 T12 T13 E14 V10 V11 V12 V13 V14

11 Path Expressions XPath expressions Result is a set Simple: A/P/T
Branching: A[B]/P/T Values: A/P/T[=v11] Result is a set ρ0 A1 A2 PB3 N4 B5 P6 P7 N8 B9 V4 V8 T10 T11 T12 T13 E14 V10 V11 V12 V13 V14

12 Path Expressions XPath expressions Result is a set Simple: A/P/T
Branching: A[B]/P/T Values: A/P/T[=v11] Result is a set ρ0 A1 A2 PB3 N4 B5 P6 P7 N8 B9 V4 V8 T10 T11 T12 T13 E14 V10 V11 V12 V13 V14

13 Path Expressions XPath expressions Result is a set Simple: A/P/T
Branching: A[B]/P/T Values: A/P/T[=v11] Result is a set ρ0 A1 A2 PB3 N4 B5 P6 P7 N8 B9 V4 V8 T10 T11 T12 T13 E14 V10 V11 V12 V13 V14

14 Path Expressions XPath expressions Result is a set Simple: A/P/T
Branching: A[B]/P/T Values: A/P/T[=v11] Result is a set ρ0 A1 A2 PB3 N4 B5 P6 P7 N8 B9 V4 V8 T10 T11 T12 T13 E14 V10 V11 V12 V13 V14

15 Estimation Problem Size of result set for a PE Challenges:
Structural Correlations Example: paper/author book/author Value to Value Correlations Example: author[name=v1]/paper[title=v2] Path to Value Correlations Example: paper/author[=v] book/author[=v]

16 Estimation Problem Size of result set for a PE Challenges:
Structural Correlations Example: paper/author book/author Value to Value Correlations Example: author[name=v1]/paper[title=v2] Path to Value Correlations Example: paper/author[=v] book/author[=v]

17 Estimation Problem Size of result set for a PE Challenges:
Structural Correlations Example: paper/author book/author Value to Value Correlations Example: author[name=v1]/paper[title=v2] Path to Value Correlations Example: paper/author[=v] book/author[=v]

18 Estimation Problem Size of result set for a PE Challenges:
Structural Correlations Example: paper/author book/author Value to Value Correlations Example: author[name=v1]/paper[title=v2] Path to Value Correlations Example: paper/author[=v] book/author[=v]

19 Outline Preliminaries XSKETCH Synopses Construction
Synopses Model Estimation Construction Experimental Results Conclusions

20 Synopsis Model Graph Synopsis Edge Stability Information
Value Summaries

21 Graph Synopsis Set of elements (same tag)  Summary Node
Document Graph Synopsis ρ0 ρ(1) A1 A2 PB3 A(2) PB(1) N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) V4 V8 T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 Set of elements (same tag)  Summary Node Document Edge  Summary Edge

22 Backward Edge Stability
Document Graph Synopsis ρ0 ρ(1) b b A1 A2 PB3 A(2) PB(1) b b N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b b b V4 V8 T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 ubv: all elements in v have a parent in u

23 Forward Edge Stability
Document Graph Synopsis ρ0 ρ(1) f f A1 A2 PB3 A(2) PB(1) f f f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) f f V4 V8 T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 ufv: all elements in u have a child in v

24 Value Summaries Summarize values “under” synopsis nodes
Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 VT VT VE Summarize values “under” synopsis nodes Implementation dependent on values E.g., histograms, pruned suffix trees,…

25 Path to Value Correlations
Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 VT VT VE One histogram per summary node

26 Value to Value Correlations
Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 VT VT VE Multi-dimensional histograms Correlations within stable neighborhood

27 Value to Value Correlations
Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 VT VT VE,T Multi-dimensional histograms Correlations within stable neighborhood

28 Value to Value Correlations
Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 VT VT VE Multi-dimensional histograms Correlations within stable neighborhood

29 Outline Preliminaries XSKETCH Synopses Construction
Synopses Model Estimation Construction Experimental Results Conclusions

30 Estimation ρ0 ρ(1) A1 A2 PB3 A(2) PB(1) N4 B5 P6 P7 N8 B9 N(2) P(2)
Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 VT VT VE

31 B[E=v1]/T[=v2]  2 x f(B[E=v1]/T[=v2])
Estimation Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 VT VT VE B[E=v1]/T[=v2]  2 x f(B[E=v1]/T[=v2])

32 B[E=v1]/T[=v2]  2 x f(B[E=v1]/T[=v2])
Estimation Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 VT VT VE B[E=v1]/T[=v2]  2 x f(B[E=v1]/T[=v2])

33 B[E=v1]/T[=v2]  2 x f(B[E=v1]/T[=v2])
Estimation Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 VT VT VE B[E=v1]/T[=v2]  2 x f(B[E=v1]/T[=v2])

34 B[E=v1]/T[=v2]  2 x f(B/T[=v2]) x f(E=v1|B) x f(B[E])
Estimation Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 VT VT VE B[E=v1]/T[=v2]  2 x f(B/T[=v2]) x f(E=v1|B) x f(B[E])

35 Estimation Model Break path in stable sub-paths
Derive correlation scopes Apply statistical assumptions Independence Uniformity

36 Outline Preliminaries XSKETCH Synopses Construction
Experimental Results Conclusions

37 Coarsest Synopsis ρ0 ρ(1) A1 A2 PB3 A(2) PB(1) N4 B5 P6 P7 N8 B9 N(2)
Document Graph Coarsest Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) f f b V4 V8 VN T10 T11 T12 T13 E14 T(4) E(1) V10 V11 V12 V13 V14 VT VE All document paths, but also false paths High estimation error Small size

38 Perfect Synopsis ρ0 A1 A2 PB3 N4 B5 P6 P7 N8 B9 V4 V8 T10 T11 T12 T13
Document Graph Perfect Synopsis ρ0 ρ(0) b/f b/f b/f A1 A2 PB3 A(1) A(1) PB(1) b/f b/f b/f b/f b/f b/f b/f N4 B5 P6 P7 N8 B9 N(1) B(1) P(1) P(1) N(1) B(1) b/f b/f b/f b/f b/f b/f V4 V8 VN VN T10 T11 T12 T13 E14 T(1) T(1) T(1) T(1) E(1) V10 V11 V12 V13 V14 VT VT VT VT VE All document paths and no false paths Zero estimation error Large size

39 Construction Optimal XSKETCH: NP-Hard Forward Selection Algorithm
Refinements Successively refine coarsest summary Selection criterion: marginal gains

40 Construction Step r

41 Construction Step … Path Sample P … r … E=error(P) E’=error(P) …
S=size() S’=size()

42 Construction Step … Path Sample P … r … E=error(P) E’=error(P) …
S=size() S’=size() gain(r)=(E-E’)/(S’-S)

43 Refinements Structural Refinements Value Refinements
backward-stabilize forward-stabilize backward-split Value Refinements value-expand value-remove value-refine

44 Outline Preliminaries XSKETCH Synopses Construction
Experimental Results Conclusions

45 Implementation Single-dimensional histograms Integer values
Strings hashed to integers Construction: max-diff(V,A)

46 Datasets Elements Coarsest Summary (KB) Perfect Summary (MB) IMDB
102,755 7.8 1.9 XMark 87,480 4.1 3.3

47 Workload 1000 Positive Pes Similar results with negative PEs
Biased random sample from document Path Length: 2-5 500 contain range predicates Predicates: random, 10% of value domain Similar results with negative PEs

48 Accuracy Metric Average Absolute Relative Error

49 Results – IMDB (Branching)
Branches: 0-2 Avg. Result Count: 478(Predicates)/1901(No Predicates) (2%)

50 Results – IMDB (Simple)
Branches: 0 Avg. Result Count: 483(Predicates)/933(No Predicates)

51 Conclusions Path selectivity estimation is important XSKETCH synopses
Branching PEs with predicates Graph-structured data Model: graph synopsis+stability+value summaries Efficient forward selection algorithm Experimental Results Accurate synopses with small space requirements Effective construction algorithm

52 Overflow Slides

53 Path Expressions XPath expressions Result is a set Simple: A/P/T
Branching: A[B]/P/T Values: A[N=v8]/P/T A/P/T[=v11] Result is a set ρ0 A1 A2 PB3 N4 B5 P6 P7 N8 B9 V4 V8 T10 T11 T12 T13 E14 V10 V11 V12 V13 V14

54 Estimation (a) ρ0 ρ(1) A1 A2 PB3 A(2) PB(1) N4 B5 P6 P7 N8 B9 N(2)
Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 VT VT VE A/P/T[=v]  2 x f(T=v|A/P/T)

55 A/B/T[=v]  2 x f(T=v|B/T) x f(A/B | B/T[=v])
Estimation (c) Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 VT VT VE A/B/T[=v]  2 x f(T=v|B/T) x f(A/B | B/T[=v])

56 A/B/T[=v]  2 x f(T=v|B/T) x f(A/B)
Estimation (c) Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 V14 VT VT VE A/B/T[=v]  2 x f(T=v|B/T) x f(A/B)

57 Value-expand example ρ0 ρ(1) A1 A2 PB3 A(2) PB(1) N4 B5 P6 P7 N8 B9
Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 VT VT A[N=v1]/P/T[=v2]  2 x f(T=v2) x f(N=v1)

58 Value-expand example ρ0 ρ(1) A1 A2 PB3 A(2) PB(1) N4 B5 P6 P7 N8 B9
Document Graph Synopsis ρ0 ρ(1) b/f b/f A1 A2 PB3 A(2) PB(1) b/f f b/f N4 B5 P6 P7 N8 B9 N(2) P(2) B(2) b/f b/f b V4 V8 VN T10 T11 T12 T13 E14 T(2) T(2) E(1) V10 V11 V12 V13 VT,N VT A[N=v1]/P/T[=v2]  2 x f(T=v2 and N=v1)

59 Results – XMark (Branching)
Path length: Branches: 0-2 Avg. Result Count: 254(Predicates)/1057(No Predicates)

60 Results – XMark (Simple)
Path length: Branches: 0 Avg. Result Count: 302(Predicates)/771(No Predicates)

61 } Synopsis Model Graph Synopsis Stability Information
Value Distribution Information XSKETCH structural synopsis

62 XSKETCH Structural Synopses
Previous Work Branching PEs Graph structured XML data Low estimation error/Small size Values? Path to value correlations Value to value correlations


Download ppt "Structure and Value Synopses for XML Data Graphs"

Similar presentations


Ads by Google