CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49

Approximate query processing 22

3 Approximate query answering 1. Query-driven approximation Feasible query models: from intractable to low polynomial time Top-k query answering 2. Data-driven approximation: resource-bounded query answering Querying big data within our available resources 3 We can’t afford to always find exact query answers in big data Some queries are expensive (e.g., subgraph isomorphism) We may have constrained resources -- cannot afford unlimited resources Applications may demand real-time response

4 Revised graph query model Relaxing the semantics of queries: case study Effectiveness: capture more sensible matches in social graphs Efficiency: from intractable to low polynomial time Subgraph isomorphism NP-completeExponentially many matches Quadratic/cubic time a polynomial time algorithm Use “cheaper” queries whenever possible Works better for social network analysis 4

5 Gray: Best-effort graph pattern matching (Hanghang T. KDD 09) 5 Output Input Attributed Data Graph Query Graph Matching Subgraph

6 6 G-Ray: quick overview (for loop ) Step 1: SF Step 2: NE Step 3: BR Step 4: NE Step 5: BR Step 6: NE Step 7: BR Step 8: BR SF: Seed-Finder NE: Neighborhood -Expander BR: Bridge

7 G-Ray: example pattern and matches Not exact answers No approximation guarantee Lose topological information In linear time on |G| Best-effort match

Top-k query answering 8 Traditional query answering: compute Q ( D ) Top-k query answering: Input: Query Q, database D and a positive integer k. Output: A top-ranked set of k matches of Q It is expensive to compute when D is large The result Q ( D ) is excessively large for the users to inspect – larger than D 8 How many matches do you check when you use, e.g., Google? Early termination: return top-k matches without computing Q(D)

9 Top-k graph querying Input: Graph G, Query Q, Integer k, answer quality measure Output: top-k answer set that maximizes object function F Top-k algorithms Exact top-k Approximate top-k Any-time top-k Early terminating Difference between Top-k graph problems and top-k table aggregation? Valid match expansion Join (no monotonicity) Hard to show instance optimality (top-k Join queries are special cases!)

10 GraphTA: A template Initialize candidate list L for node/edge in Q For each list L sort L with ranking function; Set a cursor to each list; set an upper bound U For each cursor c in each list L do generate a match that contains c; update Q(G,k); update threshold H with lowest score in Q(G,k); move all cursors one step ahead; update the upper bound U; if k matches are identified and H>=U then break; Return Q(G, k) nodes/edges of interests nodes/edges of interests

Finding best candidates 11 Project Manager* Programmer DB manager Tester PM 1 BA PM 2 PM 3 PM 4 PRG 1 DB 1 DB 2 PRG 3 DB 3 PRG 4 PRG 2 UD 1 UD 2 ST 1 ST 2 ST 3 ST 4 Query: find good PM (project manager) candidates collaborated with PRG (programmer), DB (database developer) and ST (software tester). Collaboration network G “query focus” complete matching relation (project manager, PM 1 ), (project manager, PM 2 ) (project manager, PM 3 ), (project manager, PM 4 ) (programmer, PRG 1 ), (programmer, PRG 2 ) (programmer, PRG 3 ), (programmer, PRG 4 ) (DBmanager, DB 1 ), (DBmanager, DB 2 ) (DBmanager, DB 3 ) (tester, ST 1 ), (tester, ST 2 ) (tester, ST 3 ), (tester, ST 4 ) Pattern graph Q Querying collaborative networks: we just want top-ranked PMs

12 Input: graph G = (V, E, f A ), pattern Q = (V Q, E Q, f v, u o ) Output: Q(G, u o ) = { v | ( u o, v)  Q(G) } Graph pattern matching with output node Output: k nodes vs. the entire set Q ( G ) Output node Matches of the output node Top-k query answering: Input: : Pattern Q, data graph G and a positive integer k. Output: Top-k matches in Q ( G, u o ) PM DBPRG ST Pattern Q * pm 1 pm 2 pm 3 pm n db 1 db 2 db 3 prg 1 prg 2 prg 3 st 1 st 2 st 3 st 4 st m …… Top-2 matches How to rank the answers? 12

Top-k answers 13 Top-k matching: top-k matches that maximize the total relevance PM 2 DB 2 PRG 3 DB 3 PRG 4 PRG 2 ST 2 ST 3 ST 4 Relevant set R(u,v) for a match v of a query node u: all descendants of v as matches of descendants of u a unique, maximum relevance set Relevance function ◦ The more reachable matches, the better

Finding Top-k Matches 14 Finding Top-k matches for acyclic patterns ◦ Initializes a heap S, and a vector for each candidate v ◦ Computes a set of matches for some query nodes (can be determined without following steps) ◦ Iteratively updates vectors of other candidates by propagating the partial answers ◦ Termination condition: (1) each v in S is a match of u o, and (2) min v ∈ S (l(u o, v)) ≥ max v′ ∈ can(uo)\S (h(u o, v)), where l(u o, v) and h(u o, v) denote a lower bound and upper bound of r(u o, v). xXv: match? v.R: relevance set v.l ower, v.upper: relevance bound

Finding Top-k Matches 15 Project Manager* Programmer DB manager PM 1 BA PM 2 PM 3 PM 4 PRG 1 DB 1 DB 2 PRG 3 DB 3 PRG 4 vv.T = PM 1 PM 2 PM 3 PM 4 PRG 1 PRG j (j ∈ [3,4]) DB k (k ∈ [1,3]) vv.T = PM 1 PM 2 PM 3 PM 4 PRG 1 PRG j (j ∈ [3,4]) DB 2 DB k (k ∈ [1,3]) After initialization propagation from DB 2 a valid match, and its relevant set includes the most matches compared with others. Early termination condition is met.

A revision of conventional approximation theory 16

17 Traditional approximation theory Traditional approximation algorithms T : for an NPO (NP-complete optimization problem), for each instance x, T (x) computes a feasible solution y quality metric f(x, y) performance ratio  : for all x, Does it work when it comes to querying big data? OPT(x): optimal solution,   1 Minimization: OPT(x)  f(x, y)   OPT(x) Maximization: 1/  OPT(x)  f(x, y)  OPT(x) 17

18 The approximation theory revisited Traditional approximation algorithms T : for an NPO for each instance x, T (x) computes a feasible solution y quality metric f(x, y) performance ratio (minimization): for all x, A quest for revising approximation algorithms for querying big data Approximation: for even low PTIME problems, not just NPO Quality metric: answer to a query is a typically a set, not a number Approach: it does not help much if T (x) conducts computation on “big” data x directly! OPT(x)  f(x, y)   OPT(x) Big data? 18

Data-driven: Resource bounded query answering 19 Input: A class Q of queries, a resource ratio   [0, 1), and a performance ratio   (0, 1] Question: Develop an algorithm that given any query Q  Q and dataset D, accesses a fraction D  of D such that |D  |   |D| computes as Q(D  ) as approximate answers to Q(D), and accuracy(Q, D,  )   Q( D ) dynamic reduction D DD Q approximation Q Q( D  ) Accessing  |D| amount of data in the entire process 19

20 Resource bounded query answering Resource bounded: resource ratio   [0, 1) decided by our available resources: time, space, … In combination with other tricks for making big data small Dynamic reduction: given Q and D, find D  for Q contrast this to synopses: find D  for all Q histogram, wavelets, sketches, sampling, … better reduction ratio Q( D ) dynamic reduction D DD Q approximation Q Q( D  ) access schema, distributed, views, … 20

21 Accuracy metrics Performance ratio for approximate query answering Performance ratio: F-measure of precision and recall to cope with the set semantics of query answers precision(Q, D,  ) = | Q(D  )  Q(D)| / | Q(D  )| recall(Q, D,  ) = | Q(D  )  Q(D)| / | Q(D)| accuracy(Q, D,  ) = 2 * precision(Q, D,  ) * recall(Q, D,  ) / (precision(Q, D,  ) + recall(Q, D,  )) 21

22 Personalized social search make big graphs of PB size fit into our memory Graph Search, Facebook Find me all my friends who live in Pullman and like cycling Find me restaurants in Seattle my friends have been to Find me photos of my friends in New York personalized social search with  = 0.0015%! 1.5 * 10 -6 * 1PB (10 15 B) = 15 * 10 9 = 15GB making big graphs of PB size as small as 15GB! Localized patterns with 100% accuracy! Add to this access schema, distributed, views, … 22

Localized queries Localized queries: can be answered locally ◦ Graph pattern queries: revised simulation queries ◦ matching relation over d Q -neighborhood of a personalized node Michael hiking group cycling club member ?cycling lovers Michael (unique match) hiking group … … … cycling club member cycling fans hg m hg 1 hg 2 cc 1 cc 2 cc 3 cl 1 cl 2 cl n-1 cl n Personalized node Personalized social search, ego network analysis, … Michael: “find cycling fans who know both my friends in cycling club and my friends in hiking groups 23

Resource-bounded simulation 24 Preprocessing (auxiliary information) dynamic reduction (compute reduced subgraph) Approximate query evaluation over reduced subgraph local auxiliary information G Boolean guarded condition: label matching Cost function c(u,v) Potential function p(u,v), estimated probability that v matches u bound b, determines an upper bound of the number of nodes to be visited Q degree|neighbor| … u v u v label match Dynamically updated auxiliary information u v ？ If v is included, the number of additional nodes that need also to be included – budget The probability for v to match u (total number of nodes in the neighbor of v that are candidate matches Query guided search – bounded by the budget

Resource-bounded simulation 25 preprocessing dynamic reduction (compute reduced subgraph) Approximate query evaluation over reduced subgraph Michael hiking group cycling club ?cycling lovers Michael hiking group … cycling club member cycling fans hg m hg 1 hg 2 cc 1 cc 2 cc 3 cl 1 cl 2 cl n-1 cl n cycling club cc 1 cc 2 cc 3 cycling club member ? cycling lovers cl n-1 cl n cycling fans hg m hiking group hiking group FALSE - - - TRUE Cost=1 Potential=3 Bound =2 TRUE Cost=1 Potential=2 Bound =2 bound = 14 visited = 16 Match relation: (Michael, Michael), (hiking group, hg m ), (cycling club, cc 1 ), (cycling club, cc 3 ), (cycling lover, cl n-1 ), (cycling lover, cl n ) Dynamic data reduction and query-guided search

Accuracy 26 Varying α ( 10 -5 ), accuracy, Yahoo 89%-100% for simulation queries both achieves 100% accuracy when α>0.0015%, 100% accuracy for 10 -5 * 1PB (10 15 B) = 10 9 = 10GB

27 Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric … Non-localized queries Reachability Input: A directed graph G, and a pair of nodes s and t in G Question: Does there exist a path from s to t in G? Non-localized: t may be far from s Does dynamic reduction work for non-localized queries? Is Michael connected to Eric via social links?

Resource-bounded reachability 28 Reduction size | G Q | <= α|G| Reachability query results Reachability query results Approximation (experimentally Verified; no false positive, in time O(α|G|) big graph G small tree index G Q O(|G|) Yes, dynamic reduction works for non-localized queries

Preprocessing: landmarks 29 Preprocessing dynamic reduction (compute landmark index) Approximate query evaluation over landmark index Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric … Landmarks ◦a landmark node covers certain number of node pairs ◦Reachability of the pairs it covers can be computed by landmark labels cc 1 “I can reach cl3” cl 3 cl n-1, “cl3 can reach me” cl 4 … cl 6 cl 16 A revision of 2-hop covers Search landmark index instead of G <= α|G|

Hierarchical landmark Index 30 Landmark Index ◦landmark nodes are selected to encode pairwise reachability ◦Hierarchical indexing: apply multiple rounds of landmark selection to construct a tree of landmarks cc 1 cl 7 cl n-1 Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric … … cl 16 cl 3 cl 5 cl 6 cl 4 cl 9 … A node v can reach v’ if there exists v1, v2, v2 in the index such that v reaches v1, v2 reaches v’, and v1 and v2 are connected to v3 at the same level (coding)

Hierarchical landmark Index 31 Landmark Index ◦landmark nodes are selected to encode pairwise reachability ◦Hierarchical indexing: apply multiple rounds of landmark selection to construct a tree of landmarks cc 1 cl 7 cl n-1 Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric … … cl 16 cl 3 cl 5 cl 6 cl 4 cl 9 … Boolean guarded condition (v, vp, v’) Cost function c(v): size of unvisited landmarks in the subtree rooted at v Potential P(v), total cover size of unvisited landmarks as the children of v Cover size Landmark labels/encoding Topological rank/range Whether v can possibly reach v’ via vp Guided search on landmark index

Resource-bounded reachability 32 Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric cc 1 … cl 7 cl n-1 … cl 16 cl 3 cl 5 cl 6 cl 4 Michael Eric “drill down”? cl 9 … local auxiliary information “roll up” Preprocessing dynamic reduction (compute landmark index) Approximate query evaluation over landmark index bi-directed guided traversal Condition = FALSE - - Condition = ? Cost=9 Potential = 46 Condition = ? Cost=2 Potential = 9 Condition = TRUE … … Drill down and roll up

Accuracy 33 Varying α ( 10 -4 ), accuracy, Yahoo achieves 100% accuracy when α>0.05%, 100% accuracy for 10 -4 * 1PB (10 15 B) = 10 10 = 100GB

Efficiency: resource bounded reachability 34 RBreach is 62.5 times faster than BFS and 5.7 times faster than BFS-OPT Varying α ( 10 -4 ), Yahoo 10 -4 * 1PB (10 15 B) = 10 10 = 100GB

Summing up 35

36 Approximate query answering Challenges: to get real-time answers Big data and costly queries Limited resources Yes, we can query big data within bounded resources! 36 Combined with techniques for making big data small Two approaches: Query-driven approximation Cheaper queries Retain sensible answers Data-driven approximation Dynamic data reduction Query-guided search Reduce data of PG size to GB

37 G. Gou and R. Chirkova. Efficient algorithms for exact ranked twig- pattern matching over graphs. In SIGMOD, 2008. http://dl.acm.org/citation.cfm?id=1376676 H. Shang, Y. Zhang, X. Lin, and J. X. Yu. Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. PVLDB,2008. http://www.vldb.org/pvldb/1/1453899.pdf R. T. Stern, R. Puzis, and A. Felner. Potential search: A bounded-cost search algorithm. In ICAPS, 2011. (search Google Scholar) S. Zilberstein, F. Charpillet, P. Chassaing, et al. Real-time problem solving with contract algorithms. In IJCAI, 1999. (search Google Scholar) W. Fan, X. Wang, and Y. Wu. Diversified Top-k Graph Pattern Matching, VLDB 2014. (query-driven approximation) W. Fan, X. Wang, and Y. Wu. Querying big graphs with bounded resources, SIGMOD 2014. (data-driven approximation) Papers for you to review

Reading 38 1. M. Arenas, L. E. Bertossi, J. Chomicki: Consistent Query Answers in Inconsistent Databases, PODS 1999. http://web.ing.puc.cl/~marenas/publications/pods99.pdf 2. Indrajit Bhattacharya and Lise Getoor. Collective Entity Resolution in Relational Data. TKDD, 2007. http://linqs.cs.umd.edu/basilic/web/Publications/2007/bhattacharya:tkdd07/bhattac harya-tkdd.pdf 3. P. Li, X. Dong, A. Maurino, and D. Srivastava. Linking Temporal Records. VLDB 2011. http://www.vldb.org/pvldb/vol4/p956-li.pdf 4. W. Fan and F. Geerts ， Relative information completeness, PODS, 2009. 5. Y. Cao. W. Fan, and W. Yu. Determining relative accuracy of attributes. SIGMOD 2013. 6. P. Buneman, S. Davidson, W. Fan, C. Hara and W. Tan. Keys for XML. WWW 2001.

CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

Similar presentations

Presentation on theme: "CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

Similar presentations

Presentation on theme: "CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49."— Presentation transcript:

Similar presentations

About project

Feedback