Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 QSX: Querying Social Graphs Approximate query answering Query-driven approximation Data-driven approximation Graph systems.

Similar presentations


Presentation on theme: "1 QSX: Querying Social Graphs Approximate query answering Query-driven approximation Data-driven approximation Graph systems."— Presentation transcript:

1 1 QSX: Querying Social Graphs Approximate query answering Query-driven approximation Data-driven approximation Graph systems

2 2 Make big graphs small Input: A class Q of queries Question: Can we effectively find, given queries Q  Q and any (possibly big) graph G, a small G Q such that Q(G) = Q(G Q )? Is it always to compute exact query answers? Distributed query processing Boundedly evaluable graph queries Query preserving graph compression Query answering using views Bounded incremental evaluation Q( ) G G GQGQ GQGQ 2

3 3 Approximate query answering 1. Query-driven approximation ① From intractable to low polynomial time ② Top-k query answering 2. Data-driven approximation: resource-bounded query answering Querying big data within our available resources 3 We can’t always find exact query answers in big data Some queries are not parallel scalable and can’t be made small enough We may have constrained resources -- cannot afford unlimited resources Applications may demand real-time response What can we do?

4 Query-driven approximation 44

5 5 Revised graph simulation Relaxing the semantics of query answering Effectiveness: capture more sensible matches in social graphs Efficiency: from intractable to low polynomial time Subgraph isomorphism NP-completeExponentially many matches Quadratic/cubic time|V Q ||V| Use “cheaper” queries whenever possible Works better for social network analysis 5

6 Top-k query answering 6 Traditional query answering: compute Q ( G ) Top-k query answering: Input: Query Q, a graph G and a positive integer k. Output: A top-ranked set of k matches of a designated node u o It is expensive to compute when G is large The result Q ( G ) is excessively large for the users to inspect – larger than G 6 How many matches do you check when you use, e.g., Google? Early termination: return diversified top-k matches without computing Q(G)

7 Finding best candidates 7 Project Manager* Programmer DB manager Tester PM 1 BA PM 2 PM 3 PM 4 PRG 1 DB 1 DB 2 PRG 3 DB 3 PRG 4 PRG 2 UD 1 UD 2 ST 1 ST 2 ST 3 ST 4 Query: find good PM (project manager) candidates collaborated with PRG (programmer), DB (database developer) and ST (software tester). Collaboration network G “query focus” complete matching relation (project manager, PM 1 ), (project manager, PM 2 ) (project manager, PM 3 ), (project manager, PM 4 ) (programmer, PRG 1 ), (programmer, PRG 2 ) (programmer, PRG 3 ), (programmer, PRG 4 ) (DBmanager, DB 1 ), (DBmanager, DB 2 ) (DBmanager, DB 3 ) (tester, ST 1 ), (tester, ST 2 ) (tester, ST 3 ), (tester, ST 4 ) Pattern graph Q Querying collaborative networks: we just want top-ranked PMs

8 8 Input: graph G = (V, E, f A ), pattern Q = (V Q, E Q, f v, u o ) Output: Q(G, u o ) = { v | ( u o, v)  Q(G) } Graph pattern matching with output node Output: k nodes vs. the entire set Q ( G ) Output node Matches of the output node Top-k query answering: Input: : Pattern Q, data graph G and a positive integer k. Output: Top-k matches in Q ( G, u o ) PM DBPRG ST Pattern Q * pm 1 pm 2 pm 3 pm n db 1 db 2 db 3 prg 1 prg 2 prg 3 st 1 st 2 st 3 st 4 st m …… Top-2 matches How to rank the answers? 8

9 Top-k answers 9 Top-k matching: top-k matches that maximize the total relevance PM 2 DB 2 PRG 3 DB 3 PRG 4 PRG 2 ST 2 ST 3 ST 4 Based on relevance alone: O(|G||Q| + |G| 2 ) time Relevant set R(u,v) for a match v of a query node u: all descendants of v as matches of descendants of u a unique, maximum relevance set Relevance function ◦ The more reachable matches, the better Early termination

10 10 Ranking match results: Relevance Top-k graph pattern matching: social impact Top-k query answering: Input: : Pattern Q, data graph G and a positive integer k. Output: Top-k matches in Q ( G, u o ) pm 1 pm 2 pm 3 pm n db 1 db 2 db 3 prg 1 prg 2 prg 3 st 1 st 2 st 3 st 4 st m …… PM DBPRG ST Pattern * Tok-2 relevant matches 10

11 Result diversification 11 Diversity function: set difference of the relevant set Diversified Top-k matching: find a set S of k matches for output node such that F(S) is maximized Diversification: bi-objective optimization: relevance and diversity To avoid too homogeneous matches: social relevance relevance Diversifies top-k matching: NP-complete More expensive than simulation diversity

12 12 Ranking match results: Diversity Diversified top-k graph pattern matching: social diversity Top-k query answering: Input: : Pattern Q, data graph G and a positive integer k. Output: Top-k diversified matches in Q ( G, u o ) pm 1 pm 2 pm 3 pm n db 1 db 2 db 3 prg 1 prg 2 prg 3 st 1 st 2 st 3 st 4 st m …… PM DBPRG ST Pattern * δ d (pm 2,pm 3 )=3/(m+2) δ d (pm 1,pm 2 )=(m+5)/(m+6) δ d (pm 1,pm 3 )=1 Top-2 diversified matches 12

13 13 The complexity Relevance alone: O((| V | + | Q |) (| E | + | V | ) Diversification based on both relevance and diversity NP-complete (decision problem) APX-hard O((| V | + | Q |) (| E | + | V | ) with approximation ratio 2 Early termination: stop as soon as top-k matches are found 1.8 times faster than the traditional simulation algorithm Top-k query answering: Input: : Pattern Q, data graph G and a positive integer k. Output: Top-k matches in Q ( G, u o ) without computing the entire Q ( G, u o ) quadratic time 13 Even with diversification

14 Data-driven approximation for bounded resources 14

15 15 Traditional approximation theory Traditional approximation algorithms T : for an NPO (NP-complete optimization problem), for each instance x, T (x) computes a feasible solution y quality metric f(x, y) performance ratio  : for all x, Does it work when it comes to querying big data? OPT(x): optimal solution,   1 Minimization: OPT(x)  f(x, y)   OPT(x) Maximization: 1/  OPT(x)  f(x, y)  OPT(x) 15

16 16 The approximation theory revisited Traditional approximation algorithms T : for an NPO for each instance x, T (x) computes a feasible solution y quality metric f(x, y) performance ratio (minimization): for all x, A quest for revising approximation algorithms for querying big data Approximation: for even low PTIME problems, not just NPO Quality metric: answer to a query is a typically a set, not a number Approach: it does not help much if T (x) conducts computation on “big” data x directly! OPT(x)  f(x, y)   OPT(x) Big data? 16

17 Data-driven: Resource bounded query answering 17 Input: A class Q of queries, a resource ratio   [0, 1), and a performance ratio   (0, 1] Question: Develop an algorithm that given any query Q  Q and dataset D, accesses a fraction D  of D such that |D  |   |D| computes as Q(D  ) as approximate answers to Q(D), and accuracy(Q, D,  )   Q( D ) dynamic reduction D DD Q approximation Q Q( D  ) Accessing  |D| amount of data in the entire process 17

18 18 Resource bounded query answering Resource bounded: resource ratio   [0, 1) decided by our available resources: time, space, … In combination with other tricks for making big data small Dynamic reduction: given Q and D, find D  for Q contrast this to synopses: find D  for all Q histogram, wavelets, sketches, sampling, … better reduction ratio Q( D ) dynamic reduction D DD Q approximation Q Q( D  ) access schema, distributed, views, … 18

19 19 Accuracy metrics Performance ratio for approximate query answering Performance ratio: F-measure of precision and recall to cope with the set semantics of query answers precision(Q, D,  ) = | Q(D  )  Q(D)| / | Q(D  )| recall(Q, D,  ) = | Q(D  )  Q(D)| / | Q(D)| accuracy(Q, D,  ) = 2 * precision(Q, D,  ) * recall(Q, D,  ) / (precision(Q, D,  ) + recall(Q, D,  )) 19

20 20 Personalized social search We can make big graphs of PB size fit into our memory! Graph Search, Facebook Find me all my friends who live in Edinburgh and like cycling Find me restaurants in London my friends have been to Find me photos of my friends in New York We can do personalized social search with  = 0.0015%! 1.5 * 10 -6 * 1PB (10 15 B) = 15 * 10 9 = 15GB We are making big graphs of PB size as small as 15GB! Localized patterns with 100% accuracy! Add to this access schema, distributed, views, … 20

21 Localized queries Localized queries: can be answered locally ◦ Graph pattern queries: revised simulation queries ◦ matching relation over d Q -neighborhood of a personalized node Michael hiking group cycling club member ?cycling lovers Michael (unique match) hiking group … … … cycling club member cycling fans hg m hg 1 hg 2 cc 1 cc 2 cc 3 cl 1 cl 2 cl n-1 cl n Personalized node Personalized social search, ego network analysis, … Michael: “find cycling fans who know both my friends in cycling club and my friends in hiking groups 21

22 Resource-bounded simulation 22 Preprocessing (auxiliary information) dynamic reduction (compute reduced subgraph) Approximate query evaluation over reduced subgraph local auxiliary information G Boolean guarded condition: label matching Cost function c(u,v) Potential function p(u,v), estimated probability that v matches u bound b, determines an upper bound of the number of nodes to be visited Q degree|neighbor| … u v u v label match Dynamically updated auxiliary information u v ? If v is included, the number of additional nodes that need also to be included – budget The probability for v to match u (total number of nodes in the neighbor of v that are candidate matches Query guided search – bounded by the budget

23 Resource-bounded simulation 23 preprocessing dynamic reduction (compute reduced subgraph) Approximate query evaluation over reduced subgraph Michael hiking group cycling club ?cycling lovers Michael hiking group … cycling club member cycling fans hg m hg 1 hg 2 cc 1 cc 2 cc 3 cl 1 cl 2 cl n-1 cl n cycling club cc 1 cc 2 cc 3 cycling club member ? cycling lovers cl n-1 cl n cycling fans hg m hiking group hiking group FALSE - - - TRUE Cost=1 Potential=3 Bound =2 TRUE Cost=1 Potential=2 Bound =2 bound = 16 visited = 16 Match relation: (Michael, Michael), (hiking group, hg m ), (cycling club, cc 1 ), (cycling club, cc 3 ), (cycling lover, cl n-1 ), (cycling lover, cl n ) Dynamic data reduction and query-guided search

24 Accuracy 24 Varying α ( 10 -5 ), accuracy, Yahoo 89%-100% for simulation queries both achieves 100% accuracy when α>0.0015%, 100% accuracy for 10 -5 * 1PB (10 15 B) = 10 9 = 10GB

25 Efficiency of resource bounded simulation 25 Varying α ( 10 -5 ), Yahoo 10 -5 * 1PB (10 15 B) = 10 9 = 10GB

26 26 Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric … Non-localized queries Reachability Input: A directed graph G, and a pair of nodes s and t in G Question: Does there exist a path from s to t in G? Non-localized: t may be far from s Does dynamic reduction work for non-localized queries? Is Michael connected to Eric via social links?

27 Resource-bounded reachability 27 Reduction size | G Q | <= α|G| Reachability query results Reachability query results Approximation (experimentally Verified; no false positive, in time O(α|G|) big graph G small tree index G Q O(|G|) Yes, dynamic reduction works for non-localized queries

28 Preprocessing: landmarks 28 Preprocessing dynamic reduction (compute landmark index) Approximate query evaluation over landmark index Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric … Landmarks ◦a landmark node covers certain number of node pairs ◦Reachability of the pairs it covers can be computed by landmark labels cc 1 “I can reach cl3” cl 3 cl n-1, “cl3 can reach me” cl 4 … cl 6 cl 16 A revision of 2-hop covers Search landmark index instead of G <= α|G|

29 Hierarchical landmark Index 29 Landmark Index ◦landmark nodes are selected to encode pairwise reachability ◦Hierarchical indexing: apply multiple rounds of landmark selection to construct a tree of landmarks cc 1 cl 7 cl n-1 Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric … … cl 16 cl 3 cl 5 cl 6 cl 4 cl 9 … A node v can reach v’ if there exist v1, v2, v3 in the index such that v reaches v1, v2 reaches v’, and v1 and v2 are connected to v3 (to and from) at the same level (coding)

30 Hierarchical landmark Index 30 Landmark Index ◦landmark nodes are selected to encode pairwise reachability ◦Hierarchical indexing: apply multiple rounds of landmark selection to construct a tree of landmarks cc 1 cl 7 cl n-1 Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric … … cl 16 cl 3 cl 5 cl 6 cl 4 cl 9 … Boolean guarded condition (v, vp, v’) Cost function c(v): size of unvisited landmarks in the subtree rooted at v Potential P(v), total cover size of unvisited landmarks as the children of v Cover size Landmark labels/encoding Topological rank/range Whether v can possibly reach v’ via vp Guided search on landmark index

31 Resource-bounded reachability 31 Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric cc 1 … cl 7 cl n-1 … cl 16 cl 3 cl 5 cl 6 cl 4 Michael Eric “drill down”? cl 9 … local auxiliary information “roll up” Preprocessing dynamic reduction (compute landmark index) Approximate query evaluation over landmark index bi-directed guided traversal Condition = FALSE - - Condition = ? Cost=9 Potential = 46 Condition = ? Cost=2 Potential = 9 Condition = TRUE … … Drill down and roll up

32 Accuracy 32 Varying α ( 10 -4 ), accuracy, Yahoo achieves 100% accuracy when α>0.05%, 100% accuracy for 10 -4 * 1PB (10 15 B) = 10 10 = 100GB

33 Efficiency: resource bounded reachability 33 RBreach is 62.5 times faster than BFS and 5.7 times faster than BFS-OPT Varying α ( 10 -4 ), Yahoo 10 -4 * 1PB (10 15 B) = 10 10 = 100GB

34 Graph systems 34

35 35 Systems based on vertex-centric models Giraph (Apache) –Model: BSP, Pregel, Google –On top of Hadoop (MapReduce) –Scalability: Facebook used Giraph with some performance improvements to analyze one trillion (10 12 ) edges using 200 machines in 4 minutes –Download: http://www.apache.org/dyn/closer.cgi/giraph/http://www.apache.org/dyn/closer.cgi/giraph/ Think like a vertex GraphLab (CMU, an Apache licence) –Initially for machine learning, and beyond now –Libraries of algorithms – https://dato.com/products/create/open_source.html

36 36 Graph databases –Neo4j (Neo, GNU public license) Graph databases: stored in form of either an edge, a node or an attribute Optimized for local traversal indexing Download: http://neo4j.com// Database style GraphX: a Spark API for graphs (AmpLab, UC Berkeley) –Based on Spark (in-memory primitives as opposed to Hadoop’s two- stage disk based MapReduce; claimed 100 times faster; support SQL) –Graph parallel (Pregel, Graphlab) and data parallel (Spark, tables), – Download: http://spark.apache.org/graphx/

37 Summing up 37

38 38 Approximate query answering Challenges: to get real-time answers Big data and costly queries Limited resources Yes, we can query big data within bounded resources! 38 Combined with techniques for making big data small Two approaches: Query-driven approximation Cheaper queries Retain sensible answers Data-driven approximation Dynamic data reduction Query-guided search Reduce data of PB size to GB Beyond graphs: relational queries SPC: 7.9 * 10 -4 SQL: 1.9 * 10 -3

39 Summary and review What is query-driven approximation? When can we use the approach? Traditional approximation scheme does not work very well for query answering in big data. Why? What is data-driven dynamic approximation? Does it work on localized queries? Non-localized queries? What is query-guided search? 39

40 40 Project (1) Revise subgraph queries to include a designated output node, as query focus. Develop ranking (relevance) function to rank matches of the output node in a subgraph query. Justify the design of your ranking function. Develop an algorithm that, given a graph G, a subgraph query Q with a designated output node v o and a positive integer k, computes top-k matches of v o based on your ranking function. Give correctness and complexity analyses of your algorithms. Experimentally evaluate your algorithms, especially their scalability with the size of graphs A research and development project Recall graph pattern matching via subgraph isomorphism (Lecture 3),referred to as subgraph queries in the sequel.

41 41 Project (2) Develop a resource-bounded algorithm for evaluating personalized subgraph queries. Give correctness and complexity analyses of your algorithm. Implement your algorithm Experimentally evaluate your algorithm, including accuracy and scalability A development project Revise graph pattern matching via subgraph isomorphism (Lecture 3) by including a designated personalized node, referred to as personalized subgraph queries in the sequel.

42 42 Project (3) A research and development project Recall keyword search with Steiner trees (Lecture 2). You may add practical restrictions to keyword search, e.g., a designated keyword is required to match a small set of nodes in a graph. Develop a resource-bounded algorithm for keyword search. Give correctness and complexity analyses of your algorithm. Implement your algorithm Experimentally evaluate your algorithm, including accuracy and scalability 42

43 43 G. Gou and R. Chirkova. Efficient algorithms for exact ranked twig- pattern matching over graphs. In SIGMOD, 2008. http://dl.acm.org/citation.cfm?id=1376676 H. Shang, Y. Zhang, X. Lin, and J. X. Yu. Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. PVLDB,2008. http://www.vldb.org/pvldb/1/1453899.pdf R. T. Stern, R. Puzis, and A. Felner. Potential search: A bounded-cost search algorithm. In ICAPS, 2011. (search Google Scholar) S. Zilberstein, F. Charpillet, P. Chassaing, et al. Real-time problem solving with contract algorithms. In IJCAI, 1999. (search Google Scholar) W. Fan, X. Wang, and Y. Wu. Diversified Top-k Graph Pattern Matching, VLDB 2014. (query-driven approximation) W. Fan, X. Wang, and Y. Wu. Querying big graphs with bounded resources, SIGMOD 2014. (data-driven approximation) Papers for you to review


Download ppt "1 QSX: Querying Social Graphs Approximate query answering Query-driven approximation Data-driven approximation Graph systems."

Similar presentations


Ads by Google