Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 QSX: Querying Social Graphs Querying Big Graphs Parallel scalability Making big graphs small –Bounded evaluability –Query-preserving graph compression.

Similar presentations


Presentation on theme: "1 QSX: Querying Social Graphs Querying Big Graphs Parallel scalability Making big graphs small –Bounded evaluability –Query-preserving graph compression."— Presentation transcript:

1 1 QSX: Querying Social Graphs Querying Big Graphs Parallel scalability Making big graphs small –Bounded evaluability –Query-preserving graph compression

2 2 The impact of the sheer volume of big data D Using SSD of 6G/s, a linear scan of a data set D would take D 1.9 days when D is of 1PB (10 15 B) D 5.28 years when D is of 1EB (10 18 B) Is it feasible to query real-life big graphs? A departure from classical computational complexity theory Traditional computational complexity theory of almost 50 years: The good: polynomial time computable (PTIME) The bad: NP-hard (intractable) The ugly: PSPACE-hard, EXPTIME-hard, undecidable…

3 Parallel query answering We can do better provided more resources 10,000 processors How to cope with the sheer volume of big graphs? 3 D Using 10000 SSD of 6G/s, a linear scan of D might take: D 1.9 days/10000 = 16 seconds when D is of 1PB (10 15 B) D 5.28 years/10000 = 4.63 days when D is of 1EB (10 18 B) Only ideally, why? DB M M M interconnection network PP P Do parallel algorithms always work? If not, is it still feasible to query big graphs?

4 Parallel scalability 4 4

5 5 A distributed algorithm is useful if it is parallel scalable Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G Complexity t(|G|, |Q|): the time taken by a sequential algorithm with a single processor T(|G|, |Q|, n): the time taken by a parallel algorithm with n processors Parallel scalable: if T(|G|, |Q|, n) = O(t(|G|, |Q|)/n) + O((n + |Q|) k ) including the cost of data shipment, k is a constant When G is big, we can still query G by adding more processors if we can afford them partition

6 6 Degree of parallelism -- speedup Speedup: for a given task, TS/TL, TS: time taken by a traditional DBMS TL: time taken by a parallel system with more resources TS/TL: more sources mean proportionally less time for a task Linear speedup: the speedup is N while the parallel system has N times resources of the traditional system resources Speed: throughput response time Linear speedup Question: can we do better than linear speedup?

7 7 Better than linear speedup? NO, even hard to achieve linear speedup/scaleup! Startup costs: initializing each process Interference: competing for shared resources (network, disk, memory or even locks) Skew: it is difficult to divide a task into exactly equal-sized parts; the response time is determined by the largest part Data shipment cost: in a shared-nothing architecture Linear speedup is the best we can hope for -- optimal! A closer look: Ullman’s algorithm for subgraph isomorphism: the adjacency matrix for the entire G. What if we break G into n fragments and leverage the data locality of subgraph isomorphism? Give 4 reasons Think of blocking in MapReduce Worst-case: exponential in |G| and |Q| vs exponential in |G|/n and |Q|! Contradiction? No: the worst-case complexity of a particular algorithm vs the time really needed by a sequential algorithm

8 8 linear scalability Querying big data by adding more processors An algorithm T for answering a class Q of queries Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G Algorithm T is linearly scalable in computation if its parallel complexity is a function of |Q| and |G|/n, and in data shipment if the total amount of data shipped is a function of |Q| and n The more processors, the less response time Independent of the size |G| of big G Is it always possible?

9 9 Graph pattern matching via graph simulation Input: a graph pattern graph Q and a graph G Output: Q(G) is a binary relation S on the nodes of Q and G O((| V | + | VQ |) (| E | + | EQ| )) time each node u in Q is mapped to a node v i n G, such that (u, v) ∈ S for each (u,v) ∈ S, each edge (u,u’) in Q is mapped to an edge (v, v’ ) i n G, such that (u’,v’ ) ∈ S 9 Parallel scalable?

10 10 Impossibility Nontrivial to develop parallel scalable algorithms There exists NO algorithm for distributed graph simulation that is parallel scalable in either computation, or data shipment Why? Pattern: 2 nodes Graph: 2n nodes, distributed to n processors Possibility: when G is a tree, parallel scalable in both response time and data shipment What can we do if parallel scalability is beyond reach?

11 Making big graphs small 11

12 12 The cost of query answering Input: A query Q and a graph G Question: The answer Q(G) to Q in G Reduce the cost of computing Q(G) by making G small! too costly when G is big The cost of computing Q(G): a function f(|G|, |Q|) Find a lower function for f? Develop faster algorithm Reduce the size of |Q|? Q( ) G G GQGQ GQGQ Reduce the size of G What should we do? 12

13 13 Making big graphs small Input: A class Q of queries Question: Can we effectively find, given queries Q  Q and any (possibly big) graph G, a small G Q such that Q(G) = Q(G Q )? How to make G small? Particularly useful for A single dataset G, e.g., the social graph of Facebook Minimum G Q – the necessary amount of data for answering Q Q( ) G G GQGQ GQGQ Much smaller than G

14 The essence of parallel query answering Given a big graph G, and n processors S1, …, Sn G is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si Dividing a big G into small fragments G i of manageable size Each processor Si processes its local fragment Gi in parallel Parallel query answering Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G Q( ) G G G1G1 G1G1 GnGn GnGn G2G2 G2G2 … What can we do if parallel scalability is beyond reach for our queries? |G|/n, much smaller 14

15 15 How to make big graphs small Input: A class Q of queries Question: Can we effectively find, given queries Q  Q and any (possibly big) graph G, a small G Q such that Q(G) = Q(G Q )? Effective methods for making big graphs small A number of methods We have seen one of the methods: parallel query answering Other methods – in the next two lectures Q( ) G G GQGQ GQGQ Much smaller than G Distributed query processing Boundedly evaluable graph queries Query preserving graph compression Query answering using views Bounded incremental evaluation … 15

16 Making big graphs small 16

17 17 What do we need Input: A class Q of queries Question: Can we effectively find, given queries Q  Q and any (possibly big) graph G, a small G Q such that Q(G) = Q(G Q )? How to characterize this? How to find G Q ? The time taken to find G Q should be independent of |G| Not very likely in the absence of auxiliary information Q( ) G G GQGQ GQGQ Much smaller than G Why?

18 Boundedly evaluable queries Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q  Q and any (possibly big) graph G, a fraction G Q of G such that |G Q | is independent of |G|, Q(G) = Q(G Q ), and moreover, G Q can be identified in time determined by Q and A ? A closer look G Q does not get bigger when G grows -- Q(G Q ) can be efficiently computed The time taken on finding G Q does not increase when G grows effectively find Is this possible in practice? 18

19 Example: subgraph isomorphism Find pairs of leading actors and actresses from the same country and stared in an award-winning movie released in 2011-2014 Find all matches of the pattern in the graph A movie database represented as a graph, for movies from 1880 -- 2014 – Nodes: movies, casts (actors, actresses), awards, etc – Edges: relationships between the nodes 5.1 million nodes and 19.5 million edges awardyear 2011-2014 movie actor actress country 19

20 Example: access constraints Hold on the entire graph, regardless of queries posed on it C1: an award is presented to no more than 4 movies each year C2: each movie has at most 30 leading actors and actresses C3: each person has only one country of origin C4-6: there are no more than 134 years (2014  1880), 24 major awards, and 196 countries in the graph awardyear 2011-2014 movie actor actress country real-life limits Build indices accordingly 20

21 Example: a query plan Visit at most 17922 nodes and 35136 edges, using indices 1. Fetch a set V1 of 134 year nodes, 24 awards and 195 countries 2. Fetch a set V2 of at most 24 * 3 * 4 = 288 award-winning movies released in 2011-2014, with at most 288 * 2 associated edges, by using award and year nodes in V1 3. Fetch a set V3 of at most (30 + 30) * 288 = 17280 actors and actresses with 17280 edges, using nodes in V2 4. Connect the actors and actresses in V3 to country nodes in V1, with at most 17280 edges -- G Q awardyear 2011-2014 movie actor actress country 21 As opposed to 5.1 million nodes and 19.5 million edges By using the indices

22 Access constraints: Example S  (l, N) S: a set of node labels, and l is another label N: a natural number -- cardinality Access schema: A set of access constraints Combining cardinality constraints and index For any set Vs of nodes in G with label S, there exist at most N common neighbours of Vs with label l There is an index on S for l Semantics: G satisfies S  (l, N) 22 With distinct labels, in S Connected by an edge to each node in Vs For each set Vs of nodes with label S, find all common neighbours labelled l in O(N) time

23 Example: access constraints Useful special cases:   (l, N), l  (l’, N), C1: an award is presented to no more than 4 movies each year C2: each movie has at most 30 leading actors and actresses C3: each person has only one country of origin C4-6: there are no more than 134 years (2014  1880), 24 major awards, and 196 countries in the graph Access constraints 23 Build indices accordingly (year, award)  (movie, 4) movie  (actor/actress, 30) actor/actress  (country, 1)   (year, 134),   (award, 24),   (country, 196)

24 24 discovering access schema S  (l, N) How to maintain constraints in response to changes to graphs? Functional dependencies X  Y, e.g., movie  (year, 1) Degree bound: l  (l’, N) if a node with label l has a degree N, for any label l’   (l, N), very common, e.g.,   (country, 196) Aggregate queries: group by (year, award), we find (year, award)  (movie, 4) Real-life bounds: 5000 friends per person (Facebook) … Shredding graphs to relations, using, e.g., TANE Local changes: only to common neighbours

25 Generating query plans Fetch operations: construct G Q ; then we compute Q(G Q ) A query plan P for a query Q is a sequence of fetching operations fetch(u, Vs, C, q(u)) given a set Vs of nodes fetched earlier, fetch all common neighbours u of Vs labelled l, by using access constraint C, the nodes satisfy the condition of u, e.g., year in [2011, 2014] awardyear 2011-2014 movie actor actress country Efficient by using the indices 25

26 Generating query plans Independent of |G| no matter how big G grows! 1. Fetch a set V1 of 134 year nodes, 24 awards and 195 countries 2. Fetch a set V2 of at most 24 * 3 * 4 = 288 award-winning movies released in 2011-2014, with at most 288 * 2 associated edges, by using award and year nodes in V1 3. Fetch a set V3 of at most (30 + 30) * 288 = 17280 actors and actresses with 17280 edges, using nodes in V2 4. Connect the actors and actresses in V3 to country nodes in V1, with at most 17280 edges -- G Q 26 Boundedly evaluable Boundedly evaluable: if there exists a query plan under an access schema A such that for all graphs G that satisfies A, Its fetch operations finds G Q, and Q(G Q ) = Q(G) The time for all fetch operations is determined by Q and A only, independent of |G| example

27 An approach to querying big graphs 27 Given a query Q, and an access schema A 1.Decide whether Q is boundedly evaluable under A 2.If so, generate a bounded query plan P for Q Independent of the size of |G|? 3. Given any graph G, use the query plan P a)Fetch G Q b)Compute Q(G Q ) Questions: the complexity of – deciding bounded evaluability? – generating a boundedly evaluable query plan? Are we done yet?

28 28 Positive: in O(|A| |V Q | |E Q |) time Input: A query Q, and an access schema A Question: Is Q boundedly evaluable under A? Graph pattern matching via subgraph isomorphism Independent of any graph G Characterization: Q is boundedly evaluable under A iff VCov(Q, A) = VQ ECov(Q, A) = EQ Q = (V Q, E Q ), small in real life Nodes covered by A, computed by   (l, N) first and inductively by other constraints in A Edges (u1, u2) covered by A: one of them is in VCov and the other has a bounded number of candidates by A Deciding bounded evaluability: independent of |G| Deciding bounded evaluability 28

29 Positive: in O(|A| |E Q | + |A| |V Q | 2 ) time Input: A boundedly evaluable query Q, and an access schema A Output: A boundedly evaluable query plan P for Q under A Graph pattern matching via subgraph isomorphism Independent of any graph G Q = (V Q, E Q ) Inductively identify covered nodes and edges, and in each step, generate a corresponding fetch operation Yes, since Q is decided boundedly evaluable under A Always possible? Query plan generation: independent of |G| Generating boundedly evaluable query plan 29

30 Instance-bounded in a graph G 1.Decide whether Q is effectively bounded under A 2.If so, generate a bounded query plan P for Q For any finite set Q of pattern queries, access schema A and a graph G satisfying A, there exists M such that all queries in Q are M-bounded in G under A 30 Can we do anything if Q is not boundedly evaluable under A? Extending A by to A M adding constraints of the form   (l, M), l  (l’, M) such that G satisfies A M Query Q is M-bounded in G if there is G Q of G such that Q(G) = Q(G Q ), and G Q can be found in time determined by Q and A M M: may depend on |G| M  L Q (L Q + 1)/2, L Q : the number of labels in G Instance-bounded: on an individual graph, e.g., Facebook

31 Effectiveness of bounded evaluability Bounded evaluability: effective for graph pattern queries 31 How effective is this approach? 60% of subgraph queries and 33% of simulation queries are boundedly evaluable under small access schema Improvement: 4 orders of magnitudes for subgraph queries, and 3 orders of magnitudes for simulation queries A small M of 0.016% of |G| makes all queries M-bounded Graph pattern matching via subgraph isomorphism: data locality Does the same approach work on graph simulation, without data locality? All the results remain intact on graph pattern matching via simulation Revised node and edge covers 28587 times faster

32 Query-preserving graph compression 32

33 33 Dynamic reduction vs. Uniform reduction Is there any effective uniform reduction to query big data? Bounded evaluability: dynamic reduction on dataset D Given a query Q, identify and fetch a minimum subset D Q of D such that it has sufficient information for answering Q in D What is the benefit? Uniform reduction on dataset D Identify and fetch a minimum D C such that for all queries Q posed on D, D C has sufficient information to find answers to Q in D What is the benefit? Questions: D Q is typically smaller than D C. Why? D C is computed once offline and then we don’t have to worry about it; is this claim true?

34 Graph compression The cost of query processing: f(|G|, |Q|) Compression For a graph G, G C = R(G) For any Q, Q( G ) = P(Q(G C )) Q( G ) R G GcGc Q P Q Q( Gc ) Compress big G into a smaller G C It is unlikely that we can lower its complexity, but can we reduce the size of its parameter |G|? Compressing Post-processing Q( ) GCGC GCGC G G 34 Lossless: restore G from G C. G C is not much smaller than G Query friendly compression: decompression of G C back to G

35 Query preserving graph compression 35 Query preserving compression for a class Q of queries For any graph G, G C = R(G) For any Q in Q, Q( G ) = P(Q(Gc)) Q( G ) R G GcGc Q P Q Q( Gc ) Compress G w.r.t. to a particular query class Q Compressing Post-processing Q( ) GCGC GCGC G G 35

36 What is new about query preserving compression? 36 In contrast to lossless compression, no need to restore the original graph G Relative to a class L of queries of users’ choice Better compression ratio: only information about L queries Query preserving compression for a class L of queries For any graph G, Gc = R(G) For any Q in L, Q( G ) = P(Q(Gc)) For any Q in L, Q(Gc) can be directly computed Any algorithms and indexing structures for G can be used for Gc no need to decompress Gc Gc is computed once for all queries Q in L Incrementally maintained Compress G relative to your queries

37 Compress G by leveraging the equivalence relation Equivalence relation: reachability relation R e : a node pair (u,v) ∈ R e iff they have the same set of ancestors and descendants in G. for any graph G, there is a unique maximum R e, i.e., the reachability equivalence relation of G Reachability queries Reachability Input: A directed graph G, and a pair of nodes s and t in G Question: Does there exist a path from s to t in G? O(|V| + |E|) time 37

38 C1 QRQR MSA 1 BSA 1 MSA 2 BSA 2 … FA 1 C1C1 C3C3 FA 2 C2C2 CkCk FA 3 FA 4 FA 1 FA 3 FA 4 MSA 1 BSA 1 MSA 2 BSA 2 C1C1 FA 2 C2C2 C3C3 … C4C4 CkCk 1. Compute Re and its equivalence classes 2. Construct a node for each node set in the equivalence class 3. Construct G C Algorithm and example O(|V||E|) 38

39 Reachability preserving compression A reachability preserving compression R for G –R maps each node in G to its reachability equivalence class in G C, and each edge to an edge between two equivalence classes Reduction: 95% in average for reachability queries Correctness: –For any query Q R (v,w) over G, v can reach w iff R(v) can reach R(w) in G C –Compression R is in quadratic time –no post-processing function P is required. Nodes in G C : equivalence classes 39

40 How does it look like in real life? 18 times faster on average for reachability queries 40

41 Graph pattern matching by graph simulation Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R 41 Bisimulation: a binary relation B over V of G, such that for each node pair (u,v) ∈ B, L(u) = L(v) for each edge (u,u’) ∈ E, there exists (v,v’) ∈ E, s.t. (u’,v’) ∈ B, for each edge (v,v’) ∈ E, there exists (u,u’) ∈ E, s.t. (u’,v’) ∈ B Equivalence relation Rb: the unique maximum bisimulation relation Compress G by leveraging the equivalence relation A3A3 B4B4 A4A4 A5A5 B5B5 C3C3 C4C4 A1A1 B1B1 D1D1 C1C1 A2A2 B2B2 D2D2 C2C2 B3B3 G1G1 G2G2

42 Compression for simulation 42 msa 1 bsa 1 fa 1 c1c1 msa 2 bsa 2 fa 2 c2c2 fa 3 c3c3 ckck G R(G): computes equivalence classes MSAr BSAr FAr FAr’ CrCr’ msa 1 msa 2 bsa 1 bsa 2 fa 1 fa 2 fa 3 … c1c1 c2c2 c3c3 ckck Gc R(G): constructs Gc with equivalence classes P(Q,Gc): expanded to the nodes in their equivalence classes 42

43 Compression for simulation 43 Reduction: 57% in average for graph pattern matching nodes in Gc denote equivalence classes compression function R( ): maximum bisimulation relation on the nodes of G equivalence relation Query preserving compression for graph pattern matching R(G) in O(|E| log (|V|)) time P(Q, Gc): linear time in the size of Q( G ) post-processing function P( ): making use of the inverse of R( ) nodes in Q(Gc ) are expanded to nodes in their equivalence classes, in the size of output Subgraph isomorphism? 2.3 times faster (simulation)

44 Summing up 44

45 45 Summary and review What is parallel scalability? Why do we care about it? Study some parallel algorithms. Show that they are parallel scalable if they are, and disprove it otherwise Why do we want to make big graphs small? How can we do it? What is bounded evaluability of queries? What auxiliary structures do we need to make queries boundedly evaluable? What is query-preserving graph compression? Is it lossless? Do we lose information when using such a compression scheme? How to develop query preserving graph compression schemes?

46 46 Project (1) Bounded evaluability. Recall keyword search via distinct-root trees (bounded by a fixed depth k; see Lectures 2 and 4) Develop an algorithm for keyword search based on access constraints; show that such queries can be boundedly evaluated Develop optimization strategies Develop a parallel version of your algorithm, in whatever model you like (MapReduce, BSP, GRAPE) Experimentally evaluate your algorithms, especially their scalability with the size of G Write a survey on various methods for keyword search with distinct- trees, as part of the related work. A research and development project

47 47 Project (2) Recall graph pattern matching by subgraph isomorphism (Lecture 3) Develop a query-preserving compression scheme for subgraph isomorphism Implement your compression scheme and an algorithm for graph pattern matching via subgraph isomorphism, based on your query- preserving compression scheme Experimentally evaluate your compression scheme and evaluation algorithm, especially its scalability with the size of G Write a survey on graph compression schemes, as part of the related work. A research and development project

48 Project (3) Combine query-preserving compression and distributed algorithm for reachability queries (Lecture 5) 48 Develop a framework for answering reachability queries, with – query-preserving compression scheme to reduce graphs – distributed algorithm for answering reachability queries – incremental algorithm to maintain compressed graphs in response to changes to the original graphs Implement the framework with all three algorithms Experimentally evaluate method for answering reachability queries, especially its scalability with the size of G Write a survey on graph compression schemes and distributed algorithms for reachability queries, as part of the related work. A development project

49 49 M. Armbrust, A. Fox, D. A. Patterson, N. Lanham, B. Trushkowsky, J. Trutna, and H. Oh. SCADS: Scale-independent storage for social computing applications. In CIDR, 2009. http://arxiv.org/ftp/arxiv/papers/0909/0909.1775.pdf M. Armbrust, E. Liang, T. Kraska, A. Fox, M. J. Franklin, and D. Patterson. Generalized scale independence through incremental precomputation. In SIGMOD, 2013. http://www.cs.albany.edu/ ~jhh/courses/readings/armbrust.sigmod13.incremental.pdf S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, 2013. https://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf Y. Tian, R. A. Hankins, and J. M. Patel. Efficient Aggregation for Graph Summarization. http://pages.cs.wisc.edu/~jignesh/publ/summarization.pdf Y. Cao, W. Fan, and R. Huang. Making pattern queries bounded in big graphs. ICDE 2015. (bounded evaluability) W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression, SIGMOD, 2012. (query-preserving compression) Papers for you to review


Download ppt "1 QSX: Querying Social Graphs Querying Big Graphs Parallel scalability Making big graphs small –Bounded evaluability –Query-preserving graph compression."

Similar presentations


Ads by Google