Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graph Analytics on Massive Collections of Small Graphs Dritan Bleco Yannis Kotidis Department of Informatics Athens University Of Economics and Business.

Similar presentations


Presentation on theme: "Graph Analytics on Massive Collections of Small Graphs Dritan Bleco Yannis Kotidis Department of Informatics Athens University Of Economics and Business."— Presentation transcript:

1 Graph Analytics on Massive Collections of Small Graphs Dritan Bleco Yannis Kotidis Department of Informatics Athens University Of Economics and Business EDBT 2014 - Athens kotidis@aueb.grdritanbleco@aueb.gr

2 Outline Motivation Graph Records & Queries Storage of Graph Records and Indexing using a Column Store Graph View Materialization Selection of Graph Views Extensions Experiments Conclusions Dritan Bleco

3 Motivational Example Focus on small graphs that are generated continuously – Examples: data from CRM, WMS and SCM applications Difference between our targeted applications and other applications of graphs (e.g. social web, biology) – Not a single massive graph but a massive collection of smaller graphs – Nodes/ Edges are mapped to real world entities Thus, no need for isomorphism discovery Dritan Bleco

4 Framework Overview Our framework puts together three different techniques – A column-oriented relational backend to permit a flat description of the graph records. Alleviates recursion and costly joins for path calculations (required in a straightforward relational implementation) – A very efficient indexing mechanism using bitmap columns Analogous to bitmap indexes frequently used in DWs This model is generic and can accommodate specialized graph indexes (for example the gIndex) – A framework that permits the creation and reuse of materialized graph views of different types These views improve query times especially for aggregation queries Dritan Bleco

5 A F E G D I K Production Lines Hubs Customer Locations Dritan Bleco B C H J Own Route Leased Route QUERIES Delivery Time for products shipped via [A, D, E, G, I] path Delivery Cost for products shipped using Leased Routes The longest delay for products shipped from Region 1 to Location I via Hubs of Region2 Region1 Region2

6 Primitive Query Types Graph Queries – Find records that contain a given query graph G q – The result is the record id with the respective measures of each matching record – For example return delivery times along all hops in [A, D, E, G, I] Aggregate Graph Queries – A Graph Query G q with the addition of a user-defined aggregate function f – The result is the aggregation of the measures along all maximal paths (paths connecting sink and terminal nodes in G q ) – E.g. total delivery time for all shipments via [A, D, E, G, I] Dritan Bleco

7 A Graph Queries Dritan Bleco Record 1 B D C E A F D C E G A F D E G Record 2 Record 3 1:3 2:4 3:2 4:1 5:2 2:1 3:2 6:4 7:1 4:2 5:3 4:5 5:4 6:3 7:1 EdgeEdge Id AB1 AC2 CE3 AD4 DE5 EF6 FG7 Find records that follow path [ACEF] Result : r2, AC:1, CE:2, EF:4 (record id, related measures)

8 A Graph Aggregate Queries Dritan Bleco Record 1 B D C E A F D C E G A F D E G Record 2 Record 3 1:3 2:4 3:2 4:1 5:2 2:1 3:2 6:4 7:1 4:2 5:3 4:5 5:4 6:3 7:1 EdgeEdge Id AB1 AC2 CE3 AD4 DE5 EF6 FG7 Find records and the total (sum) cost for path [ADEF] Result : r2, ADEF:9 (record id, aggregated measures) r3, ADEF:12

9 A Storage Model Dritan Bleco Record 1 B D C E A F D C E G A F D E G Record 2 Record 3 1:3 2:4 3:2 4:1 5:2 2:1 3:2 6:4 7:1 4:2 5:3 4:5 5:4 6:3 7:1 EdgeEdge Id AB1 AC2 CE3 AD4 DE5 EF6 FG7 rec Idm1m1 m2m2 m3m3 m4m4 m5m5 m6m6 m7m7 134212Null 2 122341 3 5431

10 A Bitmap Columns – a simple index Dritan Bleco Record 1 B D C E A F D C E G A F D E G Record 2 Record 3 1:3 2:4 3:2 4:1 5:2 2:1 3:2 6:4 7:1 4:2 5:3 4:5 5:4 6:3 7:1 EdgeEdge Id AB1 AC2 CE3 AD4 DE5 EF6 FG7 rec Idm1m1 m2m2 m3m3 m4m4 m5m5 m6m6 m7m7 b1b1 b2b2 b3b3 b4b4 b5b5 b6b6 b7b7 134212Null 1111100 2 1223410111111 3 54310001111

11 A Queries using Bitmap Columns Dritan Bleco B D C EFG EdgeEdge Id AB1 AC2 CE3 AD4 DE5 EF6 FG7 rec Idm1m1 m2m2 m3m3 m4m4 m5m5 m6m6 m7m7 b1b1 b2b2 b3b3 b4b4 b5b5 b6b6 b7b7 134212Null 1111100 2 1223410111111 3 54310001111 Graph Query Get the costs delay of [ACEF] path Select recid, m 2, m 3, m 6 where b 2 =1 AND b 3 =1 AND b 6 =1 Graph Aggregate Query Get the total cost delay of [ACEF] path Select recid, m 2 + m 3 + m 6 where b 2 =1 AND b 3 =1 AND b 6 =1

12 Graph View Materialization Materialized Graph Views – Used for Graph Queries / Aggregate Graph Queries – Implemented as bitmaps resulting from ANDing the edges of a subgraph derived (by our techniques) from a set of graph queries – These bitmaps are added as a new columns in the database Materialized Aggregate Graph Views – Used for Graph Queries / Graph Aggregate Queries – A Bitmap (as in a Graph View) and pre-computed aggregates Bitmap is the corresponding materialized Graph View Aggregates are derived from the measures stored in graph records Dritan Bleco

13 A Materialized Graph Views Dritan Bleco B D C EFG EdgeEdge Id AB1 AC2 CE3 AD4 DE5 EF6 FG7 rec Idm1m1 m2m2 m3m3 m4m4 m5m5 m6m6 m7m7 b1b1 b2b2 b3b3 b4b4 b5b5 b6b6 b7b7 b q1 134212Null 11111000 2 12234101111111 3 543100011110 Query Q 1 = Get the cost delay of [ACEF] path Select recid, m 2,m 3,m 6 where b q1 =1 (b 2 =1 AND b 3 =1 AND b 6 =1) Materialized View for Q1 : b q1 = b 2 AND b 3 AND b 6

14 A Materialized Aggregate Views Dritan Bleco B D C EFG EdgeEdge Id AB1 AC2 CE3 AD4 DE5 EF6 FG7 rec Idm1m1 m2m2 m3m3 m4m4 m5m5 m6m6 m7m7 m q1 b1b1 b2b2 b3b3 b4b4 b5b5 b6b6 b7b7 b q1 134212Null 11111000 2 122341701111111 3 5431 00011110 Query Q 1 = Get the total cost of [ACEF] path Select recid, m q1 (m 2 + m 3 + m 6 ) where b q1 =1 (b 2 =1 AND b 3 =1 AND b 6 =1) Path Aggregated Q 1 : b q1 = b 2 AND b 3 AND b 6 m q1 = m 2 + m3 + m 6

15 A Dritan Bleco B D C EFG EdgeEdge Id AB1 AC2 CE3 AD4 DE5 EF6 FG7 rec Idm1m1 m2m2 m3m3 m4m4 m5m5 m6m6 m7m7 m q1 b1b1 b2b2 b3b3 b4b4 b5b5 b6b6 b7b7 b q1 134212Null 11111000 2 122341701111111 3 5431 00011110 Another Query can use the materialization of Q 1 Q 2 = Get the total cost delay of [ACEFG] path Select recid, m q1 + m 7 (m 2 + m 3 + m 6 +m 7 ) where b q1 =1 AND b 7 =1 (b 2 =1 AND b 3 =1 AND b 6 =1 AND b 7 =1 ) Aggregated Q1 : b q1 = b 2 AND b 3 AND b 6 m q1 = m 2 + m 3 + m 6

16 Re-use of materialized graph views See our past work "Business Intelligence on Complex Graph Data", BEWEB, Berlin, Germany, March 2012, – How to formulate complex graph expressions using a set of intuitive operators we define How to best answer a user query using materialized (Aggregate or not) Graph Views? – A simple cost model based on the number of bitmaps required for answering a query – Mapped to a set cover problem – – Solved via a greedy algorithm – Details are in the paper. Dritan Bleco

17 What to materialize? Aggressive materialization: Materialize whole queries – Often not possible due to space limitations Our approach: Query Driven Graph View Selection First need to derive a set of candidate views – Naïve approach : Consider all subsets of the edges in the Union of all Query Graphs Exponential number of candidates (thus not feasible) Many redundant Views – Intuition: Prune candidates based on a monotonicity property Dritan Bleco

18 Candidate Generation Based on this property we only consider the following candidates : 1.Each query graph +{[ACEFGHJ], [ADEFGHJ]} 2.All the subgraphs that are intersection between 2 query graphs +{[EFGHJ]} 3.All the subgraphs that are intersection between 2 graphs of the previous step until no more new views are created A B D C EFG H J Frequent Query Set {[ACEFGHJ], [ADEFGHJ]} The view selection from candidate set mapped as set a cover problem

19 Dritan Bleco Extensions All data are be stored in a single relation rec Idm1m1 m2m2 m3m3 m4m4 m5m5 m6m6 m7m7 b1b1 b2b2 b3b3 b4b4 b5b5 b6b6 b7b7 134212Null 1111100 2 1223410111111 3 54310001111 But obviously can be partitioning in more than one relation rec Idm1m1 m2m2 m3m3 b1b1 b2b2 b3b3 1342111 2Null12011 3 000 rec Idm4m4 m5m5 m6m6 m7m7 b4b4 b5b5 b6b6 b7b7 112Null 1100 223411111 354311111 Can easily incorporate Specialized Graph Indexes (for example the gIndex)

20 Experiments Graph records from two datasets 1.* NY: Depicts New York roads and 2.**Gnutella: Describes connections among Gnutella hosts from August 2002. Experimental evaluation among 4 systems – Commercial Row Store Relational DB – Column Store Relational DB – Neo4j – Commercial Native RDF DB * http://www.dis.uniroma1.it/~challenge9/download.shtmlhttp://www.dis.uniroma1.it/~challenge9/download.shtml ** http://snap.stanford.edu/data/p2p-Gnutella05.html Dritan Bleco

21 Comparison to alternative Systems (no views) Dritan Bleco Our System provides almost constant query times with increasing graph query size as fewer records are retrieved (even though more bitmaps are being used) Column store not affected from increasing density (% edges in a record)

22 Benefit of Using Graph Views Graph views provide savings of up to 32% in query times – there is a mandatory cost for fetching the records that is not affected by materialization Thus, more savings are seen in aggregate queries – using 100 aggregate graph views reduce the execution time by 89% Larger gains when queries exhibit skew (graphs in the paper) Dritan Bleco Runtime for 100 uniform Graph Queries Runtime for 100 uniform Aggregate Graph Queries

23 Using Additional Indexes gIndex (record driven): trained the index using records that are part of the query result set – It took about 24 hours to process about 100.000 records Graph views (query driven) result in up to 6 times faster query processing times – It ran in less than one second Dritan Bleco gIndex in 100 uniform Graph Queries gIndex 100 uniform Aggregate Graph Queries

24 Conclusions Presented a framework where both data and queries are modeled as abstract graph structures – Abstracted two primitive query graphs – Introduced two types of Graph Views for expediting queries – Discussed an efficient mechanism for selecting a set of non-redundant views – Answering queries using Graph Views by solving an instance of a set cover problem Argued for a simple yet effective representation of graph records using a flat relational model implemented in a column store – Introduced bitmap indexes for efficient query processing – Graph Views are stored within the same relational schema Presented experimental results using datasets consisting of hundreds of millions of graph records – Experimental results show that our platform is orders of magnitude faster than A straightforward relational implementation Alternative systems that natively handle graph data. Dritan Bleco

25 Thank you, Dritan Bleco Questions?


Download ppt "Graph Analytics on Massive Collections of Small Graphs Dritan Bleco Yannis Kotidis Department of Informatics Athens University Of Economics and Business."

Similar presentations


Ads by Google