Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra Vikas V. Deshpande Kemafor Anyanwu {pravind2, vvdeshpa, COUL - semantic.

Similar presentations


Presentation on theme: "Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra Vikas V. Deshpande Kemafor Anyanwu {pravind2, vvdeshpa, COUL - semantic."— Presentation transcript:

1 Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra Vikas V. Deshpande Kemafor Anyanwu {pravind2, vvdeshpa, kogan}@ncsu.edu COUL - semantic COmpUting research Lab

2 Introduction Growing interest in exploiting RDF data for decision-making Requires support for analytical-style querying e.g. : Sales (Cust, prod, price, loc, month, year) * For each prod, count for each month of 2008, the sales that were between previous months avg sale and next months avg sale - More complex than traditional SPJ queries -Often include multiple groupings and / or aggregations -Next release of SPARQL expected to include such constructs (prev_avg_sale, next_avg_sale) ProdMonthCount Prod1Feb3 * Example from [1]

3 Analytical Query Processing Traditional OLAP techniques Requires star / snowflake schema Enterprise-scale But Semantic Web data (RDF) Semi-structured (labeled graphs) Absence of star-like schema Billion triple data sets Goal : Exploit MapReduce-based frameworks to develop a scalable, cost-effective platform for Semantic Web analytics.

4 MapReduce-based Data Processing High-level dataflow languages - Pig Latin, DryadLINQ, HiveQL, JAQL Hybrid approach - HadoopDB [5] MapReduce in RDF processing Graph pattern queries [8], [9] Graph closure computation [10] RAPID [6] Succinct expression of complex queries Optimize multiple groupings / aggregations

5 RDF data model Statements (triples) Graph representation SubPropObj R1typeRanking R1pageRank11 R1pageURLUrl1 R1avgDuration97 UV1typeUserVisits UV1srcIP158.112.27.3 UV1destURLurl1 UV1adRevenue339.08142 UV1visitDate1979/12/12 UV1userAgentSCOPE UV1cCodeVNM UV1iCodeVNM-KH UV1sKeywordcomets UV1avgTime3 Rankings UserVisits Groups = Stars

6 SPARQL Query Matching graph pattern Traditional Querying of RDF Graph pattern matching E.g. Get details about all pages visited by particular users between 1979/12/01 and 1979/12/30

7 Example Analytical Query on RDF data Compute the average pageRank and total adRevenue for all pages visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30 Pattern matching Star sub graphs – Rankings, UserVisits Join between the stars Grouping based on value of srcIP property Aggregation on value of pageRank and adRevenue

8 Pig : Data Processing Express data processing tasks using high- level query primitives usability, code reuse, automatic optimization Pig Latin data model : atom, tuple, bag (nesting) Operators : LOAD, STORE, JOIN, GROUP BY, COGROUP, FOREACH, SPLIT, aggr. functions Extensibility support via UDFs Operators compile into MapReduce jobs Partition REL A using values in age column ($1) SPLIT A into minors IF $1 < 18, majors IF $1 >= 18; Equijoin on REL A (column 0) and REL B (column 1) JOIN A by $0, B by $1;

9 Package tuples JOIN A by $1, B by $0; Compiling Pig Latins JOIN to MapReduce $0$1 C1P1 C1P2 C2P1 $0$1 P118 P225 REL A REL B $0$1$2$3 C1P1 18 C2P1 18 C1P2 25 Annotate based on $1 (join key) map reduce P1 C1P1 18 Reducer 1 C2P1 18 P2 Reducer 2 C1P2 25 P2 P1 P2 P1

10 Pattern Matching in Pig : Approach 1 SubPropObj R1typeRanking R1pageRank11 R1pageURLUrl1 UV1typeUserVisits UV1srcIP158.112.27.3 SubPropObj R1typeRanking R1pageRank11 R1pageURLUrl1 UV1typeUserVisits UV1srcIP158.112.27.3 SubPropObj R1typeRanking R1pageRank11 R1pageURLUrl1 UV1typeUserVisits UV1srcIP158.112.27.3 R1 11 Ranking type url1 RankingsStarPattern = JOIN triples1 ON Sub, triples2 ON Sub, triples3 ON Sub; Rankings triples1triples2triples3 Issues - Self-joins on very large relations high I/O costs - Generate meaningless tuples additional filtering step (R1, type, Ranking, R1, type, Ranking, R1, type, Ranking) Rankings star pattern = 3-way self-join UserVisits star pattern = 5-way self-join pageRank pageURL Triple store

11 LOAD all the RDF triples Sub Prop Obj R1 type Ranking R2 type Ranking Sub Prop Obj R1 type Ranking R2 type Ranking typeRanking Sub Prop Obj UV1 destURL url1 UV2 destURL url1 Sub Prop Obj UV1 destURL url1 UV2 destURL url1 destURL Sub Prop Obj R1 pageURL url1 R2 pageURL url2 Sub Prop Obj R1 pageURL url1 R2 pageURL url2 pageURL Sub Prop Obj R1 pageRank 11 R2 pageRank 27 Sub Prop Obj R1 pageRank 11 R2 pageRank 27 pageRank Sub Prop Obj UV1 type userVisits UV2 type userVisits Sub Prop Obj UV1 type userVisits UV2 type userVisits typeUV Sub Prop Obj UV1 scrIP 158.112.27.3 UV2 scrIP 159.222.21.9 Sub Prop Obj UV1 scrIP 158.112.27.3 UV2 scrIP 159.222.21.9 srcIP Sub Prop Obj UV1 adRev 339.08142 UV2 adRev 330.51248 Sub Prop Obj UV1 adRev 339.08142 UV2 adRev 330.51248 adRev Sub Prop Obj UV1 visitDate 1979/12/12 UV2 visitDate 1980/02/02 Sub Prop Obj UV1 visitDate 1979/12/12 UV2 visitDate 1980/02/02 visitDate Ranking = JOIN (compute Star Pattern) UserVisits = JOIN (compute Star Pattern) JOIN between Ranking, UserVisits GROUP BY srcIP FOREACH group GENERATE aggregations SPLIT Sub Prop Obj UV1 visitDate 1979/12/12 UV4 visitDate 1979/12/02 Sub Prop Obj UV1 visitDate 1979/12/12 UV4 visitDate 1979/12/02 visitDate Approach 2: Vertical Partitioning Filter

12 LOAD all the RDF triples Sub Prop Obj R1 type Ranking R2 type Ranking Sub Prop Obj R1 type Ranking R2 type Ranking typeRanking Sub Prop Obj UV1 destURL url1 UV2 destURL url1 Sub Prop Obj UV1 destURL url1 UV2 destURL url1 destURL Sub Prop Obj R1 pageURL url1 R2 pageURL url2 Sub Prop Obj R1 pageURL url1 R2 pageURL url2 pageURL Sub Prop Obj R1 pageRank 11 R2 pageRank 27 Sub Prop Obj R1 pageRank 11 R2 pageRank 27 pageRank Sub Prop Obj UV1 type userVisits UV2 type userVisits Sub Prop Obj UV1 type userVisits UV2 type userVisits typeUV Sub Prop Obj UV1 scrIP 158.112.27.3 UV2 scrIP 159.222.21.9 Sub Prop Obj UV1 scrIP 158.112.27.3 UV2 scrIP 159.222.21.9 srcIP Sub Prop Obj UV1 adRev 339.08142 UV2 adRev 330.51248 Sub Prop Obj UV1 adRev 339.08142 UV2 adRev 330.51248 adRev Sub Prop Obj UV1 visitDate 1979/12/12 UV2 visitDate 1980/02/02 Sub Prop Obj UV1 visitDate 1979/12/12 UV2 visitDate 1980/02/02 visitDate Ranking = JOIN (compute Star Pattern) SPLIT Approach 2: Vertical Partitioning Issues SPLIT : Concurrent sub flows Risk of Disk spills I/O costs Structure of intermediate relations

13 FILTER FOREACH Compilation to MapReduce Jobs JOIN map1 JOIN GROUP BY reduce1 map3 reduce3 map4 reduce4 JOIN map2 reduce2 Step 1 : Pattern MatchingStep 2 : GroupingStep 3 : Aggregation RankingsUserVisits

14 Our Approach : RAPID+ Goal : Minimize I/O costs Strategy: Concurrent computation of star patterns using grouping-based algorithm Can improve efficiency using Operator- coalescing and Look-ahead processing

15 Concurrent Star Pattern Matching SubPropObj R1typeRanking R1pageRank11 R1pageURLUrl1 UV1typeUserVisits UV1srcIP158.112.27.3 UV1destURLurl1 UV1adRevenue339.08142 UV1visitDate1979/12/12 SubPropObj R1typeRanking R1pageRank11 R1pageURLUrl1 R1avgDuration97 UV1typeUserVisits UV1srcIP158.112.27.3 UV1destURLurl1 UV1adRevenue339.08142 UV1visitDate1979/12/12 UV1userAgentSCOPE UV1cCodeVNM UV1iCodeVNM-KH UV1sKeywordcomets UV1avgTime3 Use grouping-based algorithm on a triple storage model - GROUP BY Subject More efficient if prior filtering of irrelevant triples` Filter irrelevant properties Compute the average pageRank and total adRevenue for all pageURLs visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30 SubPropObj R1typeRanking R1pageRank11 R1pageURLUrl1 R1avgDuration97 UV1typeUserVisits UV1srcIP158.112.27.3 UV1destURLurl1 UV1adRevenue339.08142 UV1visitDate1979/12/12 UV1userAgentSCOPE UV1cCodeVNM UV1iCodeVNM-KH UV1sKeywordcomets UV1avgTime3 Ranking UserVisits

16 Concurrent Star Pattern Matching -2 Filter irrelevant triples by coalescing LOAD and FILTER operators input = LOAD \data using loadFilter ( pageRank, pageURL, type:Ranking, destURL, adRevenue, srcIP, visitDate, type:UserVisits ) LOAD FILTER map1 LOAD loadFilter Our Approach Operator Coalescing Savings by Coalescing: Context switching Parameter passing Multiple handling of same data Using Pig Latin map1

17 Grouping-based Pattern Matching SubPropObj R1typeRanking R1pageRank11 R1pageURLUrl1 UV1typeUserVisits UV1srcIP158.112.27.3 UV1destURLurl1 UV1adRevenue339.08142 UV1visitDate1979/12/12 GROUP BY Subject BUT heterogeneous bags starSubgraphs = GROUP input BY $0;

18 Filtering the Groups BUT all possible sub patterns computed Filter non-matching sub patterns Value-based filtering validate each sub graph against filter condition Structure-based filtering eliminate sub graphs with missing properties Missing srcIP visitDate between 1979/12/01 and 1979/12/30

19 Joining the Stars : Look-ahead Processing Annotate based on Subject Process each bag Annotate based on value of join property Group by Subject Process each bag Structure-based and value-based filtering Join between the star sub graphs map reduce Star Pattern Matching Cycle Next Cycle (Joining the Stars) Group by Subject Process each bag Structure-based and value-based filtering Annotate based on value of join prop No repeated processing

20 Example : Look-ahead Processing Star Pattern Matching Joining the Stars Structure-based filtering Value-based filtering Look-Ahead - Annotate bag based on join key Join between the star sub graphs Eliminate properties irrelevant for future processing (join and filter prop) Minimize size of intermediate results

21 Comparison : Pig vs RAPID+ Pig ApproachRAPID+ Multiple map-reduce cycles - N star sub graphs N cycles Single cycle - N star sub graphs 1 cycle Potential for increased I/O (i)Disk spills (SPLIT operator) (ii)Materialization of several intermediate results due to sequential computation of star patterns Minimized I/O (i)Filtering in triple storage model + load-filter coalescing (ii)Concurrent computation of star patterns (single intermediate result) Would require advanced optimization techniques - Introduce project operator to eliminate unneeded columns Smaller intermediate result sizes - Eliminate tuples and columns not necessary in future steps of processing Not applicableMinimize repeated tuple handling by look-ahead processing

22 Case Study Setup: 5-node / 20-node Hadoop clusters on NCSUs Virtual Computing Lab [13] Dataset: Synthetic benchmark data set [4] Tasks: Baseline case Task A (PM) – basic pattern matching (2 star patterns and a join between the stars) Task B (PM+GA) – pattern matching with grouping and aggregation (two look-ahead processing opportunities)

23 Experimental Results Cost Analysis for Task A (PM) 5-node cluster Cost Analysis for Task B (PM+GA) 5-node cluster

24 Experimental Results Scalability Study 5-node vs 20-nodes 1.8GB per node 2.8GB per node

25 Conclusion and Ongoing work Promising results even for baseline case Further opportunities for improvement First-class operators vs UDFs Exploit combiners during aggregations More efficient data structures for processing bags Further look-ahead optimizations during multiple groupings and aggregations

26 References [1] D. Chatziantoniou M. Akinde, T. Johnson, and S. Kim The MD-join: an operator for Complex OLAP ICDE 2001, 108–121 [2] J. Dean and S. Ghemawat. MapReduce : Simplified Data Processing on Large Clusters. In Proc. Of OSDI'04, 2004 [3] C. Olston, B. Reed, U.Srivastava, R. Kumar and A.Tomkins. Pig Latin: a not-so-foreign language for data processing. In Proc. of ACM SIGMOD2008, p.1099 -1110 [4] A.Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. "A Comparison of Approaches to Large-Scale Data Analysis", In Proc. of SIGMOD 2009 [5] Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB 2009 [6] Sridhar, R., Ravindra, P., Anyanwu, K.:RAPID: Enabling scalable ad-hoc analytics on the semantic web. ISWC 2009 [7] Yu,Y., Isard, M., Fetterly,D., Badiu,M., Erlingsson,U., Gunda,P.K., and Currey,J.: DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. OSDI 2008 [8] A. Newman, Y. Li, J. Hunter. Scalable Semantics – The Silver Lining of Cloud Computing. eScience, 2008. IEEE Fourth International Conference on eScience '08. 2008 [9] Newman, A., Hunter, J., Li, Y-F., Bouton, C., Davis, M.: A Scale-Out RDF Molecule Store for Distributed Processing of Biomedical Data. HCLS'08 at WWW 2008. [10] J. Urbani, S. Kotoulas, E. Oren, and F. van Harmelen, "Scalable Distributed Reasoning using MapReduce," in Proceedings of the ISWC 09, 2009 [11] Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 2007 [12] Prud'hommeaux, E., Seaborne, A.: SPARQL query language for RDF. Technical report, World Wide Web Consortium (2005) http://www.w3.org/TR/rdf-sparql-quer [13] VCL Setup at NC State University, https://vcl.ncsu.edu/ [14] HiveQL, http://hadoop.apache.org/hive/ [15] JAQL, http://code.google.com/p/jaql [16] RDF, http://www.w3.org/RDF/

27 Thank You!


Download ppt "Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra Vikas V. Deshpande Kemafor Anyanwu {pravind2, vvdeshpa, COUL - semantic."

Similar presentations


Ads by Google