Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.

Similar presentations


Presentation on theme: "Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh."— Presentation transcript:

1 Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh Srivastava Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience

2 - 2 - What is Pig? Procedural dataflow language (Pig Latin) for Map-Reduce Provides standard relational transforms (group, join, filter, sort, etc.) Schemas are optional; if used, can be part of data or specified at run time User defined functions are first class citizens of the language

3 - 3 - An Example You have a dataset urls: (url, category, pagerank) You want to know the top 10 urls per category as measured by pagerank for sufficiently large categories: urls = load ‘dataset’ as (url, category, pagerank); grps = group urls by category; bgrps = filter grps by COUNT(urls) > 1000000; rslt = foreach bgrps generate group, top10(urls); store rslt into ‘myOutput’;

4 - 4 - Pig Latin = Sweet Spot between SQL & Map- Reduce SQLPigMap-Reduce Programming style Large blocks of declarative constraints  “Plug together pipes” Built-in data manipulations Group-by, Sort, Join, Filter, Aggregate, Top-k, etc...  Group-by, Sort Execution modelFancy; trust the query optimizer  Simple, transparent Opportunities for automatic optimization Many  Few (logic buried in map() and reduce()) Data SchemaMust be known at table creation  Not required, may be defined at runtime

5 - 5 - Building Pig Type System and Type Inference Compilation to Map-Reduce Jobs Plan Execution Streaming; supporting user provided executables Performance Measurements Project Experience

6 - 6 - Map Reduce Overview

7 - 7 - From Pig Latin to Map Reduce Parser Script A = load B = filter C = group D = foreach Logical Plan Semantic Checks Logical Plan Logical Optimizer Logical Plan Logical to Physical Translator Physical Plan Physical To MR Translator MapReduce Launcher Jar to hadoop Map-Reduce Plan Logical Plan ≈ relational algebra Plan standard optimizations Physical Plan = physical operators to be executed Map-Reduce Plan = physical operators broken into Map, Combine, and Reduce stages

8 - 8 - Pig Latin to Logical Plan A = load ‘users’ as (user, age); B = load ‘pageviews’ as (user, url); C = filter A by age < 18; D = join A by user, B by user; E = group D by url; F = foreach E generate group, CalcScore(url); store F into ‘scored_urls’; Pig LatinLogical Plan load users load pageviews filter join group foreach store 

9 - 9 - Group (tim, 17, yahoo.com) (tim, 17, ebay.com) (joe, 15, yahoo.com) D = join A by user, B by user; E = group D by url; F = foreach E generate group, CalcScore(url); join group foreach (yahoo.com, ) (tim, 17), (joe, 15) (ebay.com, (tim, 17) ) (yahoo.com, 0.95) (ebay.com, 0.90) 

10 - 10 - Join join  cogroup foreach (tim, 17, yahoo.com) (tim, 17, ebay.com) (joe, 15, yahoo.com) (tim, yahoo.com) (tim, ebay.com) (joe, yahoo.com) (tim, 17) (joe, 15) (bob, 11) (tim, (17) ) (joe, (15), (yahoo.com) ) (bob, (11), ) (yahoo.com) (ebay.com) load pageviews filter load pageviews filter

11 - 11 - Join Implementations Default is symmetric hash join Fragment-replicate for joining large and small inputs Merge join for joining inputs sorted on join key Skew join for handling inputs with significant skew in the join key

12 - 12 - Logical to Physical Plan Logical Plan load users load pageviews filter join group foreach store Physical Plan load users load pageviews filter local rearrange global rearrange foreach local rearrange global rearrange package foreach package store 

13 - 13 - Physical to Map-Reduce Plan Physical Plan load users load pageviews filter local rearrange global rearrange foreach local rearrange global rearrange package foreach package store filter local rearrange foreach package local rearrange package foreach  Map-Reduce Plan map reduce

14 - 14 - Sharing Scans load users filter out bots group by state group by demographic apply UDFs store into ‘bystate’ store into ‘bydemo’

15 - 15 - Multiple Group Map-Reduce Plan map filter local rearrange split local rearrange reduce multiplex package foreach

16 - 16 - Performance

17 - 17 - Questions


Download ppt "Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh."

Similar presentations


Ads by Google