Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc. 2011.

Similar presentations


Presentation on theme: "Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc. 2011."— Presentation transcript:

1 Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc. 2011

2 Who Am I? Pig committer and PMC Member HCatalog committer and mentor Member of ASF and Incubator PMC Co-founder of Hortonworks Author of Programming Pig from O’Reilly Photo credit: Steven Guarnaccia, The Three Little Pigs

3 Who Are You? 3

4 What Should We Optimize? Minimize scans – Hadoop is still often I/O bound Minimize total number of MR jobs Minimize shuffle size and number of shuffles Avoid spills to disk Reduce or remove skew For small jobs, minimize start-up time 4

5 Pig Deployment User machine Hadoop Cluster Pig resides on user machine or gateway Job executes on cluster No server, all optimization and planning done on the launching machine

6 Pig Guts (i.e. Pig Architecture), p. 1 6 A = LOAD ‘myfile’ AS (x, y, z); B = GROUP A by x; C = FILTER B by group > 0; D = FOREACH C GENERATE group, COUNT(A); STORE D INTO ‘output’; Pig Latin Load Group Filter Foreach Store Logical Plan AST Semantic Checks

7 Pig Guts, p. 2 7 Load Group Filter Foreach Store Logical Plan Load Filter Group Foreach Store Rule based optimizations Map Filter Rearrange Reduce Package Foreach MapReduce Plan

8 Pig Guts, p. 3 8 Map Filter Rearrange Reduce Package Foreach MapReduce Plan Map Filter Rearrange Reduce Package Foreach Combine Foreach Physical optimizations

9 It would be really cool if… 9 Map Reduce Map Reduce What’s the right join algorithm here? Even with statistics it would be hard to know. Need on the fly execution plan rewrites.

10 Memory Java + memory management = oil + water –Java types inefficient memory users (~4x disk size) –Very difficult to tell how much memory you are using Originally tried to monitor memory use via MXBeans: FAIL! Now estimate number of records we can hold in memory and spill when we exceed; allow user to tune guess 10

11 Reducing Spills to Disk Select Map size and io.sort.mb size such that 1 Map produces 1 Combiner Would be nice if Pig did this automatically Recent improvements: hash based aggregation in

12 Skew You are only as fast as your slowest reducer Data often power law distributed, means one reducer gets 10x+ the data of others Solution 1, use combiner whenever possible Solution 2, break rule that all records for a given key go to one reducer; works for order by and join 12

13 Reducing your Reducers Whenever possible use algorithms that can be done with no reduce –Fragment-replicate join –Merge join –Collected group 13

14 (De)serialization Data moves between memory and disk often Need to highly optimize, more work to be done here Need to do lazy deserialization 14

15 Faster Job Startup Should be using the distributed cache for Pig jar and UDFs For small jobs could use LocalJobRunner Need to try Tenzing approach of having a few tasks spun up and waiting for small jobs 15

16 Improved Execution Models 16 Map Reduce Map Reduce This is unnecessary. Anything that can be done in this map can be pushed to the previous reduce. Need MR*

17 Code Generation Currently Pig physical operators are packaged in jar and pieced together on the backend to construct the data pipeline Tenzing and others have tried generating code on the fly instead, have seen significant improvements Downside, need javac on client machine 17

18 Multi-store script A = load ‘users’ as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’; load users filter nulls group by state group by age, gender apply UDFs store into ‘bystate’ store into ‘bydemo’

19 Multi-Store Map-Reduce Plan map filter rearrange split rearrange reduce multiplex package foreach

20 Hash Join Pages Users Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Users by name, Pages by user; Map 1 Pages block n Pages block n Map 2 Users block m Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred) (2, fred) (1, jane) (2, jane)

21 Fragment Replicate Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “replicated”; Pages Users Map 1 Map 2 Users Pages block 1 Pages block 1 Pages block 2 Pages block 2

22 Skew Join Pages Users Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “skewed”; Map 1 Pages block n Pages block n Map 2 Users block m Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred, p1) (1, fred, p2) (2, fred) (1, fred, p3) (1, fred, p4) (2, fred) SPSP SPSP SPSP SPSP

23 Merge Join Pages Users aaron. zach aaron. zach Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “merge”; Map 1 Map 2 Users Pages aaron… amr aaron … amy… barb amy …

24 Learn More Read the online documentation: Programming Pig from O’Reilly Press Join the mailing lists: for user for developer Follow me

25 Questions? 25


Download ppt "Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc. 2011."

Similar presentations


Ads by Google