Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc. 2011.

Pig Optimization and Execution Page 1 Alan F. Gates @alanfgates © Hortonworks Inc. 2011

Who Am I? Pig committer and PMC Member HCatalog committer and mentor Member of ASF and Incubator PMC Co-founder of Hortonworks Author of Programming Pig from O’Reilly Photo credit: Steven Guarnaccia, The Three Little Pigs

Who Are You? 3

What Should We Optimize? Minimize scans – Hadoop is still often I/O bound Minimize total number of MR jobs Minimize shuffle size and number of shuffles Avoid spills to disk Reduce or remove skew For small jobs, minimize start-up time 4

Pig Deployment User machine Hadoop Cluster Pig resides on user machine or gateway Job executes on cluster No server, all optimization and planning done on the launching machine

Pig Guts (i.e. Pig Architecture), p. 1 6 A = LOAD ‘myfile’ AS (x, y, z); B = GROUP A by x; C = FILTER B by group > 0; D = FOREACH C GENERATE group, COUNT(A); STORE D INTO ‘output’; Pig Latin Load Group Filter Foreach Store Logical Plan AST Semantic Checks

Pig Guts, p. 2 7 Load Group Filter Foreach Store Logical Plan Load Filter Group Foreach Store Rule based optimizations Map Filter Rearrange Reduce Package Foreach MapReduce Plan

Pig Guts, p. 3 8 Map Filter Rearrange Reduce Package Foreach MapReduce Plan Map Filter Rearrange Reduce Package Foreach Combine Foreach Physical optimizations

It would be really cool if… 9 Map Reduce Map Reduce What’s the right join algorithm here? Even with statistics it would be hard to know. Need on the fly execution plan rewrites.

Memory Java + memory management = oil + water –Java types inefficient memory users (~4x disk size) –Very difficult to tell how much memory you are using Originally tried to monitor memory use via MXBeans: FAIL! Now estimate number of records we can hold in memory and spill when we exceed; allow user to tune guess 10

Reducing Spills to Disk Select Map size and io.sort.mb size such that 1 Map produces 1 Combiner Would be nice if Pig did this automatically Recent improvements: hash based aggregation in 0.10 11

Skew You are only as fast as your slowest reducer Data often power law distributed, means one reducer gets 10x+ the data of others Solution 1, use combiner whenever possible Solution 2, break rule that all records for a given key go to one reducer; works for order by and join 12

Reducing your Reducers Whenever possible use algorithms that can be done with no reduce –Fragment-replicate join –Merge join –Collected group 13

(De)serialization Data moves between memory and disk often Need to highly optimize, more work to be done here Need to do lazy deserialization 14

Faster Job Startup Should be using the distributed cache for Pig jar and UDFs For small jobs could use LocalJobRunner Need to try Tenzing approach of having a few tasks spun up and waiting for small jobs 15

Improved Execution Models 16 Map Reduce Map Reduce This is unnecessary. Anything that can be done in this map can be pushed to the previous reduce. Need MR*

Code Generation Currently Pig physical operators are packaged in jar and pieced together on the backend to construct the data pipeline Tenzing and others have tried generating code on the fly instead, have seen significant improvements Downside, need javac on client machine 17

Multi-store script A = load ‘users’ as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’; load users filter nulls group by state group by age, gender apply UDFs store into ‘bystate’ store into ‘bydemo’

Multi-Store Map-Reduce Plan map filter rearrange split rearrange reduce multiplex package foreach

Hash Join Pages Users Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Users by name, Pages by user; Map 1 Pages block n Pages block n Map 2 Users block m Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred) (2, fred) (1, jane) (2, jane)

Fragment Replicate Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “replicated”; Pages Users Map 1 Map 2 Users Pages block 1 Pages block 1 Pages block 2 Pages block 2

Skew Join Pages Users Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “skewed”; Map 1 Pages block n Pages block n Map 2 Users block m Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred, p1) (1, fred, p2) (2, fred) (1, fred, p3) (1, fred, p4) (2, fred) SPSP SPSP SPSP SPSP

Merge Join Pages Users aaron. zach aaron. zach Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “merge”; Map 1 Map 2 Users Pages aaron… amr aaron … amy… barb amy …

Learn More Read the online documentation: http://pig.apache.org/ http://pig.apache.org/ Programming Pig from O’Reilly Press Join the mailing lists: –user@pig.apache.org for user questionsuser@pig.apache.org –dev@pig.apache.com for developer issuesdev@pig.apache.com Follow me on Twitter, @alanfgates @alanfgates

Questions? 25

Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc. 2011.

Similar presentations

Presentation on theme: "Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc. 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc. 2011.

Similar presentations

Presentation on theme: "Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc. 2011."— Presentation transcript:

Similar presentations

About project

Feedback