How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig Yahoo! Apache pig.

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations http://pig.apache.org Thejas Nair pig team @ Yahoo! Apache pig PMC member

What is Pig? Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

Pig Latin example Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user;

Comparison with MR in Java 1/20 the lines of code1/16 the development time What about Performance ?

Pig Compared to Map Reduce Faster development time Data flow versus programming logic Many standard data operations (e.g. join) included Manages all the details of connecting jobs and data flow Copes with Hadoop version change issues

And, You Don’t Lose Power UDFs can be used to load, evaluate, aggregate, and store data External binaries can be invoked Metadata is optional Flexible data model Nested data types Explicit data flow programming

Pig performance Pigmix : pig vs mapreduce

Pig optimization principles vs RDBMS: There is absence of accurate models for data, operators and execution env Use available reliable info. Trust user choice. Use rules that help in most cases Rules based on runtime information

Logical Optimizations Restructure given logical dataflow graph Apply filter, project, limit early Merge foreach, filter statements Operator rewrites Script A = load B = foreach C = filter Logical Plan A -> B -> C Parser Logical Optimizer Optimized L. Plan A -> C -> B

Physical Optimizations Physical plan: sequence of MR jobs having physical operators. Built-in rules. eg. use of combiner Specified in query - eg. join type Optimized L. Plan X -> Y -> Z Optimizer Phy/MR plan M(PX-PYm) R(PYr) -> M(Z) Optimized Phy/MR Plan M(PX-PYm) C(PYc)R(PYr) -> M(Z) Translator

Hash Join Pages Users Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Users by name, Pages by user; Map 1 Pages block n Pages block n Map 2 Users block m Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred) (2, fred) (1, jane) (2, jane)

Skew Join Pages Users Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using ‘skewed’; Map 1 Pages block n Pages block n Map 2 Users block m Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred, p1) (1, fred, p2) (2, fred) (1, fred, p3) (1, fred, p4) (2, fred) SPSP SPSP SPSP SPSP

Merge Join Pages Users aaron. zach aaron. zach Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using ‘merge’; Map 1 Map 2 Users Pages aaron… amr aaron … amy… barb amy …

Replicated Join Pages Users aaron. zach aaron. zach Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using ‘replicated’; Map 1 Map 2 Users Pages aaron… amr aaron. zach amy… barb Users aaron. zach

Group/cogroup optimizations On sorted and ‘collected’ data grp = group Users by name using ‘collected’; Pages aaron barney carol. zach Map 1 aaron barney Map 2 carol.

Multi-store script A = load ‘users’ as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’; A: load B: filter C2: group C1: group C3: eval udf C2: eval udf store into ‘bystate’ store into ‘bydemo’

Multi-Store Map-Reduce Plan map filter local rearrange split local rearrange reduce multiplex package foreach

Memory Management Use disk if large objects don’t fit into memory JVM limit > phy mem - Very poor performance Spill on memory threshold notification from JVM - unreliable pre-set limit for large bags. Custom spill logic for different bags -eg distinct bag.

Other optimizations Aggressive use of combiner, secondary sort Lazy deserialization in loaders Better serialization format Faster regex lib, compiled pattern

Future optimization work Improve memory management Join + group in single MR, if same keys used Even better skew handling Adaptive optimizations Automated hadoop tuning …

Pig - fast and flexible More flexibility in 0.8, 0.9 Udfs in scripting languages (python) MR job as relation Relation as scalar Turing complete pig (0.9) Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

Further reading Docs - http://pig.apache.org/docs/r0.7.0/ Papers and talks - http://wiki.apache.org/pig/PigTalksPaper s http://wiki.apache.org/pig/PigTalksPaper s Training videos in vimeo.com (search ‘hadoop pig’)

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig Yahoo! Apache pig.

Similar presentations

Presentation on theme: "How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig Yahoo! Apache pig."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig Yahoo! Apache pig.

Similar presentations

Presentation on theme: "How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig Yahoo! Apache pig."— Presentation transcript:

Similar presentations

About project

Feedback