© Hortonworks Inc. 2011 Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

© Hortonworks Inc. 2011 What is Apache Pig? Page 2 Architecting the Future of Big Data Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

© Hortonworks Inc. 2011 Pig-latin example Page 3 Architecting the Future of Big Data Query : Get the list of web pages visited by users whose age is between 20 and 29 years. USERS = load ‘users’ as (uid, age); USERS_20s = filter USERS by age >= 20 and age <= 29; PVs = load ‘pages’ as (url, uid, timestamp); PVs_u20s = join USERS_20s by uid, PVs by uid;

© Hortonworks Inc. 2011 Why pig ? Page 4 Architecting the Future of Big Data Faster development – Fewer lines of code – Don’t re-invent the wheel Flexible – Metadata is optional – Extensible – Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

© Hortonworks Inc. 2011 Pig optimizations Page 5 Architecting the Future of Big Data Ideally user should not have to bother Reality – Pig is still young and immature – Pig does not have the whole picture –Cluster configuration –Data histogram – Pig philosophy: Pig is docile

© Hortonworks Inc. 2011 Pig optimizations Page 6 Architecting the Future of Big Data What pig does for you – Do safe transformations of query to optimize – Optimized operations (join, sort) What you do – Organize input in optimal way – Optimize pig-latin query – Tell pig what join/group algorithm to use

© Hortonworks Inc. 2011 Column Pruner Page 8 Architecting the Future of Big Data Pig will do column pruning automatically Cases Pig will not do column pruning automatically – No schema specified in load statement A = load ‘input’ as (a0, a1, a2); B = foreach A generate a0+a1; C = order B by $0; Store C into ‘output’; Pig will prune a2 automatically A = load ‘input’; B = order A by $0; C = foreach B generate $0+$1; Store C into ‘output’; A = load ‘input’; A1 = foreach A generate $0, $1; B = order A1 by $0; C = foreach B generate $0+$1; Store C into ‘output’; DIY

© Hortonworks Inc. 2011 Column Pruner Page 9 Architecting the Future of Big Data Another case Pig does not do column pruning – Pig does not keep track of unused column after grouping A = load ‘input’ as (a0, a1, a2); B = group A by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’; DIY A = load ‘input’ as (a0, a1, a2); A1 = foreach A generate $0, $1; B = group A1 by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’;

© Hortonworks Inc. 2011 Push up filter Page 10 Architecting the Future of Big Data Pig split the filter condition before push A A Join a0>0 && b0>10 B B Filter A A Join a0>0 B B Filter b0>10 Original query Split filter condition A A Join a0>0 B B Filter b0>10 Push up filter

© Hortonworks Inc. 2011 Other push up/down Page 11 Architecting the Future of Big Data Push down flatten Push up limit Load Flatten Order Load Flatten Order A = load ‘input’ as (a0:bag, a1); B = foreach A generate flattten(a0), a1; C = order B by a1; Store C into ‘output’; Load Limit Foreach Load Foreach Limit Load (limited) Foreach Load Limit Order Load Order (limited)

© Hortonworks Inc. 2011 Partition pruning Page 12 Architecting the Future of Big Data Prune unnecessary partitions entirely – HCatLoader 2010 2011 2012 HCatLoader Filter (year>=2011) 2010 2011 2012 HCatLoader (year>=2011)

© Hortonworks Inc. 2011 Intermediate file compression Page 13 Architecting the Future of Big Data Pig Script map 1 reduce 1 map 2 reduce 2 Pig temp file map 3 reduce 3 Pig temp file Intermediate file between map and reduce – Snappy Temp file between mapreduce jobs – No compression by default

© Hortonworks Inc. 2011 Enable temp file compression Page 14 Architecting the Future of Big Data Pig temp file are not compressed by default – Issues with snappy (HADOOP-7990) – LZO: not Apache license Enable LZO compression –Install LZO for Hadoop –In conf/pig.properties –With lzo, up to > 90% disk saving and 4x query speed up pig.tmpfilecompression = true pig.tmpfilecompression.codec = lzo

© Hortonworks Inc. 2011 Multiquery Page 15 Architecting the Future of Big Data Combine two or more map/reduce job into one – Happens automatically – Cases we want to control multiquery: combine too many Load Group by $0 Group by $1 Foreach Store Group by $2 Foreach Store

© Hortonworks Inc. 2011 Control multiquery Page 16 Architecting the Future of Big Data Disable multiquery – Command line option: -M Using “exec” to mark the boundary A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, COUNT(A); Store C0 into ‘output0’; B1 = group A by $1; C1 = foreach B1 generate group, COUNT(A); Store C1 into ‘output1’; exec B2 = group A by $2; C2 = foreach B2 generate group, COUNT(A); Store C2 into ‘output2’;

© Hortonworks Inc. 2011 Implement the right UDF Page 17 Architecting the Future of Big Data Algebraic UDF – Initial – Intermediate – Final A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, SUM(A); Store C0 into ‘output0’; Map Initial Combiner Intermediate Reduce Final

© Hortonworks Inc. 2011 Implement the right UDF Page 18 Architecting the Future of Big Data Accumulator UDF – Reduce side UDF – Normally takes a bag Benefit – Big bag are passed in batches – Avoid using too much memory – Batch size A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, my_accum(A); Store C0 into ‘output0’; my_accum extends Accumulator { public void accumulate() { // take a bag trunk } public void getValue() { // called after all bag trunks are processed } pig.accumulative.batchsize=20000

© Hortonworks Inc. 2011 Memory optimization Page 19 Architecting the Future of Big Data Control bag size on reduce side – If bag size exceed threshold, spill to disk – Control the bag size to fit the bag in memory if possible reduce(Text key, Iterator values, ……) Mapreduce: Iterator Bag of Input 1 Bag of Input 2 Bag of Input 3 pig.cachedbag.memusage=0.2

© Hortonworks Inc. 2011 Input format -Test Query Page 21 Architecting the Future of Big Data > searches = load ’aol_search_logs.txt' using PigStorage() as (ID, Query, …); > search_thejas = filter searches by Query matches '.*thejas.*'; > dump search_thejas; (1568578, thejasminesupperclub, ….)

© Hortonworks Inc. 2011 Tests with RCFile Page 24 Architecting the Future of Big Data Tests with load + project + filter out all records. Using hcatalog, w compression,types Test 1 Project 1 out of 5 columns Test 2 Project all 5 columns

© Hortonworks Inc. 2011 Hash Based Agg Use pig.exec.mapPartAgg=true to enable Map task Cost based optimization - Aggregation Page 27 Architecting the Future of Big Data Map (logic) Map (logic) M. Output HBA Output HBA Output Reduce task

© Hortonworks Inc. 2011 Cost based optimization – Hash Agg. Page 28 Architecting the Future of Big Data Auto off feature switches off HBA if output reduction is not good enough Configuring Hash Agg Configure auto off feature - pig.exec.mapPartAgg.minReduction Configure memory used - pig.cachedbag.memusage

© Hortonworks Inc. 2011 Cost based optimization – MR tuning Page 30 Architecting the Future of Big Data Tune MR parameters to reduce IO Control spills using map sort params Reduce shuffle/sort-merge params

© Hortonworks Inc. 2011 Parallelism of reduce tasks Page 31 Architecting the Future of Big Data Number of reduce slots = 6 Factors affecting runtime Cores simultaneously used/skew Cost of having additional reduce tasks

© Hortonworks Inc. 2011 Cost based optimization – keep data sorted Page 32 Architecting the Future of Big Data Frequent joins operations on same keys Keep data sorted on keys Use merge join Optimized group on sorted keys Works with few load functions – needs additional i/f implementation

© Hortonworks Inc. 2011 Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

Similar presentations

Presentation on theme: "© Hortonworks Inc. 2011 Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© Hortonworks Inc. 2011 Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

Similar presentations

Presentation on theme: "© Hortonworks Inc. 2011 Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop."— Presentation transcript:

Similar presentations

About project

Feedback