© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop
© Hortonworks Inc What is Apache Pig? Page 2 Architecting the Future of Big Data Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from
© Hortonworks Inc Pig-latin example Page 3 Architecting the Future of Big Data Query : Get the list of web pages visited by users whose age is between 20 and 29 years. USERS = load ‘users’ as (uid, age); USERS_20s = filter USERS by age >= 20 and age <= 29; PVs = load ‘pages’ as (url, uid, timestamp); PVs_u20s = join USERS_20s by uid, PVs by uid;
© Hortonworks Inc Why pig ? Page 4 Architecting the Future of Big Data Faster development – Fewer lines of code – Don’t re-invent the wheel Flexible – Metadata is optional – Extensible – Procedural programming Pic courtesy
© Hortonworks Inc Pig optimizations Page 5 Architecting the Future of Big Data Ideally user should not have to bother Reality – Pig is still young and immature – Pig does not have the whole picture –Cluster configuration –Data histogram – Pig philosophy: Pig is docile
© Hortonworks Inc Pig optimizations Page 6 Architecting the Future of Big Data What pig does for you – Do safe transformations of query to optimize – Optimized operations (join, sort) What you do – Organize input in optimal way – Optimize pig-latin query – Tell pig what join/group algorithm to use
© Hortonworks Inc Rule based optimizer Page 7 Architecting the Future of Big Data Column pruner Push up filter Push down flatten Push up limit Partition pruning Global optimizer
© Hortonworks Inc Column Pruner Page 8 Architecting the Future of Big Data Pig will do column pruning automatically Cases Pig will not do column pruning automatically – No schema specified in load statement A = load ‘input’ as (a0, a1, a2); B = foreach A generate a0+a1; C = order B by $0; Store C into ‘output’; Pig will prune a2 automatically A = load ‘input’; B = order A by $0; C = foreach B generate $0+$1; Store C into ‘output’; A = load ‘input’; A1 = foreach A generate $0, $1; B = order A1 by $0; C = foreach B generate $0+$1; Store C into ‘output’; DIY
© Hortonworks Inc Column Pruner Page 9 Architecting the Future of Big Data Another case Pig does not do column pruning – Pig does not keep track of unused column after grouping A = load ‘input’ as (a0, a1, a2); B = group A by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’; DIY A = load ‘input’ as (a0, a1, a2); A1 = foreach A generate $0, $1; B = group A1 by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’;
© Hortonworks Inc Push up filter Page 10 Architecting the Future of Big Data Pig split the filter condition before push A A Join a0>0 && b0>10 B B Filter A A Join a0>0 B B Filter b0>10 Original query Split filter condition A A Join a0>0 B B Filter b0>10 Push up filter
© Hortonworks Inc Other push up/down Page 11 Architecting the Future of Big Data Push down flatten Push up limit Load Flatten Order Load Flatten Order A = load ‘input’ as (a0:bag, a1); B = foreach A generate flattten(a0), a1; C = order B by a1; Store C into ‘output’; Load Limit Foreach Load Foreach Limit Load (limited) Foreach Load Limit Order Load Order (limited)
© Hortonworks Inc Partition pruning Page 12 Architecting the Future of Big Data Prune unnecessary partitions entirely – HCatLoader HCatLoader Filter (year>=2011) HCatLoader (year>=2011)
© Hortonworks Inc Intermediate file compression Page 13 Architecting the Future of Big Data Pig Script map 1 reduce 1 map 2 reduce 2 Pig temp file map 3 reduce 3 Pig temp file Intermediate file between map and reduce – Snappy Temp file between mapreduce jobs – No compression by default
© Hortonworks Inc Enable temp file compression Page 14 Architecting the Future of Big Data Pig temp file are not compressed by default – Issues with snappy (HADOOP-7990) – LZO: not Apache license Enable LZO compression –Install LZO for Hadoop –In conf/pig.properties –With lzo, up to > 90% disk saving and 4x query speed up pig.tmpfilecompression = true pig.tmpfilecompression.codec = lzo
© Hortonworks Inc Multiquery Page 15 Architecting the Future of Big Data Combine two or more map/reduce job into one – Happens automatically – Cases we want to control multiquery: combine too many Load Group by $0 Group by $1 Foreach Store Group by $2 Foreach Store
© Hortonworks Inc Control multiquery Page 16 Architecting the Future of Big Data Disable multiquery – Command line option: -M Using “exec” to mark the boundary A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, COUNT(A); Store C0 into ‘output0’; B1 = group A by $1; C1 = foreach B1 generate group, COUNT(A); Store C1 into ‘output1’; exec B2 = group A by $2; C2 = foreach B2 generate group, COUNT(A); Store C2 into ‘output2’;
© Hortonworks Inc Implement the right UDF Page 17 Architecting the Future of Big Data Algebraic UDF – Initial – Intermediate – Final A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, SUM(A); Store C0 into ‘output0’; Map Initial Combiner Intermediate Reduce Final
© Hortonworks Inc Implement the right UDF Page 18 Architecting the Future of Big Data Accumulator UDF – Reduce side UDF – Normally takes a bag Benefit – Big bag are passed in batches – Avoid using too much memory – Batch size A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, my_accum(A); Store C0 into ‘output0’; my_accum extends Accumulator { public void accumulate() { // take a bag trunk } public void getValue() { // called after all bag trunks are processed } pig.accumulative.batchsize=20000
© Hortonworks Inc Memory optimization Page 19 Architecting the Future of Big Data Control bag size on reduce side – If bag size exceed threshold, spill to disk – Control the bag size to fit the bag in memory if possible reduce(Text key, Iterator values, ……) Mapreduce: Iterator Bag of Input 1 Bag of Input 2 Bag of Input 3 pig.cachedbag.memusage=0.2
© Hortonworks Inc Optimization starts before pig Page 20 Architecting the Future of Big Data Input format Serialization format Compression
© Hortonworks Inc Input format -Test Query Page 21 Architecting the Future of Big Data > searches = load ’aol_search_logs.txt' using PigStorage() as (ID, Query, …); > search_thejas = filter searches by Query matches '.*thejas.*'; > dump search_thejas; ( , thejasminesupperclub, ….)
© Hortonworks Inc Input formats Page 22 Architecting the Future of Big Data
© Hortonworks Inc Columnar format Page 23 Architecting the Future of Big Data RCFile Columnar format for a group of rows More efficient if you query subset of columns
© Hortonworks Inc Tests with RCFile Page 24 Architecting the Future of Big Data Tests with load + project + filter out all records. Using hcatalog, w compression,types Test 1 Project 1 out of 5 columns Test 2 Project all 5 columns
© Hortonworks Inc RCFile test results Page 25 Architecting the Future of Big Data
© Hortonworks Inc Cost based optimizations Page 26 Architecting the Future of Big Data Optimizations decisions based on your query/data Often iterative process Run query Measure Tune
© Hortonworks Inc Hash Based Agg Use pig.exec.mapPartAgg=true to enable Map task Cost based optimization - Aggregation Page 27 Architecting the Future of Big Data Map (logic) Map (logic) M. Output HBA Output HBA Output Reduce task
© Hortonworks Inc Cost based optimization – Hash Agg. Page 28 Architecting the Future of Big Data Auto off feature switches off HBA if output reduction is not good enough Configuring Hash Agg Configure auto off feature - pig.exec.mapPartAgg.minReduction Configure memory used - pig.cachedbag.memusage
© Hortonworks Inc Cost based optimization - Join Page 29 Architecting the Future of Big Data Use appropriate join algorithm Skew on join key - Skew join Fits in memory – FR join
© Hortonworks Inc Cost based optimization – MR tuning Page 30 Architecting the Future of Big Data Tune MR parameters to reduce IO Control spills using map sort params Reduce shuffle/sort-merge params
© Hortonworks Inc Parallelism of reduce tasks Page 31 Architecting the Future of Big Data Number of reduce slots = 6 Factors affecting runtime Cores simultaneously used/skew Cost of having additional reduce tasks
© Hortonworks Inc Cost based optimization – keep data sorted Page 32 Architecting the Future of Big Data Frequent joins operations on same keys Keep data sorted on keys Use merge join Optimized group on sorted keys Works with few load functions – needs additional i/f implementation
© Hortonworks Inc Optimizations for sorted data Page 33 Architecting the Future of Big Data
© Hortonworks Inc Future Directions Page 34 Architecting the Future of Big Data Optimize using stats Using historical stats w hcatalog Sampling
© Hortonworks Inc Questions Page 35 Architecting the Future of Big Data ?
© Hortonworks Inc Page 36