Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Hortonworks Inc. 2011 Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

Similar presentations


Presentation on theme: "© Hortonworks Inc. 2011 Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop."— Presentation transcript:

1 © Hortonworks Inc. 2011 Daniel Dai (@daijy) Thejas Nair (@thejasn) Page 1 Making Pig Fly Optimizing Data Processing on Hadoop

2 © Hortonworks Inc. 2011 What is Apache Pig? Page 2 Architecting the Future of Big Data Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

3 © Hortonworks Inc. 2011 Pig-latin example Page 3 Architecting the Future of Big Data Query : Get the list of web pages visited by users whose age is between 20 and 29 years. USERS = load ‘users’ as (uid, age); USERS_20s = filter USERS by age >= 20 and age <= 29; PVs = load ‘pages’ as (url, uid, timestamp); PVs_u20s = join USERS_20s by uid, PVs by uid;

4 © Hortonworks Inc. 2011 Why pig ? Page 4 Architecting the Future of Big Data Faster development – Fewer lines of code – Don’t re-invent the wheel Flexible – Metadata is optional – Extensible – Procedural programming Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

5 © Hortonworks Inc. 2011 Pig optimizations Page 5 Architecting the Future of Big Data Ideally user should not have to bother Reality – Pig is still young and immature – Pig does not have the whole picture –Cluster configuration –Data histogram – Pig philosophy: Pig is docile

6 © Hortonworks Inc. 2011 Pig optimizations Page 6 Architecting the Future of Big Data What pig does for you – Do safe transformations of query to optimize – Optimized operations (join, sort) What you do – Organize input in optimal way – Optimize pig-latin query – Tell pig what join/group algorithm to use

7 © Hortonworks Inc. 2011 Rule based optimizer Page 7 Architecting the Future of Big Data Column pruner Push up filter Push down flatten Push up limit Partition pruning Global optimizer

8 © Hortonworks Inc. 2011 Column Pruner Page 8 Architecting the Future of Big Data Pig will do column pruning automatically Cases Pig will not do column pruning automatically – No schema specified in load statement A = load ‘input’ as (a0, a1, a2); B = foreach A generate a0+a1; C = order B by $0; Store C into ‘output’; Pig will prune a2 automatically A = load ‘input’; B = order A by $0; C = foreach B generate $0+$1; Store C into ‘output’; A = load ‘input’; A1 = foreach A generate $0, $1; B = order A1 by $0; C = foreach B generate $0+$1; Store C into ‘output’; DIY

9 © Hortonworks Inc. 2011 Column Pruner Page 9 Architecting the Future of Big Data Another case Pig does not do column pruning – Pig does not keep track of unused column after grouping A = load ‘input’ as (a0, a1, a2); B = group A by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’; DIY A = load ‘input’ as (a0, a1, a2); A1 = foreach A generate $0, $1; B = group A1 by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’;

10 © Hortonworks Inc. 2011 Push up filter Page 10 Architecting the Future of Big Data Pig split the filter condition before push A A Join a0>0 && b0>10 B B Filter A A Join a0>0 B B Filter b0>10 Original query Split filter condition A A Join a0>0 B B Filter b0>10 Push up filter

11 © Hortonworks Inc. 2011 Other push up/down Page 11 Architecting the Future of Big Data Push down flatten Push up limit Load Flatten Order Load Flatten Order A = load ‘input’ as (a0:bag, a1); B = foreach A generate flattten(a0), a1; C = order B by a1; Store C into ‘output’; Load Limit Foreach Load Foreach Limit Load (limited) Foreach Load Limit Order Load Order (limited)

12 © Hortonworks Inc. 2011 Partition pruning Page 12 Architecting the Future of Big Data Prune unnecessary partitions entirely – HCatLoader 2010 2011 2012 HCatLoader Filter (year>=2011) 2010 2011 2012 HCatLoader (year>=2011)

13 © Hortonworks Inc. 2011 Intermediate file compression Page 13 Architecting the Future of Big Data Pig Script map 1 reduce 1 map 2 reduce 2 Pig temp file map 3 reduce 3 Pig temp file Intermediate file between map and reduce – Snappy Temp file between mapreduce jobs – No compression by default

14 © Hortonworks Inc. 2011 Enable temp file compression Page 14 Architecting the Future of Big Data Pig temp file are not compressed by default – Issues with snappy (HADOOP-7990) – LZO: not Apache license Enable LZO compression –Install LZO for Hadoop –In conf/pig.properties –With lzo, up to > 90% disk saving and 4x query speed up pig.tmpfilecompression = true pig.tmpfilecompression.codec = lzo

15 © Hortonworks Inc. 2011 Multiquery Page 15 Architecting the Future of Big Data Combine two or more map/reduce job into one – Happens automatically – Cases we want to control multiquery: combine too many Load Group by $0 Group by $1 Foreach Store Group by $2 Foreach Store

16 © Hortonworks Inc. 2011 Control multiquery Page 16 Architecting the Future of Big Data Disable multiquery – Command line option: -M Using “exec” to mark the boundary A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, COUNT(A); Store C0 into ‘output0’; B1 = group A by $1; C1 = foreach B1 generate group, COUNT(A); Store C1 into ‘output1’; exec B2 = group A by $2; C2 = foreach B2 generate group, COUNT(A); Store C2 into ‘output2’;

17 © Hortonworks Inc. 2011 Implement the right UDF Page 17 Architecting the Future of Big Data Algebraic UDF – Initial – Intermediate – Final A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, SUM(A); Store C0 into ‘output0’; Map Initial Combiner Intermediate Reduce Final

18 © Hortonworks Inc. 2011 Implement the right UDF Page 18 Architecting the Future of Big Data Accumulator UDF – Reduce side UDF – Normally takes a bag Benefit – Big bag are passed in batches – Avoid using too much memory – Batch size A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, my_accum(A); Store C0 into ‘output0’; my_accum extends Accumulator { public void accumulate() { // take a bag trunk } public void getValue() { // called after all bag trunks are processed } pig.accumulative.batchsize=20000

19 © Hortonworks Inc. 2011 Memory optimization Page 19 Architecting the Future of Big Data Control bag size on reduce side – If bag size exceed threshold, spill to disk – Control the bag size to fit the bag in memory if possible reduce(Text key, Iterator values, ……) Mapreduce: Iterator Bag of Input 1 Bag of Input 2 Bag of Input 3 pig.cachedbag.memusage=0.2

20 © Hortonworks Inc. 2011 Optimization starts before pig Page 20 Architecting the Future of Big Data Input format Serialization format Compression

21 © Hortonworks Inc. 2011 Input format -Test Query Page 21 Architecting the Future of Big Data > searches = load ’aol_search_logs.txt' using PigStorage() as (ID, Query, …); > search_thejas = filter searches by Query matches '.*thejas.*'; > dump search_thejas; (1568578, thejasminesupperclub, ….)

22 © Hortonworks Inc. 2011 Input formats Page 22 Architecting the Future of Big Data

23 © Hortonworks Inc. 2011 Columnar format Page 23 Architecting the Future of Big Data RCFile Columnar format for a group of rows More efficient if you query subset of columns

24 © Hortonworks Inc. 2011 Tests with RCFile Page 24 Architecting the Future of Big Data Tests with load + project + filter out all records. Using hcatalog, w compression,types Test 1 Project 1 out of 5 columns Test 2 Project all 5 columns

25 © Hortonworks Inc. 2011 RCFile test results Page 25 Architecting the Future of Big Data

26 © Hortonworks Inc. 2011 Cost based optimizations Page 26 Architecting the Future of Big Data Optimizations decisions based on your query/data Often iterative process Run query Measure Tune

27 © Hortonworks Inc. 2011 Hash Based Agg Use pig.exec.mapPartAgg=true to enable Map task Cost based optimization - Aggregation Page 27 Architecting the Future of Big Data Map (logic) Map (logic) M. Output HBA Output HBA Output Reduce task

28 © Hortonworks Inc. 2011 Cost based optimization – Hash Agg. Page 28 Architecting the Future of Big Data Auto off feature switches off HBA if output reduction is not good enough Configuring Hash Agg Configure auto off feature - pig.exec.mapPartAgg.minReduction Configure memory used - pig.cachedbag.memusage

29 © Hortonworks Inc. 2011 Cost based optimization - Join Page 29 Architecting the Future of Big Data Use appropriate join algorithm Skew on join key - Skew join Fits in memory – FR join

30 © Hortonworks Inc. 2011 Cost based optimization – MR tuning Page 30 Architecting the Future of Big Data Tune MR parameters to reduce IO Control spills using map sort params Reduce shuffle/sort-merge params

31 © Hortonworks Inc. 2011 Parallelism of reduce tasks Page 31 Architecting the Future of Big Data Number of reduce slots = 6 Factors affecting runtime Cores simultaneously used/skew Cost of having additional reduce tasks

32 © Hortonworks Inc. 2011 Cost based optimization – keep data sorted Page 32 Architecting the Future of Big Data Frequent joins operations on same keys Keep data sorted on keys Use merge join Optimized group on sorted keys Works with few load functions – needs additional i/f implementation

33 © Hortonworks Inc. 2011 Optimizations for sorted data Page 33 Architecting the Future of Big Data

34 © Hortonworks Inc. 2011 Future Directions Page 34 Architecting the Future of Big Data Optimize using stats Using historical stats w hcatalog Sampling

35 © Hortonworks Inc. 2011 Questions Page 35 Architecting the Future of Big Data ?

36 © Hortonworks Inc. 2011 Page 36


Download ppt "© Hortonworks Inc. 2011 Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop."

Similar presentations


Ads by Google