© Hortonworks Inc. 2011 Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

Slides:

Advertisements

Similar presentations

Beyond Mapper and Reducer

Advertisements

Starfish: A Self-tuning System for Big Data Analytics.

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig Yahoo! Apache pig.

Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.

Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer and PMC Member An architect in Yahoo! grid team Photo credit: Steven Guarnaccia,

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer Hadoop PMC Member An architect in Yahoo! grid team Or, as one coworker put.

Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.

High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.

Pig Contributors Workshop Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Spark: Cluster Computing with Working Sets

Clydesdale: Structured Data Processing on MapReduce Jackie.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.

1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,

Putting the Sting in Hive Page 1 Alan F.

PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

HADOOP ADMIN: Session -2

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

Making Hadoop Easy pig

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

An Introduction to HDInsight June 27 th,

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Alan Gates Becoming a Pig Developer Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig.

Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

CS 440 Database Management Systems Lecture 5: Query Processing 1.

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.

What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Pig, Making Hadoop Easy Alan F. Gates Yahoo!.

CS 440 Database Management Systems

Pig Latin - A Not-So-Foreign Language for Data Processing

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

MapReduce Simplied Data Processing on Large Clusters

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

CSE 491/891 Lecture 21 (Pig).

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Pig performance has been improving because of the optimizations.

Charles Tappert Seidenberg School of CSIS, Pace University

(Hadoop) Pig Dataflow Language

Evaluation of Relational Operations: Other Techniques

5/7/2019 Map Reduce Map reduce.

(Hadoop) Pig Dataflow Language

Pig and pig latin: An Introduction

Map Reduce, Types, Formats and Features

Presentation transcript:

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop

© Hortonworks Inc What is Apache Pig? Page 2 Architecting the Future of Big Data Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from

© Hortonworks Inc Pig-latin example Page 3 Architecting the Future of Big Data Query : Get the list of web pages visited by users whose age is between 20 and 29 years. USERS = load ‘users’ as (uid, age); USERS_20s = filter USERS by age >= 20 and age <= 29; PVs = load ‘pages’ as (url, uid, timestamp); PVs_u20s = join USERS_20s by uid, PVs by uid;

© Hortonworks Inc Why pig ? Page 4 Architecting the Future of Big Data Faster development – Fewer lines of code – Don’t re-invent the wheel Flexible – Metadata is optional – Extensible – Procedural programming Pic courtesy

© Hortonworks Inc Pig optimizations Page 5 Architecting the Future of Big Data Ideally user should not have to bother Reality – Pig is still young and immature – Pig does not have the whole picture –Cluster configuration –Data histogram – Pig philosophy: Pig is docile

© Hortonworks Inc Pig optimizations Page 6 Architecting the Future of Big Data What pig does for you – Do safe transformations of query to optimize – Optimized operations (join, sort) What you do – Organize input in optimal way – Optimize pig-latin query – Tell pig what join/group algorithm to use

© Hortonworks Inc Rule based optimizer Page 7 Architecting the Future of Big Data Column pruner Push up filter Push down flatten Push up limit Partition pruning Global optimizer

© Hortonworks Inc Column Pruner Page 8 Architecting the Future of Big Data Pig will do column pruning automatically Cases Pig will not do column pruning automatically – No schema specified in load statement A = load ‘input’ as (a0, a1, a2); B = foreach A generate a0+a1; C = order B by $0; Store C into ‘output’; Pig will prune a2 automatically A = load ‘input’; B = order A by $0; C = foreach B generate $0+$1; Store C into ‘output’; A = load ‘input’; A1 = foreach A generate $0, $1; B = order A1 by $0; C = foreach B generate $0+$1; Store C into ‘output’; DIY

© Hortonworks Inc Column Pruner Page 9 Architecting the Future of Big Data Another case Pig does not do column pruning – Pig does not keep track of unused column after grouping A = load ‘input’ as (a0, a1, a2); B = group A by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’; DIY A = load ‘input’ as (a0, a1, a2); A1 = foreach A generate $0, $1; B = group A1 by a0; C = foreach B generate SUM(A.a1); Store C into ‘output’;

© Hortonworks Inc Push up filter Page 10 Architecting the Future of Big Data Pig split the filter condition before push A A Join a0>0 && b0>10 B B Filter A A Join a0>0 B B Filter b0>10 Original query Split filter condition A A Join a0>0 B B Filter b0>10 Push up filter

© Hortonworks Inc Other push up/down Page 11 Architecting the Future of Big Data Push down flatten Push up limit Load Flatten Order Load Flatten Order A = load ‘input’ as (a0:bag, a1); B = foreach A generate flattten(a0), a1; C = order B by a1; Store C into ‘output’; Load Limit Foreach Load Foreach Limit Load (limited) Foreach Load Limit Order Load Order (limited)

© Hortonworks Inc Partition pruning Page 12 Architecting the Future of Big Data Prune unnecessary partitions entirely – HCatLoader HCatLoader Filter (year>=2011) HCatLoader (year>=2011)

© Hortonworks Inc Intermediate file compression Page 13 Architecting the Future of Big Data Pig Script map 1 reduce 1 map 2 reduce 2 Pig temp file map 3 reduce 3 Pig temp file Intermediate file between map and reduce – Snappy Temp file between mapreduce jobs – No compression by default

© Hortonworks Inc Enable temp file compression Page 14 Architecting the Future of Big Data Pig temp file are not compressed by default – Issues with snappy (HADOOP-7990) – LZO: not Apache license Enable LZO compression –Install LZO for Hadoop –In conf/pig.properties –With lzo, up to > 90% disk saving and 4x query speed up pig.tmpfilecompression = true pig.tmpfilecompression.codec = lzo

© Hortonworks Inc Multiquery Page 15 Architecting the Future of Big Data Combine two or more map/reduce job into one – Happens automatically – Cases we want to control multiquery: combine too many Load Group by $0 Group by $1 Foreach Store Group by $2 Foreach Store

© Hortonworks Inc Control multiquery Page 16 Architecting the Future of Big Data Disable multiquery – Command line option: -M Using “exec” to mark the boundary A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, COUNT(A); Store C0 into ‘output0’; B1 = group A by $1; C1 = foreach B1 generate group, COUNT(A); Store C1 into ‘output1’; exec B2 = group A by $2; C2 = foreach B2 generate group, COUNT(A); Store C2 into ‘output2’;

© Hortonworks Inc Implement the right UDF Page 17 Architecting the Future of Big Data Algebraic UDF – Initial – Intermediate – Final A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, SUM(A); Store C0 into ‘output0’; Map Initial Combiner Intermediate Reduce Final

© Hortonworks Inc Implement the right UDF Page 18 Architecting the Future of Big Data Accumulator UDF – Reduce side UDF – Normally takes a bag Benefit – Big bag are passed in batches – Avoid using too much memory – Batch size A = load ‘input’; B0 = group A by $0; C0 = foreach B0 generate group, my_accum(A); Store C0 into ‘output0’; my_accum extends Accumulator { public void accumulate() { // take a bag trunk } public void getValue() { // called after all bag trunks are processed } pig.accumulative.batchsize=20000

© Hortonworks Inc Memory optimization Page 19 Architecting the Future of Big Data Control bag size on reduce side – If bag size exceed threshold, spill to disk – Control the bag size to fit the bag in memory if possible reduce(Text key, Iterator values, ……) Mapreduce: Iterator Bag of Input 1 Bag of Input 2 Bag of Input 3 pig.cachedbag.memusage=0.2

© Hortonworks Inc Optimization starts before pig Page 20 Architecting the Future of Big Data Input format Serialization format Compression

© Hortonworks Inc Input format -Test Query Page 21 Architecting the Future of Big Data > searches = load ’aol_search_logs.txt' using PigStorage() as (ID, Query, …); > search_thejas = filter searches by Query matches '.*thejas.*'; > dump search_thejas; ( , thejasminesupperclub, ….)

© Hortonworks Inc Input formats Page 22 Architecting the Future of Big Data

© Hortonworks Inc Columnar format Page 23 Architecting the Future of Big Data RCFile Columnar format for a group of rows More efficient if you query subset of columns

© Hortonworks Inc Tests with RCFile Page 24 Architecting the Future of Big Data Tests with load + project + filter out all records. Using hcatalog, w compression,types Test 1 Project 1 out of 5 columns Test 2 Project all 5 columns

© Hortonworks Inc RCFile test results Page 25 Architecting the Future of Big Data

© Hortonworks Inc Cost based optimizations Page 26 Architecting the Future of Big Data Optimizations decisions based on your query/data Often iterative process Run query Measure Tune

© Hortonworks Inc Hash Based Agg Use pig.exec.mapPartAgg=true to enable Map task Cost based optimization - Aggregation Page 27 Architecting the Future of Big Data Map (logic) Map (logic) M. Output HBA Output HBA Output Reduce task

© Hortonworks Inc Cost based optimization – Hash Agg. Page 28 Architecting the Future of Big Data Auto off feature switches off HBA if output reduction is not good enough Configuring Hash Agg Configure auto off feature - pig.exec.mapPartAgg.minReduction Configure memory used - pig.cachedbag.memusage

© Hortonworks Inc Cost based optimization - Join Page 29 Architecting the Future of Big Data Use appropriate join algorithm Skew on join key - Skew join Fits in memory – FR join

© Hortonworks Inc Cost based optimization – MR tuning Page 30 Architecting the Future of Big Data Tune MR parameters to reduce IO Control spills using map sort params Reduce shuffle/sort-merge params

© Hortonworks Inc Parallelism of reduce tasks Page 31 Architecting the Future of Big Data Number of reduce slots = 6 Factors affecting runtime Cores simultaneously used/skew Cost of having additional reduce tasks

© Hortonworks Inc Cost based optimization – keep data sorted Page 32 Architecting the Future of Big Data Frequent joins operations on same keys Keep data sorted on keys Use merge join Optimized group on sorted keys Works with few load functions – needs additional i/f implementation

© Hortonworks Inc Optimizations for sorted data Page 33 Architecting the Future of Big Data

© Hortonworks Inc Future Directions Page 34 Architecting the Future of Big Data Optimize using stats Using historical stats w hcatalog Sampling

© Hortonworks Inc Questions Page 35 Architecting the Future of Big Data ?

© Hortonworks Inc Page 36