Pig, a high level data processing system on Hadoop.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer Hadoop PMC Member An architect in Yahoo! grid team Or, as one coworker put.
J OIN ALGORITHMS USING MAPREDUCE Haiping Wang
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Programming Types of Testing.
Spark: Cluster Computing with Working Sets
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
CS347: MapReduce CS Motivation for Map-Reduce Distribution makes simple computations complex Communication Load balancing Fault tolerance … What.
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
HADOOP ADMIN: Session -2
Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Christopher Jeffers August 2012
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Components of Database Management System
Big Data Analytics Training
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
Making Hadoop Easy pig
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
MapReduce How to painlessly process terabytes of data.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
An Introduction to HDInsight June 27 th,
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
VIRTUAL MEMORY By Thi Nguyen. Motivation  In early time, the main memory was not large enough to store and execute complex program as higher level languages.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Design of Pig B. Ramamurthy. Pig’s data model Scalar types: int, long, float (early versions, recently float has been dropped), double, chararray, bytearray.
Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Data Engineering Shivnath Babu Introduction to Parallel Execution.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
MSBIC Hadoop Series Processing Data with Pig
Pig Latin - A Not-So-Foreign Language for Data Processing
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CSE 491/891 Lecture 21 (Pig).
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
(Hadoop) Pig Dataflow Language
CS639: Data Management for Data Science
MapReduce: Simplified Data Processing on Large Clusters
(Hadoop) Pig Dataflow Language
Presentation transcript:

Pig, a high level data processing system on Hadoop

2 Is MapReduce not Good Enough? Restricted programming model Only two phases Job chain for long data flow Too many lines of code even for simple logic How many lines do you have for word count? Programmers are responsible for this

3 Pig to the Rescue High level dataflow language (Pig Latin) Much simpler than Java Simplifies the data processing Puts the operations at the apropriate phases Chains multiple MR jobs

4 How Pig is used in the Industry At Yahoo, 70% MapReduce jobs are written in Pig Used to Process web logs Build user behavior models Process images Data mining Also used by Twitter, LinkedIn, eBay, AOL,...

5 Motivation by Example Suppose we have user data in one file, website data in another file. We need to find the top 5 most visited pages by users aged 18-25

6 In MapReduce

7 In Pig Latin

8 Pig runs over Hadoop

9 Wait a minute How to map the data to records  By default, one line → one record  User can customize the loading process How to identify attributes and map them to the schema  Delimiter to separate different attributes  By default, delimiter is tab. Customizable.

10 MapReduce Vs. Pig cont. Join in MapReduce  Various algorithms. None of them are easy to implement in MapReduce  Multi-way join is more complicated  Hard to integrate into SPJA workflow

11 MapReduce Vs. Pig cont. Join in Pig  Various algorithms are already available.  Some of them are generic to support multi-way join  No need to consider integration into SPJA workflow. Pig does that for you! A = LOAD 'input/join/A'; B = LOAD 'input/join/B'; C = JOIN A BY $0, B BY $1; DUMP C;

12 Pig Latin Data flow language  Users specify a sequence of operations to process data  More control on the process, compared with declarative language Various data types are supported Schema is supported User-defined functions are supported

13 Statement A statement represents an operation, or a stage in the data flow Usually a variable is used to represent the result of the statement Not limited to data processing operations, but also contains filesystem operations

14 Schema User can optionally define the schema of the input data Once the schema of the source data is given, the schema of the intermediate relation will be induced by Pig

15 Schema cont. Why schema?  Scripts are more readable (by alias)  Help system validate the input Similar to Database?  Yes. But schema here is optional  Schema is not fixed for a particular dataset, but changable

16 Schema cont. Schema 1 A = LOAD 'input/A' as (name:chararray, age:int); B = FILTER A BY age != 20; Schema 2 A = LOAD 'input/A' as (name:chararray, age:chararray); B = FILTER A BY age != '20'; No Schema A = LOAD 'input/A' ; B = FILTER A BY A.$1 != '20';

17 Data Types Every attribute can always be interpreted as a bytearray, without further type definition Simple data types  For each attribute  Defined by user in the schema  Int, double, chararray... Complex data types  Usually contructed by relational operations  Tuple, bag, map

18 Data Types cont. Type casting  Pig will try to cast data types when type inconsistency is seen.  Warning will be thrown if casting fails. Process still goes on Validation  Null will replace the inconvertable data type in type casting  User can tell a corrupted record by detecting whether a particular attribute is null

19 Date Types cont.

20 Operators Relational Operators  Represent an operation that will be added to the logical plan  LOAD, STORE, FILTER, JOIN, FOREACH...GENERATE

21 Operators Diagnostic Operators  Show the status/metadata of the relations  Used for debugging  Will not be integrated into execution plan  DESCRIBE, EXPLAIN, ILLUSTRATE.

22 Functions Eval Functions  Record transformation Filter Functions  Test whether a record satisfies particular predicate Comparison Functions  Impose ordering between two records. Used by ORDER operation Load Functions  Specify how to load data into relations Store Functions  Specify how to store relations to external storage

23 Functions Built-in Functions  Hard-coded routines offered by Pig. User Defined Function (UDF)  Supports customized functionalities  Piggy Bank, a warehouse for UDFs

View of Pig from inside

25 Pig Execution Modes Local mode  Launch single JVM  Access local file system  No MR job running Hadoop mode  Execute a sequence of MR jobs  Pig interacts with Hadoop master node

26 Compilation

27 04/13/10 Parsing Type checking with schema Reference verification Logical plan generation One-to-one fashion Independent of execution platform Limited optimization No execution until DUMP or STORE Parsing

28 04/13/10 Logic Plan A=LOAD 'file1' AS (x, y, z); B=LOAD 'file2' AS (t, u, v); C=FILTER A by y > 0; D=JOIN C BY x, B BY u; E=GROUP D BY z; F=FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output'; LOAD FILTER LOAD JOIN GROUP FOREACH STORE Logical Plan

29 04/13/10 Physical Plan 1:1 correspondence with most logical operators Except for: DISTINCT (CO)GROUP JOIN ORDER Physical Plan

Two typical types of join  Map-side join  Reduce-side join Joins in MapReduce

Map tasks: Table R Table L Map-side Join

REDUCE-SIDE JOIN Drawback: all records may have to be buffered Out of memory  The key cardinality is small  The data is highly skewed L: ratings.dat R: movies.dat Pairs: (key, targeted record) 1:: 1193 ::5:: :: 661 ::3:: :: 661 ::3:: :: 661 ::4:: :: 1193 ::5:: :: 1193 ::5:: :: 661 ::3:: :: 661 ::3:: :: 661 ::4:: :: 1193 ::5:: ::James and the Glant… 914 ::My Fair Lady ::One Flew Over the… 2355 ::Bug’s Life, A… 3408 ::Erin Brockovich… 661 ::James and the Glant… 914 ::My Fair Lady ::One Flew Over the… 2355 ::Bug’s Life, A… 3408 ::Erin Brockovich… 1193, L:1::1193::5:: , L :1::661::3:: , L :1::661::3:: , L :1::661::4:: , L :1 ::1193::5 :: , R :661::James and the Gla… 914, R : 914::My Fair Lady , R : 1193::One Flew Over … 2355, R : 2355::Bug’s Life, A… 3408, R : 3408::Erin Brockovi… (661, …) (1193, …) (661, …) (2355, …) (3048, …) (914, …) (1193, …) ( 661, [L :1::661::3::97…], [R:661::James…], [L:1::661::3::978…], [L :1::661::4::97…]) ( 2355, [R:2355::B’…]) (3408, [R:3408::Eri…]) (1,Ja..,3, …) (1,Ja..,4, …) Group by join key Buffers records into two sets according to the table tag + Cross-product Buffers records into two sets according to the table tag + Cross-product {(661::James…) } X (1::661::3::97…), (1::661::4::97…) Phase /FunctionImprovement Map FunctionOutput key is changed to a composite of the join key and the table tag. Partitioning functionHashcode is computed from just the join key part of the composite key Grouping functionRecords are grouped on just the join key

33 04/13/10 Physical Plan 1:1 correspondence with most logical operators Except for: DISTINCT (CO)GROUP JOIN ORDER Physical Plan

34 04/13/10 LOAD FILTER LOAD JOIN GROUP FOREACH STORE LOAD FILTER LOAD LOCAL REARRANGE PACKAGE FOREACH STORE GLOBAL REARRANGE LOCAL REARRANGE PACKAGE FOREACH GLOBAL REARRANGE

35 04/13/10 Physical Optimization Always use combiner for pre-aggregation Insert SPLIT to re-use intermediate result Early projection (logical or physical?) Physical Optimizations

36 04/13/10 MapReduce Plan Determine MapReduce boundaries GLOBAL REARRANGE STORE/LOAD Some operations are done by MapReduce framework Coalesce other operators into Map & Reduce stages Generate job jar file MapReduce Plan

37 04/13/10 LOAD FILTER LOAD LOCAL REARRANGE PACKAGE FOREACH STORE GLOBAL REARRANGE LOCAL REARRANGE PACKAGE FOREACH GLOBAL REARRANGE FILTER LOCAL REARRANGE Map Reduce Map Reduce PACKAGE FOREACH LOCAL REARRANGE PACKAGE FOREACH

38 Execution in Hadoop Mode The MR jobs not dependent on anything in the MR plan will be submitted for execution MR jobs will be removed from MR plan after completion  Jobs whose dependencies are satisfied are now ready for execution Currently, no support for inter-job fault- tolerance

Discussion of the Two Readings on Pig (SIGMOD 2008 and VLDB 2009)

40 Discussion Points for Reading 1 Examples of the nested data model, CoGroup, and Join (Figure 2) Nested query in Section 3.7

41 What are the Logical, Physical, and MapReduce plans for: STORE answer INTO ‘/user/alan/answer’;

42 04/13/10 LOAD LOCAL REARRANGE PACKAGE FOREACH STORE GLOBAL REARRANGE LOCAL REARRANGE PACKAGE FOREACH GLOBAL REARRANGE FILTER LOCAL REARRANGE Map Reduce Map Reduce PACKAGE FOREACH LOCAL REARRANGE PACKAGE FOREACH FILTER

 B,D  R.A = “c” R S Recall Operator Plumbing Materialization: output of one operator written to disk, next operator reads from the disk Pipelining: output of one operator directly fed to next operator

 B,D  R.A = “c” R S Materialization Materialized here

 B,D  R.A = “c” R S Iterators: Pipelining  Each operator supports: Open() GetNext() Close()

46 04/13/10 FILTER LOCAL REARRANGE Map Reduce PACKAGE FOREACH How do these operators execute in Pig? Hints (based on Reading 2):  What will Hadoop’s map function and reduce function calls do in this case?  How does each operator work? What does each operator do? (Section 4.3)  Outermost operator graph (Section 5)  Iterator model (Section 5)

47 04/13/10 Branching Flows in Pig Hints (based on Reading 2, Section 5.1, last two paras before Section 5.1.1):  Outermost data flow graph  New pause signal for iterators clicks = LOAD `clicks' AS (userid, pageid, linkid, viewedat); SPLIT clicks INTO pages IF pageid IS NOT NULL, links IF linkid IS NOT NULL; cpages = FOREACH pages GENERATE userid, CanonicalizePage(pageid) AS cpage, viewedat; clinks = FOREACH links GENERATE userid, CanonicalizeLink(linkid) AS clink, viewedat; STORE cpages INTO `pages'; STORE clinks INTO `links';

48 04/13/10 Branching Flows in Pig Draw the MapReduce plan for this query clicks = LOAD `clicks' AS (userid, pageid, linkid, viewedat); byuser = GROUP clicks BY userid; result = FOREACH byuser { uniqPages = DISTINCT clicks.pageid; uniqLinks = DISTINCT clicks.linkid; GENERATE group, COUNT(uniqPages), COUNT(uniqLinks); };

49 04/13/10 Branching Flows in Pig Draw the MapReduce plan for this query clicks = LOAD `clicks' AS (userid, pageid, linkid, viewedat); byuser = GROUP clicks BY userid; result = FOREACH byuser { fltrd = FILTER clicks BY viewedat IS NOT NULL; uniqPages = DISTINCT fltrd.pageid; uniqLinks = DISTINCT fltrd.linkid; GENERATE group, COUNT(uniqPages), COUNT(uniqLinks); };

Performance and future improvement

51 Pig Performance Images from

52 Future Improvements Query optimization  Currently rule-based optimizer for plan rearrangement and join selection  Cost-based in the future Non-Java UDFs Grouping and joining on pre-partitioned/sorted data  Avoid data shuffling for grouping and joining  Building metadata facilities to keep track of data layout Skew handling  For load balancing

53 Get more information at the Pig website You can work with the source code to implement something new in Pig Also take a look at Hive, a similar system from Facebook

54 References Some of the content come from the following presentations:  Introduction to data processing using Hadoop and Pig, by Ricardo Varela  Pig, Making Hadoop Easy, by Alan F. Gates  Large-scale social media analysis with Hadoop, by Jake Hofman  Getting Started on Hadoop, by Paco Nathan  MapReduce Online, by Tyson Condie and Neil Conway