The Pig Experience: Building High-Level Data flows on top of Map-Reduce The Pig Experience: Building High-Level Data flows on top of Map-Reduce DISTRIBUTED.

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

The Hadoop Stack, Part 1 Introduction to Pig Latin CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.

Presented By: Imranul Hoque

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing.

Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Shahram Ghandeharizadeh.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

HADOOP ADMIN: Session -2

Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Presenters: Abhishek Verma, Nicolas Zea.  Map Reduce  Clean abstraction  Extremely rigid 2 stage group-by aggregation  Code reuse and maintenance.

Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.

Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Cloud Computing Other High-level parallel processing languages Keke Chen.

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

Big Data Analytics Training

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.

Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Alan Gates Becoming a Pig Developer Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

CS347: Map-Reduce & Pig Hector Garcia-Molina Stanford University CS347Notes 09 1.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.

CS 347MapReduce1 CS 347 Distributed Databases and Transaction Processing Distributed Data Processing Using MapReduce Hector Garcia-Molina Zoltan Gyongyi.

What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Pig, Making Hadoop Easy Alan F. Gates Yahoo!.

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Pig : Building High-Level Dataflows over Map-Reduce

RDDs and Spark.

Pig Latin - A Not-So-Foreign Language for Data Processing

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Pig Latin: A Not-So-Foreign Language for Data Processing

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Hector Garcia-Molina Stanford University

Overview of big data tools

Pig : Building High-Level Dataflows over Map-Reduce

CSE 491/891 Lecture 21 (Pig).

Charles Tappert Seidenberg School of CSIS, Pace University

Hadoop – PIG.

Pig and pig latin: An Introduction

Pig Hive HBase Zookeeper

Presentation transcript:

The Pig Experience: Building High-Level Data flows on top of Map-Reduce The Pig Experience: Building High-Level Data flows on top of Map-Reduce DISTRIBUTED INFORMATION SYSTEMS Presenter: Javeria Iqbal Tutor: Dr.Martin Theobald

Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Optimization Future Work

Data Processing Renaissance  Internet companies swimming in data TBs/day for Yahoo! Or Google! PBs/day for FaceBook!  Data analysis is “inner loop” of product innovation

Data Warehousing …? Scale Often not scalable enough Price Prohibitively expensive at web scale Up to $200K/TB SQL High level declarative approach Little control over execution method

Map-Reduce Map : Performs filtering Reduce : Performs the aggregation These are two high level declarative primitives to enable parallel processing BUT no complex Database Operations e.g. Joins

Split the Program Master and Worker Threads Worker reads, parses key/value pairs and passes pairs to user-defined Map function Worker reads, parses key/value pairs and passes pairs to user-defined Map function Buffered pairs are written to local disk partitions, Location of buffered pairs are sent to reduce workers Buffered pairs are written to local disk partitions, Location of buffered pairs are sent to reduce workers Execution Overview of Map-Reduce

Reduce worker sorts data by the intermediate keys. Reduce worker sorts data by the intermediate keys. Unique keys, values are passed to user’s Reduce function. Output is appended to the output file for this reduce partition. Unique keys, values are passed to user’s Reduce function. Output is appended to the output file for this reduce partition. Execution Overview of Map-Reduce

The Map-Reduce Appeal Scale Scalable due to simpler design Explicit programming model Only parallelizable operations Price Runs on cheap commodity hardware Less Administration Procedural Control- a processing “pipe” SQL

Disadvantages 1. Extremely rigid data flow Other flows hacked in Join, Union Split M M R R M M M M R R M M Chains 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize 3. No combined processing of multiple Datasets Joins and other data processing operations

Motivation ScalableCheap Control over execution Inflexible Lots of hand coding Semantics hidden Need a high-level, general data flow language

Enter Pig Latin ScalableCheap Control over execution Pig Latin Need a high-level, general data flow language

Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Optimization Future Work

Pig Latin: Data Types Rich and Simple Data Model Simple Types: int, long, double, chararray, bytearray Complex Types: Atom: String or Number e.g. (‘apple’) Tuple: Collection of fields e.g. (áppe’, ‘mango’) Bag: Collection of tuples { (‘apple’, ‘mango’) (ápple’, (‘red’, ‘yellow’)) } Map: Key, Value Pair

Example: Data Model Atom: contains Single atomic value ‘alice’ ‘lanker’ ‘ipod’ Atom Tuple Tuple: sequence of fields Bag: collection of tuple with possible duplicates

Pig Latin: Input/Output Data Input: queries = LOAD `query_log.txt' USING myLoad() AS (userId, queryString, timestamp); Output: STORE query_revenues INTO `myoutput' USING myStore();

Pig Latin: General Syntax Discarding Unwanted Data: FILTER C omparison operators such as ==, eq, !=, neq Logical connectors AND, OR, NOT

Pig Latin: Expression Table

Pig Latin: FOREACH with Flatten expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString); expanded_queries = FOREACH queries GENERATE userId, FLATTEN(expandQuery(queryString));

Pig Latin: COGROUP Getting Related Data Together: COGROUP Suppose we have two data sets result:(queryString, url, position) revenue:(queryString, adSlot, amount) grouped_data = COGROUP result BY queryString, revenue BY queryString;

Pig Latin: COGROUP vs. JOIN

Pig Latin: Map-Reduce Map-Reduce in Pig Latin map_result = FOREACH input GENERATE FLATTEN(map(*)); key_group = GROUP map_result BY $0; output = FOREACH key_group GENERATE reduce(*);

Pig Latin: Other Commands UNION : Returns the union of two or more bags CROSS: Returns the cross product ORDER: Orders a bag by the specified field(s) DISTINCT: Eliminates duplicate tuple in a bag

Pig Latin: Nested Operations grouped_revenue = GROUP revenue BY queryString; query_revenues = FOREACH grouped_revenue { top_slot = FILTER revenue BY adSlot eq `top'; GENERATE queryString, SUM(top_slot.amount), SUM(revenue.amount); };

Pig Pen: Screen Shot

Pig Latin: Example 1 Suppose we have a table urls: (url, category, pagerank) Simple SQL query that finds, For each sufficiently large category, the average pagerank of high-pagerank urls in that category SELECT category, Avg(pagetank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10 6

Data Flow Filter good_urls by pagerank > 0.2 Group by category Filter category by count > 10 6 Foreach category generate avg. pagerank Foreach category generate avg. pagerank

Equivalent Pig Latin good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls) > 10 6 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

Example 2: Data Analysis Task UserUrlTime Amycnn.com8:00 Amybbc.com10:00 Amyflickr.com10:05 Fredcnn.com12:00 Find the top 10 most visited pages in each category UrlCategoryPageRank cnn.comNews0.9 bbc.comNews0.8 flickr.comPhotos0.7 espn.comSports0.9 VisitsUrl Info

Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls

Equivalent Pig Latin visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Quick Start and Interoperability Operates directly over files

visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Quick Start and Interoperability Schemas optional; Can be assigned dynamically Schemas optional; Can be assigned dynamically

visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; User-Code as a First-Class Citizen User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach

Pig Latin has a fully nested data model with: – Atomic values, tuples, bags (lists), and maps Avoids expensive joins Nested Data Model yahoo, finance news

Common case: aggregation on these nested sets Power users: sophisticated UDFs, e.g., sequence analysis Efficient Implementation (see paper) Nested Data Model Decouples grouping as an independent operation UserUrlTime Amycnn.com8:00 Amybbc.com10:00 Amybbc.com10:05 Fredcnn.com12:00 groupVisits cnn.com Amycnn.com8:00 Fredcnn.com12:00 bbc.com Amybbc.com10:00 Amybbc.com10:05 group by url I frankly like pig much better than SQL in some respects (group + optional ﬂatten), I love nested data structures).” Ted Dunning Chief Scientist, Veoh 35

CoGroup queryurlrank Lakersnba.com1 Lakersespn.com2 Kingsnhl.com1 Kingsnba.com2 queryadSlotamount Lakerstop50 Lakersside20 Kingstop30 Kingsside10 groupresultsrevenue Lakers nba.com1Lakerstop50 Lakersespn.com2Lakersside20 Kings nhl.com1Kingstop30 Kingsnba.com2Kingsside10 resultsrevenue Cross-product of the 2 bags would give natural join

Pig Features Explicit Data Flow Language unlike SQL Low Level Procedural Language unlike Map- Reduce Quick Start & Interoperability Mode (Interactive Mode, Batch, Embedded) User Defined Functions Nested Data Model

Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Optimization Future Work

Pig Process Life Cycle Hadoop Job Manager Logical Optimizer Parser Pig Latin to Logical Plan Map-Reduce Optimizer Map-Reduce Compiler Logical Plan to Physical Plan

Pig Latin to Physical Plan A = LOAD ‘file1’ AS (x,y,z); B = LOAD ‘file2’ AS (t,u,v); C = FILTER A by y > 0; D = JOIN C by x,B by u; E = GROUP D by z; F = FOREACH E generate group, COUNT(D); STORE F into ‘output’; LOAD FILTER LOAD JOIN GROUP FOREACH STORE x,y,z x,y,z,t,u,v group, count

Logical Plan to Physical Plan LOAD FILTER LOAD JOIN GROUP FOREACH STORE LOCAL REARRANGE GLOABL REARRANGE GLOBAL REARRANGE STORE LOCAL REARRANGE LOAD FILTER LOAD PACKAGE FOREACH PACKAGE FOREACH

Physical Plan to Map-Reduce Plan LOCAL REARRANGE GLOABL REARRANGE GLOBAL REARRANGE STORE LOCAL REARRANGE LOAD FILTER LOAD PACKAGE FOREACH PACKAGE FOREACH Filter Local Rearrange Package Foreach Package Foreach

Implementation cluster Hadoop Map-Reduce Pig SQL automatic rewrite + optimize or user Pig is open-source. Pig is open-source. ~50% of Hadoop jobs at Yahoo! are Pig 1000s of jobs per day

Compilation into Map-Reduce Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10(urls) Foreach category generate top10(urls) Map 1 Reduce 1 Map 2 Reduce 2 Map 3 Reduce 3 Every group or join operation forms a map-reduce boundary Other operations pipelined into map and reduce phases

Nested Sub Plans SPLIT FILTER FOEACHFOREACH FILTER LOCAL REARRANGE GLOBAL REARRANGE SPLIT FOREACH PACKAGE MULTIPLEX

Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Optimization Future Work

Using the Combiner Input records k1k1 v1v1 k2k2 v2v2 k1k1 v3v3 k2k2 v4v4 k1k1 v5v5 map k1k1 v1v1 k1k1 v3v3 k1k1 v5v5 k2k2 v2v2 k2k2 v4v4 Output records reduce Can pre-process data on the map-side to reduce data shipped Algebraic Aggregation Functions Distinct processing

Skew Join Default join method is symmetric hash join. groupresultsrevenue Lakers nba.com1Lakerstop50 Lakersespn.com2Lakersside20 Kings nhl.com1Kingstop30 Kingsnba.com2Kingsside10 cross product carried out on 1 reducer Problem if too many values with same key Skew join samples data to find frequent values Further splits them among reducers

Fragment-Replicate Join Symmetric-hash join repartitions both inputs If size(data set 1) >> size(data set 2) – Just replicate data set 2 to all partitions of data set 1 Translates to map-only job – Open data set 2 as “side file”

Merge Join Exploit data sets are already sorted. Again, a map-only job – Open other data set as “side file”

Multiple Data Flows [1] Load Users Filter bots Group by state Group by state Apply udfs Store into ‘bystate’ Group by demographic Group by demographic Apply udfs Store into ‘bydemo’ Map 1 Reduce 1

Multiple Data Flows [2] Load Users Filter bots Group by state Group by state Apply udfs Store into ‘bystate’ Group by demographic Group by demographic Apply udfs Store into ‘bydemo’ Split Demultiplex Map 1 Reduce 1

Performance

Strong & Weak Points Explicit Dataflow Retains Properties of Map-Reduce Scalability Fault Tolerance Multi Way Processing Open Source Column wise Storage structures are missing Memory Management No facilitation for Non Java Users Limited Optimization No GUI for Flow Graphs +++ The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo! With the various interleaved clauses in SQL It is difficult to know what is actually happening sequentially. With the various interleaved clauses in SQL It is difficult to know what is actually happening sequentially. David Ciemiewicz Search Excellence, Yahoo!

Hot Competitor Google Map-Reduce System

Installation Process 1.Download Java Editor (NetBeans or Eclipse) 2.Create sample pig latin code in any Java editor 3.Install Pig Plugins (JavaCC, Subclipse) 4.Add necessary jar files for Hadoop API in project 5.Run input files in any Java editor using Hadoop API, NOTE: Your system must work as distributed cluster Another NOTE: If you want to run sample as Command Line please install more softwares: -JUNIT, Ant, CygWin -And set your path variables everywhere

New Systems For Data Analysis  Map-Reduce  Apache Hadoop  Dryad  Sawzall  Hive...

Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Optimization Future Work

Future / In-Progress Tasks Columnar-storage layer Non Java UDF and SQL Interface Metadata repository GUI Pig Tight integration with a scripting language – Use loops, conditionals, functions of host language Memory Management & Enhanced Optimization Project Suggestions at:

Summary Big demand for parallel data processing – Emerging tools that do not look like SQL DBMS – Programmers like dataflow pipes over static files Hence the excitement about Map-Reduce But, Map-Reduce is too low-level and rigid Pig Latin Sweet spot between map-reduce and SQL Pig Latin Sweet spot between map-reduce and SQL