Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh.

Slides:



Advertisements
Similar presentations
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
Advertisements

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
CS 540 Database Management Systems
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
The Hadoop Stack, Part 1 Introduction to Pig Latin CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.
Presented By: Imranul Hoque
Homework 1: Common Mistakes Memory Leak Storing of memory pointers instead of data.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
FALL 2004CENG 351 File Structures and Data Managemnet1 Relational Algebra.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
1 Relational Algebra. 2 Relational Query Languages Query languages: Allow manipulation and retrieval of data from a database. Relational model supports.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Rutgers University Relational Algebra 198:541 Rutgers University.
Relational Algebra Chapter 4 - part I. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
CSCD343- Introduction to databases- A. Vaisman1 Relational Algebra.
HADOOP ADMIN: Session -2
Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
ASP.NET Programming with C# and SQL Server First Edition
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.
Introduction to Hadoop and HDFS
Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
1 Relational Algebra and Calculas Chapter 4, Part A.
1.1 CAS CS 460/660 Introduction to Database Systems Relational Algebra.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
CSCD34-Data Management Systems - A. Vaisman1 Relational Algebra.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Session 1 Module 1: Introduction to Data Integrity
CS347: Map-Reduce & Pig Hector Garcia-Molina Stanford University CS347Notes 09 1.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Map Reduce.
Pig Latin - A Not-So-Foreign Language for Data Processing
Relational Algebra Chapter 4, Part A
Chapter 15 QUERY EXECUTION.
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Pig Latin: A Not-So-Foreign Language for Data Processing
Hector Garcia-Molina Stanford University
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Data Model.
Pig : Building High-Level Dataflows over Map-Reduce
CSE 491/891 Lecture 21 (Pig).
CENG 351 File Structures and Data Managemnet
Pig and pig latin: An Introduction
Presentation transcript:

Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Shahram Ghandeharizadeh Computer Science Department University of Southern California

A Shared-Nothing Framework Shared-nothing architecture consisting of thousands of nodes! Shared-nothing architecture consisting of thousands of nodes!  A node is an off-the-shelf, commodity PC. Google File System Google’s Bigtable Data Model Google’s Map/Reduce Framework Yahoo’s Pig Latin …….

Pig Latin Supports read-only data analysis workloads that are scan-centric; no transactions! Supports read-only data analysis workloads that are scan-centric; no transactions! Fully nested data model. Fully nested data model.  Does not satisfy 1NF! By definition will violate the other normal forms. Extensive support for user-defined functions. Extensive support for user-defined functions.  UDF as first class citizen. Manages plain input files without any schema information. Manages plain input files without any schema information. A novel debugging environment. A novel debugging environment.

Data Models Conceptual Logical Physical You are here! Relational data model Relational Algebra SQL

Data Models Conceptual Logical Physical You are here! Nested data model Pig Latin

Why Nested Data Model? Closer to how programmers think and more natural to them. Closer to how programmers think and more natural to them.  E.g., To capture information about the positional occurrences of terms in a collection of documents, a programmer may create a structure of the form Idx > for each term.  Normalization of the data creates two tables: Term_info: (TermId, termString, ….) Pos_info: (TermId, documentId, position)  Obtain positional occurrence by joining these two tables on TermId and grouping on  Obtain positional occurrence by joining these two tables on TermId and grouping on

Why Nested Data Model? Data is often stored on disk in an inherently nested fashion. Data is often stored on disk in an inherently nested fashion.  A web crawler might output for each url, the set of outlinks from that url. A nested data model justifies a new algebraic language! A nested data model justifies a new algebraic language! Adaptation by programmers because it is easier to write user-defined functions. Adaptation by programmers because it is easier to write user-defined functions.

Dataflow Language User specifies a sequence of steps where each step specifies only a single, high level data transformation. Similar to relational algebra and procedural – desirable for programmers. User specifies a sequence of steps where each step specifies only a single, high level data transformation. Similar to relational algebra and procedural – desirable for programmers. With SQL, the user specifies a set of declarative constraints. Non-procedural and desirable for non-programmers. With SQL, the user specifies a set of declarative constraints. Non-procedural and desirable for non-programmers.

Dataflow Language: Example A high level program that specifies a query execution plan. A high level program that specifies a query execution plan.  Example: For each sufficiently large category, retrieve the average pagerank of high-pagerank urls in that category.  SQL assuming a table urls (url, category, pagerank) SELECTcategory, AVG(pagerank) FROMurls WHEREpagerank > 0.2 GROUP BYcategory HAVINGcount(*) > 1,000,000

Dataflow Language: Example (Cont…) A high level program that specifies a query execution plan. A high level program that specifies a query execution plan.  Example: For each sufficiently large category, retrieve the average pagerank of high-pagerank urls in that category.  Pig Latin: 1. Good_urls = FILTER urls BY pagerank > 0.2; 2. Groups = GROUP Good_urls BY category; 3. Big_groups = FILTER Groups by COUNT(Good_urls) > 1,000,000; 4. Output = FOREACH Big_groups GENERATE category, AVG(Good_urls, AVG(Good_urls.pagerank); Availability of schema is optional! Columns are referenced using $0, $1, $2, …

Lazy Execution Database style optimization by lazy processing of expressions. Database style optimization by lazy processing of expressions. Example Example Recall urls: (url, category, pagerank) Set of urls of pages that are classified as spam and have a high pagerank score. 1. Spam_urls = Filter urls BY isSpam(url); 2. Culprit_urls = FILTER spam_urls BY pagerank > 0.8; Optimized execution: 1. HighRank_urls = FILTER urls BY pagerank > 0.8; 2. Cultprit_urls = FILTER HighRank_urls BY isSpam (url);

Quick Start/Interoperability To process a file, the user provides a function that gives Pig the ability to parse the content of the file into records. To process a file, the user provides a function that gives Pig the ability to parse the content of the file into records. Output of a Pig program is formatted based on a user-defined function. Output of a Pig program is formatted based on a user-defined function. Why do not conventional DBMSs do the same? (They require importing data into system-managed tables) Why do not conventional DBMSs do the same? (They require importing data into system-managed tables)

Quick Start/Interoperability To process a file, the user provides a function that gives Prig the ability to parse the content of the file into records. To process a file, the user provides a function that gives Prig the ability to parse the content of the file into records. Output of a Pig program is formatted based on a user-defined function. Output of a Pig program is formatted based on a user-defined function. Why do not conventional DBMSs do the same? (They require importing data into system-managed tables) Why do not conventional DBMSs do the same? (They require importing data into system-managed tables)  To enable transactional consistency guarantees,  To enable efficient point lookups (RIDs),  To curate data on behalf of the user, and record the schema so that other users can make sense of the data.

Pig

Data Model Consists of four types: Consists of four types:  Atom: Contains a simple atomic value such as a string or a number, e.g., ‘Joe’.  Tuple: Sequence of fields, each of which might be any data type, e.g., (‘Joe’, ‘lakers’)  Bag: A collection of tuples with possible duplicates. Schema of a bag is flexible.  Map: A collection of data items, where each item has an associated key through which it can be looked up. Keys must be data atoms. Flexibility enables data to change without re-writing programs.

A Comparison with Relational Algebra Pig Latin Pig Latin  Everything is a bag.  Dataflow language. Relational Algebra Relational Algebra  Everything is a table.  Dataflow language.

Expressions in Pig Latin

Specifying Input Data Use LOAD command to specify input data file. Use LOAD command to specify input data file. Input file is query_log.txt Input file is query_log.txt Convert input file into tuples using myLoad deserializer. Convert input file into tuples using myLoad deserializer. Loaded tuples have 3 fields. Loaded tuples have 3 fields. USING and AS clauses are optional. USING and AS clauses are optional.  Default serializer that expects a plain text, tab-deliminated file, is used. No schema  reference fields by position $0 No schema  reference fields by position $0 Return value, assigned to “queries”, is a handle to a bag. Return value, assigned to “queries”, is a handle to a bag.  “queries” can be used as input to subsequent Pig Latin expressions.  Handles such as “queries” are logical. No data is actually read and no processing carried out until the instruction that explicitly asks for output (STORE).  Think of it as a “logical view”.

Per-tuple Processing Iterate members of a set using FOREACH command. Iterate members of a set using FOREACH command. expandQuery is a UDF that generates a bag of likely expansions of a given query string. expandQuery is a UDF that generates a bag of likely expansions of a given query string. Semantics: Semantics:  No dependence between processing of different tupels of the input  Parallelism!  GENERATE can be followed by a list of any expression from Table 1.

FOREACH & Flattening To eliminate nesting in data, use FLATTEN. To eliminate nesting in data, use FLATTEN. FLATTEN consumes a bag, extracts the fields of the tuples in the bag, and makes them fields of the tuple being output by GENERATE, removing one level of nesting. FLATTEN consumes a bag, extracts the fields of the tuples in the bag, and makes them fields of the tuple being output by GENERATE, removing one level of nesting. OUTPUT

FILTER Discards unwanted data. Identical to the select operator of relational algebra. Discards unwanted data. Identical to the select operator of relational algebra. Synatx: Synatx:  FILTER bag-id BY expression  Expression is: field-name op Constant Field-name op UDF op might be ==, eq, !=, neq,, =  A comparison operation may utilize boolean operators (AND, OR, NOT) with several expressions

A Comparison with Relational Algebra Pig Latin Pig Latin  Everything is a bag.  Dataflow language.  FILTER is same as the Select operator. Relational Algebra Relational Algebra  Everything is a table.  Dataflow language.  Select operator is same as the FILTER cmd.

MAP part of MapReduce : Grouping related data COGROUP groups together tuples from one or more data sets that are related in some way. COGROUP groups together tuples from one or more data sets that are related in some way. Example: Example:  Imagine two data sets:  Results contains, for different query strings, the urls shown as search results and the position at which they are shown.  Revenue contains, for different query strings, and different ad slots, the average amount of revenue made by the ad for that query string at that slot.  For a queryString, group data together. (querystring, adSlot, amount)

COGROUP The output of a COGROUP contains one tuple for each group. The output of a COGROUP contains one tuple for each group.  First field of the tuple, named group, is the group identifier.  Each of the next fields is a bag, one for each input being cogrouped, and is named the same as the alias of that input.

COGROUP Grouping can be performed according to arbitrary expressions which may include UDFs. Grouping can be performed according to arbitrary expressions which may include UDFs. Grouping is different than “Join” Grouping is different than “Join”

COGROUP is not JOIN Assign search revenue to search-result urls to figure out the monetary worth of each url. A UDF, distributeRevenue attributes revenue from the top slot entirely to the first search result, while the revenue from the side slot may be attributed equally to all the results. Assign search revenue to search-result urls to figure out the monetary worth of each url. A UDF, distributeRevenue attributes revenue from the top slot entirely to the first search result, while the revenue from the side slot may be attributed equally to all the results.

WITH JOIN

GROUP A special case of COGROUP when there is only one data set involved. A special case of COGROUP when there is only one data set involved. Example: Find the total revenue for each query string. Example: Find the total revenue for each query string.

JOIN Pig Latin supports equi-joins. Pig Latin supports equi-joins. Implemented using COGROUP Implemented using COGROUP

MapReduce in Pig Latin A map function operates on one input tuple at a time, and outputs a bag of key-value pairs. A map function operates on one input tuple at a time, and outputs a bag of key-value pairs. The reduce function operates on all values for a key at a time to produce the final results. The reduce function operates on all values for a key at a time to produce the final results.

MapReduce Plan Compilation Map tasks assign keys for grouping, and the reduce tasks process a group at a time. Map tasks assign keys for grouping, and the reduce tasks process a group at a time. Compiler: Compiler: Converts each (CO)GROUP command in the logical plan into a distinct MapReduce job consisting of its own MAP and REDUCE functions. Converts each (CO)GROUP command in the logical plan into a distinct MapReduce job consisting of its own MAP and REDUCE functions.

Debugging Environment Iterative process for programming. Iterative process for programming. Sandbox data set generated automatically to show results for the expressions. Sandbox data set generated automatically to show results for the expressions.