Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research SIGMOD’08 Presented By Sandeep Patidar Modified from original Pig Latin talk

2 Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work

3 Data Processing Renaissance Internet companies swimming in data  E.g. TBs/day at Yahoo! Data analysis is “inner loop” of product innovation Data analysts are skilled programmers

4 Data Warehousing …? Scale Often not scalable enough $ $ Prohibitively expensive at web scale Up to $200K/TB SQL Little control over execution method Query optimization is hard Parallel environment Little or no statistics Lots of UDFs

5 New Systems For Data Analysis Map-Reduce Apache Hadoop Dryad

6 Map-Reduce Map : Performs the group by Reduce : Performs the aggregation These are two high level declarative primitives to enable parallel processing

7 Execution overview of Map-Reduce [2] 1) The Map-Reduce library in the user program first splits the input les into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines. 1) The Map-Reduce library in the user program first splits the input les into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines. 2) One of the copy of the program is special – the master. The rest are workers that are assigned work by the master. There are M map task and R reduce tasks to assign, The Master picks idle worker and assign each one a task. 2) One of the copy of the program is special – the master. The rest are workers that are assigned work by the master. There are M map task and R reduce tasks to assign, The Master picks idle worker and assign each one a task. 3) A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory. 3) A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory. 4) Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The location of these buffered pairs on the local disk are passed back to the Master, who is responsible for forwarding these locations to the reduce workers 4) Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The location of these buffered pairs on the local disk are passed back to the Master, who is responsible for forwarding these locations to the reduce workers

8 Execution overview of Map-Reduce [2] 5) When a reduce worker is modified by the master about these locations, it uses remote procedure calls to read buffered data from the local disks of map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys. The sorting is needed because typically many different key map to the same reduce task. 5) When a reduce worker is modified by the master about these locations, it uses remote procedure calls to read buffered data from the local disks of map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys. The sorting is needed because typically many different key map to the same reduce task. 6) The reduce worker iterate over the sorted intermediate data and for each unique key encountered, it passes the key and the. corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to the final output file for this reduce partition. 6) The reduce worker iterate over the sorted intermediate data and for each unique key encountered, it passes the key and the. corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to the final output file for this reduce partition. 7) When all map task and reduce task have been completed, the master wakes up the user program, At this point, the Map-Reduce call in the user program returns back to the user code. 7) When all map task and reduce task have been completed, the master wakes up the user program, At this point, the Map-Reduce call in the user program returns back to the user code.

9 Input records k1k1 v1v1 k2k2 v2v2 k1k1 v3v3 k2k2 v4v4 k1k1 v5v5 map k1k1 v1v1 k1k1 v3v3 k1k1 v5v5 k2k2 v2v2 k2k2 v4v4 Output records reduce

10 Map-Reduce Appeal Scale Scalable due to simpler design Only parallelizable operations No transactions $ $ Runs on cheap commodity hardware Procedural Control- a processing “pipe” SQL

11 Limitations of Map-Reduce 1. Extremely rigid data flow Other flows constantly hacked in Join, Union Split M M R R M M M M R R M M Chains 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

12 Pros And Cons Need a high-level, general data flow language High level declarative language Low level procedural language

13 Enter Pig Latin Need a high-level, general data flow language Pig Latin

15 Pig Latin Example 1 Suppose we have a table urls: (url, category, pagerank) Simple SQL query that finds, For each sufficiently large category, the average pagerank of high-pagerank urls in that category SELECT category, Avg(pagetank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10 6

16 Equivalent Pig Latin program good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls) > 10 6 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

17 Data Flow Filter good_urls by pagerank > 0.2 Group by category Filter category by count > 10 6 Foreach category generate avg. pagerank Foreach category generate avg. pagerank

18 Example Data Analysis Task UserUrlTime Amycnn.com8:00 Amybbc.com10:00 Amyflickr.com10:05 Fredcnn.com12:00 Find the top 10 most visited pages in each category Url Categor y PageRan k cnn.comNews0.9 bbc.comNews0.8 flickr.comPhotos0.7 espn.comSports0.9 VisitsUrl Info

19 Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls

20 In Pig Latin visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

22 Dataflow Language The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo! User specifies a sequence of steps where each step specifies only a single high-level data transformation

23 Step by step execution Pig Latin program supply an explicit sequence of operations, it is not necessary that the operations be executed in that order e.g., Set of urls of pages classified as spam, but have a high pagerank score isSpam might be an expensive UDF Then, it will be much better to filter the url by pagerank first. spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank > 0.8;

24 Quick Start and Interoperability gVisits = group visits by $1; Where $1 uses positional notation to refer second field visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); Operates directly over files Schemas optional; Can be assigned dynamically Schemas optional; Can be assigned dynamically

25 Nested Data Model Pig Latin has flexible, fully nested data model (described later) allows complex, non-atomic data types such as sets, map, and tuple. Nested Model is more closer to programmer than normalization (1NF) Avoids expensive joins for web-scale data Allows programmer to easily write UDFs

26 UDFs as First-Class Citizens Used Defined Functions (UFDs) can be used in every construct Load, Store, Group, Filter, Foreach Example 2 Suppose we want to find for each category, the top 10 urls according to pagerank groups = GROUP urls BY category; output = FOREACH groups GENERATE category, top10(urls);

28 Data Model Atom: contains Simple atomic value ‘alice’ ‘lanker’ ‘ipod’ Atom Tuple Tuple: sequence of fields Bag: collection of tuple with possible duplicates

29 Map: collection of data items, where each item has an associated key through which is can be looked

30 Pig Latin Commands Specifying Input Data: LOAD queries = LOAD ‘query_log.txt’ USING myLoad() As (userId, queryString, timestamp); Per-tuple Processing: FOREACH expand_ queries = FOREACH queries GENERATE userId, expandQuery(queryString);

31 Pig Latin Commands (Cont.) Discarding Unwanted Data: FILTER real_ queries = FILTER queries BY userId neq ‘bot’; or FILTER queries BY NOT isBot(userId); Filtering conditions involve combination of expression, comparison operators such as ==, eq, !=, neq, and the logical connectors AND, OR, NOT

32 Expressions in Pig Latin

33 Example of flattening in FOREACH

34 Pig Latin Commands (Cont.) Getting Related Data Together: COGROUP Suppose we have two data sets result:(queryString, url, position) revenue:(queryString, adSlot, amount) grouped_data = COGROUP result BY queryString, revenue BY queryString;

35 COGROUP versus JOIN

36 Pig Latin Example 3 Suppose we were trying to attribute search revenue to search-result urls to figure out the monetary worth of each url. url_revenues = FOREACH grouped_data GENERATE FLATTEN( distributeRevenue(result, revenue)); Where distributeRevenue is a UDF that accepts search results and revenue info for a query string at a time, and outputs a bag of urls and the revenue attributed to them.

37 Pig Latin Commands (Cont.) Special case of COGROUP: GROUP grouped_revenue = GROUP revenue BY queryString; query_revenue = FOREACH grouped_revenue GENERATE queryString, SUM(revenue.amount) AS totalRevenue; JOIN in Pig Latin join_result = JOIN result BY queryString, revenue BY queryString;

38 Pig Latin Commands (Cont.) Map-Reduce in Pig Latin map_result = FOREACH input GENERATE FLATTEN(map(*)); key_group = GROUP map_result BY $0; output = FOREACH key_group GENERATE reduce(*);

39 Pig Latin Commands (Cont.) Other Command UNION : Returns the union of two or more bags CROSS: Returns the cross product ORDER: Orders a bag by the specified field(s) DISTINCT: Eliminates duplicate tuple in a bag Nested Operations Pig Latin allows some command to nested within a FOREACH command

40 Pig Latin Commands (Cont.) Asking for Output : STORE user can ask for the result of a Pig Latin expression sequence to be materialized to a file STORE query_revenue INTO ‘myoutput’ USING myStore(); myStore is custom serializer. For plain text file, it can be omitted

42 Implementation cluster Hadoop Map-Reduce Pig SQL automatic rewrite + optimize or USER Pig is open-source. http://incubator.apache.org/pig Pig is open-source. http://incubator.apache.org/pig

43 Building a Logical Plan Pig interpreter first parse Pig Latin command, and verifies that the input files and bags being referred are valid Builds logical plan for every bag that the user defines Processing triggers only when user invokes a STORE command on a bag ( at that point, the logical plan for that bag is compiled into physical plan and is executed)

44 Every group or join operation forms a map-reduce boundary Other operations pipelined into map and reduce phases Map-Reduce Plan Compilation

45 Compilation into Map-Reduce Filter good_urls by pagerank > 0.2 Group by category Filter category by count > 10 6 Foreach category generate avg. pagerank Foreach category generate avg. pagerank Every group or join operation forms a map- reduce boundary Other operations pipelined into map and reduce phases Map 1 Reduce 1

46 Compilation into Map-Reduce Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10(urls) Foreach category generate top10(urls) Map 1 Reduce 1 Map 2 Reduce 2 Map 3 Reduce 3 Every group or join operation forms a map-reduce boundary Other operations pipelined into map and reduce phases

47 Efficiency With Nested Bags (CO)GROUP command places tuples belonging to the same group into one or more nested bags System can avoid actually materializing these bags, which is specially important when the bags are larger than machine’s main memory One common case is where user applies a algebraic aggregation function over the result of (CO)GROUP operation

48 Debugging Environment Process of constructing Pig Latin program is iterative step  User makes an initial stab at writing a program  Submits it to the system for execution  Inspects the output To avoid this inefficiency, user often create a side data set  Unfortunately this method does not always work well Pig comes with debugging environment called Pig Pen  creates side data set automatically

49 Pig Pen screen shot

50 Generating a Sandbox Data Set There are three primary objectives in selecting a sandbox data set  Realism: sandbox data set should be subset of the actual data set  Conciseness: example bags should be as small as possible  Completeness: example bags should be collectively illustrate the key semantics of each command

51 Usage Scenarios Session Analysis :  Web users sessions, i.e., sequence of page views and clicks made by users, are analyzed.  To calculate How long is the average user session How many links does a user clicks on before leaving website How do click pattern vary in the course of a day/week/month Analysis tasks mainly consist of grouping the activity log by users and/or website First production release about a year ago At Yahoo! 30% of all Hadoop jobs are run with Pig

52 Related Work Sawzall  Scripting language used at Google on top of map-reduce  Rigid structure consisting of a filtering phase followed by an aggregation phase DryadLINQ  SQL-like language on top of Dryad, used at Microsoft Nested Data Models  Explored before in the context of object-oriented databases  explored data- parallel languages over nested data, e.g., NESL

53 Future Work Safe Optimizer  Performs only high-confidence rewrites User Interface  “Boxes and arrows” GUI  Promote collaboration, sharing code fragments and UDFs External Functions  Tight integration with a scripting language such as Perl or Python Unified Environment

54 Summary Big demand for parallel data processing  Emerging tools that do not look like SQL DBMS  Programmers like dataflow pipes over static files Hence the excitement about Map-Reduce But, Map-Reduce is too low-level and rigid Pig Latin Sweet spot between map-reduce and SQL Pig Latin Sweet spot between map-reduce and SQL

55 References C. Olston, B. Reed, U. Srivastava, R. Kumar and A. Tomkins. Pig Latin: A Not- So-Foreign Language for Data Processing. SIGMOD 2008 J. Dean and S. Ghemawat. MapReduce: Simplied data processing on large clusters. In Proc. OSDI, 2004. Pig Latin talk at SIGMOD 2008. http://i.stanford.edu/~usriv/talks/sigmod08- pig-latin.ppt

56 Thank you

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.

Similar presentations

Presentation on theme: "Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.

Similar presentations

Presentation on theme: "Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research."— Presentation transcript:

Similar presentations

About project

Feedback