# Alan F. Gates Yahoo! Pig, Making Hadoop Easy. - 2 - Who Am I? Pig committer and PMC Member An architect in Yahoo! grid team Photo credit: Steven Guarnaccia,

## Presentation on theme: "Alan F. Gates Yahoo! Pig, Making Hadoop Easy. - 2 - Who Am I? Pig committer and PMC Member An architect in Yahoo! grid team Photo credit: Steven Guarnaccia,"— Presentation transcript:

Alan F. Gates Yahoo! Pig, Making Hadoop Easy

- 2 - Who Am I? Pig committer and PMC Member An architect in Yahoo! grid team Photo credit: Steven Guarnaccia, The Three Little Pigs

- 3 - Motivation By Example You have web server logs of purchases on your site. You want to find the 10 users who bought the most and the cities they live in. You also need to know what percentage of purchases they account for in those cities. Load Logs Find top 10 users Store top 10 users Join by city Sum purchases by city Calculate percentage Store results

- 4 - In Pig Latin raw = load 'logs' as (name, city, purchase); -- Find top 10 users usrgrp = group raw by (name, city); byusr = foreach usrgrp generate group as k1, SUM(raw.purchase) as utotal; srtusr = order byusr by usrtotal desc; topusrs = limit srtusr 10; store topusrs into 'top_users'; -- Count purchases per city citygrp = group raw by city; bycity = foreach citygrp generate group as k2, SUM(raw.purchase) as ctotal; -- Join top users back to city jnd = join topusrs by k1.city, bycity by k2; pct = foreach jnd generate k1.name, k1.city, utotal/ctotal; store pct into 'top_users_pct_of_city';

- 5 - Translates to Four MapReduce Jobs Job 1Job 2Job 3Job 4 Load Group by user Sum user purchases Store user purchases Group by city Sum city purchases Sample output of user sum to decide how to partition for order by Order user sums Limit sums to 10 Join top users’ purchases with city purchases Store results

- 6 - Performance

- 7 - Where Do Pigs Live? Data Collection Data Factory Pig Pipelines Iterative Processing Research Data Warehouse BI Tools Analysis

- 8 - Pig Highlights Language designed to enable efficient description of data flow Standard relational operators built in User defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM) UDFs can be written to take advantage of the combiner Four join implementations built in: hash, fragment-replicate, merge, skewed Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned Order by provides total ordering across reducers in a balanced way Writing load and store functions is easy once an InputFormat and OutputFormat exist Piggybank, a collection of user contributed UDFs

- 9 - Multi-store script A = load ‘users’ as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’; load users filter nulls group by state group by age, gender apply UDFs store into ‘bystate’ store into ‘bydemo’

- 10 - Multi-Store Map-Reduce Plan map filter local rearrange split local rearrange reduce multiplex package foreach

- 11 - Hash Join Pages Users Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Users by name, Pages by user; Map 1 Pages block n Pages block n Map 2 Users block m Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred) (2, fred) (1, jane) (2, jane)

- 12 - Fragment Replicate Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “replicated”; Pages Users Map 1 Map 2 Users Pages block 1 Pages block 1 Pages block 2 Pages block 2

- 13 - Skew Join Pages Users Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “skewed”; Map 1 Pages block n Pages block n Map 2 Users block m Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred, p1) (1, fred, p2) (2, fred) (1, fred, p3) (1, fred, p4) (2, fred) SPSP SPSP SPSP SPSP

- 14 - Merge Join Pages Users aaron. zach aaron. zach Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “merge”; Map 1 Map 2 Users Pages aaron… amr aaron … amy… barb amy …

- 15 - Who uses Pig for What? 70% of production grid jobs at Yahoo (10ks per day) Also used by Twitter, LinkedIn, Ebay, AOL, … Used to –Process web logs –Build user behavior models –Process images –Build maps of the web –Do research on raw data sets

- 16 - Components User machine Hadoop Cluster Pig resides on user machine Job executes on cluster No need to install anything extra on your Hadoop cluster. Accessing Pig: Submit a script directly Grunt, the pig shell PigServer Java class, a JDBC like interface

- 17 - How It Works A = LOAD ‘myfile’ AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO ‘output’; Pig Latin Execution Plan Map: Filter Count Combine/Reduce: Sum pig.jar: parses checks optimizes plans execution submits jar to Hadoop monitors job progress

- 18 - New in 0.8 UDFs can be Jython, not too much work to expand to other scripting languages Improved and greatly expanded statistics Performance Improvements –Automatic merging of small files –Compression of intermediate results PigUnit for unit testing your Pig Latin scripts Access to static Java functions as UDFs Improved HBase integration Custom Partitioners B = group A by \$0 partition by YourPartitioner parallel 2; Greatly expanded string and math built in UDFs

- 19 - What’s Next? Preview of Pig 0.9 –Integrate Pig with scripting languages for control flow –Add macros to Pig Latin –Revive ILLUSTRATE –Fix most runtime type errors –Rewrite parser to give useful error messages Programming Pig from O’Reilly Press

- 20 - Learn More Read the online documentation: http://pig.apache.org/http://pig.apache.org/ Hadoop, The Definitive Guide 2 nd edition has an up to date chapter on Pig, search at your favorite bookstore Join the mailing lists: –user@pig.apache.org for user questionsuser@pig.apache.org –dev@pig.apache.com for developer issuesdev@pig.apache.com Follow me on Twitter, @alanfgates@alanfgates

- 21 - UDFs in Scripting Languages Evaluation functions can now be written in scripting languages that compile down to the JVM Reference implementation provided in Jython Jruby, others, could be added with minimal code JavaScript implementation in progress Jython sold separately

- 22 - Example Python UDF test.py: @outputSchema(”sqr:long”) def square(num): return ((num)*(num)) test.pig: register 'test.py' using jython as myfuncs; A = load ‘input’ as (i:int); B = foreach A generate myfuncs.square(i); dump B;

- 23 - Better statistics Statistics printed out at end of job run Pig information stored in Hadoop’s job history files so you can mine the information and analyze your Pig usage –Loader for reading job history files included in Piggybank New PigRunner interface that allows users to invoke Pig and get back a statistics object that contains stats information –Can also pass listener to track Pig jobs as they run –Done for Oozie so it can show users Pig statistics

- 24 - Sample stats info Job Stats (time in seconds): JobId Maps Reduces MxMT MnMT AMT MxRT MnRT ART Alias job_0 2 1 15 3 9 27 27 27 a,b,c,d,e job_1 1 1 3 3 3 12 12 12 g,h job_2 1 1 3 3 3 12 12 12 i job_3 1 1 3 3 3 12 12 12 i Input(s): Successfully read 10000 records from: “studenttab10k" Successfully read 10000 records from: “votertab10k" Output(s): Successfully stored 6 records (150 bytes) in: ”outfile" Counters: Total records written : 6 Total bytes written : 150

- 25 - Invoke Static Java Functions as UDFs Often UDF you need already exists as Java function, e.g. Java’s URLDecoder.decode() for decoding URLs define UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String'); A = load 'encoded.txt' as (e:chararray); B = foreach A generate UrlDecode(e, 'UTF-8'); Currently only works with simple types and static functions

- 26 - Improved HBase Integration Can now read records as bytes instead of auto converting to strings Filters can be pushed down Can store data in HBase as well as load from it Works with HBase 0.20 but not 0.89 or 0.90. Patch in PIG- 1680 addresses this but has not been committed yet.PIG- 1680

Download ppt "Alan F. Gates Yahoo! Pig, Making Hadoop Easy. - 2 - Who Am I? Pig committer and PMC Member An architect in Yahoo! grid team Photo credit: Steven Guarnaccia,"

Similar presentations