Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig.

Similar presentations


Presentation on theme: "Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig."— Presentation transcript:

1 Alan Gates Becoming a Pig Developer

2 - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig

3 - 3 - Current Status Release 0.3 June 2009 –Multi-store queries Pig added to Amazon Elastic MapReduce August 2009 Release 0.4 September 2009 –Added skew and merge join –Added outer join (for default hash join only) Release 0.5 November 2009 –Hadoop 0.20

4 - 4 - Components User machine Hadoop Cluster Pig resides on user machine Job executes on cluster No need to install anything extra on your Hadoop cluster.

5 - 5 - How It Works Parser Script A = load B = filter C = group D = foreach Logical Plan Semantic Checks Logical Plan Logical Optimizer Logical Plan Logical to Physical Translator Physical Plan Physical To MR Translator MapReduce Launcher Jar to hadoop Map-Reduce Plan Logical Plan ≈ relational algebra Plan standard optimizations Physical Plan = physical operators to be executed Map-Reduce Plan = physical operators broken into Map, Combine, and Reduce stages

6 - 6 - Fragment Replicate Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “replicated”; Pages Users Map 1 Map 2 Users Pages block 1 Pages block 1 Pages block 2 Pages block 2

7 - 7 - Hash Join Pages Users Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Users by name, Pages by user; Map 1 Pages block n Pages block n Map 2 Users block m Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred) (2, fred) (1, jane) (2, jane)

8 - 8 - Skew Join Pages Users Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “skewed”; Map 1 Pages block n Pages block n Map 2 Users block m Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred, p1) (1, fred, p2) (2, fred) (1, fred, p3) (1, fred, p4) (2, fred) SPSP SPSP SPSP SPSP

9 - 9 - Merge Join Pages Users aaron. zach aaron. zach Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “merge”; Map 1 Map 2 Users Pages aaron… amr aaron … amy… barb amy …

10 - 10 - Multi-store script A = load ‘users’ as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’; load users filter nulls group by state group by age, gender apply UDFs store into ‘bystate’ store into ‘bydemo’

11 - 11 - Multi-Store Map-Reduce Plan map filter local rearrange split local rearrange reduce multiplex package foreach

12 - 12 - Basic User Defined Functions A = load ‘users’; B = group A all; C = foreach B generate COUNT(A); long exec(bag b) { return b.size(); } Reduce

13 - 13 - Algebraic User Defined Functions A = load ‘users’; B = group A all; C = foreach B generate COUNT(A); long exec(tuple t){ return 1; } long exec(bag b) { long sum = 0; for (long s : b) { sum += s; } return sum; } long exec(bag b) { long sum = 0; for (long s : b) { sum += s; } return sum; } Reduce Combine Map Initial Intermediate Final

14 - 14 - Accumulative User Defined Functions A = load ‘users’ as (name, url, timestamp); B = group A by name; C = foreach B { D = order A by timestamp; generate SessionAnalysis(A); } public interface Accumulator { public void accumulate(List b); public T getValue() } Reduce

15 - 15 - Performance Tips Project early and often Use Parallel Filter out nulls before join For integer arithmetic, use types

16 - 16 - Performance 0.1 0.2 0.3 0.4, 0.5 trunk

17 - 17 - Upcoming Features Redesign of load and store function interfaces Adding outer join to all join types UDFs in python and ruby Changing spilling strategy to avoid running out of memory Adding Accumulator interface

18 - 18 - Learn More Read the online documentation: http://hadoop.apache.org/pig/ http://hadoop.apache.org/pig/ On line tutorials –From Yahoo, http://developer.yahoo.com/hadoop/tutorial/http://developer.yahoo.com/hadoop/tutorial/ –From Cloudera, http://www.cloudera.com/hadoop-traininghttp://www.cloudera.com/hadoop-training A couple of Hadoop books available that include chapters on Pig, search at your favorite bookstore Join the mailing lists: –pig-user@hadoop.apache.org for user questionspig-user@hadoop.apache.org –pig-dev@hadoop.apache.com for developer issuespig-dev@hadoop.apache.com Contribute back your work, over 40 people have contributed so far

19 - 19 - Questions


Download ppt "Alan Gates Becoming a Pig Developer. - 2 - Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig."

Similar presentations


Ads by Google