Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.

Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott

What is Pig? Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster.

An Example Problem Data –User records –Pages served Question: the 5 pages most visited by users aged 18 - 25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

In Map Reduce

In Pig Latin Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into ‘top5sites’;

Comparison 1/20 the lines of code1/16 the development time Performance: 1.5x Hadoop

Pig Compared to Map Reduce » Faster development time » Data flow versus programming logic » Many standard data operations (e.g. join) included » Manages all the details of connecting jobs and data flow » Copes with Hadoop version change issues

And, You Don’t Lose Power » UDFs can be used to load, evaluate, aggregate, and store data » External binaries can be invoked » Metadata is optional » Flexible data model » Nested data types » Explicit data flow programming

Pig Commands loadRead data from file system. storeWrite data to file system. foreachApply expression to each record and output one or more records. filterApply predicate and remove records that do not return true. group/cogroupCollect records with the same key from one or more inputs. joinJoin two or more inputs based on a key. Various join algorithms available. orderSort records based on a key. distinctRemove duplicate records. unionMerge two data sets. splitSplit data into 2 or more sets, based on filter conditions. streamSend all records through a user provided executable. sampleRead a random sample of the data. limitLimit the number of records.

How it Works Pig Latin script is translated to a set of operators which are placed in one or more MR jobs and executed. A = load ‘myfile’; B = filter A by $1 > 0; C = group B by $0; D = foreach C generate group, COUNT(B) as cnt; E = filter D by cnt > 5; dump E; Filter $1 > 0 Map COUNT(B) Combiner SUM(COUNT(B)) Filter cnt > 5 Reducer

What Users Do with Pig » Inside Yahoo (based on user interviews) › 60% of ad hoc and 40% of production MR jobs › Production Examples: search infrastructure, ad relevance Attraction: fast development, extensibility via custom code, protection against Hadoop changes, debugability › Ad hoc Examples: user intent analysis Attraction: easy to learn, compact readable code, fast iteration when trying new algorithms, easy for collaboration

What Users Do with Pig » Outside Yahoo (based on mailing list responses) › Processing search engine query logs “Pig programs are easier to maintain, and less error-prone than native java programs. It is an excellent piece of work.” › Image recommendations “I am using it as a rapid-prototyping language to test some algorithms on huge amounts of data.” › Adsorption Algorithm (video recommendations) › Hoffman’s PLSI implementation “The E/M login was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in mapreduce java. Exactly that’s the reason I wanted to try it out in Pig. It took ~ 3-4 days for me to write it, starting from learning pig.”

Users Extending Pig: PigPy » Created by Mashall Weir at Zattoo » Uses Python to create Pig Latin scripts on the fly › Enables looping › Branching based on job results » Submits Pig jobs from Python scripts » Cache intermediate calculations » Avoid variable name collisions in large scripts

Version 0.2.0 » Released April 2009 » Added type system » ~5x better performance than 0.1 » More aggressive use of the combiner » Map side join » Handles key skew in ORDER BY » Improved error handling » Improved documentation

Version 0.3.0 » Release branch created June 8 th, 2009 » Supports multiple STOREs in one MR job » Supports multiple GROUP Bys in one MR job students = load ’students' as (name, age, gpa); a_ed = filter students by age > 25; store a_ed into ‘adult_ed'; gname = group a_ed by name; cname = foreach gname generate group, COUNT(a_ed); store cname into ’count_by_name'; g_age = group a_ed by age; c_age = foreach g_age generate group, COUNT(a_ed); store c_age into ’count_by_age'; In 0.2.0 and before, this would be 3 MR jobs. In 0.3.0 it will be one. Seeing up to 10x speedup for these types of scripts.

Currently Working On » Map side merge join » Handling severe skew in join keys » Improving memory footprint » Extending optimizer capabilities

SQL » Pig will be bilingual, accepting SQL and Pig Latin » UDFs will work in both languages » Gives users ability to choose appropriate interface level » Administrators have one component to maintain

Metadata for the Grid » Provide metadata model for files and directories as data sets » Usable from Map Reduce and Pig » Attach user defined attributes to data sets » Define hierarchy and associations between data sets » Record data schema and statistics » Browsing, searching, and metadata administration via GUI and web services API » JIRA: PIG-823

Storage Access Layer » Common abstraction to contain storage access features and optimizations » Support fast projection » Support early row filtering » CPU/space efficient data serialization and compression » Usable by Map Reduce and Pig » PIG-833

Learn More » Come to the Hadoop Summit Training, tomorrow » Watch the training by Yahoo! and Cloudera: http://www.cloudera.com/hadoop- training-pig-introduction http://www.cloudera.com/hadoop- training-pig-introduction » Get involved: http://hadoop.apache.org/pig http://hadoop.apache.org/pig

Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.

Similar presentations

Presentation on theme: "Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.

Similar presentations

Presentation on theme: "Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott."— Presentation transcript:

Similar presentations

About project

Feedback