Making Hadoop Easy pig

Making Hadoop Easy pig http://hadoop.apache.org/pig

What is Pig

Pig is a Language Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster.

Apache Incubator: October’07-October’08 Graduated into Hadoop Subproject Main page: http://hadoop.apache.org/pig/ Pig is Hadoop Subproject

Higher level languages: –Increase programmer productivity –Decrease duplication of effort –Open the system to more users Pig insulates you against hadoop complexity –Hadoop version upgrades –JobConf configuration tuning –Job chains Why Pig?

An Example Problem Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

In Map Reduce

Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into ‘top5sites’; In Pig Latin

Ease of Translation Notice how naturally the components of the job translate into Pig Latin. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 Users = load … Fltrd = filter … Pages = load … Jnd = join … Grpd = group … Smmd = … COUNT()… Srtd = order … Top100 = limit …

Comparison 1/20 the lines of code1/16 the development time Performance within 2x

Pig Compared to Map Reduce Faster development time Many standard data operations (project, filter, join) already included. Pig manages all the details of Map Reduce jobs and data flow for you.

And, You Don’t Lose Power Easy to provide user code throughout. External binaries can be invoked. Metadata is not required, but metadata supported and used when available. Pig does not impose a data model on you. Fine grained control. One line equals one action. Complex data types

Example, User Code -- use a custom loader Logs = load ‘apachelogfile’ using CommonLogLoader() as (addr, logname, user, time, method, uri, p, bytes); -- apply your own function Cleaned = foreach Logs generate addr, canonicalize(url) as url; Grouped = group Cleaned by url; -- run the result through a binary Analyzed = stream Grouped through ‘urlanalyzer.py’; store Analyzed into ‘analyzedurls’;

Example, Schema on the Fly -- declare your types Grades = load ‘studentgrades’ as (name: chararray, age: int, gpa: double); Good = filter Grades by age > 18 and gpa > 3.0; -- ordering will be by type Sorted = order Good by gpa; store Sorted into ‘smartgrownups’;

Example, Nested Data Logs = load ‘weblogs’ as (url, userid); Grpd = group Logs by url; -- Code inside {} will be applied to each -- value in turn. DCnt = foreach Grpd { Userid = Logs.userid; DsctUsers = distinct Userid; generate group, COUNT(DsctUsers); } store DCnt into ‘distinctcount’;

Pig Commands Pig CommandWhat it does loadRead data from file system. storeWrite data to file system. foreachApply expression to each record and output one or more records. filterApply predicate and remove records that do not return true. group/cogroupCollect records with the same key from one or more inputs. joinJoin two or more inputs based on a key. orderSort records based on a key. distinctRemove duplicate records. unionMerge two data sets. splitSplit data into 2 or more sets, based on filter conditions. streamSend all records through a user provided binary. dumpWrite output to stdout. limitLimit the number of records.

How it Works Pig Latin script is translated to a set of operators which are placed in one or more M/R jobs and executed. A = load ‘myfile’; B = filter A by $1 > 0; C = group B by $0; D = foreach C generate group, COUNT(B) as cnt; E = filter D by cnt > 5; dump E; Filter $1 > 0 Map COUNT(B) Combiner COUNT(B) Filter cnt > 0 Reducer

Current Pig Status 30% of all Hadoop jobs at Yahoo are now pig jobs, 1000s per day. Graduated from Apache Incubator in October’08 and was accepted as Hadoop sub-project. In the process of releasing version 0.2.0 – type system –2-10x speedup –1.6x Hadoop latency Improved user experience: –Improved documentation –PigTutorial –UDF repository – PiggyBank –Development environment (eclipse plugin)

Inside Yahoo (based on user interviews) –Used for both production processes and adhoc analysis –Production Examples: search infrastructure, ad relevance Attraction: fast development, extensibility via custom code, protection against hadoop changes, debugability –Research Examples: user intent analysis Attraction: easy to learn, compact readable code, fast iteration on trying new algorithms, easy for collaboration What Users Do with Pig

Outside Yahoo (based on mail list responses) –Processing search engine query logs “Pig programs are easier to maintain, and less error-prone than native java programs. It is an excellent piece of work.” –Image recommendations “I am using it as a rapid-prototyping language to test some algorithms on huge amounts of data.” –Adsorption Algorithm (video recommendations) –Hoffman's PLSI implementation in PIG “The E/M login was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in mapreduce java. Exactly that’s the reason I wanted to try it out in Pig. It took ~ 3-4 days for me write it, starting from learning pig :)” –Inverted Index “The Pig feature that makes it stand out is the easy native support for nested elements — meaning, a tuple can have other tuples nested inside it; they also support Maps and a few other constructs. The Sigmod 2008 paper presents the language and gives examples of how the system is used at Yahoo.The Sigmod 2008 paper Without further ado — a quick example of the kind of processing that would be awkward, if not impossible, to write in regular SQL, and long and tedious to express in Java (even using Hadoop).” What Users Do with Pig

Common asks –Control structures or embedding –UDFs in scripting languages (Perl, Python) –More performance What Users Do with Pig

Roadmap Performance –Latency: goal of 10-20 % overhead compared to hadoop –Better scalability: memory usage, dealing with skew –Planned improvements Multi-query support Rule-based optimizer Handling skew in joins Pushing projections to the loader More efficient serialization Better memory utilization

Roadmap (cont.) Functionality –UDFs in languages other than Java Perl, C++ –New Parser with better error handling

How Do I Get a Pig of My Own? Need an installation of hadoop to run on, see http://hadoop.apache.org/core/ http://hadoop.apache.org/core/ Get the pig jar. You can get release 0.1.0 at http://hadoop.apache.org/pig/releases.html. I strongly recommend using the code from trunk Get a copy of the hadoop-site.xml file for your hadoop cluster. http://hadoop.apache.org/pig/releases.html Run java –cp pig.jar:configdir org.apache.pig.Main where configdir is the directory containing your hadoop-site.xml.

How Do I Make My Pig Work? Starting pig with no script puts you in the grunt shell, where you can type pig and hdfs navigation commands. Pig Latin can be put in file that is then passed to pig. JDBC like interface for java usage. PigPen, an Eclipse plugin that supports textual and graphical construction of scripts. Shows sample data flowing through the script to illustrate how your script will work.

Pig Pen screen shot. Script is on the left, schema and example data flow on the right.

Pig Pen screen shot. Graphical flow is on the left, schema and example data flow on the right.

Making Hadoop Easy pig

Similar presentations

Presentation on theme: "Making Hadoop Easy pig"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Making Hadoop Easy pig

Similar presentations

Presentation on theme: "Making Hadoop Easy pig"— Presentation transcript:

Similar presentations

About project

Feedback