Hui Li lihui@indiana.edu Pig Tutorial Hui Li lihui@indiana.edu Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
Hadoop Pig By Ravikrishna Adepu.
Your Name.  Recap  Advance  Built-In Function  UDF  Conclusion.
Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer and PMC Member An architect in Yahoo! grid team Photo credit: Steven Guarnaccia,
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer Hadoop PMC Member An architect in Yahoo! grid team Or, as one coworker put.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.
Pig Contributors Workshop Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Presented By: Imranul Hoque
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
HBase and Bigtable Storage
Cloud Computing Other High-level parallel processing languages Keke Chen.
Big Data Analytics Training
Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.
Hive Facebook 2009.
Making Hadoop Easy pig
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
HBase and Bigtable Storage Xiaoming Gao Judy Qiu Hui Li.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Design of Pig B. Ramamurthy. Pig’s data model Scalar types: int, long, float (early versions, recently float has been dropped), double, chararray, bytearray.
Pig Installation Guide and Practical Example Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Other Map-Reduce (ish) Frameworks William Cohen 1.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
Aggregator  Performs aggregate calculations  Components of the Aggregator Transformation Aggregate expression Group by port Sorted Input option Aggregate.
Apache Pig CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
Data Cleansing with Pig Latin. Neubot Tests Data Structure.
MapReduce Compilers-Apache Pig
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Unit 5 Working with pig.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
MSBIC Hadoop Series Processing Data with Pig
Spark Presentation.
Pig Latin - A Not-So-Foreign Language for Data Processing
Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Slides borrowed from Adam Shook
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Pig from Alan Gates’ book (In preparation for exam2)
The Idea of Pig Or Pig Concepts
CSE 491/891 Lecture 21 (Pig).
CSE 491/891 Lecture 24 (Hive).
(Hadoop) Pig Dataflow Language
Hadoop – PIG.
(Hadoop) Pig Dataflow Language
04 | Processing Big Data with Pig
Pig Hive HBase Zookeeper
Presentation transcript:

Hui Li lihui@indiana.edu Pig Tutorial Hui Li lihui@indiana.edu Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012

What is Pig Framework for analyzing large un-structured and semi-structured data on top of Hadoop. Pig Engine Parses, compiles Pig Latin scripts into MapReduce jobs run on top of Hadoop. Pig Latin is simple but powerful data flow language similar to scripting languages. SQL – like language Provide common data operations (e.g. filters, joins, ordering) Write pig latin job is as simple as writing sql queries, for complex cases, the developers can integrate user defined function into the pig statements.

Motivation of Using Pig Faster development Fewer lines of code (Writing map reduce like writing SQL queries) Re-use the code (Pig library, Piggy bank) One test: Find the top 5 words with most high frequency 10 lines of Pig Latin V.S 200 lines in Java 15 minutes in Pig Latin V.S 4 hours in Java Accelerate development process, many company such as Yahoo, Twitter, using Pig Latin to process large scale data.

Word Count using MapReduce

Word Count using Pig Lines=LOAD ‘input/access.log’ AS (line: chararray); Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line)) AS word; Groups = GROUP Words BY word; Counts = FOREACH Groups GENERATE group, COUNT(Words); Results = ORDER Words BY Counts DESC; Top5 = LIMIT Results 5; STORE Top5 INTO /output/top5words;

Pig Tutorial Basic Pig knowledge: (Word Count) Pig Data Types Pig Operations How to run Pig Scripts Advanced Pig features: (Kmeans Clustering) Embedding Pig within Python User Defined Function

Pig Data Types Pig Latin Data Types Primitive types Complex types Int, long, float, double, boolean,nul, chararray, bytearry, Complex types Cell  field in Database {(0002576169), (Tome), (21), (“Male”)….} Tuple  Row in Database ( 0002576169, Tome, 21, “Male”) DataBag  Table or View in Database {(0002576169 , Tome, 21, “Male”), (0002576170, Mike, 20, “Male”), (0002576171 Lucy, 20, “Female”)…. }

Pig Operations Loading data Projection De-duplication Grouping LOAD loads input data Lines=LOAD ‘input/access.log’ AS (line: chararray); Projection FOREACH … GENERTE (similar to SELECT) takes a set of expressions and applies them to every record. De-duplication DISTINCT removes duplicate records Grouping GROUPS collects together records with the same key Aggregation AVG, COUNT, COUNT_STAR, MAX, MIN, SUM

How to run Pig Latin scripts Local mode Neither Hadoop nor HDFS is required Local host and local file system is used Useful for prototyping and debugging Hadoop mode Run on a Hadoop cluster and HDFS Batch mode - run a script directly Pig –p input=someInput script.pig Script.pig Lines = LOAD ‘$input’ AS (…); Interactive mode use the Pig shell to run script Grunt> Lines = LOAD ‘/input/input.txt’ AS (line:chararray); Grunt> Unique = DISTINCT Lines; Grunt> DUMP Unique;

Sample: Word Count using Pig Lines=LOAD ‘input/access.log’ AS (line: chararray); Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line)) AS word; Groups = GROUP Words BY word; Counts = FOREACH Groups GENERATE group, COUNT(Words); Results = ORDER Words BY Counts DESC; Top5 = LIMIT Results 5; STORE Top5 INTO /output/top5words;

Sample: Kmeans using Pig A method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Assignment step: Assign each observation to the cluster with the closest mean Update step: Calculate the new means to be the centroid of the observations in the cluster Reference: http://en.wikipedia.org/wiki/K-means_clustering

Kmeans Using Pig PC = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = Foreach grouped Generate group, AVG(centroided.gpa); store result into 'output'; """) while iter_num<MAX_ITERATION: PCB = PC.bind({'centroids':initial_centroids}) results = PCB.runSingle() iter = results.result("result").iterator() centroids = [None] * v distance_move = 0.0 # get new centroid of this iteration, calculate the moving distance with last iteration for i in range(v): tuple = iter.next() centroids[i] = float(str(tuple.get(1))) distance_move = distance_move + fabs(last_centroids[i]-centroids[i]) distance_move = distance_move / v; if distance_move<tolerance: converged = True break ……

Embedding Python scripts with Pig Pig does not support flow control statement: if/else, while loop, for loop, etc. Pig embedding API can leverage all language features provided by Python including control flow: Loop and exit criteria Similar to the database embedding API Easier parameter passing JavaScript is available as well The framework is extensible. Any JVM implementation of a language could be integrated

Compile Pig Script Compile the Pig script outside the loop since we will run the same query every time P = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = Group centroided by centroid; result = Foreach grouped Generate group, AVG(centroided.gpa); store result into 'output'; """) Within the loop, we invoke the compiled Pig script public class Kmeans extends Configured implements Tool { while iter_num<MAX_ITERATION: Q = P.bind({'centroids':initial_centroids}) results = Q.runSingle(); ........ }//public class

User Defined Function What is UDF Why use UDF Way to do an operation on a field or fields Called from within a pig script Currently all done in Java Why use UDF You need to do more than grouping or filtering Actually filtering is a UDF Maybe more comfortable in Java land than in SQL/Pig Latin P = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids');

Zoom In Pig Kmeans code Iterate MAX_ITERATION times while iter_num<MAX_ITERATION: PCB = PC.bind({'centroids':initial_centroids}) results = PC.runSingle() iter = results.result("result").iterator() centroids = [None] * v distance_move = 0 for i in range(v): tuple = iter.next() centroids[i] = float(str(tuple.get(1))) distance_move = distance_move + fabs(last_centroids[i]-centroids[i]) distance_move = distance_move / v; if distance_move<tolerance: writeoutput() converged = True break last_centroids = centroids[:] initial_centroids = "" initial_centroids = initial_centroids + str(last_centroids[i]) if i!=v-1: initial_centroids = initial_centroids + ":" iter_num += 1 Binding parameters get new centroid of this iteration, calculate the moving distance with last iteration Update Centroids

Run Pig Kmeans Scripts 2012-07-14 14:51:24,636 [main] INFO org.apache.pig.scripting.BoundScript - Query to run: register udf.jar DEFINE find_centroid FindCentroid('0.0:1.0:2.0:3.0'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = foreach grouped generate group, AVG(centroided.gpa); store result into 'output'; Input(s): Successfully read 10000 records (219190 bytes) from: "hdfs://iw-ubuntu/user/developer/student.txt" Output(s): Successfully stored 4 records (134 bytes) in: "hdfs://iw-ubuntu/user/developer/output“ last centroids: [0.371927835052,1.22406743491,2.24162171881,3.40173705722]

References: Questions? 1) http://pig.apache.org (Pig official site) 2) http://en.wikipedia.org/wiki/K-means_clustering 3) slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012 4) Docs http://pig.apache.org/docs/r0.9.0 5) Papers: http://wiki.apache.org/pig/PigTalksPapers 6) http://en.wikipedia.org/wiki/Pig_Latin Questions?