Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented By: Imranul Hoque

Similar presentations


Presentation on theme: "Presented By: Imranul Hoque"— Presentation transcript:

1 Presented By: Imranul Hoque
Pig (Latin) Demo Presented By: Imranul Hoque

2 Topics Last Seminar: Today: Hadoop Installation Running MapReduce Jobs
MapReduce Code Status Monitoring Today: Complexity of writing MapReduce programs Pig Latin and Pig Pig Installation Running Pig

3 Example Problem Goal: for each sufficiently large category find the average pagerank of high-pagerank urls in that category URL Category Pagerank Search Engine 0.9 News 0.8 Social Network 0.85 0.78 Blah 0.1 0.5

4 Example Problem (cont’d)
SQL: SELECT category, AVG(pagerank) FROM url-table WHERE pagerank > 0.2 GROUP BY category HAVING count (*) > 10^6 MapReduce: ? Procedural (MapReduce) vs.Declarative (SQL) Pig Latin: Sweet spot between declarative and procedural Pig System Hadoop Pig Latin MapReduce

5 Pig Latin Solution urls = LOAD url-table as (url, category, pagerank) good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls) > 10^6; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); For each sufficiently large category find the average pagerank of high-pagerank urls in that category

6 Features Dataflow language User defined function (UDF)
Find the set of urls that are classified as spams but have a high pagerank score spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank > 0.8; User defined function (UDF) Debugging environment Nested data model

7 Pig Latin Commands load Read data from file system. store
Write data to file system. foreach Apply expression to each record and output one or more records. filter Apply predicate and remove records that do not return true. group/cogroup Collect records with the same key from one or more inputs. join Join two or more inputs based on a key. order Sort records based on a key. distinct Remove duplicate records. union Merge two data sets. dump Write output to stdout. limit Limit the number of records.

8 Pig System parsed Pig Latin program program cross-job output optimizer
user parsed program Parser Pig Latin program execution plan Pig Compiler cross-job optimizer join output filter X f( ) Y map-red. jobs MR Compiler Map-Reduce Cluster

9 MapReduce Compiler

10 Pig Pen Find users who tend to visit “good” pages Transform
to (user, Canonicalize(url), time) Load Pages(url, pagerank) Visits(user, url, time) Join url = url Group by user to (user, Average(pagerank) as avgPR) Filter avgPR > 0.5

11 Challenges? Load Visits(user, url, time) Load Pages(url, pagerank)
(Amy, cnn.com, 8am) (Amy, 9am) (Fred, 11am) ( 0.9) ( 0.4) Transform to (user, Canonicalize(url), time) Join url = url (Amy, 8am) (Amy, 9am) (Fred, 11am) (Amy, 8am, 0.9) (Amy, 9am, 0.4) (Fred, 11am, 0.4) Group by user (Amy, { (Amy, 8am, 0.9), (Amy, 9am, 0.4) }) (Fred, { (Fred, 11am, 0.4) }) Transform to (user, Average(pagerank) as avgPR) (Amy, 0.65) (Fred, 0.4) Challenges? Filter avgPR > 0.5 (Amy, 0.65)

12 Installation Extract Build (ant) Environment variable
In pig and in tutorial dir Environment variable PIGDIR=~/pig-0.1.1 HADOOPSITEPATH=~/hadoop /conf

13 Running Pig Two modes: Three ways to execute: Local mode Hadoop mode
Shell (grunt) Script API (currently Java) GUI (future work)

14 Running Pig (2) Save data into HDFS Launch shell/Run script
bin/hadoop -copyFromLocal excite-small.log excite-small.log Launch shell/Run script java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH org.apache.pig.Main -x mapreduce <script_name> Our script: script1-hadoop.pig

15 Conclusion For more details: http://hadoop.apache.org/core/


Download ppt "Presented By: Imranul Hoque"

Similar presentations


Ads by Google