The Hadoop Stack, Part 1 Introduction to Pig Latin CSE 40822 – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.

High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

The Hadoop Stack, Part 3 Introduction to Spark

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

Presented By: Imranul Hoque

Homework 1: Common Mistakes Memory Leak Storing of memory pointers instead of data.

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Relations The Relational Data Model John Sieg, UMass Lowell.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.

Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Shahram Ghandeharizadeh.

The Hadoop Stack, Part 2 Introduction to HBase CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

HADOOP ADMIN: Session -2

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.

Cloud Computing Other High-level parallel processing languages Keke Chen.

NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.

Big Data Analytics Training

Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

Introduction to Hadoop and HDFS

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015.

CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.

Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Restore ： Reusing results of mapreduce jobs Jun Fan.

An Introduction to HDInsight June 27 th,

RESTORE IMPLEMENTATION as an extension to pig Vijay S.

Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –

+ Information Systems and Databases 2.2 Organisation.

I Power Higher Computing Software Development Development Languages and Environments.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010.

A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.

CS347: Map-Reduce & Pig Hector Garcia-Molina Stanford University CS347Notes 09 1.

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.

CSE 326: Data Structures Lecture #22 Databases and Sorting Alon Halevy Spring Quarter 2001.

What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.

BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Pig Latin - A Not-So-Foreign Language for Data Processing

Pig Latin: A Not-So-Foreign Language for Data Processing

Overview of big data tools

Pig : Building High-Level Dataflows over Map-Reduce

CSE 491/891 Lecture 21 (Pig).

Charles Tappert Seidenberg School of CSIS, Pace University

Hadoop – PIG.

Pig and pig latin: An Introduction

Presentation transcript:

The Hadoop Stack, Part 1 Introduction to Pig Latin CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame

Three Case Studies Workflow: Pig Latin A dataflow language and execution system that provides an SQL- like way of composing workflows of multiple Map-Reduce jobs. Storage: HBase A NoSQl storage system that brings a higher degree of structure to the flat-file nature of HDFS. Execution: Spark An in-memory data analysis system that can use Hadoop as a persistence layer, enabling algorithms that are not easily expressed in Map-Reduce.

References Christopher Olston, et al, Pig Latin: A Not-so-Foreign Language for Data Processing, SIGMOD Alan Gates et al. Building a high-level dataflow system on top of Map- Reduce: the Pig experience, VLDB Apache Pig Documentation

Pig Latin Overview Map-Reduce programs can be challenging to write, and are limited to performing only one analysis step at a time. Pig Latin: Chain together multiple Map-Reduce runs, each one expressed by a compact statement similar to SQL. By using a high level query language, the system can optimize the order of operations in order to produce a better plan. Pig Pen: Test out queries on sampled data.

Dataflow Programming In a dataflow programming system (an old idea), a program is a directed graph of the steps by which multiple data items are transformed from input to output. The programmer is not responsible for choosing the order in which items are processed. Procedural Programming int A[] = { 10, 20, 30, … }; for( i=0; i<1000; i++ ) B[i] = F( A[i] ); for( i=0; i<1000; i++ ) C[i] = G( A[i] ); Dataflow Programming set A = { 10, 20, 30, … } set B = apply F(x) to A set C = apply G(x) to A

Sample Objective Find each category that contains more than 10^6 entries with a pagerank of 0.2 or higher. categoryurlpagerank Academicwww.nd.edu3.0 Commercialwww.Studebaker.com1.5 Non-Profitwww.redcross.org5.3...(billions of rows)...

If we did it in Map-Reduce: Round One – Compute Averages by Category map( data ) { split line into category, url, pagerank if(pagerank>0.2) emit(category,pagerank) } reduce( key, list( values ) ) { for each item in list { count++; total+=value; } emit( count, category, total/count ); } Round Two – Collect Results into One File map( data ) { split line into count, category, average if(count>10^6) emit(1,(category)); } reduce( key, list( values ) ) { for each value in list { emit(value); }

If we did it in SQL… SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10^6 categoryAVG(pagerank) Academic1.75 Commercial0.56 Non-Profit

Same thing in Pig Latin… urls = LOAD “urls.txt” AS (url, category, pagerank) good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)> 10^6; output = FOREACH big_group GENERATE category, AVG(good_urls.pagerank);

Basic Idea Express queries in Pig Latin. Pig translates statements into DAG. DAG is translated into Map-Reduce Jobs. Map-Reduce jobs run in Hadoop. Encourage Unstructured, Nested Data User Defined Functions: A small bit of Java that can be imported into the query language. Could potentially be a large/expensive piece of code.

Normalized Database (SQL) ProductIDName 1Bicycle 2Tricycle 3Unicycle PartIDNamePrice 1Wheel10 2Chain15 3Handlebars3 4Seat12 ProductIDPartIDQuantity Codd’s definitions of normal forms: 1NF - The domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain. 2NF - No non-prime attribute in the table is functionally dependent on a proper subset of any candidate key. 3NF - Every non-prime attribute is non-transitively dependent on every candidate key in the table. The attributes that do not contribute to the description of the primary key are removed from the table. In other words, no transitive dependency is allowed.

Nested Data (Pig) { Name-> ‘Bicycle’, Price -> 105, Parts -> { (1, ‘Wheel’, 2, ), (2, ‘Chain’, 1, ), (3, ‘Handlebars’, 1, 3.00), … }; Name -> ‘Tricycle’, Price -> $55, Parts -> { (1, ‘Wheel’, 3), (3, ‘Handlebars’,1), … } JSON-Like Syntax: Atom: ‘hello’ or 25 Ordered Tuple: (a,b,c) Unordered Bag: {a,b,c} Map: {a->x, b->y, c->z}

Operations bag = LOAD “file” USING format AS (field-list) bag = FILTER bag BY function bag = FOREACH bag GENERATE expression-list bag = DISTINCT bag bag = ORDER bag BY expression-list bag = GROUP bag BY (field-list) bag = COGROUP bag-list BY (field-list) STORE bag INTO file USING format

From Pig to Map-Reduce urls = LOAD “urls.txt” AS (url, category, pagerank) good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)> 10^6; output = FOREACH big_group GENERATE category, AVG(good_urls.pagerank); Example on Board: Code -> DAG -> Map-Reduce Jobs

Exercise Given a whole lot of stored in Hadoop, write a Pig program will yield the list of spammers who have sent 1000 or more pieces of spam in the last year, and also generate the list of victims who have received more than 1000 items in the last year. Mail from should never be considered spam. The mail is stored in this form: (sender, receiver, contents, date) You have a function called IsSpam(contents) which returns true if the contents contain spam. (Expensive function.)

Points for Discussion Which operations may be rearranged in the DAG? Is Pig Latin different from SQL in a fundamental way? Or could this have simply been done with slight changes to SQL? Big data systems often recommend data management practices that are completely opposite of those developed by relational database systems. Why?