Pig Latin - A Not-So-Foreign Language for Data Processing

Slides:



Advertisements
Similar presentations
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
Advertisements

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Hadoop Pig By Ravikrishna Adepu.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Foundations of Relational Implementation n Defining Relational Data n Relational Data Manipulation n Relational Algebra.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Database management concepts Database Management Systems (DBMS) An example of a database (relational) Database schema (e.g. relational) Data independence.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Shahram Ghandeharizadeh.
Rutgers University Relational Algebra 198:541 Rutgers University.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
HADOOP ADMIN: Session -2
Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.
Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.
Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
CS347: Map-Reduce & Pig Hector Garcia-Molina Stanford University CS347Notes 09 1.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
CS 347MapReduce1 CS 347 Distributed Databases and Transaction Processing Distributed Data Processing Using MapReduce Hector Garcia-Molina Zoltan Gyongyi.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
MapReduce Compilers-Apache Pig
Databases and DBMSs Todd S. Bacastow January
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Dynamic SQL Writing Efficient Queries on the Fly
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Pig : Building High-Level Dataflows over Map-Reduce
Relational Algebra Chapter 4 1.
Database Systems: Design, Implementation, and Management Tenth Edition
SQOOP.
Relational Algebra Chapter 4, Part A
Chapter 15 QUERY EXECUTION.
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Relational Algebra.
Lecture 23: Feature Selection
Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Database management concepts
Hector Garcia-Molina Stanford University
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Overview of big data tools
Pig : Building High-Level Dataflows over Map-Reduce
CSE 491/891 Lecture 21 (Pig).
An Introduction to Software Architecture
Database management concepts
How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Pig performance has been improving because of the optimizations.
Charles Tappert Seidenberg School of CSIS, Pace University
(Hadoop) Pig Dataflow Language
Query Optimization.
Hadoop – PIG.
CENG 351 File Structures and Data Managemnet
Course Instructor: Supriya Gupta Asstt. Prof
(Hadoop) Pig Dataflow Language
Pig and pig latin: An Introduction
Pig Hive HBase Zookeeper
Presentation transcript:

Pig Latin - A Not-So-Foreign Language for Data Processing YING YU Sep 19th 2016

Background what's the problem? Looking for trends →Product improvement When the data sets are growing to be extremely large, how can we make the ad-hoc data analysis tasks more efficiently? Looking for trends →Product improvement →Advertisement

Previous Solution-Analysis of Large Data Sets parallel database products: Teradata, Oracle RAC, etc. Problem: Expensive at web scale SQL:Unnatural, restrictive to write MapReduce Model Code: difficult to reuse ad maintain map and reduce functions : opaque for the optimization and check user's code

Yahoo!'s New Solution: Pig Latin combine the best of SQL and MapReduce: A sequence of steps , each step carries out a single fairly high-level data transformation high level declarative querying low level procedural programming

Why Pig Latin? Quick start and interoperability no need to import data before queries user chooses the output operate data in external files 2. Support nested data model flexible, more natural for us to think operate on files, reduce time and space algebraic language e.g: To store information of documents Nested data model : Map<documentId,Set<positions>> Databases: term_info: (termId, termString, ...) position_info: (termId, documentId, position)

Why Pig Latin? 3. Support UDFs-Extensibility for specialized data processing top10()is a UDF In SQL, different clauses can only use the correspondent functions 4. A novel debugging environment detect errors early and pinpoint the error step. groups = GROUP urls BY category; output = FOREACH groups GENERATE category, top10(urls); Select url from table where catagory='a'

Pig Latin-data model type definition fetch data atom a string or a number 'a' tuple sequence of fields ('a','b') use tuple.id, tuple.(id0, id1) or $0 bag collection of tuples {('a','b'),('a',('c','d'))} use bag.id, bag.(id0, id1) or $0 map collection of data items map#’key’ [‘fan of’ →{(‘lakers’)(‘iPod’)} ‘age’ → 20 ] use $0#’key’ or field_name#’key’ field by position or field by name

1. Load: to specify input data Commands 1. Load: to specify input data 2. Foreach: to do the independent processing to every tuple, “flatten” is used to eliminate the nesting. queries = LOAD ‘query_log.txt’ USING myLoad() AS (userId, queryString, timestamp); expanded_queries = FOREACH queries GENERATE userId, FLATTEN(expandQuery(queryString)); myload ()convert the file into tuples

3. filter : to get the target subset data Commands 3. filter : to get the target subset data 4. cogroup: to get related data together group, join,map-reduce real_queries = FILTER queries BY NOT isBot(userId); SELECT queryresult FROM table WHERE userId !=”a ” grouped_data = COGROUP results BY queryString, revenue BY queryString;

5. Other SQL similar commands: 6.store: store nested data union, cross, order and distinct 6.store: store nested data 7.nested functions allow FILTER, ORDER, and DISTINCT to be nested within FOREACH STORE query_revenues INTO 'myoutput' USING myStore(); ask for result to be materialized to a file

Execution platform: Hadoop Build a logical plan: Implementation Execution platform: Hadoop Build a logical plan: iterative independent Try to avoid the large nested bags Q: What to do when some operations after the cogroup is not algebraic UDF?

e.g: Logical Plan Construction of logical plan The logical plan of “group_data” includes: a cogroup command logical plans for two bags, results and revenue Interpreter to parse the commands Compile When the store command is invoked on a bag: logical plan→physical plan grouped_data = COGROUP results BY queryString, revenue BY queryString;

Map-Reduce Plan convert each COGROUP command in the logical plan into a distinct map reduce job Q: What if there exists more than input data set in a case of a COGROUP command? map functions appends an extra field to each tuple that identifies where the tuple is from map and reduce instances are run in parallel e.g. from Programming Pig

Debugging environment- Pig Pen sandbox data set(Realism,Conciseness,Completeness) help understand the schema Q:How to generate the dataset that fulfills the three main objectives

Data sets are too big and continuous to be accumulated use scenarios Rollup aggregates: Data sets are too big and continuous to be accumulated Do the successive aggregations or other commands directly on the files Temporal analysis To study the pattern of data change data in different periods (COGROUP, UDFs) Session analysis Analysis of the user activities during a session use the nested data model represents and manipulates sessions.(ORDER BY)

Future Optimizations Database style optimization Secure optimization without influence on semantics UI support other languages unified environment

Q: Does Pig have the transaction handling and index ? Objectives of Pig Latin What are the main differences between PIG and the RDBMS? Q: Does Pig have the transaction handling and index ? A: No How will the pig latin support other script programming languages such as Perl or Python?