Christopher Olston and many others Yahoo! Research Programming and Debugging Large-Scale Data Processing Workflows.

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig Yahoo! Apache pig.
Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer Hadoop PMC Member An architect in Yahoo! grid team Or, as one coworker put.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
Pig Contributors Workshop Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Presented By: Imranul Hoque
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Kashif Jalal CA-240 (072) Web Development Using ASP.NET CA – 240 Kashif Jalal Welcome to week – 2 of…
Cloud Computing Other Mapreduce issues Keke Chen.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Christopher Jeffers August 2012
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS. storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming.
Cloud Computing Other High-level parallel processing languages Keke Chen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
An Introduction to HDInsight June 27 th,
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
Apache Tez : Accelerating Hadoop Query Processing Page 1.
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Spark Presentation.
Pig : Building High-Level Dataflows over Map-Reduce
Pig Latin - A Not-So-Foreign Language for Data Processing
Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Web Development Using ASP .NET
Hadoop Technopoints.
Overview of big data tools
The Idea of Pig Or Pig Concepts
Pig : Building High-Level Dataflows over Map-Reduce
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Pig and pig latin: An Introduction
Pig Hive HBase Zookeeper
Presentation transcript:

Christopher Olston and many others Yahoo! Research Programming and Debugging Large-Scale Data Processing Workflows

Context Elaborate processing of large data sets e.g.: web search pre-processing cross-dataset linkage web information extraction serving ingestion storage & processing

Context Debugging aides: Before: example data generator During: instrumentation framework After: provenance metadata manager storage & processing scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig 1 Nova 2 Dataflow Illustrator 3 Inspector Gadget 4 Ibis 5

Next: PIG Debugging aides: Before: example data generator During: instrumentation framework After: provenance metadata manager storage & processing scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig 1

Pig: A High-Level Language and Runtime for Map-Reduce 1/20 the lines of code1/16 the development time performs on par with raw Hadoop

Syntax Visits = load ‘/data/visits’ as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load ‘/data/pages’ as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate flatten(FindSessions(*)); HappyEndings = filter Sessions by BestIsLast(*); store HappyEndings into '/data/happy_endings'; Web browsing sessions with “happy endings.”

Pig Latin = Sweet Spot Between SQL & Map-Reduce "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.” -- Jasmine Novak, Engineer, Yahoo! "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.” -- Jasmine Novak, Engineer, Yahoo! "PIG seems to give the necessary parallel programming construct (FOREACH, FLATTEN, COGROUP.. etc) and also give sufficient control back to the programmer (which purely declarative approach like [SQL on top of Map- Reduce] doesn’t).” -- Ricky Ho, Adobe Software "PIG seems to give the necessary parallel programming construct (FOREACH, FLATTEN, COGROUP.. etc) and also give sufficient control back to the programmer (which purely declarative approach like [SQL on top of Map- Reduce] doesn’t).” -- Ricky Ho, Adobe Software "The [Hofmann PLSA E/M] algorithm was implemented in pig in lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig.” -- Prasenjit Mukherjee, Mahout project "The [Hofmann PLSA E/M] algorithm was implemented in pig in lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig.” -- Prasenjit Mukherjee, Mahout project

Status Productized at Yahoo (~12-person team) – 1000s of jobs/day – 70% of Hadoop jobs Open-source (the Apache Pig Project) Offered on Amazon Elastic Map-Reduce Used by LinkedIn, Twitter, Yahoo,...

Next: NOVA Debugging aides: Before: example data generator During: instrumentation framework After: provenance metadata manager storage & processing scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig 1 Nova 2

Why a Workflow Manager? Modularity: a workflow connects N dataflow modules – Written independently, and re-used in other workflows – Scheduled independently Optimization: optimize across modules – Share read costs among side-by-side modules – Pipeline data between end-to-end modules Continuous processing: push new data through – Selective re-running – Incremental algorithms (“view maintenance”) Manageability: help humans keep tabs on execution – Alerts – Metadata (e.g. data provenance)

Example Workflow template detection news site templates news articles RSS feed template tagging shingling de- duping unique articles ALL NEW ALL shingle hashes seen ALL NEW Pig program template “delta blocks” programmable “merge” operation (lazy vs. eager) HDFS

Nova Performance Merge OverheadIncremental Join

Next: DATAFLOW ILLUSTRATOR Debugging aides: Before: example data generator During: instrumentation framework After: provenance metadata manager storage & processing scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig 1 Nova 2 Dataflow Illustrator 3

Transform to (user, Canonicalize(url), time) Load Pages(url, pagerank) Load Visits(user, url, time) Join url = url Group by user Transform to (user, Average(pagerank) as avgPR) Filter avgPR > 0.5 Example Pig Dataflow Find users who tend to visit “good” pages.

Transform to (user, Canonicalize(url), time) Join url = url Group by user Transform to (user, Average(pagerank) as avgPR) Filter avgPR > 0.5 Load Pages(url, pagerank) Load Visits(user, url, time) (Amy, 0.65) (Fred, 0.4) (Amy, { (Amy, 8am, 0.9), (Amy, 9am, 0.4) }) (Fred, { (Fred, 11am, 0.4) }) (Amy, 8am, 0.9) (Amy, 9am, 0.4) (Fred, 11am, 0.4) (Amy, cnn.com, 8am) (Amy, 9am) (Fred, 11am) (Amy, 8am) (Amy, 9am) (Fred, 11am) ( 0.9) ( 0.4) Illustrated!

Transform to (user, Canonicalize(url), time) Join url = url Group by user Transform to (user, Average(pagerank) as avgPR) Filter avgPR > 0.5 Load Pages(url, pagerank) Load Visits(user, url, time) (Amy, cnn.com, 8am) (Amy, 9am) (Fred, 11am) (Amy, 8am) (Amy, 9am) (Fred, 11am) ( 0.9) ( 0.4) Naïve Algorithm

Dataflow Illustrator Try to satisfy 3 objectives simultaneously: – Realism, conciseness, completeness Why it’s hard: – Large data set to comb through for “real” examples – Selective operators (e.g. filter, join) – Noninvertible operators (e.g. UDFs) Implemented in Pig as “ILLUSTRATE” command “ILLUSTRATE lets me check the output of my lengthy batch jobs and their custom functions without having to do a lengthy run of a long pipeline. [This feature] enables me to be productive.” -- Russell Jurney, LinkedIn “ILLUSTRATE lets me check the output of my lengthy batch jobs and their custom functions without having to do a lengthy run of a long pipeline. [This feature] enables me to be productive.” -- Russell Jurney, LinkedIn

Next: INSPECTOR GADGET Debugging aides: Before: example data generator During: instrumentation framework After: provenance metadata manager storage & processing scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig 1 Nova 2 Dataflow Illustrator 3 Inspector Gadget 4

Motivated by User Interviews Interviewed 10 Yahoo dataflow programmers (mostly Pig users; some users of other dataflow environments) Asked them how they (wish they could) debug

Summary of User Interviews # of requestsfeature 7crash culprit determination 5row-level integrity alerts 4table-level integrity alerts 4data samples 3data summaries 3memory use monitoring 3backward tracing (provenance) 2forward tracing 2golden data/logic testing 2step-through debugging 2latency alerts 1latency profiling 1overhead profiling 1trial runs

Precept Add these features to Pig without modifying Pig or tampering with data flowing through Pig

Our Hammer: Pig Script Rewriting Instrument Pig script by inserting special “monitoring agent” UDFs between operators Each agent observes records flowing through Agents communicate “interesting observations” to a central coordinator Coordinator delivers findings to end user

group count join filter load IG coordinator store IG agent Instrumented Dataflow

Example: Crash Culprit Determination tuple countsPhases 1 to n-1: tuplesPhase n: Phases 1 to n-1: maintain count lower bounds Phase n: maintain last-seen tuples group count join filter load IG coordinator store IG agent

Example: Forward Tracing tracing instructions report traced records to user group count join filter load IG coordinator store IG agent traced records

join filter load IG coordinator IG agent store dataflow engine runtime application launch instrumented dataflow run(s) end user dataflow program + app. parameters IG driver library raw result(s) result Flow:

Applications Developed Using IG # of requestsfeaturelines of code (Java) 7crash culprit determination141 5row-level integrity alerts89 4table-level integrity alerts99 4data samples97 3data summaries130 3memory use monitoringN/A 3backward tracing (provenance) 237 2forward tracing114 2golden data/logic testing200 2step-through debuggingN/A 2latency alerts168 1latency profiling136 1overhead profiling124 1trial runs93

Next: IBIS Debugging aides: Before: example data generator During: instrumentation framework After: provenance metadata manager storage & processing scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig 1 Nova 2 Dataflow Illustrator 3 Inspector Gadget 4 Ibis 5

Ibis Motivation scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova low-latency processor servingingestion Question: what is the provenance of X? datum X datum Y

Ibis Motivation Benefits: – Provide uniform view to users – Factor out metadata management code – Decouple metadata lifetime from data/subsystem lifetime data processing sub-systemsmetadata manager users metadata queries answers metadata Ibis integrated metadata

Key Challenge: Many Granularities of Data and Processing Elements Pig script Pig job Pig logical operation MR job Pig physical operation MR job phase MR task Task attempt data granularitiesprocess granularities Table Column group Row Column Cell Version Web page Workflow MR program Snapshot

Provenance Graph Example extract pig script titleyearlead actor Avatar2009V1: Worthington V2: Saldana Inception2010DiCaprio titleyearlead actor Avatar2009Saldana Inception2010DiCaprio titleyearlead actor Avatar2009Worthington Inception2010DiCaprio Yahoo! Movies web page Yahoo! Movies web page IMDB web page IMDB web page map output 1 map output 2 pig job 2 Y!Movies extracted table IMDB extracted table combined extracted table map task 1, attempt 1 map task 2, attempt 1 reduce task 1, attempt 1 merge pig script version = 3 wrapper = yahoo version = 3 wrapper = yahoo pig job 1 version = 2 wrapper = imdb version = 2 wrapper = imdb

IQL: Ibis Query Language Find all data rows that stem from version 3 of the “extract” pig script: Resolve version conflicts based on source authority: select d2.* from PigScript p, PigJob j, AnyData d1, Row d2 where p.id = “extract.pig” and j under p and j.version = 3 and j emits d1 and d1 influences(2) d2; select d2.* from PigScript p, PigJob j, AnyData d1, Row d2 where p.id = “extract.pig” and j under p and j.version = 3 and j emits d1 and d1 influences(2) d2; select v.id, source.authScore from Version v, WebPage source where source influences(2) v and not exists (select * from Version v2, WebPage source2, (Row,Column) commonParent where source2 influences(2) v2 and v under commonParent and v2 under commonParent and source2.authScore > source.authScore); select v.id, source.authScore from Version v, WebPage source where source influences(2) v and not exists (select * from Version v2, WebPage source2, (Row,Column) commonParent where source2 influences(2) v2 and v under commonParent and v2 under commonParent and source2.authScore > source.authScore);

Summary Debugging aides: Before: example data generator During: instrumentation framework After: provenance metadata manager storage & processing scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig 1 Nova 2 Dataflow Illustrator 3 Inspector Gadget 4 Ibis 5

Related Work Pig: DryadLINQ, Hive, Jaql, Scope, relational query languages Nova: BigTable, CBP, Oozie, Percolator, scientific workflow, incremental view maintenance Dataflow illustrator: [Mannila/Raiha, PODS’86], reverse query processing, constraint databases, hardware verification & model checking Inspector gadget: XTrace, taint tracking, aspect-oriented programming Ibis: Kepler COMAD, ZOOM user views, provenance management for databases & scientific workflows

Collaborators Shubham Chopra Anish Das Sarma Alan Gates Pradeep Kamath Ravi Kumar Shravan Narayanamurthy Olga Natkovich Benjamin Reed Santhosh Srinivasan Utkarsh Srivastava Andrew Tomkins