Christopher Olston and many others Yahoo! Research Programming and Debugging Large-Scale Data Processing Workflows.

Christopher Olston and many others Yahoo! Research Programming and Debugging Large-Scale Data Processing Workflows

Context Elaborate processing of large data sets e.g.: web search pre-processing cross-dataset linkage web information extraction serving ingestion storage & processing

Context Debugging aides: Before: example data generator During: instrumentation framework After: provenance metadata manager storage & processing scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig 1 Nova 2 Dataflow Illustrator 3 Inspector Gadget 4 Ibis 5

Next: PIG Debugging aides: Before: example data generator During: instrumentation framework After: provenance metadata manager storage & processing scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig 1

Pig: A High-Level Language and Runtime for Map-Reduce 1/20 the lines of code1/16 the development time performs on par with raw Hadoop

Syntax Visits = load ‘/data/visits’ as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load ‘/data/pages’ as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate flatten(FindSessions(*)); HappyEndings = filter Sessions by BestIsLast(*); store HappyEndings into '/data/happy_endings'; Web browsing sessions with “happy endings.”

Pig Latin = Sweet Spot Between SQL & Map-Reduce "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.” -- Jasmine Novak, Engineer, Yahoo! "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.” -- Jasmine Novak, Engineer, Yahoo! "PIG seems to give the necessary parallel programming construct (FOREACH, FLATTEN, COGROUP.. etc) and also give sufficient control back to the programmer (which purely declarative approach like [SQL on top of Map- Reduce] doesn’t).” -- Ricky Ho, Adobe Software "PIG seems to give the necessary parallel programming construct (FOREACH, FLATTEN, COGROUP.. etc) and also give sufficient control back to the programmer (which purely declarative approach like [SQL on top of Map- Reduce] doesn’t).” -- Ricky Ho, Adobe Software "The [Hofmann PLSA E/M] algorithm was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig.” -- Prasenjit Mukherjee, Mahout project "The [Hofmann PLSA E/M] algorithm was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig.” -- Prasenjit Mukherjee, Mahout project

Status Productized at Yahoo (~12-person team) – 1000s of jobs/day – 70% of Hadoop jobs Open-source (the Apache Pig Project) Offered on Amazon Elastic Map-Reduce Used by LinkedIn, Twitter, Yahoo,...

Next: NOVA Debugging aides: Before: example data generator During: instrumentation framework After: provenance metadata manager storage & processing scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig 1 Nova 2

Why a Workflow Manager? Modularity: a workflow connects N dataflow modules – Written independently, and re-used in other workflows – Scheduled independently Optimization: optimize across modules – Share read costs among side-by-side modules – Pipeline data between end-to-end modules Continuous processing: push new data through – Selective re-running – Incremental algorithms (“view maintenance”) Manageability: help humans keep tabs on execution – Alerts – Metadata (e.g. data provenance)

Example Workflow template detection news site templates news articles RSS feed template tagging shingling de- duping unique articles ALL NEW ALL shingle hashes seen ALL NEW Pig program template “delta blocks” programmable “merge” operation (lazy vs. eager) HDFS

Nova Performance Merge OverheadIncremental Join

Next: DATAFLOW ILLUSTRATOR Debugging aides: Before: example data generator During: instrumentation framework After: provenance metadata manager storage & processing scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig 1 Nova 2 Dataflow Illustrator 3

Transform to (user, Canonicalize(url), time) Load Pages(url, pagerank) Load Visits(user, url, time) Join url = url Group by user Transform to (user, Average(pagerank) as avgPR) Filter avgPR > 0.5 Example Pig Dataflow Find users who tend to visit “good” pages.

Transform to (user, Canonicalize(url), time) Join url = url Group by user Transform to (user, Average(pagerank) as avgPR) Filter avgPR > 0.5 Load Pages(url, pagerank) Load Visits(user, url, time) (Amy, 0.65) (Fred, 0.4) (Amy, { (Amy, www.cnn.com, 8am, 0.9), (Amy, www.snails.com, 9am, 0.4) }) (Fred, { (Fred, www.snails.com, 11am, 0.4) }) (Amy, www.cnn.com, 8am, 0.9) (Amy, www.snails.com, 9am, 0.4) (Fred, www.snails.com, 11am, 0.4) (Amy, cnn.com, 8am) (Amy, http://www.snails.com, 9am) (Fred, www.snails.com/index.html, 11am) (Amy, www.cnn.com, 8am) (Amy, www.snails.com, 9am) (Fred, www.snails.com, 11am) (www.cnn.com, 0.9) (www.snails.com, 0.4) Illustrated!

Transform to (user, Canonicalize(url), time) Join url = url Group by user Transform to (user, Average(pagerank) as avgPR) Filter avgPR > 0.5 Load Pages(url, pagerank) Load Visits(user, url, time) (Amy, cnn.com, 8am) (Amy, http://www.snails.com, 9am) (Fred, www.snails.com/index.html, 11am) (Amy, www.cnn.com, 8am) (Amy, www.snails.com, 9am) (Fred, www.snails.com, 11am) (www.youtube.com, 0.9) (www.frogs.com, 0.4) Naïve Algorithm

Dataflow Illustrator Try to satisfy 3 objectives simultaneously: – Realism, conciseness, completeness Why it’s hard: – Large data set to comb through for “real” examples – Selective operators (e.g. filter, join) – Noninvertible operators (e.g. UDFs) Implemented in Pig as “ILLUSTRATE” command “ILLUSTRATE lets me check the output of my lengthy batch jobs and their custom functions without having to do a lengthy run of a long pipeline. [This feature] enables me to be productive.” -- Russell Jurney, LinkedIn “ILLUSTRATE lets me check the output of my lengthy batch jobs and their custom functions without having to do a lengthy run of a long pipeline. [This feature] enables me to be productive.” -- Russell Jurney, LinkedIn

Next: INSPECTOR GADGET Debugging aides: Before: example data generator During: instrumentation framework After: provenance metadata manager storage & processing scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig 1 Nova 2 Dataflow Illustrator 3 Inspector Gadget 4

Motivated by User Interviews Interviewed 10 Yahoo dataflow programmers (mostly Pig users; some users of other dataflow environments) Asked them how they (wish they could) debug

Summary of User Interviews # of requestsfeature 7crash culprit determination 5row-level integrity alerts 4table-level integrity alerts 4data samples 3data summaries 3memory use monitoring 3backward tracing (provenance) 2forward tracing 2golden data/logic testing 2step-through debugging 2latency alerts 1latency profiling 1overhead profiling 1trial runs

Precept Add these features to Pig without modifying Pig or tampering with data flowing through Pig

Our Hammer: Pig Script Rewriting Instrument Pig script by inserting special “monitoring agent” UDFs between operators Each agent observes records flowing through Agents communicate “interesting observations” to a central coordinator Coordinator delivers findings to end user

group count join filter load IG coordinator store IG agent Instrumented Dataflow

Example: Crash Culprit Determination tuple countsPhases 1 to n-1: tuplesPhase n: Phases 1 to n-1: maintain count lower bounds Phase n: maintain last-seen tuples group count join filter load IG coordinator store IG agent

Example: Forward Tracing tracing instructions report traced records to user group count join filter load IG coordinator store IG agent traced records

join filter load IG coordinator IG agent store dataflow engine runtime application launch instrumented dataflow run(s) end user dataflow program + app. parameters IG driver library raw result(s) result Flow:

Applications Developed Using IG # of requestsfeaturelines of code (Java) 7crash culprit determination141 5row-level integrity alerts89 4table-level integrity alerts99 4data samples97 3data summaries130 3memory use monitoringN/A 3backward tracing (provenance) 237 2forward tracing114 2golden data/logic testing200 2step-through debuggingN/A 2latency alerts168 1latency profiling136 1overhead profiling124 1trial runs93

Next: IBIS Debugging aides: Before: example data generator During: instrumentation framework After: provenance metadata manager storage & processing scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig 1 Nova 2 Dataflow Illustrator 3 Inspector Gadget 4 Ibis 5

Ibis Motivation scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova low-latency processor servingingestion Question: what is the provenance of X? datum X datum Y

Ibis Motivation Benefits: – Provide uniform view to users – Factor out metadata management code – Decouple metadata lifetime from data/subsystem lifetime data processing sub-systemsmetadata manager users metadata queries answers metadata Ibis integrated metadata

Key Challenge: Many Granularities of Data and Processing Elements Pig script Pig job Pig logical operation MR job Pig physical operation MR job phase MR task Task attempt data granularitiesprocess granularities Table Column group Row Column Cell Version Web page Workflow MR program Snapshot

Provenance Graph Example extract pig script titleyearlead actor Avatar2009V1: Worthington V2: Saldana Inception2010DiCaprio titleyearlead actor Avatar2009Saldana Inception2010DiCaprio titleyearlead actor Avatar2009Worthington Inception2010DiCaprio Yahoo! Movies web page Yahoo! Movies web page IMDB web page IMDB web page map output 1 map output 2 pig job 2 Y!Movies extracted table IMDB extracted table combined extracted table map task 1, attempt 1 map task 2, attempt 1 reduce task 1, attempt 1 merge pig script version = 3 wrapper = yahoo version = 3 wrapper = yahoo pig job 1 version = 2 wrapper = imdb version = 2 wrapper = imdb

IQL: Ibis Query Language Find all data rows that stem from version 3 of the “extract” pig script: Resolve version conflicts based on source authority: select d2.* from PigScript p, PigJob j, AnyData d1, Row d2 where p.id = “extract.pig” and j under p and j.version = 3 and j emits d1 and d1 influences(2) d2; select d2.* from PigScript p, PigJob j, AnyData d1, Row d2 where p.id = “extract.pig” and j under p and j.version = 3 and j emits d1 and d1 influences(2) d2; select v.id, source.authScore from Version v, WebPage source where source influences(2) v and not exists (select * from Version v2, WebPage source2, (Row,Column) commonParent where source2 influences(2) v2 and v under commonParent and v2 under commonParent and source2.authScore > source.authScore); select v.id, source.authScore from Version v, WebPage source where source influences(2) v and not exists (select * from Version v2, WebPage source2, (Row,Column) commonParent where source2 influences(2) v2 and v under commonParent and v2 under commonParent and source2.authScore > source.authScore);

Summary Debugging aides: Before: example data generator During: instrumentation framework After: provenance metadata manager storage & processing scalable file system e.g. GFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig 1 Nova 2 Dataflow Illustrator 3 Inspector Gadget 4 Ibis 5

Related Work Pig: DryadLINQ, Hive, Jaql, Scope, relational query languages Nova: BigTable, CBP, Oozie, Percolator, scientific workflow, incremental view maintenance Dataflow illustrator: [Mannila/Raiha, PODS’86], reverse query processing, constraint databases, hardware verification & model checking Inspector gadget: XTrace, taint tracking, aspect-oriented programming Ibis: Kepler COMAD, ZOOM user views, provenance management for databases & scientific workflows

Collaborators Shubham Chopra Anish Das Sarma Alan Gates Pradeep Kamath Ravi Kumar Shravan Narayanamurthy Olga Natkovich Benjamin Reed Santhosh Srinivasan Utkarsh Srivastava Andrew Tomkins

Christopher Olston and many others Yahoo! Research Programming and Debugging Large-Scale Data Processing Workflows.

Similar presentations

Presentation on theme: "Christopher Olston and many others Yahoo! Research Programming and Debugging Large-Scale Data Processing Workflows."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Christopher Olston and many others Yahoo! Research Programming and Debugging Large-Scale Data Processing Workflows.

Similar presentations

Presentation on theme: "Christopher Olston and many others Yahoo! Research Programming and Debugging Large-Scale Data Processing Workflows."— Presentation transcript:

Similar presentations

About project

Feedback