Slides prepared by Samkit

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
FlumeJava Easy, Efficient Data-Parallel Pipelines Mosharaf Chowdhury.
Spark: Cluster Computing with Working Sets
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
Algorithms and Problem Solving. Learn about problem solving skills Explore the algorithmic approach for problem solving Learn about algorithm development.
A Denotational Semantics For Dataflow with Firing Edward A. Lee Jike Chong Wei Zheng Paper Discussion for.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Tyson Condie.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS. storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Database Management 9. course. Execution of queries.
Google’s MapReduce Connor Poske Florida State University.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Apache Beam: The Case for Unifying Streaming API's Andrew Psaltis HDF / IoT Product Solution June 13, 2016 HDF / IoT Product Solution.
IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.
MillWheel Fault-Tolerant Stream Processing at Internet Scale
TensorFlow– A system for large-scale machine learning
Algorithms and Problem Solving
Efficient Evaluation of XQuery over Streaming Data
CSCI5570 Large Scale Data Processing Systems
CS239-Lecture 4 FlumeJava Madan Musuvathi Visiting Professor, UCLA
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Ministry of Higher Education
Introduction to Spark.
Capital One Architecture Team and DataTorrent
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
COS 518: Advanced Computer Systems Lecture 11 Daniel Suo
StreamApprox Approximate Stream Analytics in Apache Spark
StreamApprox Approximate Computing for Stream Analytics
Chapter 5 Designing the Architecture Shari L. Pfleeger Joanne M. Atlee
CS110: Discussion about Spark
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
The Dataflow Model.
Algorithms and Problem Solving
Introduction to MapReduce
MapReduce Algorithm Design
CS639: Data Management for Data Science
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Slides prepared by Samkit The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, VLDB 2015 Slides prepared by Samkit

What is Dataflow Model ? Enables Massive-scale, unbounded, out-of-order stream processing Provides unified model for both batch processing and stream processing Makes it easy to select different trade-offs of completeness, cost and latency Open Source SDK for building parallelized pipelines Available as fully managed service on Google Cloud Plateform

Outline for Rest of the Talk MillWheel Overview FlumeJava Overview Motivating Use-cases Dataflow Basics Data Representation and Transformation Windowing Triggering and watermark Refinements Examples

MillWheel Overview Framework for building low-latency data processing applications Provides programming model that can be used to build complex stream processing applications without distributed system expertise Guarantees exactly once processing Uses checkpointing and persistent storage to provide fault-tolerance

FlumeJava Overview A Java library for developing efficient data processing parallel pipelines of MapReduces Provide abstraction over different data representations and execution strategies in the form of parallel collections and their operations Supports deferred evaluation of pipeline operations Transforms user constructed modular FlumeJava code into excution plan which can be executed efficiently

Motivating Use-cases Advertisement Billing Pipeline inside Google Requires complete data, latency and cost can be higher Live Cost Estimate Pipeline Approximate results at low latency can work, low cost is necessary Abuse Detection Pipeline Requires low latency, can work with incomplete data Abuse Detection Backfill Pipeline to pick appropriate detection algorithm Low latency and low cost are required, partial incomplete data can work

DataFlow Model Basics Separates the logical notion of data processing from the underlying physical implementation What Results are being computed Where in event time they are being computed When in processing time they are materialized How earlier results relate to later refinements

DataFlow Basics Separates the logical notion of data processing from the underlying physical implementation What Results are being computed Data Reprentation and Transformation Where in event time they are being computed Windowing When in processing time they are materialized Watermarks and Triggering How earlier results relate to later refinements Refinements

Data Representation and Transformation Provides logical view of data as immutable collection of same-typed elements PCollection<T>, PCollectionTuple, PCollectionList Each element has unique timestamp Data goes through pipeline, a directed graph of transformation Core Transformations:ParDo(), GroupByKey(), Combine(), Flatten() Composite Transformation: Join(), Count() Custom transformations can be used

Core Transformations ParDo ( doFn<T, F>() ) groupByKey() Supports elementwise computation over an input PCollection<T> DoFn<T, S>, a function object defining how to map each element in input T to zero or more elements in output Should be pure functions of their Input Can be used to express both map and reduce parts of MapReduce groupByKey() Similar to shuffle step in MapReduce Allows specifying a sorting order for the collection of values for each key

Core Transformations flatten() combine() Takes a list of PCollection<T> and combines them into single PCollection<T> combine() Combines input collection of values into a single output value

Example Pipline pipeline = Pipline.create(options); pipeline.apply(TextIO.Read.from(options.getInput())) .apply(ParDo.of(new function1()) .apply(Count.perElement()) .apply(ParDo.of(new function2()) .apply(new Format()) .apply(TextIO.Write.to(options.getOutput())); pipeline.run();

Windowing

Windowing

Watermark f(S) -> E Provides global time progress metric S = a point in stream processing time E = the point in event time up to which input data is complete as of S Provides global time progress metric Problems with watermark: Sometimes too fast Sometimes too slow

Triggering and Watermark Triggers control when results are emitted Triggers can be used At Completion estimates At points in processing time In Response to data arrival Can be used with loops, sequences and other constructs Users can define their own triggers utilizing these primitives

Triggering

Content Refinements How should outputs per window accumulate? Firing Elements Discarding Accumulating Accumulating and Retracting Speculative 3 Watermark 5, 1 6 9 9, -3 Late 2 11 11, -9 Total Observations 23

Different combinations of What Where When and How Example Input

Classic Batch Execution

GlobalWindows, AtPeriod, Accumulating FixedWindows, Batch

GlobalWindows, AtPeriod, Accumulating FixedWindows, Batch

FixedWindows, Streaming

FixedWindows, Streaming, Partial

FixedWindows, Streaming, Partial

Sessions, Retracting

Sessions, Retracting

Thank You