Tyler Akidau, Alex Balikov, Kaya Bekiro glu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle Google.

Slides:



Advertisements
Similar presentations
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Advertisements

MapReduce Online Veli Hasanov Fatih University.
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Spark: Cluster Computing with Working Sets
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Reliable Communication for Highly Mobile Agents ECE 7995: Term Paper.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
Distributed Computations MapReduce
Case Study - GFS.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
 Anil Nori Distinguished Engineer Microsoft Corporation.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
MapReduce M/R slides adapted from those of Jeff Dean’s.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture Chunkservers Master Consistency Model File Mutation Garbage.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Bigtable: A Distributed Storage System for Structured Data
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Group Communication A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Bigtable A Distributed Storage System for Structured Data.
1 CMPT 431© A. Fedorova Google File System A real massive distributed file system Hundreds of servers and clients –The largest cluster has >1000 storage.
Robert Metzger, Aljoscha Connecting Apache Flink® to the World: Reviewing the streaming connectors.
MillWheel Fault-Tolerant Stream Processing at Internet Scale
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
PREGEL Data Management in the Cloud
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Replication Middleware for Cloud Based Storage Service
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
MapReduce Simplied Data Processing on Large Clusters
Slides prepared by Samkit
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Interpret the execution mode of SQL query in F1 Query paper
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Tyler Akidau, Alex Balikov, Kaya Bekiro glu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle Google

 Motivation and Requirements  Introduction  High level Overview of System  Fundamental abstractions of the MillWheel model  API  Fault tolerance  System implementation.  Experimental results related work 2

3  Records over one-second intervals (bucket) are compared to expected traffic that the model predicts  Consistent mismatch over n windows concludes a query is spiking or dipping  Model is updated with newly received data

 Requires both short term and long term storage  Needs duplicate prevention, as duplicate record deliveries could cause spurious spikes.  Should distinguish whether data (expected) is delayed or actually not there. ◦ MillWheel uses low watermark 4

 Real time processing of data  Persistent state abstractions to user  Handling of out-of-order data  Constant latency as the system scales to more machines.  Guarantee for exactly-once delivery of records 5

 Framework for building low-latency data- processing applications  System manages ◦ Persistent state ◦ Continuous flow of records ◦ Fault-tolerance guarantees  Provides a notion of logical time 6

 Provides fault tolerance at the framework level ◦ Correctness is ensured in case of failure ◦ Record are handled in an idempotent fashion  Ensures exactly once delivery of record from the user’s perspective  Check-pointing is at fine granularity ◦ Eliminates buffering of pending data for long periods between check points 7

8

 At high level, MillWheel is a graph of computation nodes ◦ Users specify  A directed Computation graph  Application code for individual nodes ◦ Each node takes input and produce output ◦ Computation are also called as transformations  Transformations are parallelized ◦ Users are not concerned with load-balancing at a fine-grained level 9

 Users can add and remove computations dynamically  All internal updates are atomically checkpointed per-key ◦ User code can access a per-key, per computation persistent store ◦ Allows for powerful per-key aggregations ◦ Uses replicated, highly available data store (e.g. Spanner) 10

 Inputs and outputs in MillWheel are represented by triples. ◦ (key, value, timestamp)  key is a metadata field with semantic  value is an arbitrary byte string  timestamp can be an arbitrary value 11

12

 Computations holds the application logic  Code is invoked upon receipt of input data  Code operates in the context of a single key 13

 Abstraction for aggregation and comparison between different records (Similar to map reduce)  Key extraction function(Specified by consumer) assigns a key to the record.  Computation code is run in the context of a specific key (accesses state for that specific key only). 14

 Provides a bound on the timestamps of future records  Low watermark of A is ◦ min(oldest work of A, low watermark of C)  oldest work of A is the timestamp corresponding to the oldest unfinished record in A. :  Node C produces the output to A for consumption 15

16

 Timers are per-key programmatic hooks that trigger at a specific wall time or low watermark value  Timers are journaled in persistent state and can survive process restarts and machine failures  Timer runs the specified user function (when fired) and provides same exactly-once guarantees 17

18

 User implements a custom subclass of the Computation class ◦ Provides method for accessing MillWheel abstractions ◦ ProcessRecord and ProcessTimer hooks provide two main entry points into user code ◦ Hooks are triggered in reaction to record receipt and timer expiration  Per-key serialization is handled at the framework level 19

20

21

22

 Each computation calculates a low watermark value for all of its pending work ◦ Users rarely deals with low watermarks ◦ Users manipulate them indirectly through timestamp assignation to records.  Injectors bring external data, seed low watermark values for the rest of the pipeline ◦ If injector are distributed across multiple processes the least watermark among all processes is used 23

 Exactly-Once Delivery ◦ Steps performed on receipt of an input record for a computation are  Checked for duplication  User code is run for the input  Pending changes are committed to the backing store.  Sender is ACKed  Pending productions are sent( retried until they are ACKed ) ◦ System assigns unique IDs to all records at production time, which are stored to identify duplicate records during retries 24

 Strong Productions ◦ Produced records are checkpointed before delivery  Checkpointing is done in same atomic write as state modification  Checkpoints are scanned into memory and replayed, if a process restarts  Checkpoint data is deleted when productions are ACKed  Exactly-Once Delivery and Strong Productions ensures user logic is idempotent 25

 Some computations may already be idempotent ◦ Strong productions and/or exactly-once can be disabled  Weak Productions ◦ Broadcast downstream deliveries optimistically, prior to persisting state ◦ Each stage waits for the downstream ACKs of records ◦ Completion times of consecutive stages increases so chances of experiencing a failure increases 26

To overcome this, a small fraction of productions are checkpointed, allowing those stages to ACK their senders. This selective checkpointing can both improve end-to-end latency and reduce overall resource consumption. 27

 Following user-visible guarantees must satisfy: ◦ No data loss ◦ Updates must ensure exactly-once semantics ◦ All persisted data must be consistent ◦ Low watermarks must reflect all pending state in the system ◦ Timers must fire in-order for a given key 28

29  To avoid Inconsistencies in Persistent state ◦ Per-key update are wrapped as single atomic operation  To avoid network remnant stale writes ◦ Sequencer is attached to each write ◦ Mediator of backing store checks before allowing writes ◦ New worker invalidates any extant sequencer

30

 Distributed systems with dynamic set of host servers  Each computation in a pipeline runs on one or more machines ◦ Streams are delivered via RPC  On each machine, the MillWheel system ◦ Marshals incoming work ◦ Manages process-level metadata ◦ Delegates data to the appropriate computation

 Load distribution and balancing is handled by a master ◦ Each computation is divided into a set of lexicographic key intervals ◦ Intervals are assigned to a set of machines ◦ Depending on load intervals can be merged or splitted  Low Watermarks ◦ Central authority tracks all low watermark values in the system and journals them to persistent state

 In-memory data structures are used to store aggregated timestamp  Consumer computations subscribe low watermark from all the senders ◦ Use minimum of all values ◦ Central authority’s low watermark values is at least as those of the workers

 Latency distribution for records when running over 200 CPUs.  Median record delay is 3.6 milliseconds and 95th-percentile latency is 30 milliseconds  Strong productions and exactly once disabled, with both enabled Median Latency jumps to 33.7 milliseconds

 Median latency stays roughly constant, regardless of system size  99th-percentile latency does get significantly worse with increase in system size.

 Simple three-stage MillWheel pipeline on 200 CPUs  Polled each computation’s low watermark value once per second

 Increasing available cache linearly improves CPU usage (after 550MB most data is cached, so further increases were not helpful)