Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.

Slides:

Advertisements

Similar presentations

Spark Streaming Large-scale near-real-time stream processing

Advertisements

Spark Streaming Large-scale near-real-time stream processing

Spark Streaming Real-time big-data processing

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO

Apache Storm A scalable distributed & fault tolerant real time computation system ( Free & Open Source ) Shyam Rajendran 16-Feb-15.

Spark Streaming Large-scale near-real-time stream processing

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

MapReduce Online Veli Hasanov Fatih University.

Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark: Cluster Computing with Working Sets

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.

Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)

Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.

1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.

Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 22: Stream Processing, Graph Processing All slides © IG.

Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

MapReduce M/R slides adapted from those of Jeff Dean’s.

MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,

Spark Streaming Large-scale near-real-time stream processing

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.

Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.

Part III BigData Analysis Tools (Storm) Yuan Xue

Adaptive Online Scheduling in Storm Paper by Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni Presentation by Keshav Santhanam.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Lecture 22: Stream Processing, Graph Processing

CSCI5570 Large Scale Data Processing Systems

Guangxiang Du*, Indranil Gupta

Original Slides by Nathan Twitter Shyam Nutanix

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

COS 518: Advanced Computer Systems Lecture 11 Michael Freedman

Ministry of Higher Education

Introduction to Spark.

Boyang Peng, Le Xu, Indranil Gupta

湖南大学-信息科学与工程学院-计算机与科学系

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

CS110: Discussion about Spark

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Distributed System Gang Wu Spring，2018.

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Lecture 22: Stream Processing, Graph Processing

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

Apache Hadoop and Spark

Fast, Interactive, Language-Integrated Cluster Computing

Streaming data processing using Spark

COS 518: Advanced Computer Systems Lecture 12 Michael Freedman

COS 518: Distributed Systems Lecture 11 Mike Freedman

Lecture 29: Distributed Systems

Presentation transcript:

Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing  2013, H. Alkaff, I. Gupta

Lecture 17-2 Motivation Many important applications must process large streams of live data and provide results in near- real-time –Social network trends –Website statistics –Intrustion detection systems –etc. Require large clusters to handle workloads Require latencies of few seconds

Lecture 17-3 Traditional Solutions (Map-Reduce) No partial answers –Have to wait for the entire batch to finish Hard to configure when we want to add more nodes. Mapper 1 Mapper 2 Reducer 1 Reducer 2 New Mapper

Lecture 17-4 The Need for New Framework Scalable Second-scale Easy to program “Just” works –Adding/removing nodes should be transparent

Lecture 17-5 Enter Storm Open-sourced at Most watched JVM project Multiple languages API supported –Python, Ruby, Fancy Used by over 30 companies including –Twitter: For personalization, search –Flipboard: For generating custom feeds

Lecture 17-6 Storm’s Terminology Tuples Streams Spouts Bolts Topologies

Lecture 17-7 Tuples Ordered list of elements –E.g. [“Illinois”,

Lecture 17-8 Streams Unbounded sequence of tuples Tuple

Lecture 17-9 Spouts Source of streams (E.g. Web logs) Tuple

Lecture Bolts Processes tuples and create new streams Tuple

Lecture Bolts (Continued) Operations that can be performed –Filter –Joins –Apply/transform functions –Access DBs –Etc Could specify amount of parallelism in each bolt –How to decide which task in bolt to route subset of the streams to?

Lecture Streams Grouping in Bolts Shuffle Grouping –Streams are randomly distributed from to the bolt’s tasks Fields Grouping –Lets you group a stream by a subset of its fields (Example?) All Grouping –Stream is replicated across all the bolt’s tasks (Example?)

Lecture Topologies A directed graph of spouts and bolts Tuple

Lecture Word Count Example (Main)

Lecture Word Count Example (Spout)

Lecture Word Count Example (Bolt)

Lecture Word Count (Running)

Lecture Storm Cluster Master node –Runs a daemon called Nimbus –Responsible for distributing code around cluster –Assigning tasks to machines –Monitoring for failures Worker node –Runs a daemon called Supervisor –Listens for work assigned to its machines Zookeeper –Coordinates Nimbus and Supervisors communication –All state of Supervisor and Nimbus is being kept here

Lecture Storm Cluster NimbusZookeeper Supervisor Zookeeper Supervisor

Lecture Guaranteeing Message Processing When spout produces tuples, the followings are created: –Tuple for each word in the sentence –Tuple for the updated count for each word A tuple is considered failed when its tree of messages fails to be fully processed within a specified timeout

Lecture Anchoring When a word is being anchored, the spout tuple at the root of the tree will be replayed later on if the word tuple failed to be processed downstream –Might not necessarily wants this feature An output can be anchored to more than one input tuple –Failure of one tuple causes multiple tuples to be replayed

Lecture Miscellaneous for Storm’s Reliability API Emit(tuple, output) –Each word is anchored if you specify the input tuple as the first argument to emit Ack(tuple) –Acknowledge that you finish processing a tuple Fail(tuple) –Immediately fail the spout tuple at the root of tuple tree if there is an exception from the database, etc Must remember to ack/fail each tuple –Each tuple consume memory. Failure to do so results in memory leaks

Lecture What’s wrong with Storm? Mutable state can be lost due to failure! –May update mutable state twice! –Processes each record at least once Slow nodes? –No notion of time slices

Lecture Enter Spark Streaming Research project from AMPLab Group at UC Berkeley A follow-up from Spark research project –Idea: cache datasets in memory, dubbed Resilient Distributed Datasets (RDD) for repeated query –Able to perform 100 times faster than Hadoop MapReduce for iterative machine learning applications

Lecture Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs Chop up the live stream into batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD operations Finally, the processed results of the RDD operations are returned in batches

Lecture Discretized Stream Processing (Continued) Run a streaming computation as a series of very small, deterministic batch jobs Chop up the live stream into batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD operations Finally, the processed results of the RDD operations are returned in batches

Lecture Effects of Discretization (+) Easy to maintain consistency –Each time interval is processed independently »Especially useful when you have stragglers (slow nodes) »Not handled in Storm (+) Handle out-of-order records –Can choose to wait for a limited “slack time” –Incrementally correct late records at application level (+) Parallel recovery –Expose parallelism across both partitions of an operator and time (-) Raises minimum latency

Lecture Fault-tolerance in Spark Streaming RDDs are remember the sequence of operations that created it from the original fault-tolerant input data (Dubbed “lineage graph”) Batches of input data are replicated in memory of multiple worker nodes, therefore fault- tolerant Data lost due to worker failure, can be recomputed from input data

Lecture Fault-tolerant Stateful Processing State data not lost even if a worker node dies –Does not change the value of your result Exactly once semantics to all transformations –No double counting! Recovers from faults/stragglers within 1 sec

Lecture Comparison with Storm Higher throughput than Storm –Spark Streaming: 670k records/second/node –Storm: 115k records/second/node

Lecture More information Storm – – Spark Streaming – programming-guide.htmlhttp://spark.incubator.apache.org/docs/latest/streaming- programming-guide.html Other Streaming Frameworks –Yahoo S4 –LinkedIn Samza