Akshun Gupta, Karthik Bala

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
Stream Processing with Tamás István Ujj
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Adaptive Online Scheduling in Storm Paper by Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni Presentation by Keshav Santhanam.
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Online Parameter Optimization for Elastic Data Stream Processing Thomas Heinze, Lars Roediger, Yuanzhen Ji, Zbigniew Jerzak (SAP SE) Andreas Meister (University.
SQL Database Management
MillWheel Fault-Tolerant Stream Processing at Internet Scale
TensorFlow– A system for large-scale machine learning
Primary-Backup Replication
On-Line Transaction Processing
Introduction to Spark Streaming for Real Time data analysis
Introduction to Distributed Platforms
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
CSCI5570 Large Scale Data Processing Systems
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
International Conference on Data Engineering (ICDE 2016)
Original Slides by Nathan Twitter Shyam Nutanix
CSCI5570 Large Scale Data Processing Systems
Large-scale file systems and Map-Reduce
Applying Control Theory to Stream Processing Systems
Maximum Availability Architecture Enterprise Technology Centre.
Harry Xu University of California, Irvine & Microsoft Research
PREGEL Data Management in the Cloud
EEC 688/788 Secure and Dependable Computing
Introduction to NewSQL
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
PA an Coordinated Memory Caching for Parallel Jobs
Software Engineering Introduction to Apache Hadoop Map Reduce
Database Performance Tuning and Query Optimization
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Ministry of Higher Education
Introduction to Operating Systems
April 30th – Scheduling / parallel
湖南大学-信息科学与工程学院-计算机与科学系
Predictive Performance
Overview of Databases and Transaction Processing
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
Architecture for Real-Time ETL
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Chi: A Scalable & Programmable Control Plane for Distributed Stream Processing Luo Mai, Kai Zeng, Rahul Potharaju, Le Xu, Steve Suh, Shivaram Venkataraman,
H-store: A high-performance, distributed main memory transaction processing system Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alex.
EEC 688/788 Secure and Dependable Computing
Distributed Systems CS
Chapter 11 Database Performance Tuning and Query Optimization
with Raul Castro Fernandez* Matteo Migliavacca+ and Peter Pietzuch*
Control Theory in Log Processing Systems
Streaming data processing using Spark
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Harrison Howell CSCE 824 Dr. Farkas
Presentation transcript:

Akshun Gupta, Karthik Bala StreamScope, S-Store Akshun Gupta, Karthik Bala

What is Stream Processing? “Stream processing is designed to analyze and act on real-time streaming data, using “continuous queries” Infoq - Stream Processing Difference between Batch Processing: “Ability to process potentially infinite input events continuously with delays in seconds and minutes, rather than processing a static data set in hours and days. “ StreamS paper

Applications of Stream Processing Twitter uses stream processing to show trending tweets Algorithmic Trading or High Frequency Trading Surveillance using sensors Realtime Analytics And many more!

Stream Processing: Challenges Continuous infinite amounts of data Need to deal with failures and planned maintenance Latency sensitive Need for high throughput All of this makes stream applications hard to develop, debug, and deploy!

Presented by Akshun Gupta StreamScope: Continuous Reliable Distributed Processing of Big Data Streams Microsoft Research Presented by Akshun Gupta

StreamScope - General Information Paper came out of Microsoft Research Has been deployed in a shared 20k server production cluster at Microsoft Runs Microsoft’s core online advertisement service - created to handle business critical applications - supposed to give strong guarantees

Motivation Want to design a streaming computation engine to Execute an event exactly once with server failures and message loss. Handle large amounts of load Scale well Travel back in time Continue operation during maintenance Make distributed streaming programming easy

Key Contributions StreamS shows a streaming computation engine does not need to unnaturally convert streaming computation to a series of mini-batch jobs. Eg, Apache Spark Introduction of abstractions, rVertex and rStream, to simplify creating, debugging, and understanding data computation engines. Proven system - deployed in production running business critical applications while coping with failures and variations.

StreamS Abstractions - DAG Execution of program modeled as a DAG Vertex performs local computation Instreams and OutStreams Each stream is modeled as an infinite sequence of events, each with a continuously incremented sequence number. STREAMS determines the degree of parallelism for each stage (marked in parentheses) based on data rate and computation cost. The execution of a vertex is tracked through a series of snapshots, where each snapshot is a triplet containing the current sequence numbers of its input streams, the current sequence numbers of its output streams, and its current state.

StreamS Abstractions - rStream Abstraction to decouple upstream and downstream vertices with failure recovery mechanisms. Maintains sequence of events and sequence numbers. Provides API calls Write, Read, GarbageCollect Maintains the following properties: Uniqueness: Unique value for each sequence number Validity: If a Read happens for seq, a Write for seq is guaranteed to have happened Reliability: For any Write(seq, e), Read(seq) will return e

StreamS Abstractions - rVertex Vertex can save state with snapshots If Vertex fails, it can be restarted with Load(s). s is a saved snapshot. rVertex guarantees determinism Running Execute() on the same snapshot will produce the same result Determinism ensures correctness. Requires user defined functions to behave deterministically

Architecture The program is compiled into a streaming DAG the program is first converted into a logical plan (DAG) of STREAMS runtime operators, which include temporal joins, window aggregates, and user-defined functions; (2) the STREAMS optimizer then evaluates various plans, choosing the one with the lowest estimated cost based on available resource, data statistics such as the incoming rate, and an internal cost model; and (3) a physical plan (DAG) is finally created by mapping a logical vertex into an appropriate number of physical vertices for parallel execution and scaling, with code generated for each vertex to be deployed in a cluster and process input events continuously at runtime. manager that is responsible for: (1) scheduling vertices to and establishing channels (edges) in the DAG among different machines; (2) monitoring progress and tracking snapshots; (3) providing fault tolerance by detecting failures/stragglers and initiating recovery actions.

Failure Recovery Strategies Checkpoint-based recovery Not performant when vertices hold large internal state Replay-based recovery Rebuilding state using the most recent window like 5 minutes Deterministic execution property comes in handy Might have to reload large window but don’t have to checkpoint as frequently Replication-based recovery Multiple instances of the same vertex can be run at the same time Determinism will ensure output of different machines but of the same vertex to be the same Overhead of extra resources

Evaluation Detect fraud clicks of online transactions 3220 Vertices 9.5 TB of events processed 180 TB I/O 21.3 TB aggregate memory usage 7 day evaluation period

Evaluation - Failure Impact on Latency* A: Failed machines had high in-memory state → Latency increased for small number of failures B: Large number of failures but vertices did not have high in-memory state C: Unscheduled mass outage of machines → significant increase in latency D: scheduled maintenance → graceful transition and no significant increase in latency *End-to-end latency

Evaluation - Scalability X Axis: Degree of Parallelism Y Axis: Maximum throughput sustained under a 1-second latency bound.

Comparing Failure Recovery Strategies No effect on latency when using Replication strategy Longer latency delay for Replay because state in checkpoint is more condensed (common case) Company uses 25% replay based but others uses checkpointing

Comments Paper does not compare their streaming system with other streaming systems like Spark, Storm, etc. No outlook given on whether this system will be provided as PaaS or their plan on making it open source. Restriction on deterministic applications significant

Key Takeaways Introduction of abstractions rStream and rVertex A new way to design streaming systems Decoupling upstream and downstream vertices Valuable engineering advice Good comparison between failover strategies Checkpointing Replay Based Replication Based Proven system under production load Business critical application 20k+ nodes used Scaling is robust

S-Store Presented by Karthik Bala

Streaming Meets Transaction Processing Streaming: handle large amounts of data, but... Transaction Processing: ACID guarantees, but... Challenge: Build a streaming system which provides shared mutable state

Guarantees Transactions are stored procedures with input parameters -”Recall that it is the data that is sent to the query in streaming systems in contrast to the standard DBMS model of sending the query to the data” OLTP Transaction - can access public tables, “pull based” Streaming transaction - can access public tables, windows, streams, “push based” Define acid Use cases

Contributions Start with traditional OLTP database system (H-Store) and add streaming transactions streams and windows represented as time-varying state triggers to enable push-based processing over such state a streaming scheduler that ensures correct transaction ordering a variant on H-Store’s recovery scheme that ensures exactly-once processing for streams

Transaction Execution s: stream b: atomic batch w: window (difference?) T: transaction Atomic batches must be processed as individual units Window: time based or tuple based Difference: external vs internal Border vs interior transactions - output is input for the next

Transaction Execution ACID: Wait till T commits to make its writes public Valid orderings? For an ordering to be correct Must follow the topological ordering of the dataflow graph (relaxed if graph has multiple orderings) All batches must be processed in order Atomic batches must be processed as individual units Window: time based or tuple based Difference: external vs internal Border vs interior transactions - output is input for the next

Hybrid Schedules, Nested Transactions Any OLTP transaction can interleave between any pair of streaming transactions (in a valid TE schedule) Nested transactions : two or more transactions which execute like a block No transaction can interleave between nested transactions

H- Store Architecture Commit Log, Checkpointing Layers

S-Store Extensions Streams: time varying H-Store tables Triggers Persistent, recoverable Triggers Attached to tables, activate when tuples added PE/EE triggers Window Tables

Fault Tolerance Goal: Exactly once processing Even if a failure happens, state must be as if transaction T occurred exactly once! Weak recovery: correct but nondeterministic results

Recovery Strong Recovery Weak Recovery Use H-Store’s commit log from latest snapshot + disable PE triggers (why?) Weak Recovery Apply Snapshot Start at the inputs of dataflow graph (cached) Leave PE triggers as is! Need interior transactions that were not logged to be re-executed Finally, replay the log

Performance and Evaluation H store - fast but incorrect! We make h store correct, it is slow Esper and storm (streaming systems) better, but have bottleneck of accessing db (full round trip wait) - no push semantics - only a single transaction request at a time

Performance and Evaluation (2) EE triggers: bottleneck due to round trip time PE triggers: bottleneck in h store due to serialization( can only do one at a time), more round trip times (ALL THE WAY TO CLIENT!)

Performance and Evaluation (3) SP = stored procedure

Key Takeaways Ordering Push-based processing (triggers!) Weak vs. strong recovery ACID guarantees

Discussion S-Store: >1 node?! S-Store evaluation methods okay? Implementation of different failure strategies for each vertex not given in the paper. No details on how the optimizer works - how does it know the cost of running the application before deploying? Job Manager fault tolerance not talked about in the paper. If not replicated, it is a single point of failure Lack of custom DAG creation - probably because they have optimized for their own workload and applications