Benchmarking Modern Distributed Stream Processing Systems

Benchmarking Modern Distributed Stream Processing Systems
with Customizable Synthetic Workloads Srujun Thanmay Gupta | B.S. Computer Engineering Advisers: Prof. Indranil Gupta Shadi A. Noghabi

Outline Stream Processing Overview Modern Frameworks Why Benchmark?
Related Work Finch: Design Finch: Implementation & Details Experiments

Stream Processing Overview
What is Stream Processing? 4 V’s: Variety, Velocity, Veracity, Volume Velocity: data is being created so fast it is infeasible to collect and store everything. Need to analyze it on the spot. Why Stream Processing? Live analysis of data for live results Continuous computations More flexible target applications

Example Stream Processing Applications
Computing trending topics or hashtags: based on factors like locality, time, etc. Ad tracking: click-through rate–how many users buy a product after clicking ad curation–target ads to users based on their interactions IoT sensor analysis: detect anomalies live machine learning Common theme: process and analyze real-time events

Stream Processing Abstraction
Next: configuring stream dataflows

Modern Stream Processing Frameworks

Why Benchmark? Stream processing has become immensely popular in recent years Many different frameworks now exist, each with their own design trade-offs and intricacies System administrators, developers need to read through lengthy design documents to understand stream processing frameworks Goal: help users get a better understanding of these details in the context of the application they are trying to create compare both system features and application performance across different stream processing systems

Related Work – Current Benchmarking Tools
Chintapalli et al: ad-tracking pipeline at Yahoo using Storm, Flink and Spark Streaming Lopez et al: study parallelism in stream processing on a dataset for threat detection StreamBench: with Databricks, suite of 7 benchmark workloads to measure throughput and latency of Spark Streaming, Storm, and Samza Current stream processing benchmarks: Test very specific workloads Only application-based: test performance of “word count,” “ad tracking,” etc. Not flexible enough to emulate user’s desired application Need to also test system features... Benchmark suite has workloads like identity, sample, grep, word count, etc. User workloads are usually more complicated and specific to their end goal.

Stream Processing Feature Requirements
Based on Michael Stonebraker’s paper, “8 requirements of real-time stream processing”: Distributed parallel computation computation is distributed across smaller tasks that process subset of data streams Scalability trade-offs between data throughput and processing latency with more nodes Resiliency and fault-tolerance for high availability recovery from individual component failures recovery of state when resuming from failures Flexible data model integration integrate with different data sources and destinations Databases, filesystems, message queues, hardware sensors, etc. Resiliency: A consequence of the scalability requirement is that with more components in the system, the probability of individual component failure also increases. Need to be able to evaluate each feature individually

Need a benchmarking tool that is: generic & flexible tunable
arbitrary workloads on any target system tunable customizable parameters

Finch: Benchmark Synthetic Workloads
Goals: Generate arbitrary and flexible synthetic workloads without modifying the target system Enable evaluation of both system features and application performance Highlight trade-offs in features between different target systems Open source

Finch: Design Input Data Generation
Users can define the characteristics of the data stream, like variability in message rate, size, and distribution of keys Integration with Stream Processing Frameworks Pluggable modules that provide implementation with different streaming frameworks Workload Generation Combine operators with customized parameters to create streaming pipelines Define arbitrary types of stream dataflows without writing code (synthetic workloads) Output Data Collection & Analysis Extract and store performance metrics to analyze later

Finch: Design

Finch Workload Sources
name: the name used to refer to this source. num_keys: the number of keys in the keyspace of this source’s messages. key_dist: the distribution of keys used by messages produced from this source. msg_dist: the distribution of the length of the messages produced from this source. rate_dist: the distribution of number of messages produced per second from this source. Distributions: Uniform, Zipfian, etc. Keys are from message key-value pairs We can hash keys across the partitions to achieve parallelism in the operators.

Finch Workload Operators (Stateless)
filter: remove fraction of messages from the stream p: drop probability split: send input stream to multiple output streams n: number of output streams modify (map/flatmap): apply a function to the message size_ratio: how the message size is modified rate_ratio: how many messages are emitted merge: combine messages from multiple input streams

Finch Workload Operators (Stateful)
join: combine messages based on matching keys ttl: how long to persist state until match is found window: apply a reduce (fold left) function over a time interval tumbling: non-overlapping contiguous time-intervals session: fixed overlapping time-intervals Window parameters duration: the window’s time length State: any data that is persisted across messages and can be updated by the operators e.g. active users on webpage to track ads, last credit card per user transaction to identify fraud

Sample Workload (JSON)
"sources": { "s1": { "key_dist": "ZipfDistribution", "key_dist_params": { "num_keys": 10, "exponent": 1.2 }, "msg_dist": "UniformIntegerDistribution", "msg_dist_params": { "lower": 100, "upper": 1000 "rate_dist": "UniformIntegerDistribution", "rate_dist_params": { "rate": 10 } "transformations": { "t1": { "operator": "filter", "input": "s1", "params": { "p": 0.5 } }, "t2": { "operator": "modify", "input": "t1", "params": { "size_ratio": 0.5 } } "sinks": ["t2"] Sample Workload (JSON)

Finch Modules finch-samza Functional programming style operators
Executes workloads on Hadoop YARN Configurable state store (in memory, on disk, remote) finch-heron Streamlet API similar to Samza Executes workloads on either: Mesos cluster with Apache Aurora scheduler Built-in cluster manager using Nomad scheduler

Finch Metrics DEMO

Experiments Feature-based Analysis
Test system performance in features of stream processing frameworks Fault tolerance Scalability Resiliency State recovery Application-based Analysis Test real-world application performance to give users better information about cluster capabilities Word count Stream search (grep) Live statistics Fault tolerance Scalability In interest of time, we will focus on the feature-based analysis Word count

Experimental Setup m4.large c4.xlarge AWS EC2 instances c5.xlarge
Kafka cluster: 3 c5.xlarge: brokers Samza Cluster (YARN) 1 m4.large: master 3 c4.xlarge: slave Heron Cluster 3 c4.xlarge: executors Type vCPU Memory Network m4.large 2 8 GB 450 Mbps c4.xlarge 4 7.5 GB 750 Mbps c5.xlarge 8 2.25 Gbps

Experimental Workloads
30 keys, 10,000 messages per sec stateless filter followed by size-modify stateful 1 second tumbling window pipeline split to 2 streams, filter and modify respectively, then join on key

Throughput vs. Failure Size
Throughput Failure Ratio Feature-based Analysis: Fault Tolerance

Recovery Time vs. Failure Size
Feature-based Analysis: Fault Tolerance

Feature-based Analysis: Scalability
Maximum Throughput vs. Number of containers pipeline workload with 100 keys Feature-based Analysis: Scalability

Conclusion Finch: new benchmarking tool that enables
application-based analysis, as well as, feature-based analysis Generate arbitrary workloads that are customizable Users can learn about system trade-offs and how that affects their applications Future work: Integrate Finch with other frameworks like Spark Streaming and Flink Collected metrics are very specific to each framework–massive challenge! Additional feature analysis. E.g. resiliency, state recovery metrics, etc.

Questions?

Benchmarking Modern Distributed Stream Processing Systems

Similar presentations

Presentation on theme: "Benchmarking Modern Distributed Stream Processing Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Benchmarking Modern Distributed Stream Processing Systems

Similar presentations

Presentation on theme: "Benchmarking Modern Distributed Stream Processing Systems"— Presentation transcript:

Similar presentations

About project

Feedback