Presentation is loading. Please wait.

Presentation is loading. Please wait.

Apache Storm.

Similar presentations


Presentation on theme: "Apache Storm."— Presentation transcript:

1 Apache Storm

2 Introduction Apache Storm is a real-time fault-tolerant and distributed Stream Processing Engine. Open Sourced September 19th 2011 Main languages-Clojure  and the JAVA. Some of the Characteristics of storm are Fast, Scalable, Fault-tolerant, Reliable and Easy to operate. Some of the organizations that currently use Storm are: Yahoo!, Groupon, The Weather Channel, Alibaba, Baidu, and Rocket Fuel.

3 Why Storm? Dealing with Huge amount of data Streaming data processing
Use cases: real-time trading analytics malfunction detection social network smart advertisement placement log processing and metrics analytics. In recent years “Big Data” has become one of the hottest topic in software engineering. Necessity to deal with Huge amount of data brings new challenges and, of course, new opportunities. Streaming data processing has been attracting more and more attention due to its crucial impacts on a wide range of use cases, such as real-time trading analytics, malfunction detection, campaign, social network, smart advertisement placement, log processing and metrics analytics. To serve the booming demands of streaming data processing, many computation engines have been developed, one of them is Storm.

4 Internal Architecture
Nimbus-Master Node Assigns tasks Monitors failures Zookeeper Cluster state of Nimbus and Supervisor maintained in zookeeper Supervisor Communicates with Nimbus through Zookeeper about topologies and available resources. Workers Listens for assigned work and executes the application. Nimbus Zookeeper Zookeeper Zookeeper Change the diagram Supervisor Supervisor Supervisor Supervisor Workers Workers Workers Workers

5 Storm- Data Processing
Streams of tuples flowing through topologies Vertices represent computation and edges represent the data flow Vertices divided into Spouts –read tuples from external sources. Bolts – encapsulate the application logic. Tap diagram to show data flow

6 Interaction between Storm Internals Components
Events - Heartbeat protocol (every 15 seconds), synchronize supervisor event(every 10 seconds) and synchronize process event(every 3 seconds). Submits topology Advertises Topology Client Nimbus Supervisors* Match making Executors* Workers* Processes Tasks Spawns Workers -Client submits the topology to Nimbus. -Supervisor contacts Nimbus and advertises the current running topology and any vacancies to run more topologies. -Nimbus does the match-making between the pending topology and the supervisors. -Nimbus distributes the work. -Based on the job assignment, Supervisors spawns workers. -Each worker process runs one or more Executors and executors are in turn made up of one or more tasks. -The actual work for a bolt or a spout is done in the task.

7 Stream Grouping Shuffle grouping-randomly partitions the tuples.
Fields grouping- hashes on a subset of the tuple attributes/fields. All grouping- replicates the entire stream to all the consumer tasks. Global grouping- sends the entire stream to a single bolt.

8 Processing Semantics States of workers Atleast once:
Each tuple that is input to the topology will be processed atleast once. Atmost Once: Each tuple is processed once or dropped in case of failure. States of workers Supervisor periodically checks the state of workers for managing the worker processes. Timed out Not started Disallowed Valid

9 Topology Example

10 Topology Example-contd
SentenceSpout: emits a stream of tuples that represents sentences: {sentence: “my dog has fleas”} SplitSentenceBolt: emits a tuple for each word in the sentences it receives: {word: “my”}; {word:”dog”}; ... WordCountBolt: updates the count and at save it to file. Each node extends some abstract classes and must implements some basic methods for defining the format of the tuple emitted, the logic and so on…

11 Guaranteed Processing
Storm provides an API to guarantee that a tuple emitted by a spout is fully processed by the topology (at-least-once semantic). With guaranteed processing, each bolt in the tree can either acknowledge or fail a tuple Spout Side Bolts Side With guaranteed processing, each bolt in the tree can either acknowledge or fail a tuple: ▷  If all bolts in the tree acknowledge the tuple and child tuples the message processing is complete; ▷  if any bolts explicitly fail a tuple, or we exceed a time-out period, the processing is failed. From the Spout side, we have to keep track of the tuple emitted and be prepared to handle fails: On the Bolts side, we have to anchor any emitted tuple to the originating one and acknowledging or failing tuples

12 Deployment model of the cluster
The secondary Nimbus instance starts working when the primary one temporary fails. Each spout deals with a specific data stream which allows to produce tuples from streams with different protocols and data formats. The bolts from current layer are involved on the filtering, aggregating and analysis stages Referenced paper “Apache Storm Based on Topology for Real-Time Processing of Streaming Data from Social Networks” . Explained Deployment model of the cluster and suitability to work for Real-Time Processing of Streaming Data from Social Networks.

13 Storm Use Cases Twitter's infrastructure, including database systems (Cassandra, Memcached, etc), the messaging infrastructure, Mesos, and the monitoring/alerting systems Yahoo! is developing a next generation platform that enables the convergence of big-data and low-latency processing. Groupon Storm helps us analyze, clean, normalize, and resolve large amounts of non-unique data points with low latency and high throughput. Alibaba uses storm to process the application log and the data change in database to supply real time stats for data apps. Twitter's infrastructure, including database systems (Cassandra, Memcached, etc), the messaging infrastructure, Mesos, and the monitoring/alerting systems. Storm's isolation scheduler makes it easy to use the same cluster both for production applications and in-development applications, and it provides an efficient way for capacity planning. Yahoo! is developing a next generation platform that enables the convergence of big-data and low-latency processing. While Hadoop is our primary technology for batch processing, Storm empowers stream/micro-batch processing of user events, content feeds, and application logs. Groupon At Groupon we use Storm to build real-time data integration systems. Storm helps us analyze, clean, normalize, and resolve large amounts of non-unique data points with low latency and high throughput. Alibaba is the leading B2B e-commerce website in the world. We use storm to process the application log and the data change in database to supply real time stats for data apps.

14 Storm Use Cases-contd

15 Comparison big data open source tools
A Storm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoop you run “MapReduce jobs”, on Storm you run “topologies”. “Jobs” and “topologies” themselves are very different — one key difference is that a MapReduce job eventually finishes, whereas a topology processes messages forever (or until you kill it).Storm can do real time processing of streams of tuple’s (incoming data) while Hadoop do batch processing with MapReduce job. Storm behave like true streaming processing systems with lower latencies, While Spark is able to handle higher throughput while having somewhat higher latencies. Storm is better choice for real time data processing REFERENCE -Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming, By Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Jerry Peng and Paul Poulosky Yahoo Inc., Presented at 2016 IEEE International Parallel and Distributed Processing Symposium Workshops Add the paper

16 Storm vs Hadoop Storm is stateless but with Zookeeper Storm maintain states.

17 Storm- Pros and Cons Latency: very less
Fault tolerance: High fault tolerance Latency: very less Processing Model: Real-time stream processing model Programming language dependency: any programming language Reliable: each tuple of data should be processed at least once Scalability: high scalability Cons Use of native scheduler and resource management feature (Nimbus) in particular, become bottlenecks.  Difficulties with debugging given the way the threads and data flows are split. Because o first con Twitter switched from Storm to Twitter Heron

18 Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming
Referred paper “Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming”. As per the paper, they closely mimicked the real-world production scenarios and provided a performance comparison of the three data engines. The throughput vs latency graph for the various systems is explaining that Flink and Storm have very similar performance, and Spark Streaming, while it has much higher latency, is expected to be able to handle much higher throughput by configuring its batch interval higher. The reason why Spark has higher latency because of micro-batching methodology as compared to the pure streaming nature of Storm.

19 References Apache Storm Based on Topology for Real-Time Processing of Streaming Data from Social Networks, By Anatoliy Batyuk, Volodymyr Voityshyn, Presented at IEEE First International Conference on Data Stream Mining & Processing Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming, By Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Jerry Peng and Paul Poulosky Yahoo Inc., Presented at 2016 IEEE International Parallel and Distributed Processing Symposium Workshops computation-engines-at INTRODUCTION TO APACHE STORM, by Tiziano De Matteis Using apache storm for big data, S Surshanov*, IITU, Kazakhstan Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel*, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, Dmitriy Ryaboy, Twitter, Inc., *University of Wisconsin – Madison


Download ppt "Apache Storm."

Similar presentations


Ads by Google