Presentation is loading. Please wait.

Presentation is loading. Please wait.

Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Similar presentations


Presentation on theme: "Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook."— Presentation transcript:

1 Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

2 Agenda Apache Storm

3 Traditional Data Processing !!!ALL!!! the data !!!ALL!!! the data Batch Pre- Computation (aka MapReduce) Batch Pre- Computation (aka MapReduce) Index Query

4 Traditional Data Processing Slow... and views are out of date Absorbed into batch viewsNot absorbed Now Time

5 Compensating for the real-time stuff Need some kind of stream processing system to supplement our batch views Applications can then merge the batch and the real time views together!

6 How do we do that?

7

8 Enter: Storm Open-source project originally built by Twitter Now a top-level Apache project Enables distributed, fault-tolerant, real-time, guaranteed computation

9 A History Lesson on Twitter Metrics Twitter Firehose

10 A History Lesson on Metrics Twitter Firehose

11 Problems! Scaling is painful Fault-tolerance is practically non-existent Coding for it is awful

12 Wanted to Address Guaranteed data processing Horizontal Scalability Fault-tolerance No intermediate message brokers Higher level abstraction than message passing “Just works”

13 Storm Delivers Guaranteed data processing Horizontal Scalability Fault-tolerance No intermediate message brokers Higher level abstraction than message passing “Just works”

14 Use Cases Stream Processing Distributed RPC Continuous Computation

15 Storm Architecture Nimbus ZooKeeper Supervisor

16 Glossary Streams – Constant pump of data as Tuples Spouts – Source of streams Bolts – Process input streams and produce new streams – Functions, Filters, Aggregation, Joins, Talk to databases, etc. Topologies – Network of spouts and bolts

17 Tasks and Topologies

18 Grouping When a Tuple is emitted from a Spout or Bolt, where does it go? Shuffle Grouping – Pick a random task Fields Grouping – Consistent hashing on a subset of tuple fields All Grouping – Send to all tasks Global Grouping – Pick task with lowest ID

19 Topology shuffle [“url”] shuffle [“id1”, “id2”] all

20 Guaranteed Message Processing A tuple has not been fully processed until it all tuples in the “tuple tree” have been completed If the tree is not completed within a timeout, it is replayed Programmers need to use the API to ‘ack’ a tuple as completed

21 Stream Processing Example Word Count TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new SentenceSpout(true), 5); builder.setBolt(2, new SplitSentence(), 8).shuffleGrouping(1); builder.setBolt(3, new WordCount(), 12).fieldsGrouping(2, new Fields(“word”)); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 5); StormSubmitter.submitTopology(“word-count”, conf, builder.createTopology());

22 public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super(“python”, “splitsentence.py”); } public void declareOutputFields(OutputFieldsDeclaraer declarer) { declarer.declare(new Fields(“word”)); } #!/usr/bin/python import storm class SplitSentenceBolt(storm.BasicBolt): def process(Self, tup): words = tup.values[0].split(“ “) for word in words: storm.emit([word])

23 public static class WordCount implements IBasicBolt { Map counts = new HashMap (); public void prepare(Map conf, TopologyContext context) {} public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) { count = 0; } ++count; counts.put(Word, count); collector.emit(new Values(word, count)); } public void cleanup () {} public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“word”, “count”)); }

24 Local Mode! TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new SentenceSpout(true), 5); builder.setBolt(2, new SplitSentence(), 8).shuffleGrouping(1); builder.setBolt(3, new WordCount(), 12).fieldsGrouping(2, new Fields(“word”)); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 5); LocalCluster cluster = new LocalCluster(); cluster.submitTopology(“word-count”, conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown();

25 Command Line Interface Starting a topology storm jar mycode.jar twitter.storm.MyTopology demo Stopping a topology storm kill demo

26 Distributed RPC

27 DRPC Example Reach Reach is the number of unique people exposed to a specific URL on Twitter URL Tweeter Follower Distinct Follower Count Reach

28 Reach Topology Spout GetTweeters CountAggregator GetFollowers Distinct shuffle [“follower-id”] global

29 Storm Review Distributed code and configurations Robust process management Monitors topologies and reassigns failed tasks Provides reliability by tracking tuple trees Routing and partitioning of streams Serialization Fine-grained performance stats of topologies

30 APACHE SPARK

31 Concern! Say I have an application that involves many iterations... – Graph Algorithms – K-Means Clustering – Six Degrees of Bieber Fever What's wrong with Hadoop MapReduce?

32 References http://storm.incubator.apache.org/ http://www.slideshare.net/nathanmarz/storm- distributed-and-faulttolerant-realtime- computation http://www.slideshare.net/nathanmarz/storm- distributed-and-faulttolerant-realtime- computation http://www.slideshare.net/Hadoop_Summit/real time-analytics-with-storm http://www.slideshare.net/Hadoop_Summit/real time-analytics-with-storm http://spark.apache.org http://www.cs.berkeley.edu/~matei/papers/2012 /nsdi_spark.pdf http://www.cs.berkeley.edu/~matei/papers/2012 /nsdi_spark.pdf


Download ppt "Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook."

Similar presentations


Ads by Google