Presentation is loading. Please wait.

Presentation is loading. Please wait.

Streaming Analytics with Apache Flink 1.0

Similar presentations


Presentation on theme: "Streaming Analytics with Apache Flink 1.0"— Presentation transcript:

1 Streaming Analytics with Apache Flink 1.0
Stephan Ewen @stephanewen

2 Distributed Streaming Data Flow
Apache Flink Stack DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.

3 Distributed Streaming Data Flow
Today DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.

4 Streaming is the next programming paradigm for data applications, and you need to start thinking in terms of streams.

5 Streaming technology is enabling the obvious: continuous processing on data that is continuously produced

6 Continuous Processing with Batch
Continuous ingestion Periodic (e.g., hourly) files Periodic batch jobs

7 λ Architecture "Batch layer": what we had before
"Stream layer": approximate early results

8 A Stream Processing Pipeline
collect store analyze serve

9 A brief History of Flink
January ‘10 December ‘14 v0.5 v0.6 v0.7 March ‘16 Flink Project Incubation Top Level Project v0.8 v0.10 Release 1.0 Stratosphere (Flink precursor) v0.9 April ‘14

10 A brief History of Flink
The academia gap: Reading/writing papers, teaching, worrying about thesis January ‘10 December ‘14 v0.5 v0.6 v0.7 March ‘16 Flink Project Incubation Top Level Project v0.8 v0.10 Release 1.0 Stratosphere (Flink precursor) v0.9 April ‘14 Realizing this might be interesting to people beyond academia (even more so, actually)

11 Programs and Dataflows
val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) Source Transformation Transformation Sink Source [1] map() [1] keyBy()/ window()/ apply() [1] Sink [1] Streaming Dataflow Source [2] map() [2] keyBy()/ window()/ apply() [2]

12 What makes Flink flink? True Streaming Event Time Stateful Streaming
Low latency Make more sense of data High Throughput Works on real-time and historic data Well-behaved flow control (back pressure) True Streaming Event Time Windows & user-defined state Stateful Streaming APIs Libraries Complex Event Processing Exactly-once semantics for fault tolerance Globally consistent savepoints Flexible windows (time, count, session, roll-your own)

13 Streaming Analytics by Example

14 Time-Windowed Aggregations
case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum("measure")

15 Time-Windowed Aggregations
case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .timeWindow(Time.seconds(60), Time.seconds(5)) .sum("measure")

16 Session-Windowed Aggregations
case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .window(EventTimeSessionWindows.withGap(Time.seconds(60))) .max("measure")

17 Session-Windowed Aggregations
case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .window(EventTimeSessionWindows.withGap(Time.seconds(60))) .max("measure") Flink 1.1 syntax

18 Pattern Detection case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("producer") .flatMap(new RichFlatMapFuncion[Event, Alert]() { lazy val state: ValueState[Int] = getRuntimeContext.getState(…) def flatMap(event: Event, out: Collector[Alert]) = { val newState = state.value() match { case 0 if (event.evtType == 0) => 1 case 1 if (event.evtType == 1) => 0 case x => out.collect(Alert(event.msg, x)); 0 } state.update(newState) })

19 Embedded key/value state store
Pattern Detection case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("producer") .flatMap(new RichFlatMapFuncion[Event, Alert]() { lazy val state: ValueState[Int] = getRuntimeContext.getState(…) def flatMap(event: Event, out: Collector[Alert]) = { val newState = state.value() match { case 0 if (event.evtType == 0) => 1 case 1 if (event.evtType == 1) => 0 case x => out.collect(Alert(event.msg, x)); 0 } state.update(newState) }) Embedded key/value state store

20 Many more Joining streams (e.g. combine readings from sensor)
Detecting Patterns (CEP) Applying (changing) rules or models to events Training and applying online machine learning models

21 (It's) About Time

22 The biggest change in moving from batch to streaming is handling time explicitly

23 Example: Windowing by Time
case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

24 Example: Windowing by Time
case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

25 Different Notions of Time
Event Producer Message Queue Flink Data Source Flink Window Operator partition 1 partition 2 Event Time Ingestion Time Window Processing Time

26 Event Time vs. Processing Time
1977 1980 1983 1999 2002 2005 2015 Processing Time Episode IV Episode V Episode VI Episode I Episode II Episode III Episode VII Event Time

27 IoT / Mobile Applications
Queue / Log Stream Analysis Events analyzed in a data streaming system Events occur on devices Events stored in a log

28 Out of order Streams

29 Out of order Streams

30 Out of order Streams

31 Out of order Streams Out of order !!! First burst of events
Second burst of events

32 Out of order Streams Instant event-at-a-time Arrival time windows
First burst of events Second burst of events Event time windows

33 Processing Time Window by operator's processing time
case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(ProcessingTime) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure") Window by operator's processing time

34 Ingestion Time case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(IngestionTime) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

35 Event Time case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

36 Event Time case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) val tsStream = stream.assignAscendingTimestamps(_.timestamp) tsStream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

37 Event Time case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) val tsStream = stream.assignTimestampsAndWatermarks( new MyTimestampsAndWatermarkGenerator()) tsStream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

38 Watermarks Stream (in order) Stream (out of order) Event Watermark
7 W(11) W(20) Watermark 9 10 11 14 15 17 Event Event timestamp 18 20 19 21 23 Stream (out of order) 7 W(11) W(17) 11 15 9 12 14 17 22 20 19 21 Watermark Event Event timestamp

39 Watermarks in Parallel
Source (1) Source (2) map (1) map (2) window (1) window (2) 29 17 14 W(33) W(17) A|30 B|31 C|30 D|15 E|30 F|15 G|18 H|20 K|35 Watermark Event Time at the operator Event [id|timestamp] at input streams 33 Q|44 N|39 M|39 Watermark Generation R|37 O|23 L|22

40 Mixing Event Time Processing Time
case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) val tsStream = stream.assignAscendingTimestamps(_.timestamp) tsStream .keyBy("id") .window(SlidingEventTimeWindows.of(seconds(15), seconds(5)) .trigger(new MyTrigger()) .sum("measure")

41 Window Triggers React to any combination of
Event Time Processing Time Event data Example of a mixed EventTime / Proc. Time Trigger: Trigger when event time reaches window end OR When processing time reaches window end plus 30 secs.

42 Trigger example .sum("measure")
public class EventTimeTrigger extends Trigger<Object, TimeWindow> { public TriggerResult onElement(Object evt, long time, TimeWindow window, TriggerContext ctx) { ctx.registerEventTimeTimer(window.maxTimestamp()); ctx.registerProcessingTimeTimer(window.maxTimestamp() ); return TriggerResult.CONTINUE; } public TriggerResult onEventTime(long time, TimeWindow w, TriggerContext ctx) { return TriggerResult.FIRE_AND_PURGE; public TriggerResult onProcessingTime(long time, TimeWindow w, TriggerContext c) {

43 Trigger example .sum("measure")
public class EventTimeTrigger extends Trigger<Object, TimeWindow> { public TriggerResult onElement(Object evt, long time, TimeWindow window, TriggerContext ctx) { ctx.registerEventTimeTimer(window.maxTimestamp()); ctx.registerProcessingTimeTimer(window.maxTimestamp() ); return TriggerResult.CONTINUE; } public TriggerResult onEventTime(long time, TimeWindow w, TriggerContext ctx) { return TriggerResult.FIRE_AND_PURGE; public TriggerResult onProcessingTime(long time, TimeWindow w, TriggerContext c) { return TriggerResult.FIRE_AND_CONTINUE;

44 Matters of State (Fault Tolerance, Reinstatements, etc)

45 Back to the Aggregation Example
case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource( new FlinkKafkaConsumer09(topic, schema, properties)) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure") Stateful

46 Fault Tolerance Prevent data loss (reprocess lost in-flight events)
Recover state consistency (exactly-once semantics) Pending windows & user-defined (key/value) state Checkpoint based fault tolerance Periodicaly create checkpoints Recovery: resume from last completed checkpoint Async. Barrier Snapshots (ABS) Algorithm 

47 Checkpoints data stream newer records older records event
State of the dataflow at point Y State of the dataflow at point X

48 Checkpoint Barriers Markers, injected into the streams

49 Checkpoint Procedure

50 Checkpoint Procedure

51 Savepoints A "Checkpoint" is a globally consistent point-in-time snapshot of the streaming program (point in stream, state) A "Savepoint" is a user-triggered retained checkpoint Streaming programs can start from a savepoint Savepoint B Savepoint A

52 (Re)processing data (in batch)
Re-processing data (what-if exploration, to correct bugs, etc.) Usually by running a batch job with a set of old files Tools that map files to times :00 am :00 am :00 am :00pm :00pm :00am :00am Collection of files, by ingestion time To the batch processor

53 Unclear Batch Boundaries
:00 am :00 am :00 am :00pm :00pm :00am :00am To the batch processor ? What about sessions across batches?

54 (Re)processing data (streaming)
Draw savepoints at times that you will want to start new jobs from (daily, hourly, …) Reprocess by starting a new job from a savepoint Defines start position in stream (for example Kafka offsets) Initializes pending state (like partial sessions) Run new streaming program from savepoint Savepoint

55 Continuous Data Sources
Stream of Kafka Partitions partition partition Savepoint Kafka offsets + Operator state WIP (target: Flink 1.1) File mod timestamp + File position + Operator state Savepoint :00 am :00 am :00 am :00pm :00am :00am :00pm Stream view over sequence of files

56 Upgrading Programs A program starting from a savepoint can differ from the program that created the savepoint Unique operator names match state and operator Mechanism be used to fix bugs in programs, to evolve programs, parameters, libraries, …

57 State Backends Large state is a collection of key/value pairs
State backend defines what data structure holds the state, plus how it is snapshotted Most common choices Main memory – snapshots to master Main memory – snapshots to dist. filesystem RocksDB – snapshots to dist. filesystem

58 Complex Event Processing Primer

59 Example: Temperature Monitoring
Receiving temperature an power events from sensors Looking for temperatures repeatedly exceeding thresholds within a short time period (10 secs)

60 Event Types

61 Defining Patterns

62 Generating Alerts

63 An Outlook on Things to Come

64 data integration & distribution platform
Flink in the wild 30 billion events daily 2 billion events in 10 1Gb machines data integration & distribution platform See talks by at

65 Roadmap Dynamic Scaling, Resource Elasticity Stream SQL
CEP enhancements Incremental & asynchronous state snapshotting Mesos support More connectors, end-to-end exactly once API enhancements (e.g., joins, slowly changing inputs) Security (data encryption, Kerberos with Kafka)

66 I stream, do you?

67 Why does Flink stream flink?
Low latency Make more sense of data High Throughput Works on real-time and historic data Well-behaved flow control (back pressure) True Streaming Event Time Windows & user-defined state Stateful Streaming APIs Libraries Complex Event Processing Exactly-once semantics for fault tolerance Globally consistent savepoints Flexible windows (time, count, session, roll-your own)

68 Addendum

69 Latency and Throughput

70 Low Latency and High Throughput
Frequently though to be mutually exclusive Event-at-a-time  low latency, low throughput Mini batch  high latency, high throughput The above is not true! Very little latency has to be sacrificed for very high throughput

71 Latency and Throughput

72 Latency and Throughput

73 The Effect of Buffering
Network stack does not always operate in event-at-a-time mode Optional buffering adds some milliseconds latency but increases throughput No effect on application logic

74 On a technical level Decouple all things Clocks Buffering …
Wall clock time (processing time) Event time (watermarks & punctuations) Consistency clock (logical checkpoint timestamps) Buffering Windows (application logic) Network (throughput tuning)

75 Decoupling clocks

76 Stream Alignment

77 On exactly-once guarantees
Giant topic of confusion exactly what?

78 High Availability

79 High Availability Checkpoints
JobManager Client Apache Zookeeper™ Take snapshots TaskManagers

80 High Availability Checkpoints
JobManager Client Apache Zookeeper™ Take snapshots Persist snapshots Send handles to JM TaskManagers

81 High Availability Checkpoints
JobManager Client Apache Zookeeper™ Take snapshots Persist snapshots Send handles to JM Create global checkpoint TaskManagers

82 High Availability Checkpoints
JobManager Client Apache Zookeeper™ Take snapshots Persist snapshots Send handles to JM Create global checkpoint Persist global checkpoint TaskManagers

83 High Availability Checkpoints
JobManager Client Apache Zookeeper™ Take snapshots Persist snapshots Send handles to JM Create global checkpoint Persist global checkpoint Write handle to ZooKeeper TaskManagers

84 The Counting Pyramid of Needs


Download ppt "Streaming Analytics with Apache Flink 1.0"

Similar presentations


Ads by Google