Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stream Processing with Tamás István Ujj

Similar presentations


Presentation on theme: "Stream Processing with Tamás István Ujj"— Presentation transcript:

1 Stream Processing with Tamás István Ujj t.ujj@mortoff.hu

2 A database is nothing but our conception of it; what is man to say it differs from a stream in nature… Lambda Architecture

3 Customer Relationship Management Business Process Management Software Quality Management Application Development Manufacturing Support Business Intelligence Machine Learning Big Data

4 Telecommunications Manufacturing Financial Sector Our Customers

5 A real-time data architecture

6 I want to do complex calculations on large amounts of data. You need a batch processing system.

7 Staging Area New Data Transformation Logic Results New data is written to a temporary staging area. A scheduled batch job executes the transformation logic.

8 We changed the logic. Let’s recalculate the previous results, too. Recomputation will cost you extra.

9 Staging Area New Data Transformation Logic Results ETL Master Dataset Transformation Logic (New) Transformation Logic Master Dataset: an immutable, append-only set of raw data. Results (New) Results can be recomputed from historical data.

10 Why do I have to wait hours for the updated results?! We’ll have to reengineer the system for low latency.

11 Nathan Marz: Big Data Principles and best practices of scalable real-time data systems

12 Staging Area New Data Transformation Logic Batch Results Batch Results ETL Master Dataset Transformation Logic (Streaming) Stream Engine Real-Time Results Real-Time Results The batch layer calculates the results with high latency. The speed layer calculates the results on the most recent data in real-time. The batch layer calculates the correct results with high latency. The speed layer calculates the approximate results on the most recent data in real-time.

13 Your architecture costs me a fortune! This is the price of Big Data.

14 You don’t need the batch layer. Interesting. That’s half the costs. Stream processing isn’t reliable on its own!

15 A well-designed streaming system provides exactly-once semantics, even in case of failure. Receiving the data Kafka is a reliable source. Tracking the offsets in checkpoints. Transforming the data Repeatable transformations. Pushing out the data Idempotent updates. Transactional updates. (Saving results and offsets.) 9012345678 Offset

16 Staging Area Ne w Dat a Transformation Logic Batch Results Batch Results ETL Master Dataset Transformation Logic Stream Engine Real- Time Results Real- Time Results 9012345678 Offset Transformation Logic (New) Offset (New) Real- Time Results Real- Time Results Kafka retains incoming data. Recomputation: processing data from the beginning of the stream with a parallel streaming job.

17 How can I stream data from my databases? A stream is an ever- growing, immutable set of events. Under the hood, a database is also a stream of events: creates, updates and deletes.

18 A database is a view over this stream of events. Create Update Delete Create Update Delete Update Delete Database Let’s capture this internal stream.

19 A consistent snapshot of the entire database contents at one point in time. A real-time stream of changes from that point onward. PostgreSQL and Oracle support both. The technique is called Change Data Capture.

20 And all this with a single computational model, without code duplication. Complex asynchronous transformations… …with low latency. And fault-tolerance through recomputation.

21 The SMACK stack Spark for Micro-Batch Processing Mesos for Cluster Management Akka for Event Processing Cassandra for Persistence Kafka for Event Transport

22 Event ProcessingMicro-Batch Processing LatencySub-secondSeconds to minutes PowerSimple triggersComplex transformations A trade-off between latency and computational power. Responding to single events in real-time or a general analysis over the stream.

23 Some other alternatives: Storm, Flink, Samza. Event ProcessingMicro-Batch Processing LatencySub-secondSeconds to minutes PowerSimple triggersComplex transformations Akka Streams Reactive Streams with back pressure. Kafka Streams

24 Event ProcessingMicro-Batch Processing LatencySub-secondSeconds to minutes PowerSimple triggersComplex transformations SQL Machine Learning Graph Analytics Functional API

25 Cluster Management with YARN Hadoop and related components. Job request comes in, YARN places the job. MESOS Any application. Job request comes in, MESOS offers resources, job accepts or rejects.

26 Downstream Applications Upstream Sources An architecture for converting large amounts of raw data into vauable information in real-time.

27 Tamás István Ujj t.ujj@mortoff.hu Business Intelligence Inspiration: Nathan Marz, Jay Kreps, Tyler Akidau, Martin Kleppmann, Dean Wampler


Download ppt "Stream Processing with Tamás István Ujj"

Similar presentations


Ads by Google