Stream Processing with Tamás István Ujj

Slides:



Advertisements
Similar presentations
How to Architect Big Data Apps with the Lambda Architecture
Advertisements

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Microsoft Big Data Essentials Module 1 - Introduction to Big Data
Apache Spark and the future of big data applications Eric Baldeschwieler.
Tyson Condie.
Managing Multi-User Databases AIMS 3710 R. Nakatsu.
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Presented by Daniel Ortiz Esquivel R&D Software Engineer YAPC::Europe September 2015 Site:
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
The Intersection of Computer Interaction and Software Models A “5 by 5” Non-functional Performance Matrix Tom Hill Summer 2005 CS 6362.
Matthew Winter and Ned Shawa
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
History • Created by Nathan BackType • Open sourced on 19th September, 2011 Documentation at Contribution
Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.
Andy Roberts Data Architect
Part III BigData Analysis Tools (Storm) Yuan Xue
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.
Apache Beam: The Case for Unifying Streaming API's Andrew Psaltis HDF / IoT Product Solution June 13, 2016 HDF / IoT Product Solution.
Microsoft Ignite /28/2017 6:07 PM
Open Source Big Data Analytics Frameworks Written in Scala John A. Miller, Casey Bowman, Vishnu Gowda Harish and Shannon Quinn Department of Computer Science.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Big thanks to everyone!.
When Big Data Meets Fast Data
Modern Business Intelligence Platforms using Azure in PaaS
Image taken from: slideshare
Connected Infrastructure
5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.
Connected Living Connected Living What to look for Architecture
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Designing and Implementing an ETL Framework
Smart Building Solution
Managing Multi-User Databases
Introduction to Spark Streaming for Real Time data analysis
Introduction to Distributed Platforms
Connected Maintenance Solution
Intro to BI Architecture| Warren Sifre
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Original Slides by Nathan Twitter Shyam Nutanix
Smart Building Solution
Connected Maintenance Solution
Connected Living Connected Living What to look for Architecture
Connected Infrastructure
Building Analytics At Scale With USQL and C#
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Thoughts on Computing Upgrade Activities
Data Platform and Analytics Foundational Training
Akshun Gupta, Karthik Bala
Exploring Azure Event Grid
ETL Architecture for Real-Time BI
Ministry of Higher Education
Uber How to Stream Data with StorageTapper
Big Data - in Performance Engineering
Big Data Young Lee BUS 550.
Cluster Computing Donald E. Knuth, Literate Programming, 1984
Architecture for Real-Time ETL
TIM TAYLOR AND JOSH NEEDHAM
The Dataflow Model.
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
Big DATA.
COS 518: Distributed Systems Lecture 11 Mike Freedman
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
Presentation transcript:

Stream Processing with Tamás István Ujj

A database is nothing but our conception of it; what is man to say it differs from a stream in nature… Lambda Architecture

Customer Relationship Management Business Process Management Software Quality Management Application Development Manufacturing Support Business Intelligence Machine Learning Big Data

Telecommunications Manufacturing Financial Sector Our Customers

A real-time data architecture

I want to do complex calculations on large amounts of data. You need a batch processing system.

Staging Area New Data Transformation Logic Results New data is written to a temporary staging area. A scheduled batch job executes the transformation logic.

We changed the logic. Let’s recalculate the previous results, too. Recomputation will cost you extra.

Staging Area New Data Transformation Logic Results ETL Master Dataset Transformation Logic (New) Transformation Logic Master Dataset: an immutable, append-only set of raw data. Results (New) Results can be recomputed from historical data.

Why do I have to wait hours for the updated results?! We’ll have to reengineer the system for low latency.

Nathan Marz: Big Data Principles and best practices of scalable real-time data systems

Staging Area New Data Transformation Logic Batch Results Batch Results ETL Master Dataset Transformation Logic (Streaming) Stream Engine Real-Time Results Real-Time Results The batch layer calculates the results with high latency. The speed layer calculates the results on the most recent data in real-time. The batch layer calculates the correct results with high latency. The speed layer calculates the approximate results on the most recent data in real-time.

Your architecture costs me a fortune! This is the price of Big Data.

You don’t need the batch layer. Interesting. That’s half the costs. Stream processing isn’t reliable on its own!

A well-designed streaming system provides exactly-once semantics, even in case of failure. Receiving the data Kafka is a reliable source. Tracking the offsets in checkpoints. Transforming the data Repeatable transformations. Pushing out the data Idempotent updates. Transactional updates. (Saving results and offsets.) Offset

Staging Area Ne w Dat a Transformation Logic Batch Results Batch Results ETL Master Dataset Transformation Logic Stream Engine Real- Time Results Real- Time Results Offset Transformation Logic (New) Offset (New) Real- Time Results Real- Time Results Kafka retains incoming data. Recomputation: processing data from the beginning of the stream with a parallel streaming job.

How can I stream data from my databases? A stream is an ever- growing, immutable set of events. Under the hood, a database is also a stream of events: creates, updates and deletes.

A database is a view over this stream of events. Create Update Delete Create Update Delete Update Delete Database Let’s capture this internal stream.

A consistent snapshot of the entire database contents at one point in time. A real-time stream of changes from that point onward. PostgreSQL and Oracle support both. The technique is called Change Data Capture.

And all this with a single computational model, without code duplication. Complex asynchronous transformations… …with low latency. And fault-tolerance through recomputation.

The SMACK stack Spark for Micro-Batch Processing Mesos for Cluster Management Akka for Event Processing Cassandra for Persistence Kafka for Event Transport

Event ProcessingMicro-Batch Processing LatencySub-secondSeconds to minutes PowerSimple triggersComplex transformations A trade-off between latency and computational power. Responding to single events in real-time or a general analysis over the stream.

Some other alternatives: Storm, Flink, Samza. Event ProcessingMicro-Batch Processing LatencySub-secondSeconds to minutes PowerSimple triggersComplex transformations Akka Streams Reactive Streams with back pressure. Kafka Streams

Event ProcessingMicro-Batch Processing LatencySub-secondSeconds to minutes PowerSimple triggersComplex transformations SQL Machine Learning Graph Analytics Functional API

Cluster Management with YARN Hadoop and related components. Job request comes in, YARN places the job. MESOS Any application. Job request comes in, MESOS offers resources, job accepts or rejects.

Downstream Applications Upstream Sources An architecture for converting large amounts of raw data into vauable information in real-time.

Tamás István Ujj Business Intelligence Inspiration: Nathan Marz, Jay Kreps, Tyler Akidau, Martin Kleppmann, Dean Wampler