Stream Processing with Tamás István Ujj

Slides:

Advertisements

Similar presentations

How to Architect Big Data Apps with the Lambda Architecture

Advertisements

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox

HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.

Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.

Microsoft Big Data Essentials Module 1 - Introduction to Big Data

Apache Spark and the future of big data applications Eric Baldeschwieler.

Managing Multi-User Databases AIMS 3710 R. Nakatsu.

` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Presented by Daniel Ortiz Esquivel R&D Software Engineer YAPC::Europe September 2015 Site:

Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.

Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia

INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.

The Intersection of Computer Interaction and Software Models A “5 by 5” Non-functional Performance Matrix Tom Hill Summer 2005 CS 6362.

Matthew Winter and Ned Shawa

Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.

History • Created by Nathan BackType • Open sourced on 19th September, 2011 Documentation at Contribution

Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.

Andy Roberts Data Architect

Part III BigData Analysis Tools (Storm) Yuan Xue

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.

Apache Beam: The Case for Unifying Streaming API's Andrew Psaltis HDF / IoT Product Solution June 13, 2016 HDF / IoT Product Solution.

Microsoft Ignite /28/2017 6:07 PM

Open Source Big Data Analytics Frameworks Written in Scala John A. Miller, Casey Bowman, Vishnu Gowda Harish and Shannon Quinn Department of Computer Science.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

Big thanks to everyone!.

When Big Data Meets Fast Data

Modern Business Intelligence Platforms using Azure in PaaS

Image taken from: slideshare

Connected Infrastructure

5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.

Connected Living Connected Living What to look for Architecture

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Designing and Implementing an ETL Framework

Smart Building Solution

Managing Multi-User Databases

Introduction to Spark Streaming for Real Time data analysis

Introduction to Distributed Platforms

Connected Maintenance Solution

Intro to BI Architecture| Warren Sifre

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Original Slides by Nathan Twitter Shyam Nutanix

Smart Building Solution

Connected Maintenance Solution

Connected Living Connected Living What to look for Architecture

Connected Infrastructure

Building Analytics At Scale With USQL and C#

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Thoughts on Computing Upgrade Activities

Data Platform and Analytics Foundational Training

Akshun Gupta, Karthik Bala

Exploring Azure Event Grid

ETL Architecture for Real-Time BI

Ministry of Higher Education

Uber How to Stream Data with StorageTapper

Big Data - in Performance Engineering

Big Data Young Lee BUS 550.

Cluster Computing Donald E. Knuth, Literate Programming, 1984

Architecture for Real-Time ETL

TIM TAYLOR AND JOSH NEEDHAM

The Dataflow Model.

2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.

COS 518: Distributed Systems Lecture 11 Mike Freedman

Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,

Presentation transcript:

Stream Processing with Tamás István Ujj

A database is nothing but our conception of it; what is man to say it differs from a stream in nature… Lambda Architecture

Customer Relationship Management Business Process Management Software Quality Management Application Development Manufacturing Support Business Intelligence Machine Learning Big Data

Telecommunications Manufacturing Financial Sector Our Customers

A real-time data architecture

I want to do complex calculations on large amounts of data. You need a batch processing system.

Staging Area New Data Transformation Logic Results New data is written to a temporary staging area. A scheduled batch job executes the transformation logic.

We changed the logic. Let’s recalculate the previous results, too. Recomputation will cost you extra.

Staging Area New Data Transformation Logic Results ETL Master Dataset Transformation Logic (New) Transformation Logic Master Dataset: an immutable, append-only set of raw data. Results (New) Results can be recomputed from historical data.

Why do I have to wait hours for the updated results?! We’ll have to reengineer the system for low latency.

Nathan Marz: Big Data Principles and best practices of scalable real-time data systems

Staging Area New Data Transformation Logic Batch Results Batch Results ETL Master Dataset Transformation Logic (Streaming) Stream Engine Real-Time Results Real-Time Results The batch layer calculates the results with high latency. The speed layer calculates the results on the most recent data in real-time. The batch layer calculates the correct results with high latency. The speed layer calculates the approximate results on the most recent data in real-time.

Your architecture costs me a fortune! This is the price of Big Data.

You don’t need the batch layer. Interesting. That’s half the costs. Stream processing isn’t reliable on its own!

A well-designed streaming system provides exactly-once semantics, even in case of failure. Receiving the data Kafka is a reliable source. Tracking the offsets in checkpoints. Transforming the data Repeatable transformations. Pushing out the data Idempotent updates. Transactional updates. (Saving results and offsets.) Offset

Staging Area Ne w Dat a Transformation Logic Batch Results Batch Results ETL Master Dataset Transformation Logic Stream Engine Real- Time Results Real- Time Results Offset Transformation Logic (New) Offset (New) Real- Time Results Real- Time Results Kafka retains incoming data. Recomputation: processing data from the beginning of the stream with a parallel streaming job.

How can I stream data from my databases? A stream is an ever- growing, immutable set of events. Under the hood, a database is also a stream of events: creates, updates and deletes.

A database is a view over this stream of events. Create Update Delete Create Update Delete Update Delete Database Let’s capture this internal stream.

A consistent snapshot of the entire database contents at one point in time. A real-time stream of changes from that point onward. PostgreSQL and Oracle support both. The technique is called Change Data Capture.

And all this with a single computational model, without code duplication. Complex asynchronous transformations… …with low latency. And fault-tolerance through recomputation.

The SMACK stack Spark for Micro-Batch Processing Mesos for Cluster Management Akka for Event Processing Cassandra for Persistence Kafka for Event Transport

Event ProcessingMicro-Batch Processing LatencySub-secondSeconds to minutes PowerSimple triggersComplex transformations A trade-off between latency and computational power. Responding to single events in real-time or a general analysis over the stream.

Some other alternatives: Storm, Flink, Samza. Event ProcessingMicro-Batch Processing LatencySub-secondSeconds to minutes PowerSimple triggersComplex transformations Akka Streams Reactive Streams with back pressure. Kafka Streams

Event ProcessingMicro-Batch Processing LatencySub-secondSeconds to minutes PowerSimple triggersComplex transformations SQL Machine Learning Graph Analytics Functional API

Cluster Management with YARN Hadoop and related components. Job request comes in, YARN places the job. MESOS Any application. Job request comes in, MESOS offers resources, job accepts or rejects.

Downstream Applications Upstream Sources An architecture for converting large amounts of raw data into vauable information in real-time.

Tamás István Ujj Business Intelligence Inspiration: Nathan Marz, Jay Kreps, Tyler Akidau, Martin Kleppmann, Dean Wampler