Stream Analytics with SQL on Apache Flink®

Stream Analytics with SQL on Apache Flink®
Fabian Strata Data Conference, New York September, 27th 2017

About me Apache Flink PMC member
Contributing since day 1 at TU Berlin Focusing on Flink’s relational APIs since 2 years Co-author of “Stream Processing with Apache Flink” Work in progress… Co-founder & Software Engineer at data Artisans

dA Platform 2 Open Source Apache Flink + dA Application Manager
Original creators of Apache Flink® Data Artisans was founded by the original creators of Apache Flink and we provide the dA Platform. Just two weeks ago, we announced the next version dA Platform 2 which bundles Apache Flink and the dA App Manager.

Productionizing and operating stream processing made easy
The dA platform makes operating stream processing applications in production easy.

The dA Platform 2 dA Platform 2 dA Application Manager Apache Flink
Real-time Analytics Anomaly- & Fraud Detection Real-time Data Integration Reactive Microservices (and more) Streams from Kafka, Kinesis, S3, HDFS, Databases, ... dA Platform 2 dA Application Manager Application lifecycle management Logging Apache Flink Stateful stream processing Metrics CI/CD Kubernetes Container platform At its center is Apache Flink to perform stateful stream processing. The second major component is the da Application Manager. The app manager lets you manage the lifecycle of applications which means you can seamlessly update an application, scale it out or in, or suspend it. You can also run different versions of your application for A/B testing. The dA platform also integrates components for log and metrics collection and continuous integration / delivery. So if you are interested in these features talk to me or get in touch with data Artisans.

What is Apache Flink? Stateful Computations Over Data Streams
Data Stream Processing realtime results from data streams Batch Processing process static and historic data Event-driven Applications data-driven actions and services Stateful Computations Over Data Streams Apache Flink is the is a system that handles the full breadth of stream processing starting from bounded (or batch) data over streaming analytics to streaming data applications. The important point here is that Flink addresses the analytical as well as the transactional spectrum of stream processing.

Stateful computations over streams
What is Apache Flink? Stateful computations over streams real-time and historic fast, scalable, fault tolerant, in-memory, event time, large state, exactly-once Application Queries Applications Streams Database Devices Stream etc. Historic Data File / Object Storage Flink achieves that by providing an extensive feature set such as: scalability, fault-tolerance, exactly-once state semantics, large state handling, and support for event-time processing.

Hardened at scale Streaming Platform Service
billions messages per day A lot of Stream SQL Streaming Platform as a Service 3700+ container running Flink, 1400+ nodes, 22k+ cores, 100s of jobs 100s jobs, 1000s nodes, TBs state, metrics, analytics, real time ML, Streaming SQL as a platform Fraud detection Streaming Analytics Platform All of these features are proven to work at large scale across various dimensions like data volume, cluster size, state size, number of jobs and for a wide variety of use cases. Many big companies run their stream processing workloads on Flink. Some, such as Alibaba and Uber, built their internal stream processing platforms based on Flink and more specifically on Flink’s SQL interface.

Powerful Abstractions
Layered abstractions to navigate simple to complex use cases High-level Analytics API SQL / Table API (dynamic tables) val stats = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum((a, b) -> a.add(b)) Stream- & Batch Data Processing DataStream API (streams, windows) Process Function (events, state, time) Stateful Event- Driven Applications def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } Flink offers APIs for different levels of abstraction that all can be mixed and matched. On the lowest level, ProcessFunctions give precise control about state and time, i.e., when to process data. The intermediate level, the so-called DataStream API provides higher-level primitives such as for window processing Finally, on the top level the relational API, SQL and the Table API, are centered around the concept of dynamic tables. This is what I’m going to talk about next.

Apache Flink’s relational APIs
ANSI SQL & LINQ-style Table API Unified APIs for batch & streaming data A query specifies exactly the same result regardless whether its input is static batch data or streaming data. Common translation layers Optimization based on Apache Calcite Type system & code-generation Table sources & sinks

Show me some code! “clicks” can be a - file - database table,
tableEnvironment .scan("clicks") .filter('url.like(" .groupBy('user) .select('user, 'url.count as 'cnt) SELECT user, COUNT(url) AS cnt FROM clicks WHERE url LIKE ' GROUP BY user “clicks” can be a - file - database table, - stream, …

What if “clicks” is a file?
user cTime url Mary 12:00:00 Bob 12:00:02 Liz 12:00:03 SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user user cnt Mary 2 Bob 1 Liz Q: What if we get more click data? A: We run the query again.

What if “clicks” is a stream?
We want the same results as for batch input! Does SQL work on streams as well?

SQL was not designed for streams
Relations are bounded (multi-)sets. DBMS can access all data. SQL queries return a result and complete. ↔ Streams are infinite sequences. Streaming data arrives over time. Streaming queries continuously emit results and never complete. ↔ ↔

DBMSs run queries on streams
Materialized views (MV) are similar to regular views, but persisted to disk or memory Used to speed-up analytical queries MVs need to be updated when the base tables change MV maintenance is very similar to SQL on streams Base table updates are a stream of DML statements MV definition query is evaluated on that stream MV is query result and continuously updated

Continuous Queries in Flink
Core concept is a “Dynamic Table” Dynamic tables are changing over time Queries on dynamic tables produce new dynamic tables (which are updated based on input) do not terminate Stream ↔ Dynamic table conversions

Stream → Dynamic Table Append mode
Stream records are appended to table Table grows as more data arrives user cTime url Mary 12:00:00 ./home Bob ./cart 12:00:05 ./prod?id=1 Liz 12:01:00 12:01:30 ./prod?id=3 12:01:45 ./prod?id=7 … Mary, 12:00:00, ./home Bob, 12:00:00, ./cart Mary, 12:00:05, ./prod?id=1 Liz, :01:00, ./home Bob, 12:01:30, ./prod?id=3 Mary, 12:01:45, ./prod?id=7

Stream → Dynamic Table Upsert mode
Stream records have (composite) key attributes Records are inserted or update existing records with same key user lastLogin Mary Bob Liz … Mary, Bob, Mary, Liz, Bob, Mary,

Querying a Dynamic Table
clicks user cTime url result SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user Mary 12:00:00 ./home user cnt Bob 12:00:00 ./cart Mary 3 Mary 1 Mary 2 Mary 12:00:05 ./prod?id=1 Bob 1 Liz 12:01:00 ./home Liz 1 Liz 2 Liz 12:01:30 ./prod?id=3 Mary 12:01:45 ./prod?id=7 Rows of result table are updated.

What about windows? tableEnvironment .scan("clicks") .window(Tumble over 1.hour on 'cTime as 'w) .groupBy('w, 'user) .select('user, 'w.end AS endT, 'url.count as 'cnt) SELECT user, TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS), user

Computing window aggregates
clicks result SELECT user, TUMBLE_END( cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY user, TUMBLE( cTime, user cTime url user endT cnt Mary 13:00:00 3 Bob 1 Mary 12:00:00 ./home Bob ./cart 12:02:00 ./prod?id=2 12:55:00 Bob 14:00:00 1 Liz 2 Bob 13:01:00 ./prod?id=4 Liz 13:30:00 ./cart 13:59:00 ./home Mary 15:00:00 1 Bob 2 Liz Mary 14:00:00 ./prod?id=1 Liz 14:02:00 ./prod?id=8 Bob 14:30:00 ./prod?id=7 14:40:00 ./home Rows are appended to result table.

Why are the results not updated?
SELECT user, TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS), user cTime attribute is event-time attribute Guarded by watermarks Internally represented as special type User-facing as TIMESTAMP Special plans for queries that operate on event-time attributes

Dynamic Table → Stream Converting a dynamic table into a stream
Dynamic tables might update or delete existing rows Updates must be encoded in outgoing stream Conversion of tables to streams inspired by DBMS logs DBMS use logs to restore databases (and tables) REDO logs store new records to redo changes UNDO logs store old records to undo changes

Dynamic Table → Stream: REDO/UNDO
clicks user url SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user Mary ./home Bob ./cart ./prod?id=1 Liz ./prod?id=3 + INSERT / - DELETE + Bob,2 - Bob,1 + Liz,1 + Mary,2 - Mary,1 + Bob,1 + Mary,1

Dynamic Table → Stream: REDO
clicks user url Mary ./home Bob ./cart ./prod?id=1 Liz ./prod?id=3 ./prod?id=7 SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user * UPSERT by KEY / - DELETE by KEY * Mary,3 * Liz,2 * Liz,1 * Mary,2 * Bob,1 * Mary,1

Can we run any query on a dynamic table?
No, there are space and computation constraints  State size may not grow infinitely as more data arrives SELECT sessionId, COUNT(url) FROM clicks GROUP BY sessionId; A change of an input table may only trigger a partial re-computation of the result table SELECT user, RANK() OVER (ORDER BY lastLogin) FROM users;

Bounding the size of query state
Adapt the semantics of the query Aggregate data of last 24 hours. Discard older data. Trade the accuracy of the result for size of state Remove state for keys that became inactive. SELECT sessionId, COUNT(url) AS cnt FROM clicks WHERE last(cTime, INTERVAL '1' DAY) GROUP BY sessionId

Current state of SQL & Table API
Flink’s relational APIs are rapidly evolving Lots of interest by community and many contributors Used in production at large scale by Alibaba, Uber, and others Features released in Flink 1.3 GroupBy & Over windowed aggregates Non-windowed aggregates (with update changes) User-defined aggregation functions Features coming with Flink 1.4 Windowed Joins Reworked connectors APIs

What can be built with this?
Continuous ETL & Streaming Analytics Continuously ingest data Process with transformations & window aggregates Write to files (Parquet, ORC), Kafka, PostgreSQL, HBase, …

What can be built with this?
Event-driven applications & Dashboards Flink updates query results with low latency Result can be written to KV store, DBMS, compacted Kafka topic Maintain result table as queryable state

Wrap up! Used in production heavily at Alibaba, Uber, and others
Unified Batch and Stream Processing Lots of great features Continuously updating results like Materialized Views Sophisticated event-time model with retractions User-defined scalar, table & aggregation functions Check it out!

Thank you! @fhueske @ApacheFlink @dataArtisans
Available on O’Reilly Early Release!

data-artisans.com/careers
We are hiring! data-artisans.com/careers

Tables are materialized streams
A table is the materialization of a stream of modifications SQL DML statements: INSERT, UPDATE, and DELETE DBMSs process statements by modifying tables user name lastLogin INSERT (u1, Mary, " ") u1 Mary u1 Mary INSERT (u2, Peter, " ") u2 Peter UPDATE (lastLogin = " ") WHERE (user = u1) DELETE WHERE (user = u2)

Flink’s DataStream API
The DataStream API is very expressive Application logic implemented as user-defined functions Windows, triggers, evictors, state, timers, async calls, … Many applications follow similar patterns Do not require the expressiveness of the DataStream API Can be specified more concisely and easily with a DSL Q: What’s the most popular DSL for data processing? A: SQL!

Stream Analytics with SQL on Apache Flink®

Similar presentations

Presentation on theme: "Stream Analytics with SQL on Apache Flink®"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stream Analytics with SQL on Apache Flink®

Similar presentations

Presentation on theme: "Stream Analytics with SQL on Apache Flink®"— Presentation transcript:

Similar presentations

About project

Feedback