Presentation is loading. Please wait.

Presentation is loading. Please wait.

#SQLSAT454 Azure Stream Analytics [Part of the Data Platform] Marco Parenzan.

Similar presentations


Presentation on theme: "#SQLSAT454 Azure Stream Analytics [Part of the Data Platform] Marco Parenzan."— Presentation transcript:

1 #SQLSAT454 Azure Stream Analytics [Part of the Data Platform] Marco Parenzan

2 #SQLSAT454 Sponsors

3 #SQLSAT454 Meet Marco Parenzan  Microsoft MVP 2015 for Azure  Develop modern distributed and cloud solutions  marco.parenzan@1nn0va.it  Passion for speaking and inspiring programmers, students, people  www.innovazionefvg.net www.innovazionefvg.net  I’m a developer!

4 #SQLSAT454 Agenda  Why a developer talks about analytics  Analytics in a modern world  Introduction to Azure Stream Analytics  Stream Analytics Query Language (SAQL)  Handling time in Azure Stream Analytics  Scaling Analytics  Conclusions

5 #SQLSAT454 ANALYTICS IN A MODERN WORLD

6 #SQLSAT454 What is Analytics  From Wikipedia  Analytics is the discovery and communication of meaningful patterns in data.  Especially valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance.  Analytics often favors data visualization to communicate insight.

7 #SQLSAT454 IoT proof of concept

8 #SQLSAT454 Event-based systems  Event I “something happened…  …somewhere…  …sometime!  Event arrive at different times i.e. have unique timestamps  Events arrive at different rates (events/sec).  In any given period of time there may be 0, 1 or more events

9 #SQLSAT454 Azure Service Bus Relay Queue Topic Notification Hub Event Hub NAT and Firewall Traversal Service Request/Response Services Unbuffered with TCP Throttling Transactional Cloud AMQP/HTTP Broker High-Scale, High-Reliability Messaging Sessions, Scheduled Delivery, etc. Transactional Message Distribution Up to 2000 subscriptions per Topic Up to 2K/100K filter rules per subscription High-scale notification distribution Most mobile push notification services Millions of notification targets Hyper Scale. A Million Clients. Concurrent.

10 #SQLSAT454 Azure Event Hubs Event Producers > 1M Producers > 1GB/sec Aggregate Throughput Direct Hash Throughput Units: 1 ≤ TUs ≤ Partition Count TU: 1 MB/s writes, 2 MB/s reads

11 #SQLSAT454 Microsoft Azure IoT Services DevicesDevice ConnectivityStorageAnalyticsPresentation & Action Event HubsSQL Database Machine Learning App Service Service Bus Table/Blob Storage Stream Analytics Power BI External Data Sources DocumentDBHDInsight Notification Hubs External Data Sources Data FactoryMobile Services BizTalk Services { }

12 #SQLSAT454 ANALYTICS IN A MODERN WORLD

13 #SQLSAT454 Traditional analytics  Everything around us produce data  From devices, sensors, infrastructures and applications  Traditional Business Intelligence first collects data and analyzes it afterwards  Typically 1 day latency, the day after  But we live in a fast paced world  Social media  Internet of Things  Just-in-time production  Offline data is unuseful  For many organizations, capturing and storing event data for later analysis is no longer enough Data at Rest

14 #SQLSAT454 Analytics in a modern world  We work with streaming data  We want to monitor and analyze data in near real time  Typically a few seconds up to a few minutes latency  So we don’t have the time to stop, copy data and analyze, but we have to work with streams of data Data in motion

15 #SQLSAT454 Scenarios  Real-time ingestion, processing and archiving of data  Real-time Analytics  Connected devices (Internet of Things)

16 #SQLSAT454 Why Stream Analytics in the Cloud?  Not all data is local  Event data is already in the Cloud  Event data is globally distributed  Bring the processing to the data, not the data to the processing 16

17 #SQLSAT454 Apply cloud principles  Focus on building solutions (PAAS or SAAS)  Without having to manage complex infrastructure and software  no hardware or other up-front costs and no time- consuming installation or setup  has elastic scale where resources are efficiently allocated and paid for as requested  Scale to any volume of data while still achieving high throughput, low-latency, and guaranteed resiliency  Up and running in minutes

18 #SQLSAT454 SCENARIO demo 18

19 #SQLSAT454 An API can be a “thing”  Api Apps, Logic Apps,  World-wide distributed API (Rest)  Resource consuming (CPU, storage, network bandwidth)  Each request is logged  With Event Hub or in log files  Evaluate how API is going on  “real time” statistics  Ex.  ASP.NET apps logs directly on EventHub

20 #SQLSAT454 INTRODUCTION TO AZURE STREAM ANALYTICS

21 #SQLSAT454 What is Azure Stream Analytics?  Azure Stream Analytics is a cost effective event processing engine  Describe their desired transformations in SQL-like syntax  Is a stream processing engine that is integrated with a scalable event queuing system like Azure Event Hubs

22 #SQLSAT454 Canonical Stream Analytics Pattern

23 #SQLSAT454 Real-time analytics  Intake millions of events per second  Intake millions of events per second (up to 1 GB/s)  At variable loads  Scale that accommodates variable loads  Low processing latency, auto adaptive (sub-second to seconds)  Transform, augment, correlate, temporal operations  Correlate between different streams, or with reference data  Find patterns or lack of patterns in data in real-time

24 #SQLSAT454 No challenges with scale  Elasticity of the cloud for scale out  Spin up any number of resources on demand  Scale from small to large when required  Distributed, scale-out architecture

25 #SQLSAT454 Fully managed  No hardware (PaaS offering)  Bypasses deployment expertise  No software provisioning and maintaining  No performance tuning  Spin up any number of resources on demand  Expand your business globally leveraging Azure regions

26 #SQLSAT454 Mission critical availability  Guaranteed events delivery  Guaranteed not to lose events or incorrect output  Guaranteed “once and only once” delivery of event  Ability to replay events  Guaranteed business continuity  Guaranteed uptime (three nines of availability)  Auto-recovery from failures  Built in state management for fast recovery  Effective Audits  Privacy and security properties of solutions are evident  Azure integration for monitoring and ops alerting

27 #SQLSAT454 Lower costs  Efficiently pay only for usage  Architected for multi-tenancy  Not paying for idle resources  Typical cloud expense model  Low startup costs  Ability to incrementally add resources  Reduce costs when business needs changes

28 #SQLSAT454 Rapid development  SQL like language  High-level: focus on stream analytics solution  Concise: less code to maintain  First-class support for event streams and reference data  Built in temporal semantics  Built-in temporal windowing and joining  Simple policy configuration to manage out-of- order events and late arrivals

29 #SQLSAT454 Azure Stream Analytics Data Source CollectProcessConsumeDeliver Event Inputs -Event Hub -Azure Blob Transform -Temporal joins -Filter -Aggregates -Projections -Windows -Etc. Enrich Correlate Outputs -SQL Azure -Azure Blobs -Event Hub -Service Bus Queue -Service Bus Topics - Table storage - PowerBI Azure Storage Temporal Semantics Guaranteed delivery Guaranteed up time Reference Data -Azure Blob

30 #SQLSAT454 Inputs sources for a Stream Analytics Job Currently supported input Data Streams are Azure Event Hub, Azure IoT Hub and Azure Blob Storage. Multiple input Data Streams are supported. Advanced options lets you configure how the Job will read data from the input blob (which folders to read from, when a blob is ready to be read, etc). Reference data is usually static or changes very slowly over time. Must be stored in Azure Blob Storage. Cached for performance

31 #SQLSAT454 Defining Event Schema The serialization format and the encoding for the for the input data sources (both Data Streams and Reference Data) must be defined. Currently three formats are supported: CSV, JSON and Avro (binary JSON - https://avro.apache.org/docs/1.7.7/spec.ht ml) https://avro.apache.org/docs/1.7.7/spec.ht ml For CSV format a number of common delimiters are supported: (comma (,), semi- colon(;), colon(:), tab and space. For CSV and Avro optionally you can provide the schema for the input data.

32 #SQLSAT454 Output for Stream Analytics Jobs Currently data stores supported as outputs Azure Blob storage: creates log files with temporal query results Ideal for archiving Azure Table storage: More structured than blob storage, easier to setup than SQL database and durable (in contrast to event hub) SQL database: Stores results in Azure SQL Database table Ideal as source for traditional reporting and analysis Event hub: Sends an event to an event hub Ideal to generate actionable events such as alerts or notifications Service Bus Queue: sends an event on a queue Ideal for sending events sequentially Service Bus Topics: sends an event to subscribers Ideal for sending events to many consumers PowerBI.com: Ideal for near real time reporting! DocumentDb: Ideal if you work with json and object graphs

33 #SQLSAT454 PREPARATION demo 33

34 #SQLSAT454 STREAM ANALYTICS QUERY LANGUAGE (SAQL)

35 #SQLSAT454 SAQL – Language & Library

36 #SQLSAT454 Supported types TypeDescription bigint Integers in the range -2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807). float Floating point numbers in the range - 1.79E+308 to -2.23E-308, 0, and 2.23E-308 to 1.79E+308. nvarchar(max)Text values, comprised of Unicode characters. Note: A value other than max is not supported. datetimeDefines a date that is combined with a time of day with fractional seconds that is based on a 24-hour clock and relative to UTC (time zone offset 0). Inputs will be casted into one of these types We can control these types with a CREATE TABLE statement: This does not create a table, but just a data type mapping for the inputs

37 #SQLSAT454 INTO clause  Pipelining data from input to output  Without INTO clause we write to destination named ‘output’  We can have multiple outputs  With INTO clause we can choose for every select the appropriate destination  E.g. send events to blob storage for big data analysis, but send special events to event hub for alerting SELECT UserName, TimeZone INTO Output FROM InputStream WHERE Topic = 'XBox'

38 #SQLSAT454 WHERE clause  Specifies the conditions for the rows returned in the result set for a SELECT statement, query expression, or subquery  There is no limit to the number of predicates that can be included in a search condition. SELECT UserName, TimeZone FROM InputStream WHERE Topic = 'XBox'

39 #SQLSAT454 JOIN  We can combine multiple event streams or an event stream with reference data via a join (inner join) or a left outer join  In the join clause we can specify the time window in which we want the join to take place  We use a special version of DateDiff for this

40 #SQLSAT454 Reference Data  Seamless correlation of event streams with reference data  Static or slowly-changing data stored in blobs  CSV and JSON files in Azure Blobs  scanned for new snapshots on a settable cadence JOIN (INNER or LEFT OUTER) between streams and reference data sources  Reference data appears like another input: SELECT myRefData.Name, myStream.Value FROM myStream JOIN myRefData ON myStream.myKey = myRefData.myKey

41 #SQLSAT454 Reference data tips  Currently reference data cannot be refreshed automatically.  You need to stop the job and specify new snapshot with reference data  Reference Data are only in Blog  Practice says that you use services like Azure Data Factory to move data from Azure Data Sources to Azure Blob Storage  Have you followed Francesco Diaz’s session?

42 #SQLSAT454 UNION SELECT TollId, ENTime AS Time, LicensePlate FROM EntryStream TIMESTAMP BY ENTime UNION SELECT TollId, EXTime AS Time, LicensePlate FROM ExitStream TIMESTAMP BY EXTime TollIdEntryTimeLicensePlate… 12014-09-10 12:01:00.000JNB 7001… 12014-09-10 12:02:00.000YXZ 1001… 32014-09-10 12:02:00.000ABC 1004… TollIdExitTimeLicensePlate 12009-06-25 12:03:00.000JNB 7001 12009-06-2512:03:00.000YXZ 1001 32009-06-25 12:04:00.000ABC 1004 TollIdTimeLicensePlate 12014-09-10 12:01:00.000JNB 7001 12014-09-10 12:02:00.000YXZ 1001 32014-09-10 12:02:00.000ABC 1004 1 2009-06-25 12:03:00.000 JNB 7001 1 2009-06-2512:03:00.000YXZ 1001 3 2009-06-25 12:04:00.000ABC 1004

43 #SQLSAT454 STORING, FILTERING AND DECODING demo 43

44 #SQLSAT454 HANDLING TIME IN AZURE STREAM ANALYTICS

45 #SQLSAT454 Traditional queries  Traditional querying assumes the data doesn’t change while you are querying it:  query a fixed state  If the data is changing: snapshots and transactions ‘freeze’ the data while we query it  Since we query a finite state, our query should finish in a finite amount of time table query result table

46 #SQLSAT454 A different kind of query  When analyzing a stream of data, we deal with a potential infinite amount of data  As a consequence our query will never end!  To solve this problem most queries will use time windows stream temporal query result strea m

47 #SQLSAT454 Arrival Time Vs Application Time  Every event that flows through the system comes with a timestamp that can be accessed via System.Timestamp  This timestamp can either be an application time which the user can specify in the query  A record can have multiple timestamps associated with it  The arrival time has different meanings based on the input sources.  For the events from Azure Service Bus Event Hub, the arrival time is the timestamp given by the Event Hub  For Blob storage, it is the blob’s last modified time.  If the user wants to use an application time, they can do so using the TIMESTAMP BY keyword  Data are sorted by timestamp column

48 #SQLSAT454 Temporal Joins SELECT Make FROM EntryStream ES TIMESTAMP BY EntryTime JOIN ExitStream EX TIMESTAMP BY ExitTime ON ES.Make= EX.Make AND DATEDIFF(second,ES,EX) BETWEEN 0 AND 10 Time (Seconds) {“Mazda”,6} {“BMW”,7}{“Honda”,2}{“Volvo”,3} Toll Entry : {“Mazda”,3}{“BMW”,7}{“Honda”,2}{“Volvo”,3} Toll Exit : 0510152025

49 #SQLSAT454 Windowing Concepts  Common requirement to perform some set-based operation (count, aggregation etc) over events that arrive within a specified period of time  Group by returns data aggregated over a certain subset of data  How to define a subset in a stream?  Windowing functions!  Each Group By requires a windowing function 1542686 4 t1t2t5t6t3t4 Time Window 1Window 2Window 3 Aggregate Function (Sum) 1814 Output Events

50 #SQLSAT454 Three types of windows  Every window operation outputs events at the end of the window  The output of the window will be single event based on the aggregate function used. The event will have the time stamp of the window  All windows have a fixed length 50 Tumbling window Aggregate per time interval Hopping window Schedule overlapping windows Sliding window Windows constant re-evaluated 1542686 4 t1t2t5t6t3t4 Time Window 1Window 2Window 3 Aggregate Function (Sum) 1814 Output Events

51 #SQLSAT454 Tumbling Window 1542686 5 Time (secs) 15426 86 A 20-second Tumbling Window 361 5361 Tumbling windows: Repeat Are non-overlapping SELECT TollId, COUNT(*) FROM EntryStream TIMESTAMP BY EntryTime GROUP BY TollId, TumblingWindow(second, 20) Query: Count the total number of vehicles entering each toll booth every interval of 20 seconds. An event can belong to only one tumbling window 1542686 A 10-second Tumbling Window 86 5361 15 426 61 53

52 #SQLSAT454 Hopping Window 1542686 A 20-second Hopping Window with a 10 second “Hop” Hopping windows: Repeat Can overlap Hop forward in time by a fixed period Same as tumbling window if hop size = window size Events can belong to more than one hopping window SELECT COUNT(*), TollId FROM EntryStream TIMESTAMP BY EntryTime GROUP BY TollId, HoppingWindow (second, 20,10) 426 86 5361 15 426 8653 61 53 QUERY: Count the number of vehicles entering each toll booth every interval of 20 seconds; update results every 10 seconds 1542686 A 10-second Hopping Window with a 5-second “Hop” 426 86 5361 15 426 8653 61 53

53 #SQLSAT454 Sliding Window 15 A 20-second Sliding Window Sliding window: Continuously moves forward by an ε (epsilon) Produces an output only during the occurrence of an event Every windows will have at least one event Events can belong to more than one sliding window SELECT TollId, Count(*) FROM EntryStream ES GROUP BY TollId, SlidingWindow (second, 20) HAVING Count(*) > 10 Query: Find all the toll booths which have served more than 10 vehicles in the last 20 seconds 1 8 8 51 9 519 1542686 Every 10-second Sliding Window with changes 86 5361 15 426 61 53

54 #SQLSAT454 TEMPORAL TASKS demo 54

55 #SQLSAT454 SCALING ANALYTICS

56 #SQLSAT454 Steaming Unit  Is a measure of the computing resource available for processing a Job  A streaming unit can process up to 1 Mb / second  By default every job consists of 1 streaming unit. Total number of streaming units that can be used depends on :  rate of incoming events  complexity of the query

57 #SQLSAT454 Multiple steps, multiple outputs  A query can have multiple steps to enable pipeline execution  A step is a sub-query defined using WITH (“common table expression”)  The only query outside of the WITH keyword is also counted as a step  Can be used to develop complex queries more elegantly by creating a intermediary named result  Each step’s output can be sent to multiple output targets using INTO WITH Step1 AS ( SELECT Count(*) AS CountTweets, Topic FROM TwitterStream PARTITION BY PartitionId GROUP BY TumblingWindow(second, 3), Topic, PartitionId ), Step2 AS ( SELECT Avg(CountTweets) FROM Step1 GROUP BY TumblingWindow(minute, 3) ) SELECT * INTO Output1 FROM Step1 SELECT * INTO Output2 FROM Step2 SELECT * INTO Output3 FROM Step2

58 #SQLSAT454 Scaling Concepts – Partitions  When a query is partitioned, input events will be processed and aggregated in a separate partition groups  Output events are produced for each partition group  To read from Event Hubs ensure that the number of partitions match  The query within the step must have the Partition By keyword  If your input is a partitioned event hub, we can write partitioned queries and partitioned subqueries (WITH clause)  A non-partitioned query with a 3-fold partitioned subquery can have (1+3) * 4 = 24 streaming units! SELECT Count(*) AS Count, Topic FROM TwitterStream PARTITION BY PartitionId GROUP BY TumblingWindow(minute, 3), Topic, PartitionId QueryResult 1 QueryResult 2 QueryResult 3 Event Hub

59 #SQLSAT454 Out of order inputs  Event Hub guarantees monotonicity of the timestamp on each partition of the Event Hub  All events from all partitions are merged by timestamp order, there will be no out of order events.  When it's important for you to use sender's timestamp, so a timestamp from the event payload is chosen using "timestamp by," there can be several sources or disorderness introduced.  Producers of the events have clock skews.  Network delay from the producers sending the events to Event Hub.  Clock skews between Event Hub partitions.  Do we skip them (drop) or do we pretend they happened just now (adjust)?

60 #SQLSAT454 Handling out of order events  On the configuration tab, you will find the following defaults.  Using 0 seconds as the out of order tolerance window means you assert all events are in order all the time.  To allow ASA to correct the disorderness, you can specify a non- zero out of order tolerance window size.  ASA will buffer events up to that window and reorder them using the user chosen timestamp before applying the temporal transformation.  Because of the buffering, the side effect is the output is delayed by the same amount of time  As a result, you will need to tune the value to reduce the number of out of order events and keep the latency low.

61 #SQLSAT454 STRUCTURING AND SCALING QUERY demo 61

62 #SQLSAT454 CONCLUSIONS

63 #SQLSAT454 Summary  Azure Stream Analytics is the PaaS solution for Analytics on streaming data  It is programmable with a SQL-like language  Handling time is a special and central feature  Scale with cloud principles: elastic, self service, multitenant, pay per use  More questions:  Other solutions  Pricing  What to do with that data?  Futures

64 #SQLSAT454 Microsoft real-time stream processing options

65 #SQLSAT454 Apache Storm (in HDInsight)  Apache Storm is a distributed, fault-tolerant, open source real-time event processing solution.  Storm was originally used by Twitter to process massive streams of data from the Twitter firehose.  Today, Storm is an incubator project as part of the Apache Software foundation.  Typically, Storm will be integrated with a scalable event queuing system like Apache Kafka or Azure Event Hubs.

66 #SQLSAT454 Stream Analytics vs Apache Storm  Storm:  Data Transformation  Can handle more dynamic data (if you're willing to program)  Requires programming  Stream Analytics  Ease of Setup  JSON and CSV format only  Can change queries within 4 minutes  Only takes inputs from Event Hub, Blob Storage  Only outputs to Azure Blob, Azure Tables, Azure SQL, PowerBI

67 #SQLSAT454 Pricing  Pricing based on volume per job:  Volume of data processed  Streaming units required to process the data stream Price (USD) Volume of Data Processed  Volume of data processed by the streaming job (in GB) € 0.0009 per GB Streaming Unit*  Blended measure of CPU, memory, throughput. € 0.0262 per hour € 18,864 per month

68 #SQLSAT454 Azure Machine Learning  Undestand the “sequence” of data in the history to predict the future  But Azure can ‘learn’ which values preceded issues  Azure Machine Learning

69 #SQLSAT454 Power BI Solutions to create realtime dashboards SaaS Service Inside Office 365

70 #SQLSAT454 Futures  [started]  Native integration with Azure Machine Learning  Provide better ways to debug.  [planned]  Call to a REST endpoint to invoke custom code  [under review]  Take input from DocumentDb

71 #SQLSAT454 Thanks  Don’t forget to compile evaluations form here  http://speakerscore.com/sqlsat454 http://speakerscore.com/sqlsat454  Marco Parenzan  http://twitter.com/marco_parenzan http://twitter.com/marco_parenzan  http://www.slideshare.net/marcoparenzan http://www.slideshare.net/marcoparenzan  http://www.github.com/marcoparenzan http://www.github.com/marcoparenzan

72 #SQLSAT454 #sqlsat454


Download ppt "#SQLSAT454 Azure Stream Analytics [Part of the Data Platform] Marco Parenzan."

Similar presentations


Ads by Google