Pulsar Realtime Analytics At Scale Tony Ng, Sharad Murthy June 11, 2015.

Slides:



Advertisements
Similar presentations
REAL-TIME NETWORK ANALYTICS WITH STORM
Advertisements

Financial Services Technology Expo Microsoft StreamInsight for Financial Services A Microsoft Point of View Presentation Hilton New York Hotel New York,
Real-Time Big Data Use Cases John Leach CTO, Splice Machine.
Apache Storm and Kafka Boston Storm User Group September 25, 2014 P. Taylor Goetz,
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
© 2013 IBM Corporation October 4, 2013 IT Analytics and Big Data IBM Solutions Paul Smith (Smitty) Service Management Architect.
The NewSQL database you’ll never outgrow Taming the Big Data Fire Hose John Hugg Sr. Software Engineer, VoltDB.
1 Atigeo Confidential Cassandra in xPatterns Seattle Java User’s Group May 2014.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1.
1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.
PaaS Design and Architecture: A Deep Dive into Apache Stratos Samisa Abeysinghe VP Delivery, WSO2 Member Apache Software Foundation 10 th June 2014.
SQL Server 2008 R2 StreamInsight Complex Event Processing Event Stream Processing.
5 Complex Event Processing (CEP) is the continuous and incremental processing of event streams from multiple sources based on declarative query.
Page 1 © Hortonworks Inc – All Rights Reserved Design Patterns For Real Time Streaming Analytics 19 Feb 2015 Sheetal Dolas Principal Architect,
Microsoft SQL Server x 46% 900+ For Hosting Service Providers
HOL9396: Oracle Event Processing 12c
OEP BOF9272 SOA Event Delivery Network
Real-time Stream Processing Architecture for Comcast IP Video
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Word Wide Cache Distributed Caching for the Distributed Enterprise.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Tyson Condie.
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG MADES - A Multi-Layered, Adaptive, Distributed Event Store Tilmann Rabl Mohammad Sadoghi Kaiwen Zhang Hans-Arno.
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Lucas Jellema OGh SIG SOA/BPM – 1 September 2015 BPM Suite 12c Process Analytics with BAM.
FI-CORE Data Context Media Management Chapter Release 4.1 & Sprint Review.
John Plummer Technical Specialist Data Platform Microsoft Ltd StreamInsight Complex Event Processing (CEP) Platform.
Event Processing A Perspective From Oracle Dieter Gawlick, Shailendra Mishra Oracle Corporation March,
Confidential © 2014, Brighterion Inc. (all rights reserved) Keeping an eye on your business Brighterion Solutions SMART AGENTS SMART AGENTS UNSUPERVISED.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
OpenField Consolidates Stadium Data, Provides CRM and Analysis Functions for an Intelligent, End-to-End Solution COMPANY PROFILE : OPENFIELD Founded by.
+ Logentries Is a Real-Time Log Analytics Service for Aggregating, Analyzing, and Alerting on Log Data from Microsoft Azure Apps and Systems MICROSOFT.
MAR Capability Overview Deck Protean Analytics.
An Introduction To Big Data For The SQL Server DBA.
Apache Kafka A distributed publish-subscribe messaging system
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Microsoft Ignite /28/2017 6:07 PM
Energy Management Solution
Big thanks to everyone!.
Euro17 LSO Hackathon Open LSO Analytics
Connected Infrastructure
Data Loss and Data Duplication in Kafka
Organizations Are Embracing New Opportunities
5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.
Smart Building Solution
Streaming Analytics & CEP Two sides of the same coin?
Connected Maintenance Solution
WHY IDEAL ANALYTICS?.
Smart Building Solution
Connected Maintenance Solution
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Remote Monitoring solution
Energy Management Solution
Amit R Bhatia / Puneeth Nayak
Shubha Vijayasarathy Program Manager, Azure Event Hubs - Microsoft
ETL Architecture for Real-Time BI
Capitalize on modern technology
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Capital One Architecture Team and DataTorrent
Big Data - in Performance Engineering
Data science and machine learning at scale, powered by Jupyter
Taming the Big Data Fire Hose
Architecture for Real-Time ETL
Technical Capabilities
Data science laboratory (DSLAB)
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Johan Lindberg, inRiver
Presentation transcript:

Pulsar Realtime Analytics At Scale Tony Ng, Sharad Murthy June 11, 2015

Big Data Trends Bigger data volumes More data sources –DBs, logs, behavioral & business event streams, sensors … Faster analysis –Next day to hours to minutes to seconds Newer processing models –MR, in-memory, stream processing, Lambda … 2

What is Pulsar Open-source real-time analytics platform and stream processing framework 3

Business Needs for Real-time Analytics Near real-time insights React to user activities or events within seconds Examples: –Real-time reporting and dashboards –Business activity monitoring –Personalization –Marketing and advertising –Fraud and bot detection 4 Optimize App Experience Users Interact with Apps Collect Events Analyze & Generate Insights

Systemic Quality Requirements Scalability –Scale to millions of events / sec Latency –<1 sec delivery of events Availability –No downtime during upgrades –Disaster recovery support across data centers Flexibility –User driven complex processing rules –Declarative definition of pipeline topology and event routing Data Accuracy –Should deal with missing data –99.9% delivery guarantee 5

Pulsar Real-time Analytics 6 Complex Event Processing (CEP): SQL on stream data Custom sub-stream creation: Filtering and Mutation In Memory Aggregation: Multi Dimensional counting

Pulsar Real-time Analytics Pipeline 7 Sessionizer Metrics Calculator Event Distributor Other Real Time Data Clients Metrics Store Collector Real Time Metrics & Alert Consumer Real Time Data Pipeline BOT Detector Enriched Events Enriched Sessionized Events Producing Applications Real Time Dashboard Kafka HDFS Batch Loader Batch Pipeline

Pipeline Data 8 Sessionizer Metrics Calculator Event Distributor Other Real Time Data Clients Metrics Store Collector Real Time Data Pipeline Producing Applications Real Time Dashboard Kafka HDFS Batch Loader Batch Pipeline 100,000+ Unstructured Avg Payload – 3000 bytes Peak 300,000 to events/sec Billion sessions Mutated Streams Avg latency < 100 millisecond Event Enrichment 8 – 10 Billion events/day

Pulsar Framework Building Block (CEP Cell) 9 Event = Tuples (K,V) – Mutable Abstractions: Channels, Processors, Pipelining, Monitoring Declarative Pipeline Stitching Channels: Cluster Messaging, File, REST, Kafka, Custom Event Processor: Esper, RateLimiter, RoundRobinLB, PartitionedLB, Custom Inbound Channel-1 Inbound Channel-2 Outbound Channel Processor-1 Processor-2 Spring Container JVM

Multi Stage Distributed Pipeline 10

Pulsar Deployment Architecture 11

Availability And Scalability Elastic Clusters Self Healing Pipeline Flow Control Datacenter failovers Dynamic Partitioning –Consistent Hashing Rate Limiting 12

Messaging Models Producer Queue Kafka Producer Queue Kafka Replayer Push Model Pull Model Pause/Resume Hybrid Model (At most once delivery semantics) (At least once delivery semantics) Consumer Netty

Event Filtering and Routing Example 14 CEP OutboundMessaging Topic1 Topic2 insert into SUBSTREAM select roundTS(timestamp) as ts, D1, D2, D3, D4 from RAWSTREAM where D1 = or D2 = or D3 = or D4 = ; // // publish sub stream = D1); // partition key based on column D1 select * FROM SUBSTREAM; OutboundKafkaChannel

Aggregate Computation Example // create 10-second time window context create context MCContext end pattern [timer:interval(10)]; // aggregate event count along dimension D1 and D2 within specified time window context MCContext insert into AGGREGATE select count(*) as METRIC1, D1, D2 FROM SUBSTREAM group by D1,D2 output snapshot Select * from AGGREGATE; 15 CEP OutboundKafkaChannel Kafka DRUID SUBSTREAM OutboundMessaging

TopN Computation Example 16 TopN computation can be expensive with high cardinality dimensions Consider approximate algorithms –sacrifice little accuracy for space and time complexity Implemented as aggregate functions e.g. select topN(100, 10, D1 ||','||D2 ||','||D3) as topn from RawEventStream;

Pulsar Integration with Druid Druid –Real-time ROLAP engine for aggregation, drill-down and slice-n-dice Pulsar leveraging Druid –Real-time analytics dashboard –Near real-time metrics like number of visitors in the last 5 minutes, refreshing every 10 seconds –Aggregate/drill-down on dimensions such as browser, OS, device, geo location 17 CollectorSessionizerDistributor Metrics Calculator Kafka DRUID Ingest DRUID Real Time Pipeline adhoc queries

Key Differentiators Declarative Topology Management Streaming SQL with hot deployment of SQL Elastic clustering with flow control in the cloud Dynamic partitioning of clusters Hybrid messaging model –Combo of push and pull < 100 millisecond pipeline latency 99.99% Availability < 0.01% steady state data loss 18

Future Development and Open Source Real-time reporting API and dashboard Integration with Druid and other metrics stores Session store scaling to 1 million insert/update per sec Rolling window aggregation over long time windows (hours or days) GROK filters for log processing Anomaly detection 19

More Information GitHub: –repos: pipeline, framework, docker files Website: –Technical whitepaper –Getting started –Documentation Google group: 20