Presentation is loading. Please wait.

Presentation is loading. Please wait.

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns For Real Time Streaming Analytics 19 Feb 2015 Sheetal Dolas Principal Architect,

Similar presentations


Presentation on theme: "Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns For Real Time Streaming Analytics 19 Feb 2015 Sheetal Dolas Principal Architect,"— Presentation transcript:

1 Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns For Real Time Streaming Analytics 19 Feb 2015 Sheetal Dolas Principal Architect, Hortonworks

2 Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Who am I ? Principal Architect @ Hortonworks Most of the career has been in field, solving real life business problems Last 5+ years in Big Data including Hadoop, Storm etc. Co-developed Cisco OpenSOC ( http://opensoc.github.io ) sheetal@hortonworks.com @sheetal_dolas

3 Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda Streaming Architectural Patterns - Overview Design Patterns o What o Why o Illustrations QA

4 Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Streaming Architectural Patterns

5 Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Real Time Streaming Architecture Source Systems Sources Syslog Machine Data External Streams Other Data Collection Flume / Custom Agent A Agent B Agent N Messaging System Kafka Topic B Topic N Topic A Real Time Processing Storm Topology B Topology N Topology A Storage Search Elastic Search / Solr Low Latency NoSql HBase Historic Hive / HDFS Access Web Services REST API Web Apps Analytic Tools R / Python BI Tools Alerting Systems

6 Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Lambda Architecture New Data Data Stream Batch Layer All Data Pre-compute Views Speed Layer Stream Processing Real Time View Serving Layer Batch View Data Access Query

7 Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Kappa Architecture Data Source Data Stream Stream Processing System Job Version n Serving DB Output table n Output table n + 1 Data Access Query Job Version n + 1

8 Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns

9 Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Pattern – What is it? A General reusable solution to a commonly occurring problem within a given context in software design. Solution Reusable Problem Commonl y Occurring Software Design Contextua l

10 Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns – Why ? Streaming use cases have distinct characteristics o Unpredictable incoming data patterns o Correlating multiple streams o Out-of-sequence and late events High scale and continuous streams pose new challenges o Peaks and valleys o Changing data characteristics over period of time o Maintain the latency and throughput SLAs

11 Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Streaming Patterns Architectural Patterns Real-time Streaming Near-real-time Streaming Lambda Architecture Kappa Architecture Functional Patterns Stream Joins Top N (Trending) Rolling Windows Data Management Patterns External Lookup Responsive Shuffling Out-of- Sequence Events Stream Security Patterns Message Encryption Authorized Access Secure Cluster Authentication

12 Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Streaming Patterns – Being Discussed Architectural Patterns Real-time Streaming Near-real-time Streaming Lambda Architecture Kappa Architecture Functional Patterns Stream Joins Top N (Trending) Rolling Windows Data Management Patterns External Lookup Responsive Shuffling Out-of- Sequence Events Stream Security Patterns Message encryption Authorized Access Secure Cluster Authentication

13 Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup Dynamic, High Speed Enrichments With External Data Lookup

14 Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Description Referencing frequently changing external system data for event enrichments, filters or validations by minimizing the event processing latencies, system bottlenecks and maintaining high throughput. Page 14

15 Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Challenges Increased latency due to frequent external system calls Insufficient memory to hold all reference data in memory Scalability and performance issues with large data reference sets Dynamic reference data needs frequent cache purge and refreshes External systems can become a bottleneck Page 15

16 Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup – Potential Options PerformanceScalabilityFault Tolerance Always Fetch Cache Everything Partition and Cache on the go

17 Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - A Reference Use Case Real Time Credit Card Fraud Identification and Alert o Credit card transaction data comes as stream (typically through Kafka) o External system holds information about the card holder’s recent location o Each credit card transaction is looked up against user’s current location o If the geographic distance between the credit card transaction location and user’s recent known location is significant, the credit card transaction is flagged as potential fraud Page 17

18 Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Topology Overview Page 18 Storm Source Stream Credit Card Transaction Spout Partitioner Bolt Alerting System External Reference Data Fraud Analyzer Bolt Locally caches the user location data. Cache validity is time bound Partitions data based on area code of the mobile numbers User Location Information Fraud Alert Email Looks up user’s current location from external system and finds geo distance between transaction location and user location

19 Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Peek in the Bolts Page 19 Storm Partitioner Bolt Instance 2 Partitioner Bolt Instance 1 Partitioner Bolt Instance n Fraud Analyzer Bolt Instance 1 CANVTX Fraud Analyzer Bolt Instance 2 NYCTMA Fraud Analyzer Bolt Instance n FLNCOH Stream is partitioned based on area code Local cache (time sensitive) (Use lightweight caching solution like Guava)

20 Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Benefits of the approach Only required data is cached (on demand) Each bolt caches only partition of reference data Data is locally cached so trips to external system are reduced Cache is time sensitive On the go cache building handles failures elegantly Page 20

21 Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup – Applicability Stream processing depends on external data External data is sufficiently large that could not be hold in memory of each task External data keeps changing External system has scalability limitations

22 Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling

23 Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Description Automatically adjust shuffling for better performance and throughput during peaks and varying data skews in streams

24 Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Challenges Incoming data stream is unpredictable and can be skewed Skew can change from time to time Managing latency and throughput with skews is difficult Since streams are continuously flowing, restarting topology with new shuffling logic is practically not possible

25 Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Shuffling – Potential Options Latency & Throughput System ReliabilityUptime Static Shuffle Responsive Shuffle

26 Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - A Reference Use Case Optimized HBase Inserts o Event data is stored in HBase after storm processing o Group events such that a bolts can insert more events in HBase with less trips to region servers o Over period of time HBase regions can split/merge o Automatically adjust the event grouping as HBase region layout changes over period of time Page 26

27 Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example – HBase writes w/o responsive shuffling HBase Bolt Instance 2 (100 events) HBase Bolt Instance 1 (100 events) HBase Bolt Instance 3 (100 events) Region Server Instance 1 (100 events) Region Server Instance 2 (100 events) Region Server Instance 3 (100 events) 300 events sent 300 events received 9 trips to region servers 300 events sent App Bolt Instance 1 (100 events) App Bolt Instance 2 (100 events) App Bolt Instance 3 (100 events)

28 Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Design

29 Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example – HBase writes with responsive shuffling HBase Bolt Instance 2 (100 events) HBase Bolt Instance 1 (100 events) HBase Bolt Instance 3 (100 events) Region Server Instance 1 (100 events) Region Server Instance 2 (100 events) Region Server Instance 3 (100 events) 300 events sent 300 events received 3 trips to region servers 300 events sent RS Aware Partitioner Partitioner automatically adapts to splitting/mergi ng HBase regions App Bolt Instance 1 (100 events) App Bolt Instance 2 (100 events) App Bolt Instance 3 (100 events)

30 Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling – Sample Code In App Bolt In RS Aware Partitioner In Topology

31 Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Benefits Topology responds to changes in data patterns and adopts accordingly Maintains high level of SLA and throughput adherence Minimizes needs for maintenance & hence downtimes

32 Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Applicability Change in shuffle pattern does not impact final outcome Data stream has varying skews Target/Reference system specifications change over period of time

33 Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events

34 Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Description An out-of-sequence event is one that's received late, sufficiently late that you've already processed events that should have been processed after the out-of-sequence event was received.

35 Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Challenges Hard to determine if all events in given window have been received Need referencing of relevant data for late events Builds more pressure on processing components Increased latency and degraded overall system performance

36 Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events – Potential Options LatencyResult AccuracyOperational Ease Drop Wait Fan Out

37 Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Processing Source Spout Event Filter Bolt Typical Processing Bolt Monitors currently being processed events and identifying out-of-sequence events Ordered events Out-of- Sequence events Special Handling Bolt Based on complexities in processing, this can be extended as different topology

38 Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events – Benefits Separation of concerns Maintain the the overall throughput and latency requirements Independent scaling of components

39 Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Applicability When order of events matter Processing out-of-sequence events needs special and complex logic Stream has relatively low volume of out-of-sequence events

40 Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank You! sheetal@hortonworks.com @sheetal_dolas

41 Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Appendix

42 Page 42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka

43 Page 43 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka - Description Ability to use Kafka as secure data transfer mechanism. Apache Kafka is widely used messaging platform in streaming applications. Unfortunately Kafka does not have built in support for Authentication & Authorization (yet)

44 Page 44 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka - Flow Source Systems Sources Syslog Data Collection Custom Collector Encryptin g Producer Messaging System Kafka Encrypted Messages Real Time Processing Storm Kafka Spout Decryptin g Bolt App Bolt

45 Page 45 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka – Encryption Details Data Collection Event Producer Messaging System Kafka Topic Event(s) Envelope Real Time Processing Storm Decrypting Bolt Event(s) Envelope Encrypted AES Key (w/ RSA) Encrypted Event (w/ AES) Event(s) Envelope Event Encrypt event(s) w/ AES Encrypt AES key w/ RSA Event Decrypt event(s) w/ AES Decrypt AES key w/ RSA

46 Page 46 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka – Encryption Details RSA public/private keys are generated ahead of time and securely shared with topology AES key is randomly generated and periodically refreshed Only user having appropriate RSA private key can read the data One event or a batch of events can be encrypted together as per needs

47 Page 47 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka - Applicability Multiple applications want to use Kafka as their source to the stream Data is sensitive and can not be shared between applications Other components in the pipeline are secured

48 Page 48 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching

49 Page 49 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching - Description Micro-batching is a technique that allows a process or task to treat a stream as a sequence of small batches or chunks of data. For incoming streams, the events can be packaged into small batches and delivered to a batch system for processing

50 Page 50 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching - Challenges Data delivery reliability Unnecessary data duplication Increased latency Complexity in time-bound batching

51 Page 51 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching - Design Options Thread-based Model Controller stream to trigger batch flush Use of Tick Tuples

52 Page 52 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tick Tuples Tick tuples are system generated tuples that Storm can send to your bolt if you need to perform some actions at a fixed interval

53 Page 53 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching - Benefits Takes advantages of system characteristic by batching events together Adheres to processing latency needs by ensuring that batches are executed by certain intervals Prevents data loss by acknowledging events only after successful processing Simple, elegant and easy to maintain code

54 Page 54 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching - Applicability Target systems are more efficient with bulk transactions Processing group of events is more efficient than individual event End to end event latency is not super sensitive

55 Page 55 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching – Sample Code

56 Page 56 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank You! sheetal@hortonworks.com @sheetal_dolas


Download ppt "Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns For Real Time Streaming Analytics 19 Feb 2015 Sheetal Dolas Principal Architect,"

Similar presentations


Ads by Google