Download presentation
Presentation is loading. Please wait.
Published byAngelica Nichols Modified over 5 years ago
1
InfoSphere Streams Tushar Kale Big Data Evangelist – Streams Architect
2
Agenda Overview Architecture Customer Use Cases
3
Big Data = Variety, Velocity, and Volume
Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible. Variety Manage the complexity of multiple relational and non-relational data types and schemas Velocity Streaming data and large volume data movement Volume Scale from terabytes to zettabytes 3
4
Millions of events per second Traditional / Non-traditional
InfoSphere Streams A Platform to Run In-Motion Analytics on BIG Data Real time delivery Handles up to Petabytes of data per day Supports traditional as well as non-traditional data (Audio, Video etc.) Delivers insights with microsecond latencies Supports custom analytics written in C++/Java and warehouse analytic models Single instance can support multiple applications Volume ICU Monitoring Environment Monitoring Powerful Analytics Algo Trading Telco churn predict Variety Cyber Security Smart Grid Government / Law enforcement Velocity Millions of events per second Microsecond Latency Complex Analytics Traditional / Non-traditional data sources Agility
5
Stream Computing Illustrated
tuple directory: ”/img" filename: “farm” directory: ”/img" filename: “bird” directory: ”/opt" filename: “java” filename: “cat” height: width: data: height: width: 5
6
What can Streams do for you?
Analyze and react to events as they are happening Take advantage of more sources of data in “true” real time Build models on your most up-to-the-second information that will help predict what happens next Streams is a middleware and language for building and running analytic applications operating on data in motion Scale – easily handles a few events per second through multiple millions of events per second Reaction time – possible to get actionable results in much less than a second (< 20 micros possible) Enables TRUE situational awareness
7
BIG Data – Extending the Warehouse
Traditional / Relational Data Sources Database & Warehouse At-Rest Data Analytics Results Streams Non-Traditional / Non-Relational Data Sources In-Motion Analytics Ultra Low Latency Results InfoSphere Streams Non-Traditional/ Non-Relational Data Sources Internet Scale Internet Scale Traditional/Relational Data Sources InfoSphere BigInsights Data Analytics, Data Operations & Model Building Results
8
BigInsights, Database & Warehouse 3. Adaptive Analytics Model
Adaptive Analytics Integrating Analytics on Data in Motion and Data at Rest Visualization of real- time and historical insights Data Integration, data mining, machine learning, statistical modeling InfoSphere Streams 1. Data Ingest Data InfoSphere BigInsights, Database & Warehouse 2. Bootstrap/Enrich Data ingest, preparation, online analysis, model validation Control flow 3. Adaptive Analytics Model
9
Agenda Overview Architecture Customer Use Cases
10
What are key differentiating technical capabilities of Streams?
Language built for Streaming applications: Reusable operators Rapid application development Continuous “pipeline” processing Performance and Scaling: Operator Fusing and Threading Efficient use of cores Distributed execution Very fast data exchange Use the data that gives you a competitive advantage: Can handle virtually any data type Use data that is too expensive and time sensitive for traditional approaches Easy to extend: Built in adaptors Users add capability with familiar C++ and Java Dynamic analysis: Programmatically change topology at runtime Create new subscriptions Create new port properties Easy to manage: Automatic placement Extend applications incrementally without downtime Multi-user / multiple applications Flexible and high performance transport: Very low latency High data rates 10
11
InfoSphere Streams Streams Processing Language and IDE
Runtime Environment Tools and Technology Integration Front Office 3.0 Streams Studio Eclipse IDE for SPL Highly Scalable stream processing runtime Streams Console & Monitoring, Built-in Stream Relational Analytics, Adapters, Toolkits Supported on x86 hardware, RedHat Enterprise Linux Version 5 (5.3 and up)
12
Terminology Application Operator Operator Instance Stream
(stream<Type> A) as O1 = MySrc() {} () as O2 = MySink(A) {} () as O3 = MySink(A) {} A stream A stream connection MySink MySrc Application Data flow graph of operator instances connected to each other via stream connections Operator Reusable stream analytic Input ports: receives data / Output ports: produces data Source: No input ports / Sink: No output ports Operator Instance A specific instantiation of an operator Stream Continuous series of tuples, generated by an operator instance’s output port Stream connection A stream connected to a specific operator instance input port PE A runtime process that executes a set of operator instances Job An application instance running on a set of hosts 12
13
Application Programming (SPL) Platform optimized compilation
1313 InfoSphere Streams Programming Model Source Adapters Operator Repository Sink Adapters Application Programming (SPL) Platform optimized compilation 13
14
Streams Core Analytical Capabilities Streams Built-in Relational and Utility Operators
The Split operator is used for dividing incoming tuples into separate streams for parallel processing The Aggregate operator is used for grouping and summarization of incoming tuples The Delay operator is used to “artificially” slowdown a stream The Functor operator is used for performing tuple- level manipulations The Punctor operator is for inserting punctuation marks in streams The Join operator is used for correlating two streams The Sort operator is used for imposing an order on incoming tuples in a stream The Barrier operator is used as a synchronization point And more! 14
15
Streams Core Adapter Capabilities Streams Built-in Adapters and DB Toolkit
The ODBCEnrich operator is used for extending streaming data based on lookups performed from database tables The ODBCSource operator is used for reading data from databases, such as DB2, IDS, Oracle The ODBCAppend operator is used for writing data to databases, such as DB2, IDS, Oracle The solidDBEnrich operator is used for extending streaming data based on lookups performed from in-memory database tables The FileSource operator is used for reading data from files in formats such as csv, line, or binary The FileSink operator is used for writing data to files in formats such as csv, line, or binary The TCP / UDPSource operator is used for reading data from sockets in formats such as csv, line, or binary The TCP / UDPSink operator is used for writing data to sockets in formats such as csv, line, or binary
16
Extensibility User-defined operators that extend the language
A reusable, generic operator model written in general purpose programming languages (C++/Java) User-defined functions that extend the language Toolkits: Set of domain-specific operators/functions Toolkits available as part of Streams DB toolkit Data mining toolkit Financial toolkit Streams Exchange on developerWorks Re-usable Assets and Forum Developers in two categories Application developers Toolkit developers
17
Static vs. Dynamic Composition
Static connections Fully specified at application development-time and do not change at run-time Dynamic connections Partially specified at application development-time (Name or Properties) Established at run-time, as new jobs come and go Specifications can also be updated at run-time Dynamic application composition Incremental deployment of applications Dynamic adaptation of applications
18
Static vs. Dynamic Composition
Static connections Fully specified at application development-time and do not change at run-time Dynamic connections Partially specified at application development-time (Name or Properties) Established at run-time, as new jobs come and go Specifications can also be updated at run-time Dynamic application composition Incremental deployment of applications Dynamic adaptation of applications
19
InfoSphere Streams Runtime Architecture
Eclipse IDE and Management Tools Language/Optimizing Compiler Admin Config / Console Management APIs InfoSphere Streams Runtime running on a cluster – 125 blades streamtool Running anywhere inside the cluster Streams Web Service Name Service Root Service Name Service Partition Service Scheduler Authorization and Authentication Service Streams Application Manager Streams Resource Manager Components running on management hosts Processing Element Container Agent Subset of a SPL application (a collection of operators) Host Controller Components running on application hosts
20
InfoSphere Streams Runtime
2020 InfoSphere Streams Runtime Streams is a distributed, multi-user, multi-instance system Multiple instances can run at the same time Can run jobs from multiple users A security model is provided for authentication and authorization Application management New jobs can be added/removed at any time New and existing jobs can connect to each other Scheduler assigns PEs to Hosts based on load Resource management Hosts & Services configuration and state System & Application Metrics Failure semantics Recovery of management services state PEs can be restarted or relocated upon failure All connections will be re-established once a PE restarts All state and in transit tuples are lost Checkpointing can be used to restore operator state 20
21
InfoSphere Streams Runtime - cont’d
2121 InfoSphere Streams Runtime - cont’d Runs on commodity hardware From single node to blade centers to high performance multi-rack clusters Adapts to changes : X86 Host X86 Host X86 Host X86 Host X86 Host 21
22
InfoSphere Streams Runtime – cont’d
2222 InfoSphere Streams Runtime – cont’d Runs on commodity hardware From single node to blade centers to high performance multi-rack clusters Adapts to changes : In workloads X86 Host X86 Host X86 Host X86 Host X86 Host 22
23
InfoSphere Streams Runtime – cont’d
2323 InfoSphere Streams Runtime – cont’d Runs on commodity hardware From single node to blade centers to high performance multi-rack clusters Adapts to changes : In workloads X86 Host X86 Host X86 Host X86 Host X86 Host 23
24
InfoSphere Streams Runtime – cont’d
2424 InfoSphere Streams Runtime – cont’d Runs on commodity hardware From single node to blade centers to high performance multi-rack clusters Adapts to changes : In workloads In resources X86 Host X86 Host X86 Host X86 Host X86 Host 24
25
InfoSphere Streams Runtime – cont’d
2525 InfoSphere Streams Runtime – cont’d Runs on commodity hardware From single node to blade centers to high performance multi-rack clusters Adapts to changes : In workloads In resources X86 Host X86 Host X86 Host X86 Host X86 Host 25
26
Streams Studio Eclipse IDE
27
Streams Console – Metrics
New in the Streams Console this release is the ability to see "integrated" metrics for the object type that is selected. Here is the metrics view for the PEs running in the instance.
28
Agenda Overview Architecture Customer Use Case
29
Streaming Analytics in Action
Stock Market Impact of weather on securities prices Analyze market data at ultra-low latencies Natural Systems Wildfire management Water management Real-time multimodal surveillance Situational awareness Cyber security detection Law Enforcement, Defense & Cyber Security Transportation Intelligent traffic management Fraud Prevention Detecting multi-party fraud Real time fraud prevention Manufacturing Process control for microchip fabrication e-Science Space weather prediction Detection of transient events Synchrotron atomic research Key Points Representative and impressive use cases A way to augment and improve an existing use case and traditional technology - algorithmic trading, risk, fraud detection, security New opportunities enabled by the software - Galway Bay, Neonatal intensive care, Radio astronomy, traffic management Virtually every field needs this: Streams is suitable to build solutions for a variety of domains Stock market – Impact of weather on commodity pricing or stock prices – demo built to show impact of hurricane on oil assets in the Gulf of Mexico, Sample Equity and Options trading apps included with Streams Natural systems – A university with access to US Weather satellites and unmanned aerial vehicles wants to build a real-time wildfire analysis and monitoring system that can detect smoke from wildfires, and task satellites and UAV’s to monitor fires and direct firefighters. Several organizations are looking to better understand watersheds and oceans to predict weather patters and manage fishing stocks Transportation -- Real time fleet management is one application Another is integrating vehicle traffic information with subway, train and taxi information to improve transit operation. Pilot done for Stockholm with KTH University Manufacturing – IBM Burlington used Streams in a pilot to automate analysis of chip testing to improve yield Health & Life sciences – by correlating information across neonatal monitors, detection of systemic infection can be discovered 6 to 24 hours earlier than experienced nurses. Now being extended for brain trauma in Neurologic ICU Telephony – call detail record processing transform telephone records in ASN.1 format to standard ASCII and puts unto databases for billing. It also summarizes to a dashboard. Other apps set up a social network of who is talking to Whom, and who is likely to leave your company. Geomapping allows a telco to provide location based services. e-Science – Space Weather prediction can help alleviate damage from solar flares to satellite and electric grid. Detection of transient events and imaging remote universe from an array of radio telescopes – a worldwide $2 billion USD multiyear effort. Synchrotron research enables re-use of data across many research organizations. Fraud prevention – Detecting multi-party fraud and finding fraud faster can result in less fraud Law Enforcement – monitoring cameras to detect faces and then do facial recognition can be used to find criminals; combining many data types can improve analysis and provide better situational awareness Other – smart grid to talk to many meters, text analytics to understand content, Who’s talking to Whom – voice analysis to build up social network, Weather summarization to optimize commodities purchase and FGPA acceleration to improve number crunching (mathematical analysis) in Streams) Health & Life Sciences Neonatal ICU monitoring Epidemic early warning system Remote healthcare monitoring Other Smart Grid Text analysis Who’s talking to whom? ERP for commodities FPGA acceleration Telephony CDR processing Social analysis Churn prediction Geomapping
30
Smarter Faster Cheaper CDR Processing
6 Billion CDRs per day, dedups over 7 days, processing latency from 12 hours to a few seconds 6 machines (using ½ processor capacity) InfoSphere Streams xDR Hub Key Requirements: Price/Performance and Scaling
31
Telco: Beyond CDR processing, building on existing insight
Call Quality Analytics Call Data Analytics Database & Warehouse Mobile Network Churn Analytics Network Analytics Business Rules Customer Interactions Campaign Analytics Audio Analytics … Analytics Location Analytics … Analytics Weather … Analytics Social Media Social Analytics InfoSphere Streams
32
Surveillance and Physical Security: TerraEchos (Business Partner)
Use scenario State-of-the-art covert surveillance system based on Streams platform Acoustic signals from buried fiber optic cables are monitored, analyzed and reported in real time for necessary action Currently designed to scale up to 1600 streams of raw binary data Requirement Real-time processing of multi-modal signals (acoustics. video, etc) Easy to expand, dynamic 3.5M data elements per second Winner 2010 IBM CTO Innovation Award 32
33
Cyber Security Analytics
IT I/S Firewalls Live Packet Capture Processing Element Container Processing Element Container Processing Element Container Processing Element Container Processing Element Container InfoSphere Streams DNS / DHCP / Netflow sources Botnet Behavior modeling External C&C Feeds (live DB queries) Botnet nodes / Malware IP/MAC identifying suspects Remediation Infrastructure / Ticketing 33 33
34
University of Ontario Institute of Technology (UOIT) and Sick Kids Hospital
IBM Data Baby IBM Data Baby
35
Intelligent Transportation
Multimodal Data Streams GPS Counts, speeds, travel times Public Transport Pollution measurements Weather Conditions Archiving of cleansed data Real Time Traffic Monitoring Real Time Traffic Information (Multimodal) Travel Planner Only 4 x86 Blade servers to process 250,000 GPS probes per second GPS Data Streams Real Time Transformation Logic Real Time Geo Mapping Real Time Speed & Heading Estimation Real Time Aggregates & Statistics Storage adapters Interactive visualization Data Warehouse Web Server Google Earth Offline statistical analysis
36
THINK 36 Nov 2013
37
Questions? Nov 2013
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.