Presentation is loading. Please wait.

Presentation is loading. Please wait.

IBM InfoSphere Streams Technical Overview Julie Garrone Christophe Menichetti © 2013 IBM Corporation.

Similar presentations


Presentation on theme: "IBM InfoSphere Streams Technical Overview Julie Garrone Christophe Menichetti © 2013 IBM Corporation."— Presentation transcript:

1 IBM InfoSphere Streams Technical Overview Julie Garrone Christophe Menichetti © 2013 IBM Corporation December 31, 2013

2 Agenda Infosphere Streams general description Technical overview Streams Processing Language Use case Demo

3 Agenda Infosphere Streams general description Technical overview Streams Processing Language Use case Demo Technical overview Streams Processing Language Use case Technical overview Streams Processing Language

4 IBM InfoSphere Streams v3.0 A platform for real-time analytics on BIG data Volume – Terabytes per second – Petabytes per day Variety – All kinds of data – All kinds of analytics Velocity – Insights in microseconds Agility – Dynamically responsive – Rapid application development © 2013 IBM Corporation4 Millions of events per second Microsecond Latency Sensor, video, audio, text, and relational data sources Just-in-time decisions Powerful Analytics

5 Streams Analyzes All Kinds of Data © 2013 IBM Corporation5 Mining in Microseconds (included with Streams) Image & Video (Open Source) Simple & Advanced Text (included with Streams) Text (listen, verb), (radio, noun) Acoustic (IBM Research) (Open Source) Geospatial (included with Streams) Predictive (included with Streams) Advanced Mathematical Models (IBM Research) Statistics (included with Streams) ***New

6 Categories of Problems Solved by Streams Applications that require on-the-fly processing, filtering and analysis of streaming data – Sensors: environmental, industrial, surveillance video, GPS, … – Data exhaust: network/system/web server/app server log files – High-rate transaction data: financial transactions, call detail records Criteria: two or more of the following – Messages are processed in isolation or in limited data windows – Sources include non-traditional data (spatial, imagery, text, …) – Sources vary in connection methods, data rates, and processing requirements, presenting integration challenges – Data rates/volumes require the resources of multiple processing nodes – Analysis and response are needed with sub-millisecond latency – Data rates and volumes are too great for store-and-mine approaches © 2013 IBM Corporation6

7 Traditional / Relational Data Sources The Big Data Ecosystem: Interoperability is Key © 2013 IBM Corporation7 Streaming Data Traditional Warehouse Analytics on Data at Rest Data Warehouse Analytics on Structured Data RTAP: Analytics on Data in Motion BigInsights Non-Traditional / Non-Relational Data Sources Non-Traditional / Non-Relational Data Feeds Traditional / Relational Data Sources Internet- Scale Data Sets Streams

8 EuroBuzz with France Televisions Smarter Analytics applied to the Eurovision Song Contest 8 Tweets on Eurovision event Buzz Statistics Tweet format {"entities":{"hashtags":[{"text":eurovision","text":Sweeden will win Eurovision for sure #eurovision …. JSON Format Data-visualisation Partner Gauge by country Latest tweets on Eurovision Pictures of Eurovision/ Twitter top contributor s Streams Overall statistics (World & France) Twitter Streaming API Tweets filtered by Eurovision specific hashtags: #eurovision #eurofrancetv… Real-time analysis Filters, Aggregations, Calculations on incoming tweets 2 different Streams based on hashtags World stream collecting tweets from all eurovision hastags France stream collecting tweets from the France Televisions hastag #eurofrancetv Analyse of 2 different languages: English and French 24 hours Data Aggregation France 3 website Live Eurovision Show

9 All blades with RHEL SAN DS8700 HS22 8 cores 2.93 GHz 48GB RAM Master Node HS22 8 cores 2.93 GHz 48GB RAM Secondary Node Data-visualisation Partner Max 100% EuroBuzz with France Televisions Smarter Analytics applied to the Eurovision Song Contest Network peak at the beginning of each song Very low CPU utilization : < 1% -> InfoSphere Streams Efficiency Max 10% Infrastructure for the PoC 2 Blades in Active/Active Mode (Nehalem Sockets) SAN Storage to store all data during the PoC (All Tweets)

10 Use case –Neonatal infant monitoring –Predict infection in ICU 24 hours in advance Solutions –120 children monitored :120K msg/sec, billion msg/day –Trials expanding to include hospitals in US and China Use case –Neonatal infant monitoring –Predict infection in ICU 24 hours in advance Solutions –120 children monitored :120K msg/sec, billion msg/day –Trials expanding to include hospitals in US and China University of Ontario Institute of Technology Sensor Network Stream-based Distributed Interoperable Health care Infrastructure Solutions (Applications) Event Pre- processer Analysis Framework

11 11 Asian Telco reduces billing costs and improves customer satisfaction Capabilities: Stream Computing Analytic Accelerators Real-time mediation and analysis of 5B CDRs per day Data processing time reduced from 12 hrs to 1 min Hardware cost reduced to 1/8 th Proactively address issues (e.g. dropped calls) impacting customer satisfaction. 11

12 Agenda Infosphere Streams general description Technical overview Streams Processing Language Use case Demo

13 continuous ingestion Continuous ingestion Continuous analysis How Streams Works © 2013 IBM Corporation13

14 Achieve scale: By partitioning applications into software components By distributing across stream-connected hardware hosts Infrastructure provides services for Scheduling analytics across hardware hosts, Establishing streaming connectivity Transform Filter / Sample Classify Correlate Annotate Where appropriate: Elements can be fused together for lower communication latency Continuous ingestion Continuous analysis How Streams Works © 2013 IBM Corporation14

15 InfoSphere Streams Runtime Distributed runtime, running on a cluster – No limit on cluster size Import/export streams between jobs – Agility: dynamically reconfigure applications Performance and health metrics – down to operator level Jobs submitted and canceled at any time Hosts added to and removed from cluster at any time High Availability – Restart and relocate portions of applications – Assign parts of applications to host pools based on host tags Application portability – Simplifies test and operations © 2013 IBM Corporation15

16 From Operators to Running Jobs Streams application graph: – A directed, possibly cyclic, graph – A collection of operators – Connected by streams Each complete application is a potentially deployable job Jobs are deployed to a Streams runtime environment, known as a Streams Instance (or simply, an instance) An instance can include a single processing node (hardware) Or multiple processing nodes © 2013 IBM Corporation16 Streams instance OP Src Sink OP stream h/w node node

17 InfoSphere Streams Objects: Runtime View Instance – Runtime instantiation of InfoSphere Streams executing across one or more hosts – Collection of components and services Processing Element (PE) – Fundamental execution unit that is run by the Streams instance – Can encapsulate a single operator or many fused operators Job – A deployed Streams application executing in an instance – Consists of one or more PEs © 2013 IBM Corporation17 Instance Job Node Stream 1 PE Node PE Stream 1 Stream 2 Stream 3 Stream 4 Stream 5 operator

18 InfoSphere Streams Objects: Development View Operator – The fundamental building block of the Streams Processing Language – Operators process data from streams and may produce new streams Stream – An infinite sequence of structured tuples – Can be consumed by operators on a tuple- by-tuple basis or through the definition of a window Tuple – A structured list of attributes and their types. Each tuple on a stream has the form dictated by its stream type Stream type – Specification of the name and data type of each attribute in the tuple Window – A finite, sequential group of tuples – Based on count, time, attribute value, or punctuation marks © 2013 IBM Corporation18 directory: "/img" filename: "farm" directory: "/img" filename: "bird" directory: "/opt" filename: "java" directory: "/img" filename: "cat" Streams Application stream tuple height: 640 width: 480 data: height: 1280 width: 1024 data: height: 640 width: 480 data: operator

19 Host Pools Facilitate High Availability HA application design pattern – Source job exports stream, enriched with tuple ID – Jobs 1 & 2 process in parallel, and export final streams – Sink job imports streams, discards duplicates, alerts on missing tuples © 2013 IBM Corporation19 Host pool 1 Host pool 4 Host pool 2 Host pool 3 Job 1 Job 2 Sink x86 host Job 2 Source

20 Agenda Infosphere Streams general description Technical overview Streams Processing Language Use case Demo

21 What is Streams Processing Language? Designed for stream computing – Define a streaming-data flow graph – Rich set of data types to define tuple attributes Declarative – Operator invocations name the input and output streams – Referring to streams by name is enough to connect the graph Procedural support – Full-featured imperative language – Custom logic in operator invocations – Expressions in attribute assignments and parameter definitions Extensible – User-defined data types – Custom functions written in SPL or a native language (C++ or Java) – Custom operators written in SPL – User-defined operators written in a native language (C++ or Java) © 2013 IBM Corporation21

22 Some SPL Terms An operator represents a class of manipulations – of tuples from one or more input streams – to produce tuples on one or more output streams A stream connects to an operator on a port – an operator defines input and output ports An operator invocation – is a specific use of an operator – with specific assigned input and output streams – with locally specified parameters, logic, etc. Many operators have one input port and one output port; others have – zero input ports: source adapters, e.g., TCPSource – zero output ports: sink adapters, e.g., FileSink – multiple output ports, e.g., Split – multiple input ports, e.g., Join A composite operator is a collection of operators – An encapsulation of a subgraph of Primitive operators (non-composite) Composite operators (nested) – Similar to a macro in a procedural language © 2013 IBM Corporation22 Aggregate Employee Info Aggregate port Salary Statistics port TCP Source File Sink composite operator Split Join

23 Composite Operators Every graph is encoded as a composite – A composite is a graph of one or more operators – A composite may have input and output ports – Source code construct only Nothing to do with operator fusion (PEs) Each stream declaration in the composite – Invokes a primitive operator or – another composite operator An application is a main composite – No input or output ports – Data flows in and out but not on streams within a graph – Streams may be exported to and imported from other applications running in the same instance © 2013 IBM Corporation23 composite Main { graph stream … { } stream … { }... } Application (logical view) Stream 1 Stream 2 Stream 3 Stream 4 Stream 5 operator

24 Composing a Flow Graph with Stream Definitions © 2013 IBM Corporation24 composite POS_TxHandling { graph stream POS_Transactions = TCPSource() {…} stream Sales = Operator1(POS_Transactions) {…} stream TaxableSales = Operator2(Sales) {…} stream TaxesDue = Operator3(TaxableSales) {…} () as Sink1 = TCPSink(TaxesDue) {…} stream Deliveries = TCPSource() {…} stream Inventory = Operator4(Sales;Deliveries) {…} stream Reorders = Operator5(Inventory) {…} () as Sink2 = TCPSink(Reorders) {…} } Composite POS_TxHandling TaxableSales Operator 2 TaxesDue Operator 3 TCP- Sink Deliveries InventoryReorders Operator 5 Sales TCP- Sink TCP- Source Operator 4 Sales Operator 1 TCP- Source POS_Transactions

25 Agenda Infosphere Streams general description Technical overview Streams Processing Language Use case Demo

26 © 2013 IBM Corporation26 Example description Goal – Leverage information about the use of the OpenVPN infrastructure by on-line students Syslog server REPORTS WITH COGNOS AND REAL-TIME REPORTS WITH COGNOS RTM DATAWAREHOUSE INFOSPHERE STREAMS Users OpenVPN DATAWAREHOUSE INFOSPHERE STREAMS DATAWAREHOUSE INFOSPHERE STREAMS DATAWAREHOUSE

27 © 2013 IBM Corporation27 Global diagram Different composite for each step of the program – File source – Parsing – Aggregation – Enrichment with database – Database loading – HDFS loading Transfers between composites with Export / Import operators

28 FileSource Read data from files in formats such as csv, line, or binary © 2013 IBM Corporation28 File source composite Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started." Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid… Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid … Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;… Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils; ;openvpn5;Sun Jun 9 05:12:03 … Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;…

29 FileSource Read data from files in formats such as csv, line, or binary © 2013 IBM Corporation29 Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid…Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started. File source composite Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid …Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;…Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils; ;openvpn5;Sun Jun 9 05:12:03 …Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;…

30 Throttle Make a stream flow at a specified rate © 2013 IBM Corporation30 File source composite Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid… Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started. Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid … Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;… Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils; ;openvpn5;Sun Jun 9 05:12:03 … Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;…

31 © 2013 IBM Corporation31 Parsing composite Two different parsing methods – TextExtract Run AQL queries (module or TAM) over text documents – Functor Perform tuple-level manipulations used with Tokenizer Other operators to filter data – Filter Remove some tuples from a stream – Split Forward tuples to output streams based on a predicate

32 © 2013 IBM Corporation32 Content TagHostName Timestamp Parsing composite TextExtract – Use AQL text analysis to extract fields of the syslog Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started." Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid… Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid … Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;… Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils; ;openvpn5;Sun Jun 9 05:12:03 … Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;…

33 © 2013 IBM Corporation33 Split the flow to eliminate incomplete lines Transfer only content part with Functor operator Parsing composite Filter only tuples coming from openvpn Content TagHostName Timestamp Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started." Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid… Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid … Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;… Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils; ;openvpn5;Sun Jun 9 05:12:03 … Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;…

34 © 2013 IBM Corporation34 Parsing composite Functor with tokenizer – Break a string based on a delimiter character – Structure of a sample openvpn content ; ; ; ; ; ; ; ;S: ;R: – can be Connected, Ended or Rejected Split the flow to eliminate Rejected connections

35 © 2013 IBM Corporation35 Filter pairs that correspond to Connected / Ended Aggregation composite Aggregation to group logs which describe the same connection – ;Odi_platform_demonstration_49165_1; ;openvpn6;Sun Jun 9 04:59:53 CEST 2013;Connected;Connection accepted; ; – ;Odi_platform_demonstration_49165_1; ;openvpn6;Sun Jun 9 06:46:39 CEST 2013;Ended;Connection terminated, temps cnx 6412 sec; ;S:78203;R:370963;

36 © 2013 IBM Corporation36 Split the flow according to the userID – stud[0-9]* which are the students connexion we want to analyze – inst[0-9]* – other Aggregation composite Functor is used to create a unique output tuple for each ended connection

37 © 2013 IBM Corporation37 stud213_8,openvpn, ,, ,openvpn3,Ended, , ,19:08:21, ,19:10:17,117,4723,2779 stud209_4,openvpn, ,, ,openvpn3,Ended, , ,12:22:53, ,12:25:37,163,5231,3263 stud213_8,openvpn, ,, ,openvpn3,Ended, , ,19:08:21, ,19:10:17,117,4723,2779 stud209_4,openvpn, ,, ,openvpn3,Ended, , ,12:22:53, ,12:25:37,163,5231,3263 stud213_8,openvpn, ,, ,openvpn3,Ended, , ,19:08:21, ,19:10:17,117,4723,2779 stud209_4,openvpn, ,, ,openvpn3,Ended, , ,12:22:53, ,12:25:37,163,5231,3263 Aggregation composite Partial results

38 © 2013 IBM Corporation38 Join Correlate two streams according to the userID and the DateStart and DateEnd of the course Enrichment composite ODBCSource Read data from a SQL database – Extract data about the course attended by students

39 stud213_8,openvpn, ,, ,openvpn3,Ended, , ,19:08:21, ,19:10:17,117,4723,2779,JL27,1SM00,Cross brand,STG brands stud209_4,openvpn, ,, ,openvpn3,Ended, , ,12:22:53, ,12:25:37,163,5231,3263,5367,1SX2G,Cross brand,STG brands stud213_8,openvpn, ,, ,openvpn3,Ended, , ,19:08:21, ,19:10:17,117,4723,2779,JL27,1SM00,Cross brand,STG brands stud209_4,openvpn, ,, ,openvpn3,Ended, , ,12:22:53, ,12:25:37,163,5231,3263,5367,1SX2G,Cross brand,STG brands stud213_8,openvpn, ,, ,openvpn3,Ended, , ,19:08:21, ,19:10:17,117,4723,2779,JL27,1SM00,Cross brand,STG brands stud209_4,openvpn, ,, ,openvpn3,Ended, , ,12:22:53, ,12:25:37,163,5231,3263,5367,1SX2G,Cross brand,STG brands © 2013 IBM Corporation39 Enrichment composite Final results

40 Database loading composite ODBCAppend Insert rows into an SQL database table INFOSPHERE STREAMS DATAWAREHOUSE

41 HDFS loading composant HDFSFileSink Write data to files in Hadoop File System

42 Agenda Infosphere Streams general description Technical overview Streams Processing Language Use case Demo

43 Toolkits and Accelerators to Speed Up Development © 2013 IBM Corporation43 Standard Toolkit Relational Operators Filter Sort FunctorJoin PunctorAggregate Adapter Operators FileSource UDPSource FileSinkUDPSink DirectoryScan Export TCPSource Import TCPSink MetricsSink Utility Operators CustomSplit BeaconDeDuplicate ThrottleUnion DelayThreadedSplit Barrier DynamicFilter PairGate JavaOpSwitch ParseFormat DecompressCharacterTransform XML Operator XMLParse IBM Supported Toolkits DatabaseDataStage Big DataData Explorer MessagingInternet Text AnalyticsMining SPSSCEP Time SeriesGeospatial Financial Open-Source Toolkits JSONHTTP/REST OpenCVAccumulo HBase… Big Data Accelerators Telco Event Data Analytics Social Data Analytics User-Defined Toolkits Extend the language by adding user-defined operators, types, and functions

44 © 2013 IBM Corporation44

45 RESOURCE LINKS

46 Who Can Help? The Big Data Tiger Team – Evangelists for your region and/or industry The Big Data Black Belts – Skilled TCPs for your region Product Management – Roger Rea Product Marketing – Asha Marsh Worldwide Sales – Stewart Hanna Worldwide Technical Sales – Robert Uleman Worldwide Software Services – Sandra Tucker Customer Enablement – Kevin Foster © 2013 IBM Corporation46

47 Where to get Streams Software available via – Passport Advantage site – PartnerWorld – Academic Initiative – IBM Internal – 90-day trial code IBM Trials and Demos portal ibm.com -> Support & downloads ->Download -> Trials and demos © 2013 IBM Corporation47

48 More Streams links Streams in the Information Management Acceleration Zone Streams in the Media Library (recorded sessions) – Search on streams The Streams Sales Kit – Software Sellers Workplace (the old eXtreme Leverage) – PartnerWorld InfoSphere Streams Wiki on developerWorks – Discussion forum, Streams Exchange © 2013 IBM Corporation48

49 SPL DOC

50 Some SPL Terms An operator represents a class of manipulations – of tuples from one or more input streams – to produce tuples on one or more output streams A stream connects to an operator on a port – an operator defines input and output ports An operator invocation – is a specific use of an operator – with specific assigned input and output streams – with locally specified parameters, logic, etc. Many operators have one input port and one output port; others have – zero input ports: source adapters, e.g., TCPSource – zero output ports: sink adapters, e.g., FileSink – multiple output ports, e.g., Split – multiple input ports, e.g., Join A composite operator is a collection of operators – An encapsulation of a subgraph of Primitive operators (non-composite) Composite operators (nested) – Similar to a macro in a procedural language © 2013 IBM Corporation50 Aggregate Employee Info Aggregate port Salary Statistics port TCP Source File Sink composite operator Split Join

51 Composite Operators Every graph is encoded as a composite – A composite is a graph of one or more operators – A composite may have input and output ports – Source code construct only Nothing to do with operator fusion (PEs) Each stream declaration in the composite – Invokes a primitive operator or – another composite operator An application is a main composite – No input or output ports – Data flows in and out but not on streams within a graph – Streams may be exported to and imported from other applications running in the same instance © 2013 IBM Corporation51 composite Main { graph stream … { } stream … { }... } Application (logical view) Stream 1 Stream 2 Stream 3 Stream 4 Stream 5 operator

52 Composing a Flow Graph with Stream Definitions © 2013 IBM Corporation52 composite POS_TxHandling { graph stream POS_Transactions = TCPSource() {…} stream Sales = Operator1(POS_Transactions) {…} stream TaxableSales = Operator2(Sales) {…} stream TaxesDue = Operator3(TaxableSales) {…} () as Sink1 = TCPSink(TaxesDue) {…} stream Deliveries = TCPSource() {…} stream Inventory = Operator4(Sales;Deliveries) {…} stream Reorders = Operator5(Inventory) {…} () as Sink2 = TCPSink(Reorders) {…} } Composite POS_TxHandling TaxableSales Operator 2 TaxesDue Operator 3 TCP- Sink Deliveries InventoryReorders Operator 5 Sales TCP- Sink TCP- Source Operator 4 Sales Operator 1 TCP- Source POS_Transactions

53 Anatomy of an Operator Invocation Operators share a common structure – italics are sections to fill in Reading an operator invocation – Declare a stream stream-name – With attributes from stream-type – that is produced by MyOperator – from the input(s) input-stream – MyOperator behavior defined by logic, parameters, windowspec, and configuration; output attribute assignments are specified in output For the example: – Declare the stream Sale with the attribute item, which is a raw (ASCII) string – Join the Bid and Ask streams with – sliding windows of 30 seconds on Bid, and 50 tuples of Ask – When items are equal, and Bid price is greater than or equal to Ask price – Output the item value on the Sale stream © 2013 IBM Corporation53 stream stream-name = MyOperator(input-stream; …) { logic logic ; window windowspec ; param parameters ; output output ; config configuration ; } Syntax: 53 Example stream Sale = Join(Bid; Ask) { window Bid: sliding, time(30); Ask: sliding, count(50); param match : Bid.item == Ask.item && Bid.price >= Ask.price; output Sale: item = Bid.item; } stream Sale = Join(Bid; Ask) { window Bid: sliding, time(30); Ask: sliding, count(50); param match : Bid.item == Ask.item && Bid.price >= Ask.price; output Sale: item = Bid.item; }

54 Streams Data Types © 2013 IBM Corporation54 (any) (composite)(primitive) (collection)tuplebooleanenum(numeric)timestamp(string)blob listsetmaprstringustring(integral)(floatingpoint)(complex) (signed)(unsigned)(float)(decimal) int8 int16 int32 int64 uint8 uint16 uint32 uint64 float32 float64 float128 decimal32 decimal64 decimal128 complex32 complex64 complex128 xml User-defined types type Integers = list ; type MySchema = rstring s, Integers ls; type Integers = list ; type MySchema = rstring s, Integers ls;

55 Collection Types list: array with bounds-checking [0, 17, age-1, 99] – Random access: can access any element at any time – Ordered, base-zero indexed: first element is someList[0] set: unordered collection {"cats", "yeasts", "plankton"} – No duplicate element values map: key-to-value mappings {"Mon":0, "Sat":99, "Sun":-1} – Unordered Use type constructors to specify element type – list, set list, set – map map Can be nested to any number of levels – map >> – {1 : [{"Joe",117885}, {"Fred",923416}], 2 : [{"Max",117885}], -1 : []} Bounded collections optimize performance – list [5] : at most 5 (32-bit) integer elements – Bounds also apply to strings: rstring[3] has at most 3 (8-bit) characters © 2013 IBM Corporation55

56 Toolkits and Accelerators to Speed Up Development © 2013 IBM Corporation56 Standard Toolkit Relational Operators Filter Sort FunctorJoin PunctorAggregate Adapter Operators FileSource UDPSource FileSinkUDPSink DirectoryScan Export TCPSource Import TCPSink MetricsSink Utility Operators CustomSplit BeaconDeDuplicate ThrottleUnion DelayThreadedSplit Barrier DynamicFilter PairGate JavaOpSwitch ParseFormat DecompressCharacterTransform XML Operator XMLParse IBM Supported Toolkits DatabaseDataStage Big DataData Explorer MessagingInternet Text AnalyticsMining SPSSCEP Time SeriesGeospatial Financial Open-Source Toolkits JSONHTTP/REST OpenCVAccumulo HBase… Big Data Accelerators Telco Event Data Analytics Social Data Analytics User-Defined Toolkits Extend the language by adding user-defined operators, types, and functions

57 Streams Standard Toolkit: Relational Operators Relational operators – FunctorPerform tuple-level manipulations – FilterRemove some tuples from a stream – AggregateGroup and summarize incoming tuples – SortImpose an order on incoming tuples in a stream – JoinCorrelate two streams – PunctorInsert window punctuation markers into a stream © 2013 IBM Corporation57

58 Streams Standard Toolkit: Adapter Operators Adapter operators – FileSourceRead data from files in formats such as csv, line, or binary – FileSink Write data to files in formats such as csv, line, or binary – DirectoryScan Detect files to be read as they appear in a given directory – TCPSource Read data from TCP sockets in various formats – TCPSink Write data to TCP sockets in various formats – UDPSource Read data from UDP sockets in various formats – UDPSink Write data to UDP sockets in various formats – Export Make a stream available to other jobs in the same instance – Import Connect to streams exported by other jobs – MetricsSink Create displayable metrics from numeric expressions © 2013 IBM Corporation58

59 Streams Standard Toolkit: Utility Operators Workload generation and custom logic – BeaconEmit generated values; timing and number configurable – CustomApply arbitrary SPL logic to produce tuples Coordination and timing – ThrottleMake a stream flow at a specified rate – DelayTime-shift an entire stream relative to other streams – GateWait for acknowledgement from downstream operator – SwitchBlock or allow the flow of tuples based on control input © 2013 IBM Corporation59

60 Streams Standard Toolkit: Utility Operators (contd) Parallel flows – BarrierSynchronize tuples from sequence-correlated streams – PairGroup tuples from multiple streams of same type – SplitForward tuples to output streams based on a predicate – ThreadedSplitDistribute tuples over output streams by availability – UnionConstruct an output tuple from each input tuple – DeDuplicateSuppress duplicate tuples seen within a given time period Miscellaneous – DynamicFilterFilter tuples based on criteria that can change while it runs – JavaOpGeneral-purpose operator for encapsulating Java code – ParseParse blob data for use with user-defined adapters – FormatFormat blob data for use with user-defined adapters – CompressCompress blob data – DecompressDecompress blob data – CharacterTransformConvert blob data from one encoding to another © 2013 IBM Corporation60

61 Streams Extensibility: Toolkits Like packages, plugins, addons, extenders, etc. – Reusable sets of operators, types, and functions – What can be included in a toolkit? Primitive and composite operators User-defined types Native and SPL functions Sample applications, utilities Tools, documentation, data, etc. – Versioning is supported – Define dependencies on other versioned assets (toolkits, Streams itself) Base for creating cross-domain and domain-specific applications Developed by IBM, partners, end-customers – Complete APIs and tools available – Same power as the built-in Standard Toolkit © 2013 IBM Corporation61

62 Integration: the Internet and Database Toolkits Integration with traditional sources and consumers Internet Toolkit – InetSourceperiodically retrieve data from HTTP, FTP, RSS, and files Database Toolkit ODBC databases: DB2, Informix, Oracle, solidDb, MySQL, SQLServer, Netezza – ODBCAppendInsert rows into an SQL database table – ODBCEnrichExtend streaming data based on lookups in database tables – ODBCRunPerform SQL queries with parameters from input tuples – ODBCSourceRead data from a SQL database – SolidDBEnrichPerform table lookups in an in-memory database – DB2SplitDBSplit a stream by DB2 partition key – DB2PartitionedAppendWrite data to table in specified DB2 partition – NetezzaLoadPerform high-speed loads into a Netezza database – NetezzaPrepareLoadConvert tuple to delimited string for Netezza loads © 2013 IBM Corporation62

63 Integration: the Big Data Toolkit Integration with IBMs Big Data Platform Data Explorer – DataExplorerPushInsert records into a Data Explorer index BigInsights: Hadoop Distributed File System – HDFSDirectoryScanLike DirectoryScan, only for HDFS – HDFSFileSourceLike FileSource, only for HDFS – HDFSFileSinkLike FileSink, only for HDFS – HDFSSplitWrite batches of data in parallel to HDFS © 2013 IBM Corporation63

64 Integration: the DataStage Integration Toolkit Streams real-time analytics and DataStage information integration – Perform deeper analysis on the data as part of the information integration flow – Get more timely results and offload some analytics load from the warehouse Operators and tooling – Adapters to exchange data between Streams and DataStage DSSource DSSink – Tooling to generate integration code – DataStage provides Streams connectors © 2013 IBM Corporation64

65 Integration: the Messaging Toolkit Integrate with IBM WebSphere MQ – Create a stream from a subscription to an MQ topic or queue – Publish a stream to an MQ series topic or queue Operators – XMSSourceRead data from an MQ queue or topic – XMSSinkSend messages to applications that use WebSphere MQ © 2013 IBM Corporation65

66 Integration and Analytics: the Text Toolkit Derive structured information from unstructured text – Apply extractors Programs that encode rules for extracting information from text Written in AQL (Annotation Query Language) Developed against a static repository of text data AQL files can be combined into modules (directories) Modules can be compiled into TAM files NOTE: Old-style AOG files not supported by BigInsights 2.0 and this toolkit Can be used in Streams 3.0 with the Deprecated Text Toolkit Streams Studio plugins for developing AQL queries – Same tooling as in BigInsights Operator and utility – TextExtractRun AQL queries (module or TAM) over text documents – createtypes script Create stream type definitions that match the output views of the extractors. Plays a major role in accelerators – SDA: Social Data Analytics – MDA: Machine Data Analytics © 2013 IBM Corporation66

67 Analytics: The Geospatial Toolkit High-performance analysis and processing of geospatial data Enables Location Based Services (LBS) – Smarter Transportation: monitor traffic speed and density – Geofencing: detect when objects enter or leave a specified area Geospatial data types – e.g. Point, LineString, Polygon Geospatial functions – e.g. Distance, Map point to LineString, IsContained, etc. © 2013 IBM Corporation67

68 Analytics: The TimeSeries Toolkit Apply Digital Signal Processing techniques – Find patterns and anomalies in real time – Predict future values in real time A rich set of functionality for working with time series data – Generation Synthesize specific wave forms (e.g., sawtooth, sine) – Preprocessing Preparation and conditioning (ReSample, TSWindowing) – Analysis Statistics, correlations, decomposition and transformation – Modeling Prediction, regression and tracking (e.g. Holt-Winters, GAMLearner) © 2013 IBM Corporation68

69 Analytics: the Mining Toolkit Enables scoring of real-time data in a Streams application – Scoring is performed against a predefined model (in PMML file) – Supports a variety of model types and scoring algorithms Predictive Model Markup Language (PMML) – Standard for statistical and data mining models – Produced by Analytics tools: SPSS, ISAS, etc. – XML representation defined by Scoring operators – ClassificationAssign tuple to a class and report confidence – ClusteringAssign tuple to a cluster and compute score – RegressionCalculate predicted value and standard deviation – AssociationsIdentify the applicable rule and report consequent (rule head), support, and confidence Supports dynamic replacement of the PMML model used by an operator © 2013 IBM Corporation69

70 IBM InfoSphere Streams Analytics: Streams + SPSS Real Time Scoring Service Included with SPSS, not with Streams © 2013 IBM Corporation70 In memory Repository Change Notification S SPSS Scoring Operator SPSS Modeler Scoring Stream R SPSS Repository Operator P SPSS Publish Operator P IBM SPSS Collaboration & Deployment Services Model Refresh R File System IBM SPSS Modeler Solution Publisher S

71 Analytics: the Financial Services Toolkit Adapters – Financial Information Exchange (FIX) – WebSphere Front Office for Financial Markets (WFO) – WebSphere MQ Low-Latency Messaging (LLM) Adapters Types and Functions – OptionType (put, call), TxType (buy, sell), Trade, Quote, OptionQuote, Order – Coefficient of Correlation – The Greeks (put/call values, delta, theta, etc.) Operators – Based on QuantLib financial analytics open source package. – Compute theoretical value of an option (European, American) Example applications – Equities Trading – Options Trading © 2013 IBM Corporation71


Download ppt "IBM InfoSphere Streams Technical Overview Julie Garrone Christophe Menichetti © 2013 IBM Corporation."

Similar presentations


Ads by Google