Presentation is loading. Please wait.

Presentation is loading. Please wait.

IBM InfoSphere Streams Technical Overview

Similar presentations


Presentation on theme: "IBM InfoSphere Streams Technical Overview"— Presentation transcript:

1 IBM InfoSphere Streams Technical Overview
Streams key messages: Velocity – low latency – microsecond response times Variety – all kinds of data, all kinds of analytics Volume – millions of events per second Agility – respond quickly to changes; fast development of Streaming apps Julie Garrone Christophe Menichetti March 25, 2017 © 2013 IBM Corporation

2 Agenda Infosphere Streams general description Technical overview
Streams Processing Language Use case Demo 2

3 Agenda Infosphere Streams general description Technical overview
Streams Processing Language Streams Processing Language Streams Processing Language Use case Use case Demo 3

4 IBM InfoSphere Streams v3.0
A platform for real-time analytics on BIG data Volume Terabytes per second Petabytes per day Variety All kinds of data All kinds of analytics Velocity Insights in microseconds Agility Dynamically responsive Rapid application development Millions of events per second Microsecond Latency Sensor, video, audio, text, and relational data sources Just-in-time decisions Powerful Analytics That one sentence is the shortest statement I can think of that captures the essence of Streams Platform: Streams is not a solution or application, nor is it a limited-purpose tool. Instead, it is a platform. This means that it does not do anything useful when you first install it. It needs to be programmed before it can do anything. In that sense, it is much like a DBMS, or an operating system. It comes with the tools, language, and building blocks that let you build programs for it, and with a runtime environment that lets you run those programs. The bad news is that you have to program it (you can’t just pick a few options from a menu); the good news is that once you learn how to do that, it can do anything you can imagine (or, more accurately, anything you can find algorithms and code for). Real-time: The Streams programs you create do their processing and analysis in as close to real-time as it is possible to get on a standard IT platform. In this case, real-time means very low latency, where latency is the delay from the time a packet of data arrives to the time the result is available. A key factor here is that Streams does everything in memory; it has no concept of mass storage (disk). Analytics: Because Streams is fast, scalable, and programmable, the kinds of analysis you can apply ranges from the simple to the extremely sophisticated. You are not limited to simple averages or if-then-else rules. BIG data: Actually, make that infinite data. For purposes of program and algorithm design, streaming data has no beginning and no end and is therefore by definition infinite in volume. In practical terms, this means that Streams can process any kind of data feed, including those that would be much too slow or expensive to capture and store in their entirety. We’ve added Agility to Big Data’s V3: A few features of the programming language and runtime make it easy to break an application into modules and add new modules and modify existing ones without interrupting the overall solution. © 2013 IBM Corporation

5 Streams Analyzes All Kinds of Data
Acoustic (IBM Research) (Open Source) Mining in Microseconds (included with Streams) ***New Advanced Mathematical Models (IBM Research) Simple & Advanced Text (included with Streams) Text (listen, verb), (radio, noun) ***New Predictive (included with Streams) Statistics (included with Streams) The blue items are included in the Streams product (Mining in Microseconds, Statistics, Text Analysis) The red items have been built for Streams (Acoustic, Image & Video, Geospatial, Predictive, and Advanced Mathematics) and have been used in various projects and engagements, but they have not been made a part of the product. Contact development if you think any of them could help you in an opportunity. More later in this presentation on the toolkits available in the Streams product. A few notes: The “simple” in “Simple & Advanced Text” refers to basic functions for regular expression matching (and replacement) in string values; these are built into the Standard Toolkit, meaning they’re basically part of the language. “Advanced” refers to real text analysis using the same System T code used in BigInsights. This is new in Fix Pack 3 (November 2011). Statistics included with Streams range from simple aggregate functions (average, sum, etc.) to the more advanced metrics included in the Financial Services Toolkit. For true time series analysis, we have an advanced toolkit in development; this is covered under Advanced Mathematical Models in this slide. ***New Image & Video (Open Source) Geospatial (included with Streams) © 2013 IBM Corporation

6 Categories of Problems Solved by Streams
Applications that require on-the-fly processing, filtering and analysis of streaming data Sensors: environmental, industrial, surveillance video, GPS, … “Data exhaust”: network/system/web server/app server log files High-rate transaction data: financial transactions, call detail records Criteria: two or more of the following Messages are processed in isolation or in limited data windows Sources include non-traditional data (spatial, imagery, text, …) Sources vary in connection methods, data rates, and processing requirements, presenting integration challenges Data rates/volumes require the resources of multiple processing nodes Analysis and response are needed with sub-millisecond latency Data rates and volumes are too great for store-and-mine approaches Streams targets applications where processing goals are best achieved by on-the-fly mining and processing of streaming data. That is, where ingest volume and rate exceed storage capacity Requires data filtering, aggregating, and other processing to reduce the volume so it can be handled by a data storage and management system at the back end. Even if the goal is not analysis on the fly, Streams is needed simply to prepare the data for analysis later. Example: extremely noisy and voluminous data feeds in radio astronomy ingest rate plus processing requirements exceed ability of single application on a single processor to process the data InfoSphere Streams is specifically designed with parallelism and deployment across computing clusters in mind. result is needed in (near) real-time Latency refers to the delay, or elapsed time, between the appearance of a data item at the input and the availability of the result at the output Analysis of stored data is not a viable option when delays measured in milliseconds add up to lives or money lost. Data must be processed in memory, as it comes in, before it can be committed to persistent storage. [continued on next page] © 2013 IBM Corporation

7 The Big Data Ecosystem: Interoperability is Key
Streaming Data Streams Non-Traditional / Non-Relational Data Feeds RTAP: Analytics on Data in Motion Non-Traditional / Non-Relational Data Sources BigInsights Internet- Scale Data Sets Traditional / Relational Data Sources Analytics on Data at Rest The integration from the previous slide is depicted here again, in a slightly different and simpler way. The key point is that the Big Data approach, and in particular Streams/RTAP (Real-Time Analytic Processing), is (almost) always in addition to, not instead of, traditional warehousing and IT. All three approaches are not necessarily present in every project, but there are always bidirectional connections between them. Traditional Warehouse Traditional / Relational Data Sources Data Warehouse Analytics on Structured Data © 2013 IBM Corporation

8 EuroBuzz with France Televisions
Smarter Analytics applied to the Eurovision Song Contest Twitter Streaming API Tweets filtered by Eurovision specific hashtags: #eurovision #eurofrancetv… Real-time analysis Filters, Aggregations, Calculations on incoming tweets 2 different Streams based on hashtags World stream collecting tweets from all eurovision hastags France stream collecting tweets from the France Televisions hastag #eurofrancetv Analyse of 2 different languages: English and French 24 hours Data Aggregation Data-visualisation Partner Buzz Statistics Latest tweets on Eurovision Tweets on Eurovision event Live Eurovision Show Tweet format {"entities":{"hashtags":[{"text":“eurovision","text":“Sweeden will win Eurovision for sure #eurovision” …. Streams Overall statistics (World & France) JSON Format Gauge by country Pictures of Eurovision/ Twitter top contributors 8 France 3 website

9 Infrastructure for the PoC -> InfoSphere Streams Efficiency
EuroBuzz with France Televisions Smarter Analytics applied to the Eurovision Song Contest Infrastructure for the PoC 2 Blades in Active/Active Mode (Nehalem Sockets) SAN Storage to store all data during the PoC (All Tweets) All blades with RHEL6.1.64 HS22 8 cores 2.93 GHz 48GB RAM Master Node SAN DS8700 Data-visualisation Partner HS22 8 cores 2.93 GHz 48GB RAM Secondary Node SAN Network peak at the beginning of each song Max 100% Max 10% Very low CPU utilization : < 1% -> InfoSphere Streams Efficiency

10 University of Ontario Institute of Technology
Use case Neonatal infant monitoring Predict infection in ICU 24 hours in advance Solutions 120 children monitored :120K msg/sec, billion msg/day Trials expanding to include hospitals in US and China Event Pre-processer Analysis Framework This is a research project done by UOIT = University of Ontario Institute of Technology (Toronto) for (neonatal) premature babies in Intensive Care Units. In ICUs, the instrumentation is there: dozens of sensors for each infant. But the instruments work in isolation and only produce visual signals, auditory signals in case of anomalies, and sometimes a paper record on demand. Streams is a highly usable platform for finally capturing all that data and combining it in useful ways, in real time. Since this kind of combined analysis has not been done before, Streams first needs to enable the research effort of figuring out which patterns are relevant and predictive, before any of this can be standardized and put into an approved product and practice. In summary, Streams enables real-time analytics and correlations on physiological data streams, for instance, Blood pressure, Temperature, EKG, Blood oxygen saturation etc., This, in turn, can enable early detection of the onset of potentially life-threatening conditions up to 24 hours earlier than current medical practices. Early intervention leads to lower patient morbidity and better long term outcomes. This technology also enables physicians to verify new clinical hypotheses. Sensor Network Stream-based Distributed Interoperable Health care Infrastructure Solutions (Applications)‏ 10 10

11 Asian Telco reduces billing costs and improves customer satisfaction
Capabilities: Stream Computing Analytic Accelerators Real-time mediation and analysis of B CDRs per day Data processing time reduced from hrs to 1 min Hardware cost reduced to 1/8th Proactively address issues (e.g. dropped calls) impacting customer satisfaction. Customer Reference Database Link: We have been working with an Indian Telco client for some time now to help reduce their billing costs and improve customer satisfaction. Challenge: Call Detail Record (CDR) processing within their data warehouse was sub-optimal, Could not achieve real time billing which required handling billions of CDRs per day and de-duplication against 15 days worth of CDR data Unable to support for future IT and Business with real-time analytics Solution: Single platform for mediation and real time analytics reduces IT complexity The PMML standard is used to import data mining models from InfoSphere Warehouse. Offloaded the CDR processing to InfoSphere Streams resulting in enhanced data warehouse performance and improved TCO Each incoming CDR is analyzed using these data mining models, allowing immediate detection of events (ex: dropped calls) that might create customer satisfaction issues. Business Benefit: Data now processed at the speed of Business - from 12 hours to 1 minute HW Costs reduced to 1/8th Support for future growth without the need to re-architect, more data, more analysis Platform in-place for real-time analytics to drive revenue 1111 11 11

12 Agenda Infosphere Streams general description Technical overview
Streams Processing Language Use case Demo 12

13 How Streams Works  continuous ingestion  Continuous ingestion
1313 How Streams Works  continuous ingestion  Continuous ingestion  Continuous analysis This slide and the following one are old but serve their purpose well. As with the video processing demo, the aim is for people to get familiar with the concept of streaming data processing as the flow of discrete packets of data from one operator to the next in a flow graph; each operator does something with the data that comes in and produces new data as a result. Seeing those little blocks move from bubble to bubble in the animated slide, relentlessly and never ending, conveys the point rather well, so let the audience stare at this for a while, especially once you reveal what goes on behind the grey panel. At this point, however, all you’re showing is that a Streams application (here shown as a black box—or rather, a grey one) continuously ingests data from multiple sources (often many different types of data arriving at very different rates), and continuously sends analysis results and processed data to disk files, databases, real-time displays, or any other computational or IT processes (depicted as the proverbial cloud). © 2013 IBM Corporation

14  Continuous ingestion
1414 How Streams Works  Continuous ingestion Infrastructure provides services for Scheduling analytics across hardware hosts, Establishing streaming connectivity  Continuous analysis Filter / Sample Transform Annotate Correlate Classify Click-by-click (3 and 4 are timed—no click needed): The overall application is composed of many individual operators (or, maybe, Processing Elements consisting of one or more operators each); those are the bubbles in this graph. The data packets (sometimes called messages; we call them tuples) generated by one bubble travel to the next downstream bubble, and so on. Sometimes the output from one bubble becomes input to multiple bubbles; conversely, sometimes multiple output streams merge as input to one bubble. But the tuples keep flowing. Having all these separate processes working together gives us flexibility and scalability. If we need more power than a single computer can provide, we can simply distribute the bubbles over multiple boxes (hosts). Some (compute-intensive) bubbles get their very own box; other bubbles want to be in the same box, because they must exchange a lot of data and going across the network would slow that down. The Streams runtime services can help distribute the load. In the picture, the streams flowing within a host (interprocess communication) turn read, indicating that they are faster than the blue ones that go across the network between hosts. Finally, we label some of the boxes with some of the processing stages typically found in many Streams solutions: Filter and sample to reduce the amount of data flowing through the system; transform, changing the data format, computing new quantities, and dropping unneeded attributes; correlate to join or otherwise combine data from multiple sources; classify based on models generated from historical or training data; annotate to enrich the data by looking up static information based on keys present in the data. Where appropriate: Elements can be fused together for lower communication latency Achieve scale: By partitioning applications into software components By distributing across stream-connected hardware hosts © 2013 IBM Corporation

15 InfoSphere Streams Runtime
Distributed runtime, running on a cluster No limit on cluster size Import/export streams between jobs Agility: dynamically reconfigure applications Performance and health metrics down to operator level Jobs submitted and canceled at any time Hosts added to and removed from cluster at any time High Availability Restart and relocate portions of applications Assign parts of applications to host pools based on host tags Application portability Simplifies test and operations The Streams runtime is a clustered runtime. With the compiler and deployment manager portions of Streams, it runs Streams jobs (the instantiations of Streams applications). With v2.0, IBM has removed the 125 node limit on cluster size that we had in Streams 1.2. Streams jobs can import and export streams that are part of their application. So, any streaming data that is used between two operators in one job can be exported and imported into a new job. If finance customer suspect their may be a better trading algorithm, they can add a new job which taps into the existing job to try out this new algorithm, with output sent to some test source for verification that it is a new job. Larger applications can be built from smaller ones, much like the model-view-controller design pattern for java apps to facilitate updating one part of the overall application. But with Streams, this can mean dynamically adding new input sources – e.g., new smart grid sensors (synchrophasors) can be added as the grid grows. Or new logic or new output destinations. Performance and health metrics are not only now at operator levels vs. Process Elements as in prior releases, but a new interface is available so all performance metrics are available programmatically to facilitate automated monitoring. Applications and hosts can be added at any time to address scalability issues. As mentioned before, by designing applications to facility new jobs for new logic, input and output sources, the ability to dynamically add hosts and applications helps create a high availability, scalable environment. Existing application components can be restarted and relocated to other hosts. Assignment of jobs, or parts of jobs to different hosts can be more easily accomplished with host tags. Applications can have the system duplicate data streams to jobs running on separate nodes in a cluster to improve application availability and resistance to hardware failures. Streams has the ability to compile once and deploy to different runtimes. This simplifies test and operations, since it doesn’t require developers to recompile programs for each unique runtime instance. © 2013 IBM Corporation

16 From Operators to Running Jobs
Src Sink stream Streams application graph: A directed, possibly cyclic, graph A collection of operators Connected by streams Each complete application is a potentially deployable job Jobs are deployed to a Streams runtime environment, known as a Streams Instance (or simply, an instance) An instance can include a single processing node (hardware) Or multiple processing nodes A stream processing application is modeled as a graph (also known as a network): a set of vertices (or nodes) connected by edges. In Streams, the vertices are called operators and the edges are called, well, streams. In mathematics, there are different types of graph; ours is directed, which means that the edges have a direction from one vertex to another but not back; they are arrows. Also, our graph may be cyclic, which means that loops are allowed. There are restrictions on this (only certain types of backward connections are allowed) but this is good enough for now. The term job simply refers to a running application. It’s looking at it from the runtime execution perspective rather than the programming perspective. node h/w node Streams instance © 2013 IBM Corporation 16

17 InfoSphere Streams Objects: Runtime View
Instance Instance Runtime instantiation of InfoSphere Streams executing across one or more hosts Collection of components and services Processing Element (PE) Fundamental execution unit that is run by the Streams instance Can encapsulate a single operator or many “fused” operators Job A deployed Streams application executing in an instance Consists of one or more PEs Job Node PE PE operator Stream 1 Stream 2 Stream 1 Node PE Stream 4 Stream 3 Stream 3 Stream 5 Another view of the same thing, to get a clear model in everyone’s mind. The Streams Instance is the runtime environment that makes it possible to run Streams applications: analogous to an application server or a database server. The instance is a collection of Linux components and services, managing one or more hosts for running applications. We’ve already seen the Processing Element as well: the fundamental unit of execution, akin to a shared library. Each PE encapsulates one or more operators; at deployment time, each PE is loaded by its own executable program, the PE Container. When a Streams application runs in the Instance, it is called a Job. The job, in turn, manifests itself as a collection of one or more PEs. When you monitor an instance (using one of a variety of tools, including a command-line program, a browser-based management console, and an interactive view of the flow graph in Streams Studio), you can get information about jobs and PEs, submit and cancel jobs, terminate and restart PEs, view PE-specific logs, and so on. © 2013 IBM Corporation

18 InfoSphere Streams Objects: Development View
Streams Application InfoSphere Streams Objects: Development View Operator The fundamental building block of the Streams Processing Language Operators process data from streams and may produce new streams Stream An infinite sequence of structured tuples Can be consumed by operators on a tuple-by-tuple basis or through the definition of a window Tuple A structured list of attributes and their types. Each tuple on a stream has the form dictated by its stream type Stream type Specification of the name and data type of each attribute in the tuple Window A finite, sequential group of tuples Based on count, time, attribute value, or punctuation marks operator stream height: width: data: height: width: The nice thing about Streams is that it effectively separates the various areas of concern: the runtime instance takes care of all the distributed-computing and communication transport issues; the operators encapsulate all the algorithmic smarts, which may be domain-specific and come from external, open-source libraries or internal, protected intellectual property; and the definition of the application graph, which is completely generic and can be done by an architect without knowledge of the algorithmic internals of each operator. The terms here contrast with the runtime view of the previous slide; these are all logical concepts for stream processing. directory: "/img" filename: "farm" directory: "/img" filename: "bird" directory: "/opt" filename: "java" filename: "cat" tuple © 2013 IBM Corporation

19 Host Pools Facilitate High Availability
HA application design pattern Source job exports stream, enriched with tuple ID Jobs 1 & 2 process in parallel, and export final streams Sink job imports streams, discards duplicates, alerts on missing tuples Job 1 Job 1 Sink Host pool 1 Job 1 Job 1 Source Job 2 Host pool 2 A typical pattern for high availability is the duplicate processing of the same data by running the same job multiple times, on different sets of hosts. Using export/import as mentioned before, they pick up the same data from a source module; likewise, at the end a sink module imports the results from both. If anything happens to one of the hosts running one job, chances are that the other job continues to run. The only complexity in this arrangement is that the sink job much perform a deduplication step. A Deduplicate operator is now part of the Standard Toolkit. Duplicate processing provides a high degree of assurance against data loss, at the cost of using redundant hardware (that is always the case in high-availability scenarios). Within each job, there is also the standard recovery support (if appropriately configured): a failed job can be restarted on another host within the same host pool. This may entail data loss. It is up to the application designer to decide how much downtime and data loss is acceptable and what cost (in terms of redundant licenses and hardware) is acceptable to avoid that risk. Source Job 2 Job 2 Job 2 Host pool 3 x86 host x86 host x86 host x86 host x86 host Host pool 4 © 2013 IBM Corporation

20 Agenda Infosphere Streams general description Technical overview
Streams Processing Language Use case Demo 20

21 What is Streams Processing Language?
Designed for stream computing Define a streaming-data flow graph Rich set of data types to define tuple attributes Declarative Operator invocations name the input and output streams Referring to streams by name is enough to connect the graph Procedural support Full-featured imperative language Custom logic in operator invocations Expressions in attribute assignments and parameter definitions Extensible User-defined data types Custom functions written in SPL or a native language (C++ or Java) Custom operators written in SPL User-defined operators written in a native language (C++ or Java) Streams Processing Language is unique to IBM InfoSphere Streams. This is what gives Streams the power and flexibility that other systems lack. Of course, the downside is that there is yet another language to learn. But in practice, this is not the burden it may seem, because the greatest step in the learning curve is getting comfortable with a new computing paradigm, the distributed nature of typical deployments, and so on. Once you’ve taken that step, the specific syntax of the language is a minor addition—especially since the IDE helps complete your stream and operator declarations with all the brackets and braces in the right place. Type-generic operators: The built-in operators work on any combination of the data types supported by Streams Declarative operators: To configure the operators, you declare them, give them names, define input and output stream(s), and assign parameters User-defined operators: These can take different forms, including one that makes them behave exactly like all the built-in operators (type-generic) Procedural constructs: Additional language elements that control repetition, loops, and other aspects of the streams connecting the operators The following slides cover all of this in greater detail. © 2013 IBM Corporation 21

22 Some SPL Terms An operator represents a class of manipulations
port Aggregate An operator represents a class of manipulations of tuples from one or more input streams to produce tuples on one or more output streams A stream connects to an operator on a port an operator defines input and output ports An operator invocation is a specific use of an operator with specific assigned input and output streams with locally specified parameters, logic, etc. Many operators have one input port and one output port; others have zero input ports: source adapters, e.g., TCPSource zero output ports: sink adapters, e.g., FileSink multiple output ports, e.g., Split multiple input ports, e.g., Join A composite operator is a collection of operators An encapsulation of a subgraph of Primitive operators (non-composite) Composite operators (nested) Similar to a macro in a procedural language Employee Info Aggregate Salary Statistics port TCP Source File Sink Split Just to give you an idea of how you work with the Streams Programming Language (SPL), here are some basic terms that have to do with the fundamental building block of the language—the operator. All the work is done inside the operators; as we’ve seen, the application graph consists of just operators and the streams that connect them. Each operator has its own definition in terms of the inputs and outputs it requires and supports, the parameters, required and optional, that tell it how to do its job, and other requirements such as windowing, output attribute assignments, etc. In the operator itself, inputs and outputs are referred to as ports. In a real application program, operators are invoked; each invocation lets you specify the parameters, input and output stream names, etc. for that specific instance. The same operator may be invoked many times in the same program. This is completely analogous to, say, making function calls in a C++ program. (SPL also supports functions, though, so don’t make too much of the analogy between operators and functions.) The composite operator construct is a very powerful technique for avoiding repetitive code and promoting reusability. In the way it works, it is more like a macro than a function: the compiler expands its definition through all levels of nesting until there’s nothing left but primitive operators. By the way, “primitive” here just means that it cannot be further decomposed—not that it is somehow not advanced or sophisticated. As a sign of the consistency of the language, the composite operator idea has been carried through to the point that every self-contained Streams application program is a special kind of composite: a main composite. Join composite operator 22 © 2013 IBM Corporation

23 Composite Operators Every graph is encoded as a composite
A composite is a graph of one or more operators A composite may have input and output ports Source code construct only Nothing to do with operator fusion (PEs) Each stream declaration in the composite Invokes a primitive operator or another composite operator An application is a main composite No input or output ports Data flows in and out but not on streams within a graph Streams may be exported to and imported from other applications running in the same instance composite Main { graph stream … { } . . . Application (logical view) operator Stream 1 Stream 2 A little more detail about composite operators. In general, a composite operator is a subgraph (a piece of a graph) consisting of one or more operators. It may have input and output ports, as well as parameters. Each of the operators in a composite may be a primitive operator, or may itself be a composite: composites can be infinitely nested. In the invocation of an operator, you cannot tell whether it is a composite or a primitive: they look the same, and only by jumping to the implementation code can you find out whether it is one or the other. This nesting concept can be followed in the other direction, too: a complete graph (or application) is also a composite, but of a special kind: it is a main composite. Main composites are special in the sense that their declarations have no input and output streams (remember: a stream only exists within the Streams environment, flowing from one operator to another; applications get their data from the outside world but those data flows are not called streams), and no mandatory parameters. (We’ll learn about parameters later.) Yes, applications can export and import streams, but these are not inputs and outputs that are defined in the composite declaration; instead, they are created by (pseudo)operators in the body of the composite. Stream 1 Stream 4 Stream 3 Stream 3 Stream 5 23 23 © 2013 IBM Corporation 23

24 Composing a Flow Graph with Stream Definitions
Composite POS_TxHandling TaxableSales Operator 2 TaxesDue Operator 3 TCP- Sink Sales Operator 1 TCP- Source POS_Transactions Deliveries Inventory Reorders Operator 5 Sales TCP- Sink Source Operator 4 composite POS_TxHandling { graph stream<…> POS_Transactions = TCPSource() {…} stream<…> Sales = Operator1(POS_Transactions) {…} stream<…> TaxableSales = Operator2(Sales) {…} stream<…> TaxesDue = Operator3(TaxableSales) {…} () as Sink1 = TCPSink(TaxesDue) {…} stream<…> Deliveries = TCPSource() {…} stream<…> Inventory = Operator4(Sales;Deliveries) {…} stream<…> Reorders = Operator5(Inventory) {…} () as Sink2 = TCPSink(Reorders) {…} } Without getting too deep into the syntax, this is an illustration of how SPL, as a declarative language, sets up the graph. At the top you see the graph as we’re building it; below is the code (without the details of each operator invocation). Click by click: As mentioned in the notes to the previous slide, the program is itself a composite operator. One of the subclauses in the composite statement is the graph subclause. We define two operators: a source-type operator (which has no input stream) to get data from the outside world—in this case, a TCP link; and another one, an instance of Operator1 (whatever that is). The Source produces the stream POS_Transactions, which is consumed by Operator1, which in turn produces the stream Sales. Simply by naming these streams, and mentioning POS_Transactions as the output of one and the input of another, we’ve “wired up” the graph. We do it again for Operator 2, producing TaxableSales while consuming Sales. We complete a simple graph with a few more operators, defined in the same way, ending with a sink-type operator, which has no output stream but sends data to the outside world. We can create more complex graphs, with branches, merge points, and so on. Here we’ve added another flow with more operators. Notice that Operator4 has two input ports; simply by naming two input streams (separated by a semicolon), we get what we want. Notice also that the Sales stream is consumed by more than one operator: Operator2 as well as Operator4. No special forking or branching construct is needed: simply by using the names we get what we want. One more point: because this is a declarative language (like SQL), not a procedural one (like C++), the order of these statements does not matter at all. 24 © 2013 IBM Corporation 24

25 Agenda Infosphere Streams general description Technical overview
Streams Processing Language Use case Demo 25

26 Example description Goal Leverage information about the use of the OpenVPN infrastructure by on-line students OpenVPN Syslog server Users INFOSPHERE STREAMS INFOSPHERE STREAMS INFOSPHERE STREAMS DATAWAREHOUSE DATAWAREHOUSE DATAWAREHOUSE DATAWAREHOUSE DATAWAREHOUSE REPORTS WITH COGNOS AND REAL-TIME REPORTS WITH COGNOS RTM © 2013 IBM Corporation

27 Global diagram Different composite for each step of the program
File source Parsing Aggregation Enrichment with database Database loading HDFS loading Transfers between composites with Export / Import operators © 2013 IBM Corporation

28 File source composite FileSource Read data from files in formats such as csv, line, or binary Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started." Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid… Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid … Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;… Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils; ;openvpn5;Sun Jun 9 05:12:03 … Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;… © 2013 IBM Corporation

29 File source composite Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid … Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;… Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;… Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started. Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid… Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils; ;openvpn5;Sun Jun 9 05:12:03 … FileSource Read data from files in formats such as csv, line, or binary © 2013 IBM Corporation

30 File source composite Throttle Make a stream flow at a specified rate
Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started. Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid… Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid … Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;… Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils; ;openvpn5;Sun Jun 9 05:12:03 … Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;… © 2013 IBM Corporation

31 Parsing composite Two different parsing methods
TextExtract Run AQL queries (module or TAM) over text documents Functor Perform tuple-level manipulations used with Tokenizer Other operators to filter data Filter Remove some tuples from a stream Split Forward tuples to output streams based on a predicate © 2013 IBM Corporation

32 Parsing composite TextExtract
Use AQL text analysis to extract fields of the syslog Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started." Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid… Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid … Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;… Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils; ;openvpn5;Sun Jun 9 05:12:03 … Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;… HostName Tag Content Timestamp © 2013 IBM Corporation

33 Parsing composite Filter only tuples coming from openvpn
Transfer only content part with Functor operator Split the flow to eliminate incomplete lines Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started." Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid… Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid … Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;… Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils; ;openvpn5;Sun Jun 9 05:12:03 … Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1; ;… HostName Tag Content Timestamp © 2013 IBM Corporation

34 Parsing composite Functor with tokenizer
Break a string based on a delimiter character Structure of a sample openvpn content ;<userId>;<virtualIP>;<cnxServer>;<dateTime>;<status>;<message>;<IPsource>;S:<bytesSend>;R:<bytesReceived> <status> can be “Connected”, “Ended” or “Rejected” Split the flow to eliminate “Rejected” connections © 2013 IBM Corporation

35 Aggregation composite
Aggregation to group logs which describe the same connection ;Odi_platform_demonstration_49165_1; ;openvpn6;Sun Jun 9 04:59:53 CEST 2013;Connected;Connection accepted; ; ;Odi_platform_demonstration_49165_1; ;openvpn6;Sun Jun 9 06:46:39 CEST 2013;Ended;Connection terminated, temps cnx 6412 sec; ;S:78203;R:370963; Filter pairs that correspond to Connected / Ended © 2013 IBM Corporation

36 Aggregation composite
Functor is used to create a unique output tuple for each ended connection Split the flow according to the userID “stud[0-9]*” which are the students’ connexion we want to analyze “inst[0-9]*” other © 2013 IBM Corporation

37 Aggregation composite
Partial results stud213_8,openvpn, ,, ,openvpn3,Ended, , ,19:08:21, ,19:10:17,117,4723,2779 stud209_4,openvpn, ,, ,openvpn3,Ended, , ,12:22:53, ,12:25:37,163,5231,3263 stud213_8,openvpn, ,, ,openvpn3,Ended, , ,19:08:21, ,19:10:17,117,4723,2779 stud209_4,openvpn, ,, ,openvpn3,Ended, , ,12:22:53, ,12:25:37,163,5231,3263 stud213_8,openvpn, ,, ,openvpn3,Ended, , ,19:08:21, ,19:10:17,117,4723,2779 stud209_4,openvpn, ,, ,openvpn3,Ended, , ,12:22:53, ,12:25:37,163,5231,3263 © 2013 IBM Corporation

38 Enrichment composite ODBCSource Read data from a SQL database
Extract data about the course attended by students Join Correlate two streams according to the userID and the DateStart and DateEnd of the course © 2013 IBM Corporation

39 Enrichment composite Final results
stud213_8,openvpn, ,, ,openvpn3,Ended, , ,19:08:21, ,19:10:17,117,4723,2779,JL27,1SM00,Cross brand,STG brands stud209_4,openvpn, ,, ,openvpn3,Ended, , ,12:22:53, ,12:25:37,163,5231,3263,5367,1SX2G,Cross brand,STG brands stud213_8,openvpn, ,, ,openvpn3,Ended, , ,19:08:21, ,19:10:17,117,4723,2779,JL27,1SM00,Cross brand,STG brands stud209_4,openvpn, ,, ,openvpn3,Ended, , ,12:22:53, ,12:25:37,163,5231,3263,5367,1SX2G,Cross brand,STG brands stud213_8,openvpn, ,, ,openvpn3,Ended, , ,19:08:21, ,19:10:17,117,4723,2779,JL27,1SM00,Cross brand,STG brands stud209_4,openvpn, ,, ,openvpn3,Ended, , ,12:22:53, ,12:25:37,163,5231,3263,5367,1SX2G,Cross brand,STG brands © 2013 IBM Corporation 39

40 Database loading composite
ODBCAppend Insert rows into an SQL database table INFOSPHERE STREAMS DATAWAREHOUSE

41 HDFS loading composant
HDFSFileSink Write data to files in Hadoop File System

42 Agenda Infosphere Streams general description Technical overview
Streams Processing Language Use case Demo 42

43 Toolkits and Accelerators to Speed Up Development
IBM Supported Toolkits Database DataStage Big Data Data Explorer Messaging Internet Text Analytics Mining SPSS CEP Time Series Geospatial Financial Open-Source Toolkits JSON HTTP/REST OpenCV Accumulo HBase … Big Data Accelerators Telco Event Data Analytics Social Data Analytics User-Defined Toolkits Extend the language by adding user-defined operators, types, and functions Standard Toolkit Relational Operators Filter Sort Functor Join Punctor Aggregate Adapter Operators FileSource UDPSource FileSink UDPSink DirectoryScan Export TCPSource Import TCPSink MetricsSink Utility Operators Custom Split Beacon DeDuplicate Throttle Union Delay ThreadedSplit Barrier DynamicFilter Pair Gate JavaOp Switch Parse Format Decompress CharacterTransform XML Operator XMLParse Operators are organized into toolkits. Other than the Standard Toolkit, toolkits often correlate to functional areas of use, or even to specific industries. There is a lot to say about each of these operators, but I won’t go into detail here. Just highlighting a few points. You’ll see familiar terms like Join, Sort, and Aggregate. These do roughly similar things to what they mean in, say, SQL—but with a twist: since there is no such thing as the “entire table” or the entire result set (streams have no beginning and no end, so they are infinite), you have to do things one “window” of data at a time; this changes how you have to think dramatically. Other names may seem strange (Punctor?). Adapter operators (sources and sinks) are important since they communicate with the outside world. They cover a lot of ground, but in some projects you’ll have to write custom ones to deal with the specific type of data involved. An example of that is the kind of telco CDR (Call Detail Record) data described by a grammar definition language called ASN.1; our Telco services team has developed a parser generator for this kind of data, but it is not for casual use. Operators like Throttle, Barrier, and such are unique to streaming data processing and have no parallel in the data-at-rest world. The toolkits mentioned in the right-hand panel come with the product, but installing them is optional. 43 © 2013 IBM Corporation

44 © 2013 IBM Corporation

45 Resource Links

46 Who Can Help? The Big Data Tiger Team The Big Data Black Belts
Evangelists for your region and/or industry The Big Data Black Belts Skilled TCPs for your region Product Management Roger Rea Product Marketing Asha Marsh Worldwide Sales Stewart Hanna Worldwide Technical Sales Robert Uleman Worldwide Software Services Sandra Tucker Customer Enablement Kevin Foster For the current list of Big Data Tigers, see the IM Acceleration Zone: https://w3-connections.ibm.com/wikis/home?lang=en_US#/wiki/Info%20Mgmt%20Client%20Technical%20Professional%20Resources%20Wiki/page/People%20who%20can%20help?section=bigdatacontacts The list of Big Data Black Belts is still being worked and is more fluid. © 2013 IBM Corporation

47 Where to get Streams Software available via Passport Advantage site
PartnerWorld Academic Initiative IBM Internal 90-day trial code IBM Trials and Demos portal ibm.com -> Support & downloads ->Download -> Trials and demos URLs: Passport Advantage: PartnerWorld: https://www.ibm.com/partnerworld/mem/valuepack/ Academic Initiative: https://www.ibm.com/developerworks/university/software/get_software.html XL (internal) Software Downloads: Trials and Demos: NOTE: some of these links don’t open properly from PowerPoint Copy and paste into your browser to get around such problems. Passport Advantage notes: Passport Advantage covers nearly all IBM Software from Software Group (SWG), including WebSphere, IM, Lotus, Tivoli and Rational The more you buy (including maintenance), the greater the discounts The discounts apply to all the software you buy (including maintenance) Maintenance, which includes telephone support for usage and defects is included in new license prices Maintenance is 20% of the new license price © 2013 IBM Corporation

48 More Streams links Streams in the Information Management Acceleration Zone Streams in the Media Library (recorded sessions) Search on “streams” The Streams Sales Kit Software Sellers Workplace (the old eXtreme Leverage) PartnerWorld InfoSphere Streams Wiki on developerWorks Discussion forum, Streams Exchange Everyone should have been introduced to the Information Management Acceleration Zone (IMAZ). Streams is under Big Data and Warehousing The Sales Kit is a bit stale, so favor the IMAZ. URLs: IM Acceleration Zone: https://w3-connections.ibm.com/wikis/home?lang=en_US#/wiki/Info%20Mgmt%20Client%20Technical%20Professional%20Resources%20Wiki/page/Streams Media Library: Software Sellers Workplace: PartnerWorld: https://www.ibm.com/partnerworld/wps/servlet/mem/ContentHandler/Q836166N45949G23/lc=en_US developerWorks: The Streams Exchange is a place for Streams application developers to share ideas, source code, design patterns, experience reports, rules of thumb, best practices, etc. © 2013 IBM Corporation

49 SPL DOC

50 Some SPL Terms An operator represents a class of manipulations
port Aggregate An operator represents a class of manipulations of tuples from one or more input streams to produce tuples on one or more output streams A stream connects to an operator on a port an operator defines input and output ports An operator invocation is a specific use of an operator with specific assigned input and output streams with locally specified parameters, logic, etc. Many operators have one input port and one output port; others have zero input ports: source adapters, e.g., TCPSource zero output ports: sink adapters, e.g., FileSink multiple output ports, e.g., Split multiple input ports, e.g., Join A composite operator is a collection of operators An encapsulation of a subgraph of Primitive operators (non-composite) Composite operators (nested) Similar to a macro in a procedural language Employee Info Aggregate Salary Statistics port TCP Source File Sink Split Just to give you an idea of how you work with the Streams Programming Language (SPL), here are some basic terms that have to do with the fundamental building block of the language—the operator. All the work is done inside the operators; as we’ve seen, the application graph consists of just operators and the streams that connect them. Each operator has its own definition in terms of the inputs and outputs it requires and supports, the parameters, required and optional, that tell it how to do its job, and other requirements such as windowing, output attribute assignments, etc. In the operator itself, inputs and outputs are referred to as ports. In a real application program, operators are invoked; each invocation lets you specify the parameters, input and output stream names, etc. for that specific instance. The same operator may be invoked many times in the same program. This is completely analogous to, say, making function calls in a C++ program. (SPL also supports functions, though, so don’t make too much of the analogy between operators and functions.) The composite operator construct is a very powerful technique for avoiding repetitive code and promoting reusability. In the way it works, it is more like a macro than a function: the compiler expands its definition through all levels of nesting until there’s nothing left but primitive operators. By the way, “primitive” here just means that it cannot be further decomposed—not that it is somehow not advanced or sophisticated. As a sign of the consistency of the language, the composite operator idea has been carried through to the point that every self-contained Streams application program is a special kind of composite: a main composite. Join composite operator 50 © 2013 IBM Corporation

51 Composite Operators Every graph is encoded as a composite
A composite is a graph of one or more operators A composite may have input and output ports Source code construct only Nothing to do with operator fusion (PEs) Each stream declaration in the composite Invokes a primitive operator or another composite operator An application is a main composite No input or output ports Data flows in and out but not on streams within a graph Streams may be exported to and imported from other applications running in the same instance composite Main { graph stream … { } . . . Application (logical view) operator Stream 1 Stream 2 A little more detail about composite operators. In general, a composite operator is a subgraph (a piece of a graph) consisting of one or more operators. It may have input and output ports, as well as parameters. Each of the operators in a composite may be a primitive operator, or may itself be a composite: composites can be infinitely nested. In the invocation of an operator, you cannot tell whether it is a composite or a primitive: they look the same, and only by jumping to the implementation code can you find out whether it is one or the other. This nesting concept can be followed in the other direction, too: a complete graph (or application) is also a composite, but of a special kind: it is a main composite. Main composites are special in the sense that their declarations have no input and output streams (remember: a stream only exists within the Streams environment, flowing from one operator to another; applications get their data from the outside world but those data flows are not called streams), and no mandatory parameters. (We’ll learn about parameters later.) Yes, applications can export and import streams, but these are not inputs and outputs that are defined in the composite declaration; instead, they are created by (pseudo)operators in the body of the composite. Stream 1 Stream 4 Stream 3 Stream 3 Stream 5 51 51 © 2013 IBM Corporation 51

52 Composing a Flow Graph with Stream Definitions
Composite POS_TxHandling TaxableSales Operator 2 TaxesDue Operator 3 TCP- Sink Sales Operator 1 TCP- Source POS_Transactions Deliveries Inventory Reorders Operator 5 Sales TCP- Sink Source Operator 4 composite POS_TxHandling { graph stream<…> POS_Transactions = TCPSource() {…} stream<…> Sales = Operator1(POS_Transactions) {…} stream<…> TaxableSales = Operator2(Sales) {…} stream<…> TaxesDue = Operator3(TaxableSales) {…} () as Sink1 = TCPSink(TaxesDue) {…} stream<…> Deliveries = TCPSource() {…} stream<…> Inventory = Operator4(Sales;Deliveries) {…} stream<…> Reorders = Operator5(Inventory) {…} () as Sink2 = TCPSink(Reorders) {…} } Without getting too deep into the syntax, this is an illustration of how SPL, as a declarative language, sets up the graph. At the top you see the graph as we’re building it; below is the code (without the details of each operator invocation). Click by click: As mentioned in the notes to the previous slide, the program is itself a composite operator. One of the subclauses in the composite statement is the graph subclause. We define two operators: a source-type operator (which has no input stream) to get data from the outside world—in this case, a TCP link; and another one, an instance of Operator1 (whatever that is). The Source produces the stream POS_Transactions, which is consumed by Operator1, which in turn produces the stream Sales. Simply by naming these streams, and mentioning POS_Transactions as the output of one and the input of another, we’ve “wired up” the graph. We do it again for Operator 2, producing TaxableSales while consuming Sales. We complete a simple graph with a few more operators, defined in the same way, ending with a sink-type operator, which has no output stream but sends data to the outside world. We can create more complex graphs, with branches, merge points, and so on. Here we’ve added another flow with more operators. Notice that Operator4 has two input ports; simply by naming two input streams (separated by a semicolon), we get what we want. Notice also that the Sales stream is consumed by more than one operator: Operator2 as well as Operator4. No special forking or branching construct is needed: simply by using the names we get what we want. One more point: because this is a declarative language (like SQL), not a procedural one (like C++), the order of these statements does not matter at all. 52 © 2013 IBM Corporation 52

53 Anatomy of an Operator Invocation
Operators share a common structure italics are sections to fill in Reading an operator invocation Declare a stream stream-name With attributes from stream-type that is produced by MyOperator from the input(s) input-stream MyOperator behavior defined by logic, parameters, windowspec, and configuration; output attribute assignments are specified in output For the example: Declare the stream Sale with the attribute item, which is a raw (ASCII) string Join the Bid and Ask streams with sliding windows of 30 seconds on Bid, and 50 tuples of Ask When items are equal, and Bid price is greater than or equal to Ask price Output the item value on the Sale stream Syntax: stream<stream-type> stream-name = MyOperator(input-stream; …) { logic logic ; window windowspec ; param parameters ; output output ; config configuration ; } Example stream<rstring item> Sale = Join(Bid; Ask) { window Bid: sliding, time(30); Ask: sliding, count(50); param match : Bid.item == Ask.item && Bid.price >= Ask.price; output Sale: item = Bid.item; } Not much to add to the slide text. Note that an operator can have zero or more input ports; multiple input streams are separated by a semicolon (;), as in the example. Each operator, primitive or composite, determines which of the window, param, and output clauses may or must be present; logic and config can apply to any operator. 53 53 © 2013 IBM Corporation 53

54 Streams Data Types (any) (primitive) (composite) boolean enum
(numeric) timestamp (string) blob xml (collection) tuple (integral) (floatingpoint) (complex) rstring ustring list set map (signed) (unsigned) (float) (decimal) int8 int16 int32 int64 uint8 uint16 uint32 uint64 float32 float64 float128 decimal32 decimal64 decimal128 complex32 complex64 complex128 Type classes in (italics) Actual built-in types in bold ustring is a UTF-16 Unicode string; rstring is a “raw” (8-bit-character) string. tuple is sequence of attributes, i.e., a structure of named elements of other types, like struct in C/C++. timestamp is a 128-bit standard representation of time, ranging to billions of years with a precision of nanoseconds Collection types and strings may be bounded, meaning that they are declared with a maximum length. This can improve performance in some cases, by taking advantage of the optimizations that can be applied to “façade tuples” (meaning, fixed-length). The xml type, new in 3.0, includes built-in validation: if specified with schema (xml<"schemaURI">), a full semantic validation; otherwise, a syntactical check for well-formed XML. New types may be defined local to a composite, so they can only be used within that composite—say, to define a stream type; or global within a namespace so they can be reused in any application within that namespace or that uses the namespace. User-defined types type Integers = list<int32>; type MySchema = rstring s, Integers ls; 54 © 2013 IBM Corporation

55 Collection Types list: array with bounds-checking [0, 17, age-1, 99]
Random access: can access any element at any time Ordered, base-zero indexed: first element is someList[0] set: unordered collection {"cats", "yeasts", "plankton"} No duplicate element values map: key-to-value mappings {"Mon":0, "Sat":99, "Sun":-1} Unordered Use type constructors to specify element type list<type>, set<type> list<uint16>, set<rstring> map<key-type,value-type> map<rstring[3],int8> Can be nested to any number of levels map<int32, list<tuple<ustring name, int64 value>>> {1 : [{"Joe",117885}, {"Fred",923416}], 2 : [{"Max",117885}], -1 : []} Bounded collections optimize performance list<int32>[5]: at most 5 (32-bit) integer elements Bounds also apply to strings: rstring[3] has at most 3 (8-bit) characters Lists are based on arrays but with static or dynamic bounds-checking. Sets and maps are based on hash tables. Bounded collections and strings are purely a performance optimization; they can clutter up code but can be critically important in achieving the highest possible performance. 55 © 2013 IBM Corporation

56 Toolkits and Accelerators to Speed Up Development
IBM Supported Toolkits Database DataStage Big Data Data Explorer Messaging Internet Text Analytics Mining SPSS CEP Time Series Geospatial Financial Open-Source Toolkits JSON HTTP/REST OpenCV Accumulo HBase … Big Data Accelerators Telco Event Data Analytics Social Data Analytics User-Defined Toolkits Extend the language by adding user-defined operators, types, and functions Standard Toolkit Relational Operators Filter Sort Functor Join Punctor Aggregate Adapter Operators FileSource UDPSource FileSink UDPSink DirectoryScan Export TCPSource Import TCPSink MetricsSink Utility Operators Custom Split Beacon DeDuplicate Throttle Union Delay ThreadedSplit Barrier DynamicFilter Pair Gate JavaOp Switch Parse Format Decompress CharacterTransform XML Operator XMLParse Operators are organized into toolkits. Other than the Standard Toolkit, toolkits often correlate to functional areas of use, or even to specific industries. There is a lot to say about each of these operators, but I won’t go into detail here. Just highlighting a few points. You’ll see familiar terms like Join, Sort, and Aggregate. These do roughly similar things to what they mean in, say, SQL—but with a twist: since there is no such thing as the “entire table” or the entire result set (streams have no beginning and no end, so they are infinite), you have to do things one “window” of data at a time; this changes how you have to think dramatically. Other names may seem strange (Punctor?). Adapter operators (sources and sinks) are important since they communicate with the outside world. They cover a lot of ground, but in some projects you’ll have to write custom ones to deal with the specific type of data involved. An example of that is the kind of telco CDR (Call Detail Record) data described by a grammar definition language called ASN.1; our Telco services team has developed a parser generator for this kind of data, but it is not for casual use. Operators like Throttle, Barrier, and such are unique to streaming data processing and have no parallel in the data-at-rest world. The toolkits mentioned in the right-hand panel come with the product, but installing them is optional. 56 © 2013 IBM Corporation

57 Streams Standard Toolkit: Relational Operators
5757 Streams Standard Toolkit: Relational Operators Relational operators Functor Perform tuple-level manipulations Filter Remove some tuples from a stream Aggregate Group and summarize incoming tuples Sort Impose an order on incoming tuples in a stream Join Correlate two streams Punctor Insert window punctuation markers into a stream Reference slide. The Relational operators Aggregate, Sort, and Join are named after analogous operations in relational databases, but in stream processing there is a twist: they can only operate on windows of tuples, never the “entire” stream. Functor is a general-purpose attribute transformation and filtering operator with many of the capabilities of the SQL SELECT statement. Punctor is unique to Streams: it exists solely to add window markers to the stream, thus separating groups of tuples into disjoint windows that can be used by downstream operators for their window-based operations. Note that Import and Export are not “real” operators in that they don’t generate any code. Instead, they add metadata for the Streams Application Manager service to use in making stream connections between applications. For convenience, they are made to look syntactically like regular operators. And since they’re not about communicating with the outside world, they’re not really “adapters” either. 57 © 2013 IBM Corporation

58 Streams Standard Toolkit: Adapter Operators
5858 Streams Standard Toolkit: Adapter Operators Adapter operators FileSource Read data from files in formats such as csv, line, or binary FileSink Write data to files in formats such as csv, line, or binary DirectoryScan Detect files to be read as they appear in a given directory TCPSource Read data from TCP sockets in various formats TCPSink Write data to TCP sockets in various formats UDPSource Read data from UDP sockets in various formats UDPSink Write data to UDP sockets in various formats Export Make a stream available to other jobs in the same instance Import Connect to streams exported by other jobs MetricsSink Create displayable metrics from numeric expressions Reference slide. Note that Import and Export are not “real” operators in that they don’t generate any code. Instead, they add metadata for the Streams Application Manager service to use in making stream connections between applications. For convenience, they are made to look syntactically like regular operators. And since they’re not about communicating with the outside world, they’re not really “adapters” either. 58 © 2013 IBM Corporation

59 Streams Standard Toolkit: Utility Operators
Workload generation and custom logic Beacon Emit generated values; timing and number configurable Custom Apply arbitrary SPL logic to produce tuples Coordination and timing Throttle Make a stream flow at a specified rate Delay Time-shift an entire stream relative to other streams Gate Wait for acknowledgement from downstream operator Switch Block or allow the flow of tuples based on control input Reference slide. Many of these operators only make sense in a streaming-data system. The Custom operator is like a blank slate. It can perform any kind of custom logic in response to the arrival of a tuple on one of its input ports, including writing to the console, sending out tuples and punctuation markers, and calling system functions, by using the simple but powerful SPL expression language. The Gate operator can help even out tuple flow and avoid overwhelming a downstream operator by only sending tuples on after receiving acknowledgment of previous tuples. The Switch operator can temporarily block a stream altogether, for example while a downstream operator is initializing or refreshing lookup information. Operators that are new in 3.0 are highlighted in blue. 59 © 2013 IBM Corporation

60 Streams Standard Toolkit: Utility Operators (cont’d)
6060 Streams Standard Toolkit: Utility Operators (cont’d) Parallel flows Barrier Synchronize tuples from sequence-correlated streams Pair Group tuples from multiple streams of same type Split Forward tuples to output streams based on a predicate ThreadedSplit Distribute tuples over output streams by availability Union Construct an output tuple from each input tuple DeDuplicate Suppress duplicate tuples seen within a given time period Miscellaneous DynamicFilter Filter tuples based on criteria that can change while it runs JavaOp General-purpose operator for encapsulating Java code Parse Parse blob data for use with user-defined adapters Format Format blob data for use with user-defined adapters Compress Compress blob data Decompress Decompress blob data CharacterTransform Convert blob data from one encoding to another Reference slide. Some parallel-flow operators are for performance optimization and load balancing (e.g., ThreadedSplit); others support High Availability scenarios (DeDuplicate). Yet others support parallel processing by applying independent algorithmic steps in parallel to the same data (Barrier, Union). DynamicFilter is a very powerful operator for maintaining a list of filter criteria that may change over time, such as key words for a simple text-based filter. Parse, Format, Compress, and Decompress expose some of the capabilities that are built in to the standard adapters for files and TCP and UDP sockets; this allows you to apply the same features to data coming from or going to other types of connectors. Operators that are new in 3.0 are highlighted in blue. 60 © 2013 IBM Corporation

61 Streams Extensibility: Toolkits
Like packages, plugins, addons, extenders, etc. Reusable sets of operators, types, and functions What can be included in a toolkit? Primitive and composite operators User-defined types Native and SPL functions Sample applications, utilities Tools, documentation, data, etc. Versioning is supported Define dependencies on other versioned assets (toolkits, Streams itself) Base for creating cross-domain and domain-specific applications Developed by IBM, partners, end-customers Complete APIs and tools available Same power as the “built-in” Standard Toolkit Streams is extensible at heart; extensibility is not an afterthought. New operators are developed all the time, by IBM itself as well as by users. With call-level APIs and code-generation facilities available to all users, user-developed toolkits have all the same power and access to system capabilities as IBM-developed ones. The only limitation is that it is not possible to create new “built-in” types. For example, the xml data type introduced in 3.0 had to be added to the system by IBM; it could not have been added by a third party. 61 61 © 2013 IBM Corporation

62 Integration: the Internet and Database Toolkits
6262 Integration: the Internet and Database Toolkits Integration with traditional sources and consumers Internet Toolkit InetSource periodically retrieve data from HTTP, FTP, RSS, and files Database Toolkit ODBC databases: DB2, Informix, Oracle, solidDb, MySQL, SQLServer, Netezza ODBCAppend Insert rows into an SQL database table ODBCEnrich Extend streaming data based on lookups in database tables ODBCRun Perform SQL queries with parameters from input tuples ODBCSource Read data from a SQL database SolidDBEnrich Perform table lookups in an in-memory database DB2SplitDB Split a stream by DB2 partition key DB2PartitionedAppend Write data to table in specified DB2 partition NetezzaLoad Perform high-speed loads into a Netezza database NetezzaPrepareLoad Convert tuple to delimited string for Netezza loads InetSource is good for REST interfaces (HTTP GET and POST requests) as well as the other types listed. For files, the difference with FileSource is that it periodically reads, either just the new data since the last read, or the entire file again. More operators will probably be added to the Internet toolkit in future, such as for using javascript-based data visualization, http sources that require authentication and other capabilities not currently supported, etc. Check the Streams Exchange on developerWorks for an early view of such operators. Regarding the supported databases in the ODBC* operators, there are some nuances. Any database that has a driver for UnixODBC (2.3 or later) can be supported. For the specific databases listed, there are restrictions: on RHEL 6, Oracle is not supported, and on IBM Power7 systems running RHEL 6, solidDB, Netezza, and Oracle are not supported. SolidDBEnrich is there in addition to ODBCEnrich, because it is much more efficient to use the proprietary SolidDB API than to go through ODBC and full SQL. There are restrictions, but for stream enrichment you usually don’t need all the flexibility of SQL. Additional target-specific operators have been and will continue to be added to this toolkit; operators for high-speed writing to partitioned DB2 databases and to Netezza already exist. 62 © 2013 IBM Corporation

63 Integration: the Big Data Toolkit
6363 Integration: the Big Data Toolkit Integration with IBM’s Big Data Platform Data Explorer DataExplorerPush Insert records into a Data Explorer index BigInsights: Hadoop Distributed File System HDFSDirectoryScan Like DirectoryScan, only for HDFS HDFSFileSource Like FileSource, only for HDFS HDFSFileSink Like FileSink, only for HDFS HDFSSplit Write batches of data in parallel to HDFS IBM InfoSphere Data Explorer can index and consume results from Streams. The DataExplorerPush operator uses the BigSearch API. The Hadoop Distributed File System requires a separate set of source and sink adapters, because it is a proprietary file system that is not POSIX-compatible. The HDFS* operators all have counterparts of the same name in the Standard Toolkit, and operate in similar ways. HDFSSplit is there to improve throughput by allowing you to parallelize the HDFS I/O operations: in iterative fashion, send a specific amount of data to each of several HDFSSink operators, followed by a window punctuation marker to indicate that it is time to write that batch. This fits the parallel and batch-oriented nature of Hadoop. Note that there is no need for a similar set of adapters for BigInsights when configured with GPFS, because GPFS is POSIX-compliant and the standard-toolkit adapters work just fine. The HDFS* operators support Hadoop versions and (on Power, 0.20 only). They use both Hadoop and JNI (Java Native Interface). 63 © 2013 IBM Corporation

64 Integration: the DataStage Integration Toolkit
6464 Integration: the DataStage Integration Toolkit Streams real-time analytics and DataStage information integration Perform deeper analysis on the data as part of the information integration flow Get more timely results and offload some analytics load from the warehouse Operators and tooling Adapters to exchange data between Streams and DataStage DSSource DSSink Tooling to generate integration code DataStage provides Streams connectors The IBM InfoSphere DataStage Integration Toolkit is supported by DataStage version 9.1. Streams can enhance a DataStage ETL job by performing deeper real-time analytic processing (RTAP) on the data as it is being loaded. In a slightly different scenario, DataStage can perform additional enrichment and transformation on the output of a Streams RTAP application, before storing the end results in a warehouse. 64 © 2013 IBM Corporation

65 Integration: the Messaging Toolkit
6565 Integration: the Messaging Toolkit Integrate with IBM WebSphere MQ Create a stream from a subscription to an MQ topic or queue Publish a stream to an MQ series topic or queue Operators XMSSource Read data from an MQ queue or topic XMSSink Send messages to applications that use WebSphere MQ NOTE: This toolkit is not supported on POWER7 systems. The XMSSource and XMSSink operators use the standard XMS architecture and APIs (CreateProducer, CreateConsumer, CreateConnection). 65 © 2013 IBM Corporation

66 Integration and Analytics: the Text Toolkit
Derive structured information from unstructured text Apply extractors Programs that encode rules for extracting information from text Written in AQL (Annotation Query Language) Developed against a static repository of text data AQL files can be combined into modules (directories) Modules can be compiled into TAM files NOTE: Old-style AOG files not supported by BigInsights 2.0 and this toolkit Can be used in Streams 3.0 with the Deprecated Text Toolkit Streams Studio plugins for developing AQL queries Same tooling as in BigInsights Operator and utility TextExtract Run AQL queries (module or TAM) over text documents createtypes script Create stream type definitions that match the output views of the extractors. Plays a major role in accelerators SDA: Social Data Analytics MDA: Machine Data Analytics The Text Toolkit does not require Hadoop or BigInsights, but in Text Analytics it is almost inevitable that the analytical model, as expressed in the AQL program, is developed and tested against a repository of text documents, usually in something like BigInsights. An example of text analytics is sentiment analysis based on a Twitter feed (the documents are tweets, in that case). 66 © 2013 IBM Corporation

67 Analytics: The Geospatial Toolkit
6767 Analytics: The Geospatial Toolkit High-performance analysis and processing of geospatial data Enables Location Based Services (LBS) Smarter Transportation: monitor traffic speed and density Geofencing: detect when objects enter or leave a specified area Geospatial data types e.g. Point, LineString, Polygon Geospatial functions e.g. Distance, Map point to LineString, IsContained, etc. Note that this is a toolkit that provides no operators; only types and functions. Geospatial and spatiotemporal (including the time dimension) computing form a rich domain that requires specific expertise, such as knowledge of cartography (maps and map projections), geospatial geometry (shapes and locations and their representations), set theory (spatiotemporal relationships), interoperability standards and conventions, as well as the specifics of the application (location intelligence and location-based services for businesses, security and surveillance, geographic information systems, traffic patterns and route finding, etc.). This toolkit provides the beginnings of a set of capabilities that lets Streams provide the real-time component for emerging location-based applications. 67 © 2013 IBM Corporation

68 Analytics: The TimeSeries Toolkit
6868 Analytics: The TimeSeries Toolkit Apply Digital Signal Processing techniques Find patterns and anomalies in real time Predict future values in real time A rich set of functionality for working with time series data Generation Synthesize specific wave forms (e.g., sawtooth, sine) Preprocessing Preparation and conditioning (ReSample, TSWindowing) Analysis Statistics, correlations, decomposition and transformation Modeling Prediction, regression and tracking (e.g. Holt-Winters, GAMLearner) A time series is a sequence of numerical data that represents the value of an object or multiple objects over time. For example, you can have time series that contain: the monthly electricity usage that is collected from smart meters; daily stock prices and trading volumes; ECG recordings; seismographs; or network performance records. Time series have a temporal order and this temporal order is the foundation of all time series analysis algorithms. A time series can be regular (sampled at regular intervals) or irregular (sampled at arbitrary times, for example when triggered by events). In addition, a time series can be univariate (tracking a single, scalar variable) or vector (containing a collection of scalar variables); it can even be expanding, if the number of variables increases over time. This toolkit contains 25 operators; too many to list them all here. 68 © 2013 IBM Corporation

69 Analytics: the Mining Toolkit
Enables scoring of real-time data in a Streams application Scoring is performed against a predefined model (in PMML file) Supports a variety of model types and scoring algorithms Predictive Model Markup Language (PMML) Standard for statistical and data mining models Produced by Analytics tools: SPSS, ISAS, etc. XML representation defined by Scoring operators Classification Assign tuple to a class and report confidence Clustering Assign tuple to a cluster and compute score Regression Calculate predicted value and standard deviation Associations Identify the applicable rule and report consequent (rule head), support, and confidence Supports dynamic replacement of the PMML model used by an operator The Streams Data Mining Toolkit adds several operators that can analyze and score streaming data according to PMML-standard models developed separately—for example, with SPSS—based on historical data. The PMML support and scoring code is ported directly from the IBM InfoSphere Warehouse, ensuring consistency of results. Currently, the Streams Mining operators have no built-in provision for learning and letting the model evolve; they simply apply the model to each incoming tuple in turn, with no reference to any data that passed through before. They can, however, respond to model changes. As the warehouse-based analysis results in an updated model, simply overwrite the model file; the operator detects the update, reads in the new model, and applies it from then on to newly arriving data. 69 © 2013 IBM Corporation

70 Analytics: Streams + SPSS Real Time Scoring Service
Included with SPSS, not with Streams IBM InfoSphere Streams S SPSS Scoring Operator SPSS Modeler Scoring Stream R SPSS Repository Operator P SPSS Publish Operator S R P In memory In memory Change Notification File System IBM SPSS Modeler Solution Publisher The IBM SPSS Analytics Toolkit for InfoSphere Streams enables the use of SPSS Modeler predictive analytics and SPSS Collaboration and Deployment Services model management features in Streams applications. The Analytics Toolkit consists of three Streams operators to implement Modeler analytics in Streams applications The SPSS Scoring operator enables the scoring against predictive models designed in SPSS Modeler in the flow of Streams applications. It works by passing an in-memory request to SPSS Modeler Solution Publisher which conducts scoring via a Modeler scoring stream and returns results in-memory back to the SPSS Scoring operator. The SPSS Repository operator, meanwhile, enables the use of SPSS Collaboration and Deployment Services model management functionality. Collaboration and Deployment Services is used to manage the lifecycle of SPSS predictive analytics assets. It provides automated evaluation of model effectiveness and automated refreshing of models when warranted. The SPSS Repository operator monitors the Collaboration and Deployment Services repository. When it receives notification that a scoring model is refreshed, it retrieves the updated model from the repository, prepares it and passes it to the SPSS Publish operator. The SPSS Publish operator in turn generates the executable images required to update the scoring model used in the Streams application. Upon publication of the new images, the new scoring model is automatically swapped into place without disruption to the processing underway in the Streams application. IBM SPSS Collaboration & Deployment Services Model Refresh Repository 70 © 2013 IBM Corporation 70

71 Analytics: the Financial Services Toolkit
Adapters Financial Information Exchange (FIX) WebSphere Front Office for Financial Markets (WFO) WebSphere MQ Low-Latency Messaging (LLM) Adapters Types and Functions OptionType (put, call), TxType (buy, sell), Trade, Quote, OptionQuote, Order Coefficient of Correlation “The Greeks” (put/call values, delta, theta, etc.) Operators Based on QuantLib financial analytics open source package. Compute theoretical value of an option (European, American) Example applications Equities Trading Options Trading The Financial Information eXchange (FIX) protocol is a messaging standard to support real-time electronic exchange of security (option and stock) transactions. The Fix protocol is documented at WebSphere Front Office for Financial Markets contains feed handlers for a large number of financial-market feeds. For a detailed description of the WebSphere Front Office suite, how to configure and use feed handlers and clients, refer to the WFO documentation at IBM WebSphere MQ Low Latency Messaging is a high-throughput, low-latency messaging transport. It is optimized for the very high-volume, low-latency requirements typical of financial markets firms and other industries where speed of data delivery is paramount. In addition to the mentioned adapters, there are some simulation operators, which generate simulated feeds—continuous streams of cyclically repeating data: EquityMarketTradefeed, EquityMarketQuoteFeed, and OrderExecutor. These operators support the included sample applications and are not intended for production use. 71 © 2013 IBM Corporation


Download ppt "IBM InfoSphere Streams Technical Overview"

Similar presentations


Ads by Google