Presentation is loading. Please wait.

Presentation is loading. Please wait.

Streaming Data and Analytics

Similar presentations


Presentation on theme: "Streaming Data and Analytics"— Presentation transcript:

1 Streaming Data and Analytics
COSC 526 Class 5 Streaming Data and Analytics Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: Start with ebola…

2 Projects

3 Projects Three main areas: Datasets: Applications Algorithms
Infrastructure Datasets: Healthcare claims (> 20 GB compressed) Twitter (1 TB compressed over a year feed) Molecular simulations (> 1TB compressed) Cybersecurity (>60 GB) Bioinformatics (>6 TB compressed) Imaging (> 2 TB compressed)

4 Applications Can you find patients with similar phenotypes knowing their treatment patterns? Can you group behaviors of individuals based on their tweet patterns? Can you predict time-points where an attack may occur within a network? Can you piece together 100s of millions of basepairs against a reference genome? Can you connect the cognitive thinking process associated with decision making?

5 Infrastructure Scalable platform to host multiple types of datasets
GPU processing and implementation of various algorithms (ICA, hierarchical eigensolvers, tensor toolbox, etc.) Scalable platform to host related datasets for co- analysis processes Efficient transfer of data between CPU/GPU for specific ML algorithms

6 Algorithms Machine learning algorithm implementation for CPU/GPU resources How do we exploit Intel’s co-processors for doing ML algorithm design? Multi-way Multi-task learning

7 Part 0: Introducing Streaming Data

8 Distributed Counting  Stream and Sort Counting
Standardized message routing logic example 1 example 2 example 3 …. C[x1] += D1 C[x1] += D2 …. Increment C[x] by D Can we do this as more data becomes available? Do we have to update any of the logic to achieve this? What are the necessary updates for our approach? Logic to combine counter updates Sort Counting logic Machine A Machine C1..CK Machine B Trivial to parallelize! Easy to parallelize!

9 Is Naïve Bayes a Streaming Algorithm?
In ML, we call this Online Learning Allows to model as the data becomes available Adapts slowly as the data streams in How are we doing this? NB: slow updates to the model Learn from the training data For every example in th stream, we update it (learning rate)

10 What Attributes of Data have we seen?
Large Feature Sets Map/Reduce Request & Answer Stream & Sort Rocchio’s Algorithm Streaming Data Querying Streams Filtering Streams Counting Moment Estimation High Dimensional Dimensionality Reduction Clustering Locally Sensitive Hashing Spectral Methods Graph Data Community Detection Page Rank/ Sim Rank Link Analysis

11 What Hadoop/MapReduce can do?
Can Do (Pros) Cannot Do (Cons) High throughput: Batch optimized Assumes data is resident (somewhere, some place) Decision making is “slow” Independent of storage, language (somewhat), flexible and fault-tolerant Streaming/infinite data: Data is getting updated by the second Decision making needs to be fast! Single fixed data flow Not I/O efficient Lee, K.Y., et al SIGMOD Record 2011 (Vol. 40 [4])

12 Streaming Data Data streams are “infinite” and “continuous”
Distribution(s) can evolve over time Managing streaming data is important: Input is coming extremely fast! We don’t see all of the data (sometimes) input rates are limited (e.g., Google queries, Twitter and Facebook status updates) Key question in streaming data: How do we make critical calculations about the stream using limited (secondary) memory?

13 General Stream Processing Model
Ad-Hoc Queries Processor Standing Queries , 5, 2, 7, 0, 9, 3 a, r, v, t, y, h, b , 0, 1, 0, 1, 1, 0 time Streams Entering. Each is stream is composed of elements/tuples Output Archival Storage Limited Working Storage J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

14 Requirements of Stream Processing
Flexible data input Real-time Fault-tolerant scalable processing Handle Data fails Guaranteed delivery Listeners INCOMING OUTGOING Destination Processor Standing Queries Archival Storage Limited Working Storage

15 Stream model Input enter at a rapid rate: O(million/min) events at one or more ports Elements are usually called tuples We cannot store all of the data stream! How to make critical calculations about the stream with limited secondary memory? Tuple Tuple Tuple Tuple

16 Questions that we are interested in
Sampling from a data stream: construct a random sample Queries over a sliding window: Number of items of type x in the last k elements of the stream Filtering a data stream: Select elements with property x from the stream Counting distinct element: Number of distinct elements in the last k elements of the stream Estimating Moments Finding frequent items

17 Example Applications Finding relevant products using clickstreams
Amazon, Google, all of them want to find relevant products to shop based on user interactions Trending topics on Social media Health monitoring using sensor networks Monitoring molecular dynamics simulations A. Ramanathan, J.-O. Yoo and C.J. Langmead, On-the-fly identification of conformational sub-states from molecular dynamics simulations. J. Chem. Theory Comput. (2011). Vol. 7 (3): A. Ramanathan, P. K. Agarwal, M.G. Kurnikova, and C. J. Langmead, An online approach for mining collective behaviors from molecular dynamics simulations, J. Comp. Biol. (2010). Vol. 17 (3):

18 Sampling from a Data stream
Practical consideration: we cannot store all of the data, but we can store a sample! Two problems: Sample a fixed proportion of elements in the stream (say 1 in 10) Maintain a random sample of fixed size over a potentially infinite stream At any time k, we would like a random sample of s elements Desirable Property: For all time steps k, each of the k elements seen so far have equal probability of being sampled…

19 Part I: Sampling from a data stream

20 Sampling a fixed proportion
Scenario: Let’s look at a search engine query stream Stream: <user, query, time> Why is this interesting? Answer questions like “how often did a user have to run the same query in a single day?” We have only storage = 1/10th of the query stream Simple solution: Generate a random integer in [0..9] for each query Store the query if the integer is 0, otherwise discard

21 What is wrong with our simple approach?
Suppose each user issues s queries and d queries twice = (s + 2d) total queries What fraction of queries by an average search engine user are duplicates? if we keep only 10% of the queries: Sample will contain s/10 of the singleton queries and 2d/10 of the duplicate queries at least once But only d/100 pairs of duplicates d/100 = 1/10 ∙ 1/10 ∙ d Of d “duplicates” 18d/100 appear exactly once 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d Fraction appearing twice is (d/100) divided by (d/100 + s/ d/100): d/(10s + 19d) d/(s+d) d/100 appear twitece: 1st query gets sampled with prob. 1/10, 2nd also with 1/10, there are d such queries: d/100 18d/100 appear once. 1/10 for first to get selection and 9/10 for the second to not get selected. And the other way around so 18d/100

22 Instead of queries, let’s sample users…
Pick 1/10th of users and take all their searches in the sample User a hash function that hashes the user name or user id uniformly into 10 buckets Why is this a better solution?

23 Generalized Solution Stream of tuples with keys:
Key is some subset of each tuple’s components e.g., tuple is (user, search, time); key is user Choice of key depends on application To get a sample of a/b fraction of the stream: Hash each tuple’s key uniformly into b buckets Pick the tuple if its hash value is at most a Hash table with b buckets, pick the tuple if its hash value is at most a. How to generate a 30% sample? Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets

24 Maintaining a fixed size sample
Suppose we need to maintain a random sample S of size exactly s tuples E.g., main memory size constraint Don’t know length of stream in advance Suppose at time n we have seen n items Each item is in the sample S with equal prob. s/n How to think about the problem: say s = 2 Stream: a x c y z k c d e g… At n= 5, each of the first 5 tuples is included in the sample S with equal prob. At n= 7, each of the first 7 tuples is included in the sample S with equal prob. Impractical solution would be to store all the n tuples seen so far and out of them pick s at random

25 Fixed Size Sample Algorithm (a.k.a. Reservoir Sampling)
Store all the first s elements of the stream to S Suppose we have seen n-1 elements, and now the nth element arrives (n > s) With probability s/n, keep the nth element, else discard it If we picked the nth element, then it replaces one of the s elements in the sample S, picked uniformly at random Claim: This algorithm maintains a sample S with the desired property: After n elements, the sample contains each element seen so far with probability s/n

26 Proof: By Induction We prove this by induction: Base case:
Assume that after n elements, the sample contains each element seen so far with probability s/n We need to show that after seeing element n+1 the sample maintains the property Sample contains each element seen so far with probability s/(n+1) Base case: After we see n=s elements the sample S has the desired property Each out of n=s elements is in the sample with probability s/s = 1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

27 Proof: By Induction Inductive hypothesis: After n elements, the sample S contains each element seen so far with prob. s/n Now element n+1 arrives Inductive step: For elements already in S, probability that the algorithm keeps it in S is: So, at time n, tuples in S were there with prob. s/n Time nn+1, tuple stayed in S with prob. n/(n+1) So prob. tuple is in S at time n+1 = (n/n+1) (s/n) = s/n+1 Element n+1 not discarded Element in the sample not picked Element n+1 discarded

28 Part II: Querying using a Sliding Window

29 Sliding Window Queries are not about individual values, but instead about a window of length N: N is the most recent elements received What do we do if N is large that you cannot fit it into memory? single stream is so large that you cannot fit it into memory multiple streams so large that they don’t fit into memory

30 Example 1: A single stream may not fit into memory
Which regions of the molecule are moving together? At which time-points are the spatio-temporal patterns of motions changing? T100 T6000 T0 Often MD datasets are so large that they don’t fit into memory (> several terabytes) Questions are not about a “snapshot” but more about “time windows” People have been attracted with the idea that once a protein structure is made available from various experimental techniques, it must be possible for them to take these atomic 3D coordinates and simulate them using computers. Beginning with the 1970s, molecular dynamics simulations have become almost standard procedures to obtain insights into how proteins and other biomolecules work. MD simulations have become common in other areas, including material sciences, chemistry and so on. What does MD output? x1, y1, z1 x2, y2, z2 xN, yN, zN T 1 …, …, …, Time = Data T-1

31 Example 2: Multiple streams and so, they won’t fit into memory
howstuffworks.com fujitsu.com Computer networks: information streaming from a variety of computers on a network: <source, port, destination, no. of packets> Sensor networks: multiple sensors reporting at different time-scales…

32 Running example: Amazon stream
For every product X we keep a 0 or 1 stream: 0: not sold 1: sold in the nth transaction How many times have we sold X in the last k sales? Past Future

33 Counting Bits Given a stream of 0s and 1s: Simple solution:
How many 1s are in the last k bits? k ≤ N Simple solution: Store only the most recent N bits… When a new bit comes in, discard the N+1th bit N = 6 Past Future

34 What’s the problem? No exact answer without storing the actual window!
We need at least N bits to store to get the exact answer Amazon sells approximately 400 products every second – 24 x 3600 x 400 ~= 34 million products a day Let’s say N = 1 billion! we quickly run out of memory

35 Can we approximate our solution?
With really large N, an approximate solution is acceptable What if we assume that the bit stream is really uniformly distributed? How do we count the number of 1s in the last N bits? What is the caveat in our algorithm? Are the distributions really uniform? Initialize 2 counters: S: number of 1s from the beginning of the stream Z: number of 0s from the beginning of the stream Assuming uniformity: How many 1s in the last N bits? N Past Future

36 Datar-Gionis-Indyk-Motwani (DGIM) Algorithm
Does not assume uniformity We store O(log2N) bits per stream Solution gives approximate answer: Never off by more than 50% Error can be reduced to any fraction ε >0 with a more complicated algorithm (and more bits – still of O(log2N))

37 Simple extension to previous idea: Use Exponential Windows
Summarize exponentially increasing regions of the stream, looking backward Drop small regions if they begin at the same point as a larger region Window of width 16 has 6 1s 1 2 3 4 10 6 ? N We can reconstruct the count of the last N bits, except we are not sure how many of the last 6 1s are included in the N

38 What works? (And What does not)
What does not work? Stores only O(log2N) bits Updates are easy! Error is no larger than the number of 1s in the unknown area Assumes that 1s are fairly evenly distributed If all the 1s are at the end of the unknown area: You cannot bound the error 1 2 3 4 10 6 N ?

39 DGIM: Summarize blocks with specific number of 1s and associate a time-stamp
We now vary not only the window size, but also the block size! When there are few 1s in the window, block sizes stay small, so errors are small Each bit has a time-stamp starting with 1, 2, … until modulo N [total of log2N bits for a window] N

40 DGIM: Buckets Divide the window into buckets:
timestamp of its right (most recent end) The number of 1s in the bucket (1) stored in O(logN) bits; (2) stored in O(log log N) bits Important constraint on buckets: No. of 1s is a power of 2! N

41 Representing a data stream by buckets
Either one or two buckets with the same power- of-2 number of 1s Buckets do not overlap in timestamps Buckets are sorted by size Earlier buckets are not smaller than later buckets Buckets disappear when their end-time is > N time units in the past

42 How does a bucketized stream look like?
At least 1 of size 16. Partially beyond window. 2 of size 8 2 of size 4 1 of size 2 2 of size 1 N Either one or two buckets with the same power-of-2 number of 1s Buckets do not overlap in timestamps Buckets are sorted by size

43 How do we update our bucket?
When a new bit comes in, drop the last (oldest) bucket if its end-time is prior to N time units before the current time If the current bit is 0: no other changes are needed If the current bit is 1: Create a new bucket of size 1, for just this bit End timestamp = current time If there are now three buckets of size 1, combine the oldest two into a bucket of size 2 If there are now three buckets of size 2, combine the oldest two into a bucket of size 4 And so on …

44 Running through an example
Current state of the stream: Bit of value 1 arrives Two orange buckets get merged into a yellow bucket Next bit 1 arrives, new orange bucket is created, then 0 comes, then 1: Buckets get merged… State of the buckets after merging

45 Querying our bucketized stream
To estimate the number of 1s in the most recent N bits: Sum the sizes of all buckets but the last (note “size” means the number of 1s in the bucket) Add half the size of the last bucket Remember: We do not know how many 1s of the last bucket are still within the wanted window

46 Part III: Counting Distinct Elements

47 Problem Definition Data stream consists of a universe of elements chosen from a set of N Maintain a count of number of distinct items seen so far One Obvious Approach: Maintain the set of elements seen so far Keep a hashtable of all the distinct elements seen so far

48 Example How many unique products we sold tonight?
How many unique requests on a website came in today? How many different words did we find on a website? Unusually large number of words could be indicative of spam

49 Now, let’s constrain ourselves with limited storage…
How to estimate (in an unbiased manner) the number of distinct elements seen? Flajolet-Martin Approach Pick a hash function that maps each of the N elements to at least log2N bits For each stream element a, let r(a) be the number of trailing 0s in h(a) r(a) = position of first 1 counting from the right E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 Record R = the maximum r(a) seen R = maxa r(a), over all the items a seen so far Estimated number of distinct elements = 2R

50 Part III: Counting Item Sets

51 Problem Definition Given a stream, which items appear more than s times in the given window One Approach Think of the stream of buckets as one binary stream per item 1 if item is present; 0 if item is not Use DGIM to estimate counts of 1s of all items N 1 2 3 4 10 6

52 Exponentially Decaying Windows
A heuristic for selecting likely frequent item(sets) Example: What are “currently” the most popular movies? Instead of computing the raw counts of the last N elements, we need a “smooth aggregate” on the whole stream

53 What do we mean by a “smoothed” aggregate?
If a stream has a1, a2, …, at, where a1 is the first element to arrive and at is the current element. Let c be a small constant (10-6 to 10-9) We define an exponential window to be: When a new item at+1arrives: multiply current sum by (1-c) and add at+1

54 Now, how do we count items…
If each ai is an item, we can compute the characteristic function of each possible item x as an exponentially decaying window: if δi=1 if ai = x or 0 otherwise Imagine that for each item x we have a binary stream (1 if x appears; 0 otherwise) New item x appears: Multiply all counts by (1-c) Add +1 to the count of x Call this sum the “weight of x”

55 Sliding vs. Decaying Windows
A property to remember: Sum over all weights

56 A practical application: Creating a story board from Molecular Dynamics (MD) Simulation
We do really long time-scale simulations: 10-15 (femtosec) time-step to 10-6 (microsec) Data sets can typically range from a few GB to O(100) TB!! How to find “events” of interest within these datasets? How to organize a “storyboard” of events for biologists to interpret results?

57 Using Higher-order Moments to capture events in MD simulations

58 Putting it all together…

59 Part IV: Engineering for Streaming Data
Apache Kafka Apache Storm

60 Why do we need to engineer?
Supporting data integration: Making “all” data available in a uniform way for all its services and systems This is challenging: Event data fire-hose: Log Data comes in various forms!! Explosion of specialized data systems: e.g., OLAP, databases, etc. Evolution Dynamic Usability Quality Availability/ Collection Predictive Models Data Management & Reporting Accuracy, Validity, Completeness, etc. Data Capture, Access, Availability, etc. Jay Kreps, LinkedIn Engineering Blog:

61 Apache Kafka Distributed Messaging System Processor
Flexible data input Real-time Fault-tolerant scalable processing Handle Data fails Guaranteed delivery Listeners INCOMING OUTGOING Destination Processor Standing Queries Archival Storage Limited Working Storage

62 Three components of a messaging system
Publish/Subscribe model Message Producer, Consumer, Broker Durable message: one that is held on if a consumer becomes unavailable Persistent message: one where the message delivery is guaranteed Both messages cost! Storing metadata about billions of messages creates overhead Storage structure (e.g., B-trees) can impose restrictions

63 Messaging with Apache Kafka
Application 1 Front End 2 Service 3 Apache Kafka Hadoop Clusters Real-time Monitoring Databases, etc. Explicitly distributed: Producers, Consumers, Brokers all run on a cluster that co-operate as a logical group LinkedIn Engineering: to handle activity stream

64 Anatomy of Kafka: Topics
Topic: a category/ feed where messages are published Ordered immutable sequence of messages contiguously appended to Consumers subscribe to topics that they would like to read

65 Apache Storm Topologies: basically like a graph, where:
“processes”/ “jobs” are nodes “data flows” are edges Data flows are essentially data streams: Spout: is a source of a data stream. E.g., connecting to Twitter API and emitting a tweet stream Bolt: process from one or more spouts and emit yet another stream! Topologies run forever! [unless you stop them]

66 Comparing MapReduce & Storm
MapReduce/ Hadoop Storm Run MapReduce Jobs Built to leverage batch processing: ordering is unimportant Very simple interface: tied to Java Massive distributed processing is the end goal Run Topologies Built to leverage streaming applications: ordering is important More complex (Spouts and Bolts: has interfaces in Java, Python, Scala, etc. Massive volume & velocity processing is the end goal

67 Where can we combine the two?
Switch 1 Switch 2 Switch 3 Switch 4 Which switch has an authentic problem? How to predict which switch will go down in the near future? Switch 1 Switch 2 Switch 3 Switch 4 Prediction is not easy: Monitor “online” Build model “offline” Validate Event Sequence Event Store Event Collector HDFS Analytic model (NB)

68 Summary of Streaming Data Analytics
Data Management issues are very different: Streaming data storage and processing can constrain our analytic algorithm to think differently Example applications: Sampling a data stream Counting discrete items/ frequent items Querying with sliding windows Engineering Perspective: Apache Storm and Kafka


Download ppt "Streaming Data and Analytics"

Similar presentations


Ads by Google