Algorithms for Big Data: Streaming and Sublinear Time Algorithms

Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Moran Feldman

Motivation Big Data (huge data sets) Why? So What? The Internet
Easy to collect data Easy to transfer data New equipment LHC So What? Difficult to process all the data. Difficult to store all the data.

What is this Talk About? Big data has motivated a lot of research (both CS and non-CS). In this talk we are interested in theoretical algorithms for big data problems. Sublinear time algorithms Streaming Algorithms We will see a few classical algorithms of each one of these kinds.

Sublinear Time Algorithms

Sublinear Time Algorithms
Most algorithms read all their input. Require at least a linear time. We are interested in sublinear time algorithms. Cannot afford to read all its input. We will start with a simple example…

Diameter Approximation
All distances are non-negative. d(u, v) = 0  u = v d(u, v) = d(v, u) d(u, v) + d(v, w) ≥ d(u, w) Instance A set P of points. A function d: P  P  R giving the distance between every pair of points. We assume d is a metric. Objective Approximate the diameter D of P. u v d(v, w) z w

Algorithm Trivial Algorithm
Query the distance between every pair of points, and return the maximum one. Time complexity: O(P2). More Involved Algorithm Fix an arbitrary point u. Query the distance of every other point from u. Return the maximum distance found. The size of the input. u v z w

Analysis A square root of the size of the input. u Time Complexity Time O(P). Guarantee Let d(v, w) be the diameter. By the triangle inequality: d(u, v) + d(u, w)  d(v, w) = D. The algorithm outputs a value D’ such that: D/2  D’  D. This is a rare example of a sublinear time deterministic algorithm. v z w

Property testing We are interested in deciding whether an object has some property. Often depends on all the input: Is a list of numbers sorted? Are all the numbers in a set distinct? Is an image an half-plane? To get the right answer with a constant probability one has to read a constant fraction of the input.   

Property testing (cont.)
The exact definition varies. Intuitively, changing a fraction of  of the object cannot make it have the property. Distinguish between two cases Object has the property Answer “Yes” Object is -far from having the property Answer “No” Otherwise Does not matter!

Testing List Sortedness
More than   n numbers has to be changed to make the list sorted. Instance A list of numbers of n numbers. Objective Test whether the list is sorted (ascending) or -far from being sorted. Trivial Algorithms Pick a uniformly random 1  i  n – 1, and test whether “xi  xi+1”. Fails with high probability for the ½-far instance: Pick uniformly random 1  i  j  n, and test whether “xi  xj”. Fails with high probability for the ½-far instance:

Algorithm [EKKRV00] Pick a uniformly random i.
Run a binary search for xi. Answer “No” if the binary search ends up at a point other than i (and “Yes” otherwise). x4 x2 x6 x1 x3 x5 x7 Funda Ergün, Sampath Kannan, S Ravi Kumar, Ronitt Rubinfeld, Mahesh Viswanathan

Completeness Analysis
We need to show that the algorithm always returns “Yes” when the input is sorted. Pick a uniformly random i. Run a binary search for xi. Answer “No” if the binary search ends up at a point other than i (and “Yes” otherwise). Should never happen when the list is sorted. (and the elements are unique)

Soundness Analysis Number of Good Indexes Probability = of “Yes” n
We need to upper bound the probability that the algorithm returns “Yes” when the list is -far from being sorted. An index i is “good” if the algorithm returns “Yes” when it randomly chooses i. Clearly: Thus, we want to upper bound the number of good idexes in an -far list. Probability of “Yes” Number of Good Indexes = n

Main Observation x4 x2 x6 x1 x3 x5 x7 xk xj xi
Lemma The elements at the good indexes form a sorted sub-list. Proof Let i < j be two good indexes. Let k be the index of their lowest common ancestor in the binary search tree. Since i and j are good indexes we get: xi < xk xk < xj x4 x2 x6 x1 x3 x5 x7 xk xj xi

Soundness Analysis (cont.)
In an -far list: No (1-)n elements can form a sorted sub-list. There are less than (1-)n good elements. The algorithm answers “Yes” with probability less than: 1 -  Can be improved by repetition. Repeating the algorithm -1 times yields an algorithm that: Always answer “Yes” for a sorted input. Answer “No” for an -far input with probability at least 1/e  Time complexity: O(-1 log n) Never fails with high probability for a ½-far input.

Streaming Algorithms

Motivation (a) (b) Two Scenarios: A network element Disadvantages
Poor random access speed Poor long term reliability: Data has to be copied occasionally. Advantages Cost effective: Hardware Energy Fast sequential access Can be stored offsite: Backup Security Network traffic Magnetic Tape (a) (b) Problem The element can store only a small fraction of the traffic. Processes the traffic: For example, detects malicious activity.

An answer based on the input
Streaming Model Algorithm An answer based on the input Input stream Edges of an input graph Words of an input document Main Issue The algorithm should use little memory. Often polylogarithmic in the size of the input. Multiple Passes Sometimes the algorithm is allowed multiple passes over the input. Appropriate for the magnetic tape motivation.

Finding Frequent Elements
Theorem [MG82] There is a streaming algorithm using O(k (log n + log m)) space which: Outputs a set of at most k – 1 elements. The set contains every element with more than n/k appearances in the stream. Remarks A second pass can be used to detect the elements that really have more than n/k appearances. For simplicity, we present the algorithm for the case k = 2. Occasionally comes up in job interviews. Misra and Gries

Algorithm Initialize: counter  0. For each arriving element e do
If counter = 0 then Set counter  1, candidate  e. ElseIf candidate = e then Set counter  counter + 1. Else Set counter  counter - 1. Return candidate.

Analysis Immediate Observations
The algorithm uses O(log n + log m) space. The algorithm outputs a single element. If there is no element that appears more than n/2 times, then we are done. Otherwise, let e1/2 be this element. Definition X is defined as follows: X = counter when the candidate is e1/2. X = -counter when the candidate is not e1/2. Lemma At every given time during the execution of the algorithm: X ≥ (appearances of e1/2 so far) – (appearances of other elements so far).

and left side changes by 1.
Proof of the Lemma Lemma At every given time during the execution of the algorithm: Proof The proof is by induction. Trivially holds before the first element arrives. Assume it holds before the arrival of an element e, then: X ≥ (appearances of e1/2 so far) – (appearances of other elements so far). Intuitively, we get an inequality because elements other than e1/2 might cancel each other out. e1/2 is the candidate? Yes No Both sides increase by 1 Both sides increase by 1 and left side changes by 1. e = e1/2 ? Yes No Both sides decrease by 1 Right side decrease by 1

Warping Up the Proof After all the input is processed, we have:
X ≥ (appearances of e1/2 so far) – (appearances of other elements so far). X must be positive as well The right side is positive. e1/2 is the final candidate

Streaming Algorithms for Graph Problems
Streaming for Graph Problems The stream consists of the edges of the graph. Allows O(n polylog(n)) space. (Trivial) Algorithm for Counting Connected Components Initially each node is an independent connected component. For each edge e that arrives, if the end points of e belong to different connected component, merge these connected components. Sublinear in the length of the stream which can be ϴ(n2).

Algorithm for Counting Connected Components
Example Analysis Space complexity: O(n log n). It is enough to maintain the list of nodes in each connected component.

Applications Immediate Application Determining whether a graph is connected. More Interesting Application Determining whether a graph is bipartite. Algorithm Let n1 and n2 be the number of connected components in G and G2. G is bipartite if and only if 2n1 = n2. u v G u1 v1 u2 v2 G2

This lemma implies that the algorithm is correct.
Analysis Lemma The copies of the nodes of a connected component C of G form: Two connected components of G2 if C is bipartite. A single connected component of G2 if C is not bipartite. Proof There is never a path in G2 between copies of nodes which are not connected in G. If u and v are connected in G, then each copy of u in G2 is connected to some copy of v. The copies of the nodes of a connected component C of G form one or two connected components in G2. Moreover, in the later case each component contains exactly one copy of each node of C. This lemma implies that the algorithm is correct.

Analysis (cont.) A B A1 B2 A2 B1 C is not bipartite
Let v be a node on an odd cycle of C. The cycle becomes a path between v1 and v2 in G2. C is bipartite A path between v1 and v2 in G2 implies an odd cycle in G. Cannot exist since C is bipartite. Alternative view: v a d G b c G2 a2 d1 b1 c2 v2 v1 A B A1 B2 A2 B1

Questions ?

Algorithms for Big Data: Streaming and Sublinear Time Algorithms

Similar presentations

Presentation on theme: "Algorithms for Big Data: Streaming and Sublinear Time Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Algorithms for Big Data: Streaming and Sublinear Time Algorithms

Similar presentations

Presentation on theme: "Algorithms for Big Data: Streaming and Sublinear Time Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback