Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research.

Slides:

Advertisements

Similar presentations

The Fall Messier Marathon Guide

Advertisements

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.

AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory

PDAs Accept Context-Free Languages

ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala

Estimating Distinct Elements, Optimally

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

EuroCondens SGB E.

Lower Bounds on Streaming Algorithms for Approximating the Length of the Longest Increasing Subsequence. Anna GalUT Austin Parikshit GopalanU. Washington.

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

An Optimal Algorithm for the Distinct Elements Problem

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Xiaoming Sun Tsinghua University David Woodruff MIT

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

Addition and Subtraction Equations

By John E. Hopcroft, Rajeev Motwani and Jeffrey D. Ullman

1 When you see… Find the zeros You think…. 2 To find the zeros...

12.3 – Analyzing Data.

Introduction to Turing Machines

Grade D Number - Decimals – x x x x x – (3.6 1x 5) 9.

1 Estimating the longest increasing sequence in polylogarithmic time C. Seshadhri (Sandia National Labs) Joint work with Michael Saks (Rutgers University)

The 5S numbers game..

突破信息检索壁垒－SciFinder Scholar 介绍

A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.

Sampling Algorithms for Evolving Datasets Rainer Gemulla Defense of Ph.D. Thesis Faculty of Computer Science, Institute of System Architecture,

The basics for simulations

EE, NCKU Tien-Hao Chang (Darby Chang)

Chapter 10: Applications of Arrays and the class vector

MM4A6c: Apply the law of sines and the law of cosines.

Briana B. Morrison Adapted from William Collins

Progressive Aerobic Cardiovascular Endurance Run

Artificial Intelligence

Before Between After.

ST/PRM3-EU | | © Robert Bosch GmbH reserves all rights even in the event of industrial property rights. We reserve all rights of disposal such as copying.

Static Equilibrium; Elasticity and Fracture

Resistência dos Materiais, 5ª ed.

Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.

Chart Deception Main Source: How to Lie with Charts, by Gerald E. Jones Dr. Michael R. Hyman, NMSU.

1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)

1 Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

The Communication Complexity of Approximate Set Packing and Covering

úkol = A 77 B 72 C 67 D = A 77 B 72 C 67 D 79.

Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.

ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.

Longest Increasing Subsequence and Distance to Monotonicity in Data Stream Model Hossein Jowhari Simon Fraser University Joint work with Funda Ergun Dagstuhl.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.

1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.

Information Theory for Data Streams David P. Woodruff IBM Almaden.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.

Data Stream Algorithms Lower Bounds Graham Cormode

The Message Passing Communication Model David Woodruff IBM Almaden.

Algorithms for Big Data: Streaming and Sublinear Time Algorithms

Information Complexity Lower Bounds

New Characterizations in Turnstile Streams with Applications

Approximate Matchings in Dynamic Graph Streams

CSCI B609: “Foundations of Data Science”

Streaming Symmetric Norms via Measure Concentration

Presentation transcript:

Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research

Data Stream Model of Computation X 1 X 2 X 3 …X n Input Storage Computing with Massive data sets. Sequential access. Small storage space, update time. [Alon-Matias-Szegedy, …]

Sorting on Data-Streams Cannot sort efficiently. Can we tell if the data needs to be sorted? [Ergun-Kannan-Kumar-Rubinfeld-Vishwanathan, Ajtai-Jayram-Kumar-Sivakumar, Gupta-Zane, Cormode-Muthukrishnan-Sahinalp, LibenNowell-Vee-Zhu, Ailon-Chazelle-Commandur-Liu]

Sorting on Data-Streams Cannot sort efficiently on a data-stream. Can we tell if the data needs to be sorted? [Ergun-Kannan-Kumar-Rubinfeld-Vishwanathan, Ajtai-Jayram-Kumar-Sivakumar, Gupta-Zane, Cormode-Muthukrishnan-Sahinalp, LibenNowell-Vee-Zhu, Ailon-Chazelle-Commandur-Liu] Measuring distance from Sortedness: Kendall Tau distance Spearman Footrule distance Ulam distance

Candidate metrics 1. Spearmans footrule [ 1 distance] : e Easy to compute.

2. Kendall Tau distance [No. of Inversions] Inversions: Positions i (j) Candidate metrics

2. Kendall Tau distance [No. of Inversions] Inversions: Positions i (j) Candidate metrics

2. Kendall Tau distance [No. of Inversions] Candidate metrics Within a factor-2 of Spearmans footrule. [Diaconis- Graham] An O(log n) space, 1-pass (1 + ) algorithm. [Ajtai- Jayram-Kumar-Sivakumar]

3. Ulam distance [Edit Distance]: Ed( ): Number of deletions needed to sort. Candidate metrics Ulam: Fastest way to sort a bridge hand.

Ed( ): Number of deletions needed to sort Edit Distance and the LIS

Ed( ): Number of deletions needed to sort Delete Insert Edit Distance and the LIS

Ed( ) : Number of deletions needed to sort. LIS( ) : Length of the longest increasing sequence. Ed( ) + LIS( ) = n Edit Distance and the LIS Studied in statistics, biology, computer science … Both take a global view of the sequence. Hard for models like streaming, sketching, property-testing. 51 … … … 100

Prior Work Exact Computation of Ed( ) and LIS( ) : –Patience Sorting [Ross,Mallows]

Patience Sorting

Patience Sorting

Patience Sorting

Patience Sorting Number in place i: Earliest end to IS of length i.

Patience Sorting Number in place i: Earliest end to IS of length i.

Patience Sorting Number in place i: Earliest end to IS of length i.

Patience Sorting Number in place i: Earliest end to IS of length i.

Patience Sorting Length of LIS 0 1 LIS

Prior Work Exact Computation of Ed( ) and LIS( ) : –Patience Sorting [Ross,Mallows] –O(n) space, 1-pass streaming algorithm. – n) space lower bound. [LibenNowell-Vee-Zhu] Approximating Ed( ) and LIS( ) : –No sub-linear space algorithms, no lower bounds. [Ajtai et al, Cormode et al, LibenNowell et al] LIS Algorithms parametrized by length of LIS : [LibenNowell-Vee-Zhu, Sun-Woodruff] Computing Ed( ) in other models: –Property Testing [Ergun et al, Ailon et al] –Sketching [Charikar-Krauthgamer]

Our Results Approximating Ed( ) : –An O(log 2 n) space, randomized 4-approximation for Ed( ). –A O(n) space, deterministic (1 + ε)-approximation for Ed( ). Approximating the LIS: –A O(n) space, deterministic (1 + ε)-approximation for LIS( ). Exact Computation of Ed( ) and LIS( ): –An n) space lower bound for randomized algorithms. –Independently proved by [ Sun-Woodruff ]. Lower bounds for approximating the LIS: –Conjecture: Deterministic algorithms require n) space for (1 + ε)-approximation

Computing the Edit Distance Thm: For any ε > 0,there is a one-pass randomized algorithm using O(ε -2 log 2 n) space and update time, that gives a (4 + ε) approximation to Ed( ). 1. Combinatorial measure that approximates Ulam distance. Builds on [Ergun et al, Ailon et al]. 2. Sampling scheme to compute this measure in one pass.

A Voting Scheme [Ergun et al.] Combinatorial measure called Unpopularity. Neighborhoods of (i) : Intervals starting or ending at i

A Voting Scheme [Ergun et al.] Combinatorial measure called Unpopularity. Neighborhoods of (i) : Intervals starting or ending at i. Deciding if (i) is unpopular: For every neighborhood of (i) Every number in the neighborhood votes on Is (i) out of order? If majority in some neighborhood vote against (i), it is marked unpopular. Let U( ) denote no. of unpopular numbers. [Ergun et al]:Ed( ) U( ) [Ailon et al]: U( ) 2 Ed( )

A Voting Scheme [Ergun et al.] Can we estimate U( ) using a streaming algorithm?

A Voting Scheme [Ergun et al.] Can we estimate U( ) using a streaming algorithm? Impossible to decide if (i) is unpopular before seeing the entire input.

A New Voting Scheme Neighborhoods of (i) : Intervals ending at i. If majority in some neighborhood vote against (i), it is marked unpopular. Unpopularity based only on past, not the future. Thm: Let V( ) denote no. of unpopular numbers. Then Ed( )/2 V( ) 2 Ed( )

A Voting Scheme Let Ed( ) = k. Then V( ) 2k. Fix an optimal Bad set of size k to delete. How many numbers can be Unpopular ? Partition Unpopular into Good and Bad. Good numbers form an increasing sequence. Good never votes against Good. Good + Unpopular Bad neighborhood !

A Voting Scheme Good + Unpopular Bad neighborhood ! If k numbers are Bad, At most k are Good + Unpopular. Bad numbers might all be Unpopular. Hence V( ) 2k. Let Ed( ) = k. Then V( ) 2k. Fix an optimal Bad set of size k to delete.

A Voting Scheme Let Ed( ) = k. Then V( ) 2k. Bound can be tight … … … 90

A Voting Scheme Let V( ) = k. Then Ed( ) 2k. Fix the set of k Unpopular elements. Algorithm to produce an increasing sequence: 1.Scan right to left. 2.Delete Unpopular elements + Inversions w.r.t last number in sequence. At least half of deletions are Unpopular numbers. What remains is an increasing sequence.

A Voting Scheme Let V( ) = k. Then Ed( ) 2k. Bound can be tight. 11 … … … … 90

A New Voting Scheme Neighborhoods of (i) : Intervals ending at i. If majority in some neighborhood vote against (i), it is marked unpopular. Unpopularity based only on past, not the future. Thm: Let V( ) denote no. of unpopular numbers. Then Ed( )/2 V( ) 2 Ed( ) Can we estimate V( ) efficiently?

Outline of Sampling Scheme Taking a vote in one neighborhood: –Take O(log n) samples, take the (approx) majority. Reservoir Sampling [Vitter] Computing V( ) : Need O(log n) samples from every neighborhood.

Outline of Sampling Scheme Key observation: Dont need samples across intervals to be independent! Roughly O(log 2 n) samples suffice. Computing V( ) : Need O(log n) samples from every neighborhood.

Deterministic Algorithm for LIS Thm: For any ε > 0,there is a one-pass deterministic algorithm using O(n/ε) 1/2 space and update time, that gives a (1 - ε) approximation to LIS( ). Based on multiplayer communication protocol for LIS: 32 … … 1915 … 50 Algorithm simulates protocol for n players.

Two-Player Protocol for LIS … … 1319 Patience Sorting 6 24 … …1000 Multiples of εk n/2 k 1/ε

Approximating the LIS Conjecture: For some ε 0 > 0, every 1-pass deterministic algorithm that gives a (1 + ε 0 ) approximation to LIS( ) requires n) space. Consider k-player communication protocol for LIS: 32 … … 1915 … 50 As k increases, maximum message size increases. Proving the conjecture requires analyzing k n

Lower Bounds for approximating the LIS Conjecture: For some ε 0 > 0, every 1-pass deterministic algorithm that gives a (1 + ε 0 ) approximation to LIS( ) requires n) space. Candidate Hard Instances?

Lower Bounds for approximating the LIS Conjecture: For some ε 0 > 0, every 1-pass deterministic algorithm that gives a (1 + ε 0 ) approximation to LIS( ) requires n) space. Candidate Hard Instances? No Yes

Lower Bounds for approximating the LIS Conjecture: For some ε 0 > 0, every 1-pass deterministic algorithm that gives a (1 + ε 0 ) approximation to LIS( ) requires n) space. Candidate Hard Instances? No Yes

Lower Bounds for approximating the LIS Conjecture: For some ε 0 > 0, every 1-pass deterministic algorithm that gives a (1 + ε 0 ) approximation to LIS( ) requires n) space. Candidate Hard Instances? No Yes

Open Problems Estimate the Edit distance between two permutations. Tight bounds for approximation: Show (n) lower bound for deterministic algorithms. Randomized algorithm for LIS ?