Algorithms for data streams Foundations of Data Science 2014 Indian Institute of Science Navin Goyal.

Slides:



Advertisements
Similar presentations
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
Advertisements

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Indian Statistical Institute Kolkata
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
CS 206 Introduction to Computer Science II 09 / 14 / 2009 Instructor: Michael Eckmann.
1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Chapter Physical Database Design Methodology Software & Hardware Mapping Logical Design to DBMS Physical Implementation Security Implementation Monitoring.
CSCI 3 Introduction to Computer Science. CSCI 3 Course Description: –An overview of the fundamentals of computer science. Topics covered include number.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
A survey on stream data mining
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Artificial Intelligence Lecture No. 29 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
File Structures Foundations of Computer Science  Cengage Learning.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
ECEN4533 Data Communications Lecture #1511 February 2013 Dr. George Scheets n Review C.1 - C.3 n Problems: Web 7, 8, & 9 n Quiz #1 < 11 February (Async.
Computer Science 112 Fundamentals of Programming II Modeling and Simulation.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Last Words DM 1. Mining Data Steams / Incremental Data Mining / Mining sensor data (e.g. modify a decision tree assuming that new examples arrive continuously,
Foundation of Computing Systems
Lecture 13: Anonymity on the Web Modified from Levente Buttyan, Michael K. Reiter and Aviel D. Rubin.
Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.
Introduction to Scientific Computing II Multigrid Dr. Miriam Mehl Institut für Informatik Scientific Computing In Computer Science.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Calculating frequency moments of Data Stream
Query Processing CS 405G Introduction to Database Systems.
CSE 312 Foundations of Computing II Instructor: Pedro Domingos.
Estimating PageRank on Graph Streams Atish Das Sarma (Georgia Tech) Sreenivas Gollapudi, Rina Panigrahy (Microsoft Research)
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Mining of Massive Datasets Ch4. Mining Data Streams.
New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,
Ariel Rosenfeld.  Counter ranges from 0 to M requiers log 2 M bits.  For large data log 2 M is still a lot.  Using probability to reduce to log 2 log.
What types of problems we study, Part 1: Statistical problemsHighlights of the theoretical results What types of problems we study, Part 2: ClusteringFuture.
An Optimal Algorithm for Finding Heavy Hitters
Mining Data Streams (Part 1)
The Data Types and Data Structures
A Resource-minimalist Flow Size Histogram Estimator
Finding Frequent Items in Data Streams
School of Computing Clemson University Fall, 2012
Streaming & sampling.
Sublinear Algorithmic Tools 2
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Approximate Frequency Counts over Data Streams
Lecture 2- Query Processing (continued)
CSCI B609: “Foundations of Data Science”
Range-Efficient Computation of F0 over Massive Data Streams
Slides adapted from Donghui Zhang, UC Riverside
CS505: Intermediate Topics in Database Systems
Lecture 6: Counting triangles Dynamic graphs & sampling
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
Կարգավորում Insertion Sort, Merge Sort
Approximate Counting Algorithm
Maintaining Stream Statistics over Sliding Windows
(Learned) Frequency Estimation Algorithms
Presentation transcript:

Algorithms for data streams Foundations of Data Science 2014 Indian Institute of Science Navin Goyal

Introduction Data streams: Very large input data arriving sequentially, too large to fit in memory Examples: – networks (traffic passing through a router) – databases (transaction logs) – scientific data (satellites, sensors, LHC,…) – financial data What can we compute about the data in such situations? Today’s lecture: Start with an illustrative example problem, and then some generalities about the streaming model and problems

Example: Counting

Counting

Performance of Morris counter

Boosting the success probability I

Performance of Morris counter

Boosting the success probability II

Boosting success probability II

Test your understanding: Why don’t we just use the median all the time for boosting the probability of success instead of the mean?

Recap

Questions to ponder

Streaming data: models and problems

Models for streaming data

Restrictions on the algorithm

Some streaming problems: frequency moments

A general template for many streaming algorithms Come up with a basic random estimator for the quantity of interest (usually the non-trivial part) Give an efficient algorithm to compute the estimator (may need the use of hashing or some other way of reducing randomness requirements) Improve the probability of success by some trick such as the median of means estimator

Plan for next few lectures