Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld.

Slides:



Advertisements
Similar presentations
Ulams Game and Universal Communications Using Feedback Ofer Shayevitz June 2006.
Advertisements

COMP 482: Design and Analysis of Algorithms
Slides from: Doug Gray, David Poole
Order Statistics Sorted
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Chapter 4: Trees Part II - AVL Tree
Introduction to Training and Learning in Neural Networks n CS/PY 399 Lab Presentation # 4 n February 1, 2001 n Mount Union College.
Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
Computational problems, algorithms, runtime, hardness
CSE332: Data Abstractions Lecture 27: A Few Words on NP Dan Grossman Spring 2010.
Sorting Heapsort Quick review of basic sorting methods Lower bounds for comparison-based methods Non-comparison based sorting.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.
Testing Metric Properties Michal Parnas and Dana Ron.
EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Analysis of Algorithms CS 477/677
On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Chapter 11: Limitations of Algorithmic Power
Lecture 20: April 12 Introduction to Randomized Algorithms and the Probabilistic Method.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Complexity Classes (Ch. 34) The class P: class of problems that can be solved in time that is polynomial in the size of the input, n. if input size is.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Combinatorial Algorithms Reference Text: Kreher and Stinson.
1 Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Penn State University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
Elementary Sorting Algorithms Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
CSC 211 Data Structures Lecture 13
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
NP-Complete problems.
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
CS216: Program and Data Representation University of Virginia Computer Science Spring 2006 David Evans Lecture 8: Crash Course in Computational Complexity.
Sub-linear Time Algorithms Ronitt Rubinfeld Tel Aviv University Fall 2015.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
INTRO2CS Tirgul 8 1. Searching and Sorting  Tips for debugging  Binary search  Sorting algorithms:  Bogo sort  Bubble sort  Quick sort and maybe.
CSE 589 Applied Algorithms Spring 1999 Prim’s Algorithm for MST Load Balance Spanning Tree Hamiltonian Path.
Errol Lloyd Design and Analysis of Algorithms Approximation Algorithms for NP-complete Problems Bin Packing Networks.
Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Property Testing (a.k.a. Sublinear Algorithms )
Introduction to Randomized Algorithms and the Probabilistic Method
On Testing Dynamic Environments
Approximating the MST Weight in Sublinear Time
COSC160: Data Structures Linked Lists
Multi-dimensional likelihood
Lecture 18: Uniformity Testing Monotonicity Testing
Bin Fu Department of Computer Science
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Discrete Event Simulation - 4
3. Brute Force Selection sort Brute-Force string matching
ITEC 2620M Introduction to Data Structures
3. Brute Force Selection sort Brute-Force string matching
Minwise Hashing and Efficient Search
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

What are we here for?

Big data?

Really Big Data Impossible to access all of it Potentially accessible data is too enormous to be viewed by a single individual Once accessed, data can change

Approaches Ignore the problem Develop algorithms for dealing with such data

Small world phenomenon each “node’’ is a person “edge’’ between people that know each other

Small world property “connected” if every pair can reach each other “distance’’ between two people is the minimum number of edges to reach one from another “diameter’’ is the maximum distance between any pair

6 degrees of separation property In our language: diameter of the world population is 6

Does earth have the small world property? How can we know? data collection problem is immense unknown groups of people found on earth births/deaths

The Gold Standard linear time algorithms Inadequate…

What can we hope to do without viewing most of the data? Can’t answer “for all” or “there exists” and other “exactly” type statements: are all individuals connected by at most 6 degrees of separation? exactly how many individuals on earth are left-handed? Maybe can answer? is there a large group of individuals connected by at most 6 degrees of separation? is the average pairwise distances of a graph roughly 6? approximately how many individuals on earth are left-handed?

What can we hope to do without viewing most of the data? Must change our goals: for most interesting problems: algorithm must give approximate answer we know we can answer some questions... e.g., sampling to approximate average, median values

Sublinear time models: Random Access Queries Can access any word of input in one step Samples Can get sample of a distribution in one step, Alternatively, can only get random word of input in one step When computing functions depending on frequencies of data elements When data in random order p x 1,x 2,x 3,…

Sublinear space model: streaming Data flows by and once it goes by you can’t see it again you do have small space to work with and remember some things…

Part I: Query model

A. Standard approximation problems Estimate value of optimization problem after viewing only small portion of data Classically studied problems: Minimum spanning tree, vertex cover, max cut, positive linear program, edit distance, … some problems are “hard” some are “easy”…

Example Vertex Cover: What is the smallest set of nodes that “sees” each edge? What’s known? NP-complete (unlikely to have an efficient solution) Can approximate to within a multiplicative factor of 2 in linear time Can approximate to within a multiplicative factor of 2 and a bit of additional additive error in CONSTANT TIME on sparse graphs! [Nguyen Onak 2008][Onak Ron Rosen Rubinfeld 2012]

B. Property testing

Quickly distinguish inputs that have specific property from those that are far from having the property Benefits: natural question just as good when data constantly changing fast sanity check: rule out “bad” inputs (i.e., restaurant bills) when is expensive processing worthwhile? Machine learning: Model selection problem Main Goal: all inputs inputs with the property close to having property

Property testing algorithms: Does the input object have crucial properties? Clusterability, small diameter graph, close to a codeword, in alphabetical order,… Examples: Can test if the social network has 6 degrees of separation in CONSTANT TIME! [Parnas Ron] Can test if a large dataset is sorted in O(log n) time [Ergun Kannan Kumar Rubinfeld Viswanathan] Can test if data is well-clusterable in CONSTANT time [Parnas Ron Rubinfeld]

Property Testing Properties of any object, e.g., Functions Graphs Strings Matrices Codewords Model must specify representation of object and allowable queries notion of close/far, e.g., number of bits/words that need to be changed edit distance

A simple property tester

Sortedness of a sequence Given: list y 1 y 2... y n Question: is the list sorted? Clearly requires n steps – must look at each y i

Sortedness of a sequence Given: list y 1 y 2... y n Question: can we quickly test if the list close to sorted?

What do we mean by ``quick’’? query complexity measured in terms of list size n Our goal (if possible): Very small compared to n, will go for clog n

What do we mean by “close’’? Definition: a list of size n is  -close to sorted if can delete at most  n values to make it sorted. Otherwise,  -far. (  is given as input, e.g.,  =1/10) Sorted: Close: Far:

Requirements for algorithm: Pass sorted lists Fail lists that are  -far. Equivalently: if list likely to pass test, can change at most  fraction of list to make it sorted Probability of success > ¾ (can boost it arbitrarily high by repeating several times and outputting “fail” if ever see a “fail”, “pass” otherwise) Can test in O(1/  log n) time (and can’t do any better!) What if list not sorted, but not far?

An attempt: Proposed algorithm : Pick random i and test that y i ≤y i+1 Bad input type: 1,2,3,4,5,…n/4, 1,2,….n/4, 1,2,…n/4, 1,2,…,n/4 Difficult for this algorithm to find “breakpoint” But other tests work well… i yiyi

A second attempt: Proposed algorithm: Pick random i<j and test that y i ≤y j Bad input type: n/4 groups of 4 decreasing elements 4,3, 2, 1,8,7,6,5,12,11,10,9…,4k, 4k-1,4k-2,4k-3,… Largest monotone sequence is n/4 must pick i,j in same group to see problem need  (n 1/2 ) samples i yiyi

A minor simplification: Assume list is distinct (i.e. x i  x j ) Claim: this is not really easier Why? Can “virtually” append i to each x i x 1,x 2,…x n  (x 1,1), (x 2,2),…,(x n,n) e.g., 1,1,2,6,6  (1,1),(1,2),(2,3),(6,4),(6,5) Breaks ties without changing order

A test that works The test: Test O(1/  ) times: Pick random i Look at value of y i Do binary search for y i Does the binary search find any inconsistencies? If yes, FAIL Do we end up at location i? If not FAIL Pass if never failed Running time: O(  -1 log n) time Why does this work?

Behavior of the test: Define index i to be good if binary search for y i successful O(1/  log n) time test (restated): pick O(1/  ) i’s and pass if they are all good Correctness: If list is sorted, then all i’s good (uses distinctness)  test always passes If list likely to pass test, then at least (1-  )n i’s are good. Main observation: good elements form increasing sequence Proof: for i<j both good need to show y i < y j let k = least common ancestor of i,j Search for i went left of k and search for j went right of k  y i < y k <y j Thus list is  -close to monotone (delete <  n bad elements)

Part II: Sample model

Testing and learning distributions over BIG domains

Outbreak of diseases Similar patterns? Correlated with income level? More prevalent near large cities? Flu 2015 Flu 2016 Lots of zip codes!

Transactions of yr oldsTransactions of yr olds Testing closeness of two distributions: trend change? Lots of product types!

Play the lottery? Is it uniform? Is it independent?

Information in neural spike trails Each application of stimuli gives sample of signal (spike trail) Entropy of (discretized) signal indicates which neurons respond to stimuli Neural signals time [Strong, Koberle, de Ruyter van Steveninck, Bialek ’98]

High dimensional data Clusterable? Depends on all or few variables? Blood pressure, age, weight, pulse, temperature,… ICU, mortality, ventilator,… Vs.

What properties do your big distributions have?

Properties of distributions Given samples of a distribution, need to know, e.g., Is it uniform, normal, Poisson, Gaussian, Zipfian…? Correlations/independence? Entropy Number of distinct elements “Shape” (monotone, bimodal,…)

Key Question How many samples do you need in terms of domain size? Do you need to estimate the probabilities of each domain item? -- OR -- Can sample complexity be sublinear in size of the domain? Rules out standard statistical techniques

In many cases, the answer is YES! Well?

Assumptions on distributions? Is it Smooth? Monotone? Normal? Gaussian? Hopefully – NO ASSUMPTIONS!

Distributions on BIG domains

Independence, Monotonicity, Gaussian, normal, uniform,… Clusterable, Depends on few variables And lots and lots more! Many more!

To summarize: What can we do in sublinear time/space? Approximations – what do we mean by this? Optimization problems Decision problems – property testing Which distance metric? Which types of problems? Graphs and combinatorial objects Strings Functions Probability distributions Settings? Small time Small space Assistance of a prover?

Related course: Sublinear time algorithms Monday in Shenkar 104

My contact info: Can also google “Ronitt” and follow links. My

Giving a talk

Main issue

Giving an INTERESTING talk Good material Good slides Good delivery

Material Pick something you like! How many technical details? Make sure we learn something technical/conceptual

Slides At least 24 point font As few words as possible Split into more slides? Organization Use bullets, sub-bullets Use color wisely Write anything which should be remembered Repeat a definition on next slide? Use a board? Animations? Don’t overdo it! SPELL CHECK!!

Delivery Bad: Mumbling Monotone voice Repeat each sentence twice Filler words Stand in front of slides Don’t look at audience Important: Practice the talk – several times!

What now? Pick a paper, or two or three? Come talk to me!