No time What can we hope to do without viewing most of the data?
Small world phenomenon The social network is a graph: – node is a person – edge between people that know each other 6 degrees of separation Are all pairs of people connected by path of distance at most 6?
Vast data Impossible to access all of it Accessible data is too enormous to be viewed by a single individual Once accessed, data can change
The Gold Standard Linear time algorithms: – for inputs encoded by n bits/words, allow cn time steps (constant c) Inadequate…
What can we hope to do without viewing most of the data? Cant answer for all or exactly type statements: are all individuals connected by at most 6 degrees of separation? exactly how many individuals on earth are left-handed? Compromise? is there a large group of individuals connected by at most 6 degrees of separation? approximately how many individuals on earth are left-handed?
Property Testing Does the input object have crucial properties? Example Properties: Clusterability, Small diameter graph, Close to a codeword, Linear or low degree polynomial function, Increasing order Lots and lots more…
In the ballpark vs. out of the ballpark tests Property testing: Distinguish inputs that have specific property from those that are far from having that property Benefits: – Can often answer such questions much faster – May be the natural question to ask When some noise always present When data constantly changing Gives fast sanity check to rule out very bad inputs (i.e., restaurant bills) Model selection problem in machine learning
Examples Can test if a function is a homomorphism in CONSTANT TIME [Blum Luby R.] Can test if the social network has 6 degrees of separation in CONSTANT TIME [Parnas Ron]
Find characterization of property that is Efficiently (locally) testable Robust - objects that have the property satisfy characterization, and objects far from having the property are unlikely to PASS Constructing a property tester: Usually the bigger challenge
A bad testing characterization: x,y f(x)+f(y) = f(x+y) Another bad characterization: For most x f(x)+f(1) = f(x+1) Good characterization: For most x,y f(x)+f(y) = f(x+y) Example: Homomorphism property of functions
A bad testing characterization: For every node, all other nodes within distance 6. Another bad one: For most nodes, all other nodes within distance 6. Good characterization: For most nodes, there are many other nodes within distance 6. Example: 6 degrees of separation
Many more properties studied! Graphs, functions, point sets, strings, … Amazing characterizations of problems testable in graph and function testing models!
Properties of functions Linearity and low total degree polynomials [Blum Luby R.] [Bellare Coppersmith Hastad Kiwi Sudan] [R. Sudan] [Arora Safra] [Arora Lund Motwani Sudan Szegedy] [Arora Sudan]... Functions definable by functional equations – trigonometric, elliptic functions [R.] All locally characterized affine invariant function classes! [Bhattacharyya Fischer Hatami Hatami Lovett] Groups, Fields [Ergun Kannan Kumar R. Viswanathan] Monotonicity [EKKRV] [Goldreich Goldwasser Lehman Ron Samorodnitsky] [Dodis Goldreich Lehman Ron Raskhodnikova Samorodnitsky][Fischer Lehman Newman Raskhodnikova R. Samorodnitsky] [Chakrabarty Seshadri]… Convexity, submodularity [Parnas Ron R.] [Fattal Ron] [Seshadri Vondrak]… Low complexity functions, Juntas [Parnas Ron Samorodnitsky] [Fischer Kindler Ron Safra Samorodnitsky] [Diakonikolas Lee Matulef Onak R. Servedio Wan]…
Linear and low degree polynomials, trigonometric functions [Ergun Kumar Rubinfeld] [Kiwi Magniez Santha] [Magniez]… Some properties testable even with approximation errors!
Properties of sparse and general graphs Easily testable dense and hyperfinite graph properties are completely characterized! [Alon Fischer Newman Shapira][Newman Sohler] General Sparse graphs: bipartiteness, connectivity, diameter, colorability, rapid mixing, triangle free,… [Goldreich Ron] [Parnas Ron] [Batu Fortnow R. Smith White] [Kaufman Krivelevich Ron] [Alon Kaufman Krivelevich Ron]… Tools: Szemeredi regularity lemma, random walks, local search, simulate greedy, borrow from parallel algorithms
Some other combinatorial properties Set properties – equality, distinctness,... String properties – edit distance, compressibility,… Metric properties – metrics, clustering, convex hulls, embeddability... Membership in low complexity languages – regular languages, constant width branching programs, context-free languages, regular tree languages… Codes – BCH and dual-BCH codes, Generalized Reed- Muller,…
Yes! e.g., [Buhrman Fortnow Newman Rohrig] [Friedl Ivanyos Santha] [Friedl Santha Magniez Sen] Can quantum property testers do better?
May only query labels of points from a sample Useful for model selection problem Active property testing [Balcan Blais Blum Yang]
Traditional approximation Output number close to value of the optimal solution (not enough time to construct a solution) Some examples: Minimum spanning tree, vertex cover, max cut, positive linear program, edit distance, …
Given graph G(V,E), a vertex cover (VC) C is a subset of V such that it touches every edge. What is minimum size of a vertex cover? NP-complete Poly time multiplicative 2-approximation based on relationship of VC and maximal matching Example: Vertex Cover
Classical approximation examples Can get CONSTANT TIME approximation for vertex cover on sparse graphs! Output y which is at most 2 OPT + ϵn How? Oracle reduction framework [Parnas Ron] Construct oracle that tells you if node u in 2-approx vertex cover Use oracle + standard sampling to estimate size of cover But how do you implement the oracle?
Implementing the oracle – two approaches: Sequentially simulate computations of a local distributed algorithm [Parnas Ron] Figure out what greedy maximal matching algorithm would do on u [Nguyen Onak]
Implementing the Oracle via Greedy To decide if edge e in matching: Must know if adjacent edges that come before e in the ordering are in the matching Do not need to know anything about edges coming after Arbitrary edge order can have long dependency chains! Odd or even steps from beginning? 1 2 4 8 25 36 47 88 89 110 111 112 113
Breaking long dependency chains [Nguyen Onak] Assign random ordering to edges Greedy works under any ordering Important fact: random order has short dependency chains
Better Complexity for VC Always recurse on least ranked edge first Heuristic suggested by [Nguyen Onak] Yields time poly in degree [Yoshida Yamamoto Ito] Additional ideas yield query complexity nearly linear in average degree for general graphs [Onak Ron Rosen R.]
Further work More complicated arguments for maximum matching, set cover, positive LP… [Parnas Ron + Kuhn Moscibroda Wattenhofer] [Nguyen Onak] [Yoshida Yamamoto Ito] Even better results for hyperfinite graphs [Hassidim Kelner Nguyen Onak][Newman Sohler] e.g., planar Can dependence be made poly in average degree?
When we dont need to see all the output… do we need to see all the input?
Locally (list-)decodable codes [Sudan-Trevisan-Vadhan, Katz-Trevisan,…] Large message Encoding of large message What is the i th bit? Input Output of Decoding Algorithm Can design codes so that few queries needed to compute answer!!!
Local decompression algorithms [Chandar Shah Wornell][Sadakane Grossi] [Gonzalez Navarro] [Ferragina Venturini] [Kreft Navarro] [Billie Landau Raman Sadakane Satti Weimann] [Dutta Levi Ron R.]… Large data Compression of large data What is the i th bit? Input Output of Decompression Algorithm design compression scheme so that compress well and few queries needed to compute answer
Local Computation Algorithms (LCA) [R. Tamir Vardi Xie] Input x Output y LCA j xjxj i 1,i 2,.. y i 1, y i 2, …
Optimization problems [R. Tamir Vardi Xie] [Alon R. Vardi Xie] Maximal independent set Broadcast radio scheduling Hypergraph coloring K-CNF SAT Techniques as in oracle reduction framework, but oracle requirements are more stringent What more can be done in this model?
Phase 1: Compute a partial solution which decides most nodes simulate local algorithms can also simulate greedy Phase 2: Only small components left … brute force works! analysis similar to Beck/Alon can also analyze via Galton-Watson Two-phase plan:
Is the lottery unfair? From Hitlotto.com: Lottery experts agree, past number histories can be the key to predicting future winners.
True Story! Polish lottery Multilotek Choose uniformly at random distinct 20 numbers out of 1 to 80. Initial machine biased e.g., probability of 50-59 too small Past results: http://serwis.lotto.pl:8080/archiwum/wyniki_wszystkie.php?id_gra=2
Thanks to Krzysztof Onak (pointer) and Eric Price (graph)
New Jersey Pick 3,4 Lottery New Jersey Pick k ( =3,4) Lottery. Pick k digits in order. 10 k possible values. Assume lottery draws iid Data: Pick 3 - 8522 results from 5/22/75 to 10/15/00 2 - test gives 42% confidence Pick 4 - 6544 results from 9/1/77 to 10/15/00. fewer results than possible values 2 - test gives no confidence
Distributions on BIG domains Given samples of a distribution, need to know, e.g., entropy number of distinct elements shape (monotone, bimodal,…) closeness to uniform, Gaussian, Zipfian… No assumptions on shape of distribution i.e., smoothness, monotonicity, Normal distribution,… Considered in statistics, information theory, machine learning, databases, algorithms, physics, biology,…
Key Question How many samples do you need in terms of domain size? Do you need to estimate the probabilities of each domain item? Can sample complexity be sublinear in size of the domain? Rules out standard statistical techniques, learning distribution
Our Aim: Algorithms with sublinear sample complexity
Similarities of distributions Are p and q close or far? p is given via samples q is either known to the tester (e.g. uniform) given via samples
Testing uniformity [GR, Batu et. al.] Upper bound: Estimate collision probability and use known relation between between L 1 and L 2 norms Issues: Collision probability of uniform is 1/n Use O(sqrt(n)) samples via recycling Comment: [P] uses different estimator Easy lower bound: (n ½ ) Can get (n ½ / 2 ) [P]
Is p uniform? Theorem: ([Goldreich Ron][Batu Fortnow R. Smith White] [Paninski] ) Sample complexity of distinguishing p=U from |p-U| 1 > is (n 1/2 ) Nearly same complexity to test if p is any known distribution [Batu Fischer Fortnow Kumar R. White]: Testing identity p Test samples Pass/Fail?
Testing identity via testing uniformity on subdomains: (Relabel domain so that q monotone) Partition domain into O(log n) groups, so that each group almost flat -- differ by <(1+ ) multiplicative factor q close to uniform over each group Test: – Test that p close to uniform over each group – Test that p assigns approximately correct total weights to each group q (known)
Transactions of 20-30 yr oldsTransactions of 30-40 yr olds Testing closeness of two distributions: trend change?
Testing closeness Theorem: ([BFRSW] [P. Valiant] ) Sample complexity of distinguishing p=q from ||p-q|| 1 > is (n 2/3 ) p Test Pass/Fail? q ~
Why so different? Collision statistics are all that matter Collisions on heavy elements can hide collision statistics of rest of the domain Construct pairs of distributions where heavy elements are identical, but light elements are either identical or very different
Output ||p-q|| 1 ± need (n/log n) samples [G. Valiant P. Valiant] Additively estimate distance?
Collisions tell all Algorithms: Algorithms use collisions to determine wrong behavior E.g., too many collisions implies far from uniform [GR,BFSRW] Use Linear Programming to determine if there is a distribution with the right collision probabilities and the right property [G. Valiant P. Valiant] Lower bounds: For symmetric properties, collision statistics are only relevant information [BFRSW] (see also [Orlitsky Santhanam Zhang] [Orlitsky Santhanam Viswanthan Zhang]) Need new analysis tools since not independent Central limit theorem for generalized multinomial distributions [G. Valiant P. Valiant]
Information theoretic quantities Entropy Support size
Information in neural spike trails Each application of stimuli gives sample of signal (spike trail) Entropy of (discretized) signal indicates which neurons respond to stimuli Neural signals time [Strong, Koberle, de Ruyter van Steveninck, Bialek 98]
Can we get multiplicative approximations? In general, no…. 0 entropy distributions are hard to distinguish What if entropy is bigger? Can -multiplicatively approximate the entropy with Õ(n 1/ 2 ) samples (when entropy >2 / ) [Batu Dasgupta R. Kumar] requires (n 1/ 2 ) [Valiant] better bounds when support size is small [Brautbar Samorodnitsky] Similar bounds for estimating support size [Raskhodikova Ron R. Smith] [Raskhodnikova Ron Shpilka Smith]
Additive approximations for entropy and support size need (n/log n) samples [Raskhodnikova, Ron, Shpilka, Smith] [Valiant Valiant]
Estimating Compressibility of Data [Raskhodnikova Ron R. Smith] General question undecidable Run-length encoding Huffman coding Entropy Lempel-Ziv ``Color number = Number of elements with probability at least 1/n (support size!)
Testing Independence: Shopping patterns: Independent of zip code?
Independence of pairs p is joint distribution on pairs from [n] x [m] (wlog nm) Marginal distributions p 1,p 2 p independent if p = p 1 x p 2, that is p (a,b) =(p 1 ) a (p 2 ) b for all a,b 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Theorem: [Batu Fischer Fortnow Kumar R. White] There exists an algorithm for testing independence with sample complexity O(n 2/3 m 1/3 poly(log n)) s.t. If p=p 1 x p 2, it outputs PASS If ||p-q|| 1 > for any independent q it outputs FAIL Ω (n 2/3 m 1/3 ) samples required [Levi Ron R.]
More properties: Limited Independence: [Alon Andoni Kaufman Matulef R. Xie] [Haviv Langberg] K-flat distributions [Levi Indyk R.] K-modal distributions [Daskalakis Diakonikolas Servedio] Poisson Binomial Distributions [Daskalakis Diakonikolas Servedio] Monotonicity over general posets [Batu Kumar R.] [Bhattacharyya Fischer R. P. Valiant] Properties of multiple distributions [Levi Ron R.]
Many other properties to consider! Higher dimensional flat distributions Mixtures of k Gaussians Junta-distributions …
What about joint properties of many distributions?
Some questions (and answers): Are they all equal? Can they be clustered into k groups of similar distributions? Do they all have the same mean? See [Levi Ron R. 2011, Levi Ron R. 2012]
Getting past the lower bounds Special distributions e.g, uniform on a subset, monotone Other query models Queries to probabilities of elements Other distance measures [Guha McGregor Venkatasubramanian] Competitive classification/closeness testing -- compare to best symmetric test [Acharya Das Jafarpour Orlitsky Pan Suresh] [Acharya Das Jafarpour Orlitsky Pan]
More open directions Other properties? Non-iid samples?
For many problems, we need a lot less time and samples than one might think! Many cool ideas and techniques have been developed Lots more to do! Conclusion: