Group Testing and Coding Theory Atri Rudra University at Buffalo, SUNY Or, A Theoretical Computer Scientist’s (Biased) View of Group Testing.

Presentation on theme: "Group Testing and Coding Theory Atri Rudra University at Buffalo, SUNY Or, A Theoretical Computer Scientist’s (Biased) View of Group Testing."— Presentation transcript:

Group Testing and Coding Theory Atri Rudra University at Buffalo, SUNY Or, A Theoretical Computer Scientist’s (Biased) View of Group Testing

Group testing overview Test soldier for a disease WWII example: syphillis 1

Group testing overview Test an army for a disease WWII example: syphillis What if only one soldier has the disease? 2 Can we do better?

3 Communicating with my 2 year old C(x) x y = C(x)+error x Give up “Code” C “Akash English” C(x) is a “codeword”

4 The setup C(x) x y = C(x)+error x Give up Mapping C Error-correcting code or just code Encoding: x  C(x) Decoding: y  x C(x) is a codeword

The fundamental tradeoff Correct as many errors as possible with as little redundancy as possible 5 Can one achieve the “optimal” tradeoff with efficient encoding and decoding ?

The main message 6 Coding Theory Group Testing

Asymptotic view n! 10n 2 n2n2

O(  ) notation ≤ is O with glasses poly(n) is O(n c ) for some fixed c

Group testing overview Test an army for a disease WWII example: syphillis What if only one soldier has the disease? Can pool blood samples and check if at least one soldier has the disease 9

Group testing Set of items: (Unknown) vector x in {0,1} n At most d positives: |x| ≤ d Tests: a subset S of {1,..,n} Result of a test: OR of x i ’s such that i in S Goal 1: Figure out x Goal 2: Minimize the number of tests t Non-adaptive tests: all tests are fixed a priori 123n ………… 1 2 3 t...... 1001 …………. 0010 0001 1110...... t = O(d 2 log n) is possible Tons of applications Output + items 10

The decoding step 123n ………… 1 2 3 t...... 1001 …………. 0010 0001 1110...... x1x1 x2x2 x3x3 xnxn............ r1r1 r2r2 r3r3 rtrt...... unknown To be designed Observed How fast can this step be done? 11

An application: heavy hitters Stream items are numbers in the range {1,…,n} Output all items that occur at least 1/d fraction of the times One pass, poly log space, poly log update, poly log report time One pass, poly log space, poly log update, poly log report time 12

Cormode-Muthukrishnan idea Use group testing: maintain counters for each test Heavy tail property: Total frequency of non-heavy items < 1/d 123n ………… c1c1 c2c2 c3c3 ctct...... 1001 …………. 0010 0001 1110...... Maintain count of items in tests Maintain total count m r i = 1 iff c i ≥ m/d x j = 1 iff j is a heavy item (|x| ≤ d) r = M × x Reporting the heavy items is just decoding! 13

Requirements from group testing 123n ………… c1c1 c2c2 c3c3 ctct...... 1001 …………. 0010 0001 1110...... Non-adaptiveness is crucial Minimize t (space) Strongly explicit matrix Minimize decoding time (report time) 14

An overview of results # tests (t)Decoding time d is O(log n) O(d 2 log n)poly(t) [INR10, NPR11] O(d 2 log n)O(nt) [DR82], [PR08] O(d 4 log n)O(t) [GI04] O(d 2 log 2 n)poly(t) [GI04, implicit] Big savings 15

Tackling the first row # tests (t)Decoding time O(d 2 log n)poly(t) [INR10, NPR11] O(d 2 log n)O(nt) [DR82], [PR08] O(d 4 log n)O(t) [GI04] O(d 2 log 2 n)poly(t) [GI04, implicit] 16

d-disjunct matrices Sufficient condition for group testing d columns 1 0 0 0 …………….. 0 Exists True for every d subset of columns and a disjoint column Set of positives Test result=0 Every non- positive column has one 0 test result 17

L columns Naïve decoder for d-disjunct matrices d columns 1 0 0 0 …………….. 0 Set of positives If r j = 0 then for every column i that is in test j, set x i = 0 If x i =1 then all tests column i participates in will have a 1 O(nt) time O(Lt) time 18

What is known d columns 1 0 0 0 …………….. 0 Set of positives O(nt) time r1r1 r2r2 r3r3 rtrt...... d-disjunct matrix Strongly explicit d-disjunct matrix with t = O(d 2 log 2 n) [ Kautz-Singleton 1964 ] Deterministic d-disjunct matrix with t = O(d 2 log n) [ Porat-Rothschild 2008 ] Lower bound of Ω(d 2 log n/log d) [ Dyachkov-Rykov 1982 ] 19 Randomized d-disjunct matrix with t = O(d 2 log n) [ Dyachkov-Rykov 1982 ]

Up next # tests (t)Decoding time O(d 2 log n)poly(t) [INR10, NPR11] O(d 2 log n)O(nt) [DR82], [PR08] O(d 4 log n)O(t) [GI04] O(d 2 log 2 n)poly(t) [GI04, implicit] 20

Error-correcting codes 21 C(x) x y x Give up Mapping C :  k  m Dimension k, block length m m≥ k Rate R = k/m  1 Efficient means polynomial in m Decoding time complexity

Noise model Errors are worst case (Hamming) error locations arbitrary symbol changes Limit on total number of errors 22

Hamming’s 60 yr old observation 23 ≥ D D/2 Large “distance” is good

All you need to remember about Reed- Solomon codes– Part I q is a prime power q q/(d+1) vectors from [q] q where every two agree in < q/(d+1) positions 24

How do we get binary codes ? 25 Concatenation of codes [Forney 66] C 1 : ({0,1} k ) K  ({0,1} k ) M (Outer code) C 2 : {0,1} k  {0,1} m (Inner code) C 1 ° C 2 : {0,1} kK  {0,1} mM Typically k=O(log M) x1x1 x2x2 wMwM w1w1 w2w2 xKxK x C 1 (x) C 2 (w 1 )C 2 (w 2 ) C 2 (w M ) C 1 ° C 2 (x)

Disjunct matrices from RS codes n = q q/(d+1) Column i gets ith codeword x 0 0 1 …. 0 x x. q rows t = q 2 = O(d 2 log 2 n) d-disjunct matrix [Kautz,Singleton] Code Concatenation q 26

A q=3 example 0 0 0 1 1 1 2 2 2 0 1 2 1 2 0 2 0 1 1 0 0 0 0 1 0 1 0 0 1 2 100100 100100 100100 010010 010010 010010 001001 001001 001001 100100 010010 001001 010010 001001 100100 001001 100100 010010 27

1-Agreement between two columns 0 0 0 1 1 1 2 2 2 0 1 2 1 2 0 2 0 1 1 0 0 0 0 1 0 1 0 0 1 2 100100 100100 100100 010010 010010 010010 001001 001001 001001 100100 010010 001001 010010 001001 100100 001001 100100 010010 ≤ 1 agr Agreement in binary = Agreement among RS codewords < q/(d+1) Agreement in binary = Agreement among RS codewords < q/(d+1) 28

d-disjunct matrices Sufficient condition for group testing d columns 1 0 0 0 …………….. 0 Exists True for every d subset of columns and a disjoint column Set of positives 29

d-disjunctness of Kautz-Singleton d columns < q/(d+1) agr 1 1 1 1 1 1 1 1 1 1 1 1 1>q- q*d/(d+1)>0 rows 000 30

Up next # tests (t)Decoding time O(d 2 log n)poly(t) [INR10, NPR11] O(d 2 log n)O(nt) [DR82], [PR08] O(d 4 log n)O(t) [GI04] O(d 2 log 2 n)poly(t) [GI04, implicit] 31

The basic idea 123n ………… 1 2 3 t...... 1001 …………. 0010 0001 1110...... x1x1 x2x2 x3x3 xnxn............ r1r1 r2r2 r3r3 rtrt...... unknown Every column is a codeword Observed Show is same as `decoding’ the code 32 n= # codewords = exp(m) t = poly(m)

Decoding C(x) sent, y received x   k, y   m How much of y must be correct to recover x ? At least k symbols must be correct At most (m-k)/m = 1-R fraction of errors 1-R is the information-theoretic limit  : the fraction of errors decoder can handle Information theoretic limit implies  1-R 33 xC(x) y R = k/m

Can we get to the limit or  1-R ? 34 Not if we always want to uniquely recover the original message Limit for unique decoding,  (1-R)/2 (1-R)/2 1-R c1c1 c2c2 r R (1-R)/2

35 List decoding [ Elias57, Wozencraft58 ] Always insisting on unique codeword is restrictive The “pathological” cases are rare “Typical” received word can be decoded beyond (1- R)/2 Better Error-Recovery Model Output a list of answers List Decoding Example: Spell Checker (1-R)/2 Almost all the space in higher dimension. All but an exponential (in m) fraction

Information theoretic limit  < 1 - R – Information- theoretic limit Can handle twice as many errors 36 Rate (R) Unique decoding Inf. theoretic limit Frac. of Errors (  ) Achievable by random codes. NOT ALGORITHMIC! Achievable by random codes. NOT ALGORITHMIC!

37 Other applications of list decoding Cryptography Cryptanalysis of certain block-ciphers [ Jakobsen98 ] Efficient traitor tracing scheme [ Silverberg, Staddon, Walker 03 ] Complexity Theory Hardcore predicates from one way functions [ Goldreich,Levin 89; Impagliazzo 97; Ta-Shama, Zuckerman 01 ] Worst-case vs. average-case hardness [ Cai, Pavan, Sivakumar 99; Goldreich, Ron, Sudan 99; Sudan, Trevisan, Vadhan 99; Impagliazzo, Jaiswal, Kabanets 06 ] Other algorithmic applications IP Traceback [ Dean,Franklin,Stubblefield 01; Savage, Wetherall, Karlin, Anderson 00 ] Guessing Secrets [ Alon,Guruswami,Kaufman,Sudan 02; Chung, Graham, Leighton 01 ]

Algorithmic list decoding results  1- R -   > 0 Folded RS codes [Guruswami, R. 06] 38 Unique decoding Inf. theoretic limit Guruswami-Sudan 98 Parvaresh-Vardy 05 Frac. of Errors (  ) Rate (R) Folded RS

Concatenated codes 39 Concatenation of codes [Forney 66] C 1 : ({0,1} k ) K  ({0,1} k ) M (Outer code) C 2 : {0,1} k  {0,1} m (Inner code) C 1 ° C 2 : {0,1} kK  {0,1} mM Typically k=O(log M) x1x1 x2x2 wMwM w1w1 w2w2 xKxK x C 1 (x) C 2 (w 1 )C 2 (w 2 ) C 2 (w M ) C 1 ° C 2 (x) Brute force decoding for inner code

40 List decoding C 1 ° C 2 y1y1 y2y2 yMyM How do we “list decode” from lists ? in {0,1} m S1S1 S2S2 SMSM in {0,1} k

List recovery............. S1S1 S2S2 S3S3 SMSM ……………………… Output all codewords that agree with (all) the input lists S i subset of [q] ……………………… c1c1 c2c2 c3c3 cMcM |S i | ≤ d 41

All you need to remember about (Reed-Solomon) codes-- Part II q is a prime power q q/(d+1) vectors from [q] q where every two agree in < q/(d+1) positions poly(q) time algorithm for list recovery............. S1S1 S2S2 S3S3 SqSq ……………………… Output all codewords that agree with all the input lists S i subset of [q] ……………………… c1c1 c2c2 c3c3 cqcq |S i | ≤ d 42

Back to the example 0 0 0 1 1 1 2 2 2 0 1 2 1 2 0 2 0 1 1 0 0 0 0 1 0 1 0 0 1 2 100100 100100 100100 010010 010010 010010 001001 001001 001001 100100 010010 001001 010010 001001 100100 001001 100100 010010 101101 001001 011011 + items Result vector Result vector {1,2} {2} {0,2} 43

All you ever needed to know about (Reed-Solomon) codes… at least for this talk q is a prime power q q/(d+1) vectors from [q] q where every two agree in < q/(d+1) positions poly(q) time algorithm for list recovery............. S1S1 S2S2 S3S3 SqSq ……………………… Output all codewords that agree with all the input lists S i subset of [q] ……………………… c1c1 c2c2 c3c3 cqcq |S i | ≤ d 44

d 2 columns What does this imply? d columns 1 0 0 0 …………….. 0 Set of positives r1r1 r2r2 r3r3 rtrt...... KS matrix poly(t) time O(d 2 t) time t = O(d 2 log 2 n) Implicit in [Guruswami- Indyk 04] 45

Up next # tests (t)Decoding time O(d 2 log n)poly(t) [INR10, NPR11] O(d 2 log n)O(nt) [DR82], [PR08] O(d 4 log n)O(t) [GI04] O(d 2 log 2 n)poly(t) [GI04, implicit] 46

L columns Filter-evaluate decoding paradigm d columns 1 0 0 0 …………….. 0 Set of positives r1r1 r2r2 r3r3 rtrt...... d-disjunct matrix “Filtering” matrix y1y1 y2y2 y3y3 y t’...... poly(t’)time O(Lt) time 47

So all we need to do o(d 2 log n/log d) tests 48 [Indyk, Ngo, R. 10] [Ngo, Porat, R. 11]

Overview of the results # tests (t)Decoding time O(d 2 log n)poly(t) [INR10, NPR11] O(d 2 log n)O(nt) [DR82], [PR08] O(d 4 log n)O(t) [GI04] O(d 2 log 2 n)poly(t) [GI04, implicit] 49

The main message 50 Coding Theory Group Testing

Open Questions Close the gap between upper and lower bounds Other applications of group testing? Complexity Theory? Strongly explicit construction of optimal disjunct matrices ? 51

More on Coding Theory 52 http://www.cse.buffalo.edu/~atri/courses/coding-theory/book/index.html

Questions? 53

d+L columns The filtering matrix New* object: (d,L)-list disjunct matrix d columns Set of positives Running naïve decoder returns ≤ L bogus columns Independently considered by [Cheraghchi 09] (d,d)-list disjunct matrices exists with O(d log n) tests 54

Reed-Solomon codes 55 Message: (x 0,x 1,…,x k-1 )  F k View as poly. f(Y) = x 0 +x 1 Y+…+x k-1 Y k-1 Encoding, RS(f) = ( f(  1 ),f(  2 ),…,f(  m ) ) F ={  1,  2,…,  m } f(  1 ) f(  2 ) f(  3 ) f(  4 )f(  m ) Alphabet size is at least m

r Revisiting the decoding algorithm.... 1 2 j q.............................. 1 x x ………… SjSj.. |S j |≤ d 1 3 q 2 1 1 1 ………....... 2 1 1 3 q d-disjunct matrix Naïve decoder Works but hits a d 3 barrier 56

r Connection to List Recovery x 0 0 1 …. 0 x.... 1 2 j q.............................. Decoding: Output all codewords that match the test results 1 x x ………… SjSj... S1S1 S2S2 SqSq List recover from S 1,…,S t to get the positive codewords |S j |≤ d 57

r Revisiting the decoding algorithm-II.... 1 2 j q.............................. 1 x x ………… SjSj.. |S j |≤ 2d 1 3 q 2 (d,d)-list disjunct Naïve decoder Need to change the parameters of the Reed- Solomon codes a bit. 58

http://www.impawards.com/2007/are_we_done_yet.html 59

How we get our hands on….... 1 2 j q.............................. 1 3 q 2 (d,d)-list disjunct n ~ q q/d RS codeword d log q rows t = q X (d log q) ~ (d X log n/ log q) X (d log q) = d 2 log n 60

Solution 1 [Indyk, Ngo, R. 10] 1 3 q 2 (d,d)-list disjunct d log q rows Pick “inner” codes at random 61

Solution 2 [Ngo, Porat, R. 10] 1 3 q 2 (d,d)-list disjunct d log q rows Use explicit expanders! Some comments: Left degree of the expander not important  d 1+o(1) log q rows possible [ GUV 07, Cheraghchi 09 ] Use PV codes instead of RS codes 62

Download ppt "Group Testing and Coding Theory Atri Rudra University at Buffalo, SUNY Or, A Theoretical Computer Scientist’s (Biased) View of Group Testing."

Similar presentations