# Recovering Data in Presence of Malicious Errors Atri Rudra University at Buffalo, SUNY.

## Presentation on theme: "Recovering Data in Presence of Malicious Errors Atri Rudra University at Buffalo, SUNY."— Presentation transcript:

Recovering Data in Presence of Malicious Errors Atri Rudra University at Buffalo, SUNY

2 The setup C(x) x y = C(x)+error x Give up Mapping C  Error-correcting code or just code  Encoding: x  C(x)  Decoding: y  X  C(x) is a codeword

3 Codes are useful! Cellphones Satellite Broadcast Deep-space communication Internet CDs/DVDs RAID ECC Memory Paper Bar-codes

4 Redundancy vs. Error-correction Repetition code: Repeat every bit say 100 times  Good error correcting properties  Too much redundancy Parity code: Add a parity bit  Minimum amount of redundancy  Bad error correcting properties Two errors go completely undetected Neither of these codes are satisfactory 1 1 1 0 011 0 0 0 01

5 Two main challenges in coding theory Problem with parity example  Messages mapped to codewords which do not differ in many places Need to pick a lot of codewords that differ a lot from each other Efficient decoding  Naive algorithm: check received word with all codewords

6 The fundamental tradeoff Correct as many errors as possible with as little redundancy as possible This talk: Answer is yes Can one achieve the “optimal” tradeoff with efficient encoding and decoding ?

7 Overview of the talk Specify the setup  The model  What is the optimal tradeoff ? Previous work Construction of a “good” code High level idea of why it works Future Directions  Some recent progress

8 Error-correcting codes C(x) x y x Give up Mapping C :  k  n  Message length k, code length n n≥ k  Rate R = k/n  1 Efficient means polynomial in n  Decoding Complexity

9 Shannon’s world Noise is probabilistic Binary Symmetric Channel  Every bit is flipped w/ probability p Benign noise model  For example, does not capture bursty errors Claude E. Shannon

10 Hamming’s world Errors are worst case  error locations  arbitrary symbol changes Limit on total number of errors Much more powerful than Shannon  Captures bursty errors We will consider this channel model Richard W. Hamming

11 A “low level” view Think of each symbol in  being a packet The setup  Sender wants to send k packets  After encoding sends n packets  Some packets get corrupted  Receiver needs to recover the original k packets Packet size  Ideally constant but can grow with n

12 Decoding C(x) sent, y received  x   k, y   n How much of y must be correct to recover x ?  At least k packets must be correct  At most (n-k)/n = 1-R fraction of errors  1-R is the information-theoretic limit  : the fraction of errors decoder can handle  Information theoretic limit implies  1-R xC(x) y R = k/n

13 Can we get to the limit or  1-R ? Not if we always want to uniquely recover the original message Limit for unique decoding,  (1-R)/2 (1-R)/2 1-R c1c1 c2c2 y R (1-R)/2

14 List decoding [ Elias57, Wozencraft58 ] Always insisting on unique codeword is restrictive The “pathological” cases are rare  “Typical” received word can be decoded beyond (1-R)/2 Better Error-Recovery Model  Output a list of answers  List Decoding  Example: Spell Checker (1-R)/2 Almost all the space in higher dimension. All but an exponential (in n) fraction

15 Advantages of List decoding Typical received words have an unique closest codeword  List decoding will return list size of one such received words Still deal with worst case errors How to deal with list size greater than one ?  Declare an error; or  Use some side information Spell checker (1-R)/2

16 The list decoding problem Given a code and an error parameter  For any received word y Output all codewords c such that c and y disagree in at most   fraction of places Fundamental Question  The best possible tradeoff between R and  ? With “small” lists  Can it approach information-theoretic limit 1-R ?

17 May 25, 2007 Ph.D. Final Exam 17 Other applications of list decoding  Cryptography Cryptanalysis of certain block-ciphers [ Jakobsen98 ] Efficient traitor tracing scheme [ Silverberg, Staddon, Walker 03 ]  Complexity Theory Hardcore predicates from one way functions [ Goldreich,Levin 89; Impagliazzo 97; Ta-Shama, Zuckerman 01 ] Worst-case vs. average-case hardness [ Cai, Pavan, Sivakumar 99; Goldreich, Ron, Sudan 99; Sudan, Trevisan, Vadhan 99; Impagliazzo, Jaiswal, Kabanets 06 ]  Other algorithmic applications IP Traceback [ Dean,Franklin,Stubblefield 01; Savage, Wetherall, Karlin, Anderson 00 ] Guessing Secrets [ Alon,Guruswami,Kaufman,Sudan 02; Chung, Graham, Leighton 01 ]

18 Overview of the talk Specify the setup  The model  The optimal tradeoff between rate and fraction of errors Previous work Construction of a “good” code High level idea of why it works Future Directions  Some recent progress

19 Information theoretic limit  < 1 - R  Information- theoretic limit Can handle twice as many errors Rate (R) Unique decoding Inf. theoretic limit Frac. of Errors (  )

20 Achieving information theoretic limit There exist codes that achieve the information theoretic limit   ≥ 1-R-o(1)  Random coding argument Not a useful result  Codes are not explicit  No efficient list decoding algorithms Need explicit construction of such codes We also need poly time (list) decodability  Requires list size to be polynomial

21 The challenge Explicit construction of code(s) Efficient list decoding algorithms up to the information theoretic limit  For rate R, correct 1-R fraction of errors Shannon’s work raised similar challenge  Explicit codes achieving the information theoretic limit for stochastic models  The challenge has been met [ Forney 66, Luby- Mitzenmacher-Shokrollahi-Spielman 01, Richardson-Urbanke01 ]  Now for stronger adversarial model

22 Guruswami-Sudan The best until 1998  1 - R 1/2 Reed-Solomon codes Sudan 95, Guruswami-Sudan98 Better than unique decoding At R=0.8  Unique: 10%  Inf. Th. limit: 20%  GS : 10.56 % Unique decoding Inf. theoretic limit Frac. of Errors (  ) Rate (R) Motivating Question: Close the gap between blue and green line with explicit efficient codes

23 The best until 2005  1-(sR) s/(s+1) s  1 Parvaresh,Vardy  s=2 in the plot Based on Reed- Solomon codes Improves GS for R < 1/16 Unique decoding Inf. theoretic limit Guruswami-Sudan Parvaresh-Vardy Frac. of Errors (  ) Rate (R)

24 Our Result  1- R -   > 0 Folded RS codes [Guruswami, R. 06] Unique decoding Inf. theoretic limit Guruswami-Sudan Parvaresh-Vardy Frac. of Errors (  ) Rate (R) Our work

25 Overview of the talk Specify the setup  The model  The optimal tradeoff between rate and fraction of errors Previous work Our Construction High level idea of why it works Future Directions  Recent progress

26 The main result Construction of algebraic family of codes For every rate R >0 and  >0  List decoding algorithm that can correct 1 - R -  fraction of errors Based on Reed-Solomon codes

27 Algebra terminology F will denote a finite field  Think of it as integers mod some prime Polynomials  Coefficients come from F  Poly of degree 3 over Z 7 f(X) = X 3 +4X +5  Evaluate polynomials at points in F f(2) = (8 + 8 + 5) mod 7 = 21 mod 7 =0  Irreducible polynomials No non-trivial polynomial factors X 2 +1 is irreducible over Z 7, while X 2 -1 is not

28 Reed-Solomon codes Message: (m 0,m 1,…,m k-1 )  F k View as poly. f(X) = m 0 +m 1 X+…+m k-1 X k-1 Encoding, RS(f) = ( f(  1 ),f(  2 ),…,f(  n ) )  F ={  1,  2,…,  n } [ Guruswami-Sudan ] Can correct up to 1-(k/n) 1/2 errors in polynomial time f(  1 ) f(  2 ) f(  3 ) f(  4 )f(  n )

29 Parvaresh Vardy codes (of order 2) f(  1 ) f(  2 )f(  3 ) f(  4 ) f(  n ) g(  1 ) g(  2 ) g(  3 )g(  4 ) g(  n ) f(X)g(X) g(X)=f(X) q mod E(X) Extra information from g(X) helps in decoding Rate, R PV = k/2n [PV05] PV codes can correct 1 -(k/n) 2/3 errors in polynomial time  1 - ( 2R PV ) 2/3

30 Towards our solution Suppose g(X) = f(X) q mod E(X) = f(  X) Let us look again at the PV codeword f(  1 ) f(  1 ) g(  1 ) g(  1 )f(  1 )f(  1 )

31 Folded Reed Solomon Codes Suppose g(X) = f(X) q mod E(X) = f(  X) Don’t send the redundant symbols Reduces the length to n/2  R = (k/2)/(n/2) = k/n Using PV result, fraction of errors  1 - (k/n) 2/3 = 1 - R 2/3 f(  1 ) f(  1 ) f(  1 )

32 Getting to 1-R-  Started with PV code with s = 2 to get 1 - R 2/3 Start with PV code with general s  1 - R s/(s+1)  Pick s to be “large” enough to approach 1-R-  Decoding complexity increases from that of Parvaresh-Vardy but still polynomial

33 What we actually do We show that for any generator  F \{ 0 }  g(X) = f(X) q mod E(X) = f(  X) Can achieve similar compression by grouping elements in orbits of   m’~n/m, R ~ (k/m)/(n/m) = k/n f(1) f(  m )f(  (m’-1)m ) f(  m-1 ) f(  2m-1 ) f(  mm’-1 ) f(  ) f(  m+1 )f(  (m’-1)m+1 )

34 Proving f(X) q mod E(X) = f(  X) First use the fact f(X) q = f(X q ) over F  Need to show f(X q ) mod E(X) = f(  X) Proving X q mod E(X) =  X suffices Or, E(X) divides X q-1 -  E(X) = X q-1 –  is irreducible

35 Our Result  · 1- R -   > 0 Folded RS codes [Guruswami, R. 06] Unique decoding Inf. theoretic limit Guruswami-Sudan Parvaresh-Vardy Frac. of Errors (  ) Rate (R) Our work

36 “Welcome” to the dark side…

37 Limitations of our work To get to 1 - R - , need s > 1/  Alphabet size = n s > n 1/   Fortunately can be reduced to 2 poly(1/  )  Concatenation + Expanders [ Guruswami-Indyk’02 ]  Lower bound is 2 1/  List size (running time) > n 1/   Open question to bring this down

38 Time to wake up

39 Overview of the talk List Decoding primer Previous work on list decoding Codes over large alphabets  Construction of a “good” code  High level idea of why it works Codes over small alphabets  The current best codes Future Directions  Some (very) modest recent progress

40 Optimal Tradeoff for List Decoding Best possible  is H -1 (1-R)  H(  )= -  log  - (1-  )log(1-  ) Exists (H -1 (1-R-  ),O(1/  )) list decodable code  Random code of rate R has the property whp  > H -1 (1-R+  ) implies super poly list size  For any code For large q, H -1 (1-R)  1-R q q q q

41 Our Results (q=2) Optimal tradeoff  H -1 (1-R) [ Guruswami, R. 06 ]  “Zyablov” bound [ Guruswami, R. 07 ]  Blokh-Zyablov # Errors Rate Zyablov bound Blokh-Zyablov bound Previous best Optimal Tradeoff

42 How do we get binary codes ? Concatenation of codes [Forney 66] C 1 : (GF(2 k )) K  (GF(2 k )) N (“Outer” code) C 2 : GF(2) k  (GF(2)) n (“Inner” code)  C 1 ± C 2 : (GF(2)) kK  (GF(2)) nN  Typically k=O(log N) Brute force decoding for inner code m1m1 m2m2 wNwN w1w1 w2w2 mKmK m C 1 (m) C 2 (w 1 )C 2 (w 2 ) C 2 (w N ) C 1 ± C 2 (m)

43 List Decoding concatenated code C 1 = folded RS code C 2 = “suitably chosen” binary code Natural decoding algorithm  Divide up the received word into blocks of length n  Find closest C 2 codeword for each block  Run list decoding algorithm for C 1 Loses Information!

44 List Decoding C 2 y1y1 y2y2 yNyN How do we “list decode” from lists ? 2 GF(2) n S1S1 S2S2 SNSN 2 GF(2) k

45 The list recovery problem Given a code and an error parameter  For any set of lists S 1,…,S N such that  |S i |  s, for every i Output all codewords c such that c i 2 S i for at least 1-   fraction of i’s List decoding is special case with s=1

46 List Decoding C 1 ± C 2 y1y1 y2y2 yNyN S1S1 S2S2 SNSN List decode C 2 List Recovering Algorithm for C 1

47 Putting it together [Guruswami, R. 06] C 1 can be list recovered from  1 and C 2 can be list decoded from  2 errors  C 1 ± C 2 list decoded from  1  2 errors Folded RS of rate R list recoverable from 1-R errors Exists inner codes of rate r list decoded from H -1 (1-r) errors  Can find one by “exhaustive” search C 1 ± C 2 list decodable fr’m (1-R)H -1 (1-r) errors

48 Multilevel Concatenated Codes C 1 : (GF(2 k )) K  (GF(2 k )) N (“Outer” code 1) C 2 : (GF(2 k )) L  (GF(2 k )) N (“Outer” code 2) C in : GF(2) 2k  (GF(2)) n (“Inner” code) m1m1 m2m2 mKmK m vNvN v1v1 v2v2 C 1 (m) M1M1 M2M2 MLML M wNwN w1w1 w2w2 C 2 (M) C in (v 1,w 1 )C in (v 2,w 2 )C in (v N,w N ) C 1 and C 2 are FRS

49 Advantage over rate rR Concat Codes C 1, C 2,C in have rates R 1, R 2 and r  Final rate r(R 1 +R 2 )/2, choose R 1 < R Step 1: Just recover m  List decode C in up to H -1 (1-r) errors  List recover C 1 up to 1-R 1 errors m1m1 m2m2 mKmK m vNvN v1v1 v2v2 C 1 (m) M1M1 M2M2 MLML M wNwN w1w1 w2w2 C 2 (M) C in (v 1,w 1 )C in (v 2,w 2 )C in (v N,w N ) Can handle (1-R 1 )H -1 (1-r) >(1-R)H -1 (1-r) errors

50 Advantage over Concatenated Codes Step 2: Just recover M, given m  Subcode of C in of rate r/2 acts on M  List decode subcode upto H -1 (1-r/2) errors  List recover C 2 upto 1-R 2 errors  Can handle (1-R 2 ) H -1 (1-r/2) errors m1m1 m2m2 mKmK m vNvN v1v1 v2v2 C 1 (m) M1M1 M2M2 MLML M wNwN w1w1 w2w2 C 2 (M) C in (v 1,w 1 )C in (v 2,w 2 )C in (v N,w N )

51 Wraping it up Total errors that can be handled  min{(1-R 1 )H -1 (1-r), (1-R 2 ) H -1 (1-r/2) } Better than (1-R)H -1 (1-r)  (R 1 +R 2 )/2=R (recall that R 1 H -1 (1-r) so choose R 2 a bit > R Optimize over choices of r, R 1 and R 2 Need nested list decodability of inner code Blokh Zyablov follows from multiple outer codes

52 Our Results (q=2) Optimal tradeoff  H -1 (1-R) [ Guruswami, R. 06 ]  “Zyablov” bound [ Guruswami, R. 07 ]  Blokh-Zyablov # Errors Rate Zyablov bound Blokh-Zyablov bound Previous best Optimal Tradeoff

53 How far can concatenated codes go? Outer code: folded RS Random and independent inner codes  Different inner codes for each outer symbol Can get to the information theoretic limit   = H -1 (1-R) [Guruswami, R. 08]

54 To summarize List decoding: A central coding theory notion  Permits decoding up to the optimal fraction of adversarial errors  Bridges adversarial and probabilistic approaches to information theory Shannon’s information theoretic limit p = H -1 (1-R) List decoding information theoretic limit  = H -1 (1-R) Efficient list decoding possible for algebraic codes

55 Our Contributions Folded RS codes are explicit codes that achieve information theoretic limit for list decoding Better list decoding for binary codes Concatenated codes can get us to list decoding capacity

56 Open Questions Reduce decoding complexity of our algorithm List decoding for binary codes  Explicitly achieve error bound  = H -1 (1-R)  Erasures: decode when  = 1-R Non-algebraic codes ? Graph based codes ? Other applications of these new codes  Extractors [Guruswami, Umans, Vadhan 07]  Approximating NP-witnesses [Guruswami, R. 08]

57 Thank You Questions ?

Download ppt "Recovering Data in Presence of Malicious Errors Atri Rudra University at Buffalo, SUNY."

Similar presentations