Some aspects of information theory for a computer scientist Eric Fabre 11 Sep. 2014.

Presentation on theme: "Some aspects of information theory for a computer scientist Eric Fabre 11 Sep. 2014."— Presentation transcript:

Some aspects of information theory for a computer scientist Eric Fabre http://people.rennes.inria.fr/Eric.Fabre http://www.irisa.fr/sumo 11 Sep. 2014

Outline 1. Information: measure and compression 2. Reliable transmission of information 3. Distributed compression 4. Fountain codes 5. Distributed peer-to-peer storage 11/09/14

Information: measure and compression 11/09/14 1

Let’s play… 11/09/14 One card is drawn at random in the following set. Guess the color of the card, with a minimum of yes/no questions One strategy is it hearts ? if not, is it clubs ? if not, is it diamonds ? Wins in 1 guess, with probability ½ 2 guesses, with prob. ¼ 3 guesses, with prob. ¼  1.75 questions on average Is there a better strategy ?

11/09/14 Observation Lessons -more likely means easier to guess (carries less information) -amount of information depends only on the log likelihood of an event -guessing with yes/no questions = encoding with bits = compressing 1 01001 000

11/09/14 Important remark: codes like the one below are not permitted they cannot be uniquely decoded if one transmits sequences of encoded values of X e.g. sequence 11 can encode “Diamonds” or “Hearts,Hearts” one would need one extra symbol to separate “words” 1 011 00

Entropy 11/09/14 Source of information = random variable notation: variables X, Y, … taking values x, y, … information carried by event “X=x” average information carried by X H(X) measures the average difficulty to encode/describe/guess random outcomes of X

Properties 11/09/14 with equality iff X and Y independent (i.e. ) with equality iff X not random with equality iff is uniform Bernouilli distribution

Conditional entropy 11/09/14 uncertainty left on Y when X is known Property with equality iff Y and X independent

11/09/14 Example : X = color, Y = value average recall so one checks Exercise : check that

11/09/14 A visual representation

11/09/14 Data compression CoDec for source X, with R bits/sample on average rate R is achievable iff there exists CoDec pairs (f n,g n ) of rate R with vanishing error probability : Usage: there was no better strategy for our card game ! Theorem (Shannon, ‘48) : -a lossless compression scheme for source X must have a rate R ≥ H(X) bits/sample on average -the rate H(X) is (asymptotically) achievable

11/09/14 Proof Solution 1 use a known optimal lossless coding scheme for X : the Huffman code then prove H(X) ≤ L < H(X) + 1 over n independent symbols X 1,…,X n, one has Necessity : if R achievable, then R ≥ H(X), quite easy to prove Sufficiency : for R > H(X), it requires to build a lossless coding scheme of using R bits/sample on average Solution 2 : encoding only “typical sequences”

11/09/14 Typical sequences Let X 1,…,X n be independent, same law By the law of large numbers, one has the a.s. convergence Sequence is typical iff or equivalently Set of typical sequences :

11/09/14 AEP : asymptotic equipartition property one has and So non typical sequences count for 0, and there are approximately typical sequences, each of probability 2 nH(X) typical sequences K n =2 n log 2 K sequences, where Optimal lossless compression encode a typical sequence with nH(X) bits encode a non-typical sequence with n log 2 K bits add 0 / 1 as prefix to mean typ. / non-typ.

11/09/14 Practical coding schemes Encoding by typicality is unpractical ! Practical codes : Huffman code arithmetic coding (adapted to data flows) etc. All require to know the distribution of the source to be efficient. Universal code: does not need to know the source distribution for long sequences X 1 …X n, converge to the optimal rate H(X) bits/symbol example: Lempel-Ziv algorithm (used in ZIP, Compress, etc.)

11/09/14 Reliable transmission of information 2

Mutual information 11/09/14 Properties with equality iff X and Y are independent measures how many bits X and Y have in common (on average)

Noisy channel 11/09/14 Channel = input alphabet, output alphabet, transition probability AB A A B B observe that is left free Capacity maximizes the coupling between input and output letters favors letters that are the less altered by noise bits / use of channel

Example 11/09/14 The erasure channel : a proportion of p bits are erased A B Define the erasure variable E = f(B) with E=1 when an erasure occurred, and E=0 otherwise E 0 1 and So

Protection against errors 11/09/14 Idea: add extra bits to the message, to augment its inner redundancy (this is exactly the converse of data compression) Coding scheme X takes values in { 1, 2, …, M=2 nR } rate of the codec R = log 2 (M) / n transmitted bits / channel use R is achievable iff there exists a series of (f n,g n ) CoDecs of rate R such that fnfn gngn noisy channel where

Error correction (for a binary channel) 11/09/14 Repetition useful bit U sent 3 times : A 1 =A 2 =A 3 =U decoding by majority detects and corrects one error… but R’=R/3 Parity checks X = k useful bits U 1 …U k, expanded into n bits A 1 …A n rate R = k/n for example: add extra redundant bits V k+1 …V n that are linear combinations of the U 1 …U k examples: ASCII code k=7, n=8 ISBN social security number credit card number Questions: how ??? and how many extra bits ???

How ? 11/09/14 Almost all channel codes are linear : Reed-Solomon, Reed-Muller, Golay, BCH, cyclic codes, convolutional codes… Use finite field theory, and algebraic decoding techniques. The Hamming code 4 useful bits U 1 …U 4 3 redundant bits V 1 …V 3 rate R = 4/7 detects and corrects 1 error (exercise…) trick : 2 codewords differ by at least 3 bits U1U1 U2U2 U3U3 U4U4 V1V1 V2V2 V3V3 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 0 1 1 1 1 [ U 1 … U 4 ] = [ U 1 … U 4 V 1 … V 3 ] Generating matrix (of a linear code)

what Shannon proved in ’48 How much ? 11/09/14 what people believed before ‘48 Usage: measures the efficiency of an error correcting code for some channel Theorem (Shannon, ‘48) : -any achievable transmission rate R must satisfy R ≤ C transmitted bits / channel use -any transmission rate R < C is achievable

Proof 11/09/14 Necessity: if a coding is (asympt.) error free, then its rate satisfies R≤ C, rather easy to prove Sufficiency: any rate R { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/10/2776107/slides/slide_25.jpg", "name": "Proof 11/09/14 Necessity: if a coding is (asympt.) error free, then its rate satisfies R≤ C, rather easy to prove Sufficiency: any rate R

11/09/14 w1w1 w’ 1 M typical sequences as codewords typical sequences A 1 …A n B 1 …B n jointly typical with w 1 possible typical sequences at output w2w2 w’ 2 wMwM w’ M... if M is small enough, the output cones do not overlap (with high probability) maximal number of input codewords : which proves that any R { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/10/2776107/slides/slide_26.jpg", "name": "11/09/14 w1w1 w’ 1 M typical sequences as codewords typical sequences A 1 …A n B 1 …B n jointly typical with w 1 possible typical sequences at output w2w2 w’ 2 wMwM w’ M...", "description": "if M is small enough, the output cones do not overlap (with high probability) maximal number of input codewords : which proves that any R

n transmitted bits Perfect coding 11/09/14 Perfect code = error-free and achieves capacity. What does it look like ? by the data processing inequality nR = H(X) = I(X;X) ≤ I(A 1 …A n ;B 1 …B n ) ≤ nC if R = C, then I(A 1 …A n ;B 1 …B n ) = nC possible iff letters of the codeword A i are independent, and each I(A i ;B i )=C, i.e. each A i carries R=C bits fnfn gngn noisy channel For a binary channel: R = k / n a perfect code spreads information uniformly over a larger number of bits k useful bits channel

In practice 11/09/14 Random coding unpractical: relies on a (huge) codebook for cod./dec. Algebraic (linear) codes were preferred for long : more structure, cod./dec. with algorithms But in practice, they remained much below optimal rates ! Things changed in 1993 when Berrou & Glavieux invented the turbo-codes followed by the rediscovery of the low-density parity check codes (LDPC) invented by Gallager in his PhD… in 1963 ! both code families behave like random codes… but come with low-complexity cod./dec. algorithms

Can feedback improve capacity ? 11/09/14 Principle the outputs of the channel are revealed to the sender the sender can use this information to adapt its next symbol channel But is can greatly simplify coding, decoding, and transmission protocols. Theorem: Feedback does not improve channel capacity.

2 nd PART 11/09/14 Information theory was designed for point-to-point communications. Which was soon considered as a limitation… broadcast channel: each user has a different channel multiple access channel: interferences Spread information: which structure for this object ? how to regenerate / transmit it ? s d

2 nd PART 11/09/14 What is the capacity of a network ? Are network links just pipes, with capacity, in which information flows like a fluid ? A B C How many transmissions to broadcast from A to C,D and from B to C,D ? D E F a a a a a a a b b b b b b b a a a a b b b b a +b a a b a By network coding, one transmission over link E—F can be saved. Medard & Koetter 2003

Outline 1. Information: measure and compression 2. Reliable transmission of information 3. Distributed compression 4. Fountain codes 5. Distributed peer-to-peer storage 11/09/14

Distributed source coding 3

Collecting spread information 11/09/14 X, Y are two distant but correlated sources transmit their value to a unique receiver (perfect channels) no communication between the encoders X Y distance encoder 1 encoder 2 joint decoder X,Y no communication I(X;Y) H(Y|X) H(X|Y) Naive solution = ignore correlation, compress and send each source separately : rates R 1 =H(X), R 2 =H(Y) Can one do better, and take advantage of the correlation of X and Y ? rate R 1 rate R 2 K

Example 11/09/14 X = weather in Brest, Y = weather in Quimper probability that weathers are identical is 0.89 one wishes to send the observed weather of 100 days in both cities One has H(X) = 1 = H(Y), so naïve encoding requires 200 bits I(X;Y) = 0.5, so not sending the “common information” saves 50 bits sunrain Y sun rain X

Necessary conditions 11/09/14 Question: what are the best possible achievable transmission rates ? X Y distance encoder 1 encoder 2 joint decoder X,Y no communication I(X;Y) H(Y|X) H(X|Y) rate R 1 rate R 2 Jointly, both coders must transmit the full pair (X,Y), so R 1 +R 2 ≥ H(X,Y) Each coder alone must transmit the private information that is not accessible through the other variable, so R 1 ≥ H(X|Y) and R 2 ≥ H(Y|X) A pair (R 1,R 2 ) is achievable is there exist separate encoders f n X and f n Y of sequences X 1 …X n and Y 1 …Y n resp., and a joint decoder g n, that are asymptotically error-free.

Result 11/09/14 Theorem (Slepian & Wolf, ‘75) : The achievable region is defined by R 1 ≥ H(X|Y) R 2 ≥ H(Y|X) R 1 +R 2 ≥ H(X,Y) R1R1 R2R2 H(Y|X) H(Y) H(X|Y)H(X) achievable region The achievable region is easily shown to be convex, upper-right closed.

Compression by random binning 11/09/14 encode only typical sequences w = x 1 …x n = throw then at random into 2 nR bins, with R>H(X) 1232 nR codeword, on R bits/symbol … Encoding of w = the number b of the bin where w lies Decoding : if w = unique typical sequence in bin number b, output w otherwise, output “error” Error probability

Proof of Slepian-Wolf 11/09/14 f X and f Y are two independent random binnings of rates R 1 and R 2 for x = x 1 …x n and y = y 1 …y n resp. to decode the pair of bin numbers (b X,b Y ) = (f X (x),f Y (y)), g outputs the unique pair (x,y) of jointly typical sequences in box (b X,b Y ) or “error” if there are more than one such pair. R 2 >H(Y|X) : given x, there are 2 nH(Y|X) sequences y that are jointly typical with x R 1 +R 2 > H(X,Y) : the number of boxes 2 n(R 1 +R 2 ) must be greater than 2 nH(X,Y) 1 2 3 2 nR 1 1 2 3 2 nR 2 … … x y jointly typical pairs (x,y)

Example 11/09/14 X= color Y=value 0.5 1.25 X Y Questions: 1. Is there an instantaneous * transmission protocol for rates R X =1.25=H(X|Y), R Y =1.75=H(Y) ? send Y (always) : 1.75 bits what about X ? (caution: the code for X should be uniquely decodable) Y X ? ? ? ? 010 110 111 (*) i.e. for sequences of length n=1 2. What about R X =R Y =1.5 ? K

In practice 11/09/14 The Slepian-Wolf theorem extends to N sources. It long remained an academic result, since no practical coders existed. Beginning of the 2000s, practical coders and applications appeared compression of correlated images (e.g. same scene, 2 angles) sensor networks (e.g. measure of a temperature field) case of a channel with side information acquisition of structured information, without communication

11/09/14 Fountain codes 4

Network protocols 11/09/14 TCP/IP (transmission control protocol) network (erasure channel) 1 1 2 2 3 3 4 4 5 5 6 6 7 7 1 1 2 2 3 4 4 ack 2 slow for huge files over long-range connexions (e.g. cloud backups…) feedback channel… but feedback does not improve capacity ! repetition code… the worst rate among error correcting codes ! designed by engineers who ignored information theory ? :o) Drawbacks the erasure rate of the channel (thus capacity) is unknown / changing feedback make protocols simpler there exist faster protocols (UDP) for streaming feeds However

A fountain of information bits… 11/09/14 How to quickly and reliably transmit K packets of b bits? Fountain code: from k packets, generate and send a continuous flow of packets some get lost, some go through ; no feedback as soon as a proportion K(1+ε) of them are received, any of them, decoding becomes possible Fountain codes are example of rateless codes (no predefined rate), or universal codes : they adapt to the channel capacity.

Random coding… 11/09/14 Packet t n sent at time n is a random linear combinations of the K packets s 1 …s K to transmit. where the G n,k are random IID binary variables. … s1s1 sKsK b bits K packets … t1t1 t K’ … s1s1 sKsK … t1t1 = K 1001011 … 1 1010001... 0 0110100 … 1 … 1011010 … 0 G K’ *

Decoding 11/09/14 … s1s1 sKsK … t1t1 t K’ = K 1001011 … 1 1010001... 0 0110100 … 1 … 1011010 … 0 G K’ * 11 … 1 10 … 0 00 … 1 … 11 … 0 G’ N K = r1r1 rNrN … * … s1s1 sKsK Some packets are lost, and N out of K’ are received. This is equivalent to another random code with generating matrix G’. How big should N be to enable decoding ?

11/09/14 Decoding For N=K, what is the probability that G’ is invertible ? One has where G’ is a random K*N binary matrix. If G’ is invertible, one can decode by Answer: converges quickly to 0.289 (as soon as K>10). What about N=K+E ? What is the probability P that at least one K*K sub-matrix of G’ is invertible ? Answer: P =1-δ(E) where δ(E) ≤ 2 -E ( δ(E)<10 -6 for E=20) exponential convergence to 1 with E, regardless of K. Complexity K/2 operations per gerenated packet, so O(K 2 ) for encoding decoding: K 3 for matrix inversion one would like better complexities… linear ?

LT codes 11/09/14 Invented by Michael Luby (2003), and inspired from LDPC codes (Gallager, 1963). Idea : linear combinations of packets should be “sparse” Encoding for each packet t n, randomly select a “degree” d n according to some distribution ρ(d) on degrees choose at random d n packets among s 1 …s K and take as t n the sum of these d n packets some nodes have low degree, others have high degree: makes the graph a small world … s1s1 sKsK t1t1 tNtN …

Decoding LT codes 11/09/14 Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving Example 1011

Decoding LT codes 11/09/14 Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving Example 1011 1

Decoding LT codes 11/09/14 Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving Example 1011 1

Decoding LT codes 11/09/14 Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving Example 1011 1 0

Decoding LT codes 11/09/14 Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving Example 1011 1 0

Decoding LT codes 11/09/14 Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving Example 1011 1 01

Decoding LT codes 11/09/14 Idea = a simplified version of turbo-decoding (Berrou) that resembles cross-words solving Example 1011 1 01 How to choose degrees ? each iteration should yield to a single new node of degree 1 achieved by distribution ρ(1)=1/K and ρ (d)=1/d(d-1) for d=2…K average degree is log e K, so decoding complexity is K log e K in reality one needs a few nodes of high degree to ensure that every packet is connected to at least one check-node one needs a little more small degree nodes to ensure that decoding starts

In practice… 11/09/14 Performance both encoding and decoding are in K log K (instead of K 2 and K 3 ) for large K>10 4, the observed overhead E represents from 5% to 10% Raptor codes (Shokrollahy, 2003) do better : linear time complexity Applications broadcast to many users:  a fountain code adapts to the channel of each user  no need to rebroadcast packets missed by some user storage on many unreliable devices  e.g. RAID (redundant array of inexpensive disks)  data centers  peer-to-peer distributed storage

11/09/14 Distributed P2P storage 5

Principle 11/09/14 … s1s1 sKsK t1t1 tNtN … raw data v v redundant data t2t2 distinct storages (disks, peers,…) Problems disks can crash, peers can leave: eventual data loss original data can be recovered if enough packets remain… … but missing packets need to be restored Idea = Raw data split into packets, expanded with some ECC. Each new created packet is stored independently. Original data erased. Restoration perfect : the packet that is lost is exactly replaced functional : new packets are built, to preserve data recoverability intermediate : maintain the systematic part of the data t’ 2 t’ N … new peers

Which codes ? 11/09/14 Fountain/random codes : random linear combinations of remaining blocks among t 1 …t n will not preserve the appropriate degree distribution Target : one should rebuild missing blocks… … without first rebuilding the original data ! (would require too much bandwith) MDS codes: maximum distance separable codes can rebuild s 1 …s k from any subset of exactly k blocks in t 1 …t n example : Reed-Solomon codes

Example 11/09/14 k sets of α blocs n sets of α blocs a a b b c c d d a+c b+d b+c a+b+d a a b b c c a a b b d d reconstruction a a b b d d b+d a+b+d β blocs requested

Example 11/09/14 k sets of α blocs n sets of α blocs a a b b c c d d a+c b+d b+c a+b+d a a b b c c d d b+c a+b+d reconstruction c+d b+d b+c a+b+d a a β blocs requested Result (Dimakis et al., 2010): For functional repair, given k, n and d ≥ k (number of nodes to contact for repair) network coding techniques allows to optimally balance α (number of blocs) and β (bandwidth necessary to reconstruction).

11/09/14 Conclusion 6

A few lessons 11/09/14 Ralf Koetter* : “Communications aren’t anymore about transmitting a bit, but about transmitting evidence about a bit.” (*) one of the inventors of Network Coding Random structures spread information uniformly. Information theory gives bounds on how much one can learn about some hidden information… One does not have to build the actual protocols/codes that will reveal this information.

Management of distributed information… …in other fields 11/09/14 -A, B: random variables, possibly correlated -one wishes to compute in B the value f(A,B) -how many bits should be exchanged? -how many communication rounds? Compressed sensing (signal processing) - signal can be described by sparse coefficients - random (sub-Nyquist) sampling Communications complexity (computer science) -A, B: variables, taking values in a huge space -how many bits should A send to B in order to check A=B ? -solution by random coding A B n bits A=B ? Digital communications (network information theory)

thank you !

Download ppt "Some aspects of information theory for a computer scientist Eric Fabre 11 Sep. 2014."

Similar presentations