Claude Shannon, the father

Claude Shannon, the father
Information Theory Tutorial on Neural Systems Modeling section 8.1 Claude Shannon, the father of information theory After mastering the content and concepts from this lesson, you should be able to: define information, in a technical sense write a program that computes the Shannon entropy associated with any probability distribution

Technical Terminology
Energy Work Significance Information These terms all have very precise technical definitions that are completely different from their colloquial usages.

Defining “information” in the technical sense
A random variable is a mathematical variable that randomly takes different values according to a probability distribution Example: we can define a random variable X that is equal to 0 whenever a coin is flipped and ends up ‘tails,’ and is equal to 1 when whenever the coin is flipped and ends up ‘heads’

7 realizations of the random variable X Trial Number Result X (random variable) 1 Head 2 3 Tail 4 5 6 7

Every random variable is associated with a probability distribution function (pdf) The pdf for the random variable associated with a coin flip is… p(X = x) 1.0 --probabilities of all events should add to what? 0.75 0.50 0.25 x 1

Each of the two possible outcomes of a coin flip is said to contain information. This information, I, is defined mathematically as I(X=x) = -log2 P(X=x) So, for example, the information contained in the event that a coin is flipped and ends up tails is… I(X=0) = -log2 P(X=0) I(X=0) = -log2(1/2) I(X=0) = -(-1) I(X=0) = 1 bit What is the information associated with the event that a coin is flipped and ends up heads?

Where did that come from?
This mathematical definition comes from a very simple requirement: If two events, x and y, are statistically independent, then the information associated with both events occurring simultaneously should be the sum of the information associated with each event occurring individually: I(X=x,Y=y) = I(X=x) + I(Y=y) --on the other hand, if you set of “entangled” coins, so that they are ALWAYS opposite of one another if you flip them simultaneously, then the probability of getting heads for the first one and tails for the second one would NOT be the same as flipping them individually and getting heads for the first and tails for the second; then you would not have that the information for x and y together is equal to the sum of x and y separately Example: If you have two coins, then the event that you flip both and get heads for the first coin and tails for the second coin should carry the same information as flipping the first coin and getting heads, then flipping the second coin and getting tails, because the two events are independent.

An intuitive understanding of the technical concept of information
The information of any one event is the “degree of surprise” associated with that event actually happening.

Caution: do not confuse information with meaning
ToNSM, Anastasio, Chapter 8

The meaning of an event can be very tricky to define mathematically…
--a cornflake shaped like Illinois sold for $1350 on eBay! --the network of neurons devoted to vision has a specific purpose, so we could presumably come up with a mathematical definition of how well one could see with various connectivities between those neurons --some of those connectivities might be more likely than others; just because one particular connectivity has high meaning (results in very good sight) does not necessarily mean that it will also be high in information (very unlikely)

Shannon Entropy The Shannon entropy of a random variable X is defined according to the pdf of X: H(X) = -Sx P(X=x) log2 P(X=x) This is the average information content of a random variable.

Shannon Entropy What is the entropy of the random variable associated with a coin flip? H(X) = -Sx P(X=x) log2 P(X=x) H(X) = -P(X=0) log2 P(X=0)- P(X=1) log2 P(X=1) H(X) = -(1/2) log2 1/2 - (1/2) log2 1/2 H(X) = -(1/2)(-1) - (1/2)(-1) H(X) = 1 bit

Shannon Entropy What is the entropy of the random variable associated with a coin that is heads on both sides? (We still have that X is equal to 0 whenever a coin is flipped and ends up ‘tails,’ and is equal to 1 when whenever the coin is flipped and ends up ‘heads’) p(X = x) 1.0 0.75 0.50 0.25 x 1 H(X) = -Sx P(X=x) log2 P(X=x) H(X) = -P(X=0) log2 P(X=0)- P(X=1) log2 P(X=1) H(X) = -(0) log2 0 - (1) log2 1 H(X) = 0 bits

An intuitive understanding of entropy
The entropy of a probability distribution is the “average uncertainty” associated with that distribution. Von Neumann told me, 'You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.‘ ~Claude Shannon

Applying information theory to neural systems (Anastasio 8.1)
After mastering the content and concepts from this lesson, you should be able to: explain how information theory can be used to quantify how well a neuronal network represents different sensory stimuli compute conditional probabilities use probability distribution functions to compute the mutual information between two random variables

Given a realization of the inputs, the weight matrix ‘V’ determines the outputs Assume the inputs are random variables

Your nephew at the zoo

A four-animal zoo x1 = 1 when the animal you’re
looking at has fur, and x1 = 0 when it does not have fur x2 = 1 when the animal you’re looking at has wings, and x2 = 0 when it does not have wings x1 x2 Animal Fish 1 Bird Bear Bat

Flashcard system to learn animals
P(X1=0, X2=0) = 0.25 P(X1=1, X2=0) = 0.25 P(X1=0, X2=1) = 0.25 P(X1=1, X2=1) = 0.25 All animals equally likely to be shown by flashcard system Fish most likely to be shown P(X1=0, X2=0) = 0.75 P(X1=1, X2=0) = 0.125 P(X1=0, X2=1) = 0.125 P(X1=1, X2=1) = 0 Bat will never be shown Which of these distributions has the higher entropy?

Notation Mathematically represent the input pattern as a vector:
𝑋= 𝑋 1 𝑋 2 Do the same to represent the output pattern: 𝑌= 𝑌 1 𝑌 2

Conditional probability
One way to determine how well the output layer is forming a unique representation of the input layer is to look at how knowing the output pattern affects the probability distribution of the input patterns P(X=b|Y=a) = Probability of BOTH X=b and Y=a occurring Probability of Y=a occurring

1 2 3 4 Y X P(X,Y) 0.15 0.2 0.2 0.25 0.05 0.1 0.05 P(X=b|Y=a) = Probability of BOTH X=b and Y=a occurring Probability of Y=a occurring P(X=3|Y=2) = =0.70 P(X=3|Y=1) = =0.50 P(Y=1|X=3) = =0.167

1 1 1 1 1 1 1 1 After knowing Y=(0,1) Before knowing Y
1 1 1 1 1 1 1 1 Input Pattern #1 Input Pattern #2 Input Pattern #3 Input Pattern #4 After knowing Y=(0,1) Before knowing Y P(X1=0, X2=0) = 1 P(X1=1, X2=0) = 0 P(X1=0, X2=1) = 0 P(X1=1, X2=1) = 0 1 2 3 4 1.0 0.5 Prob. Input pattern 1 2 3 4 1.0 0.5 Prob. Input pattern P(X1=0, X2=0) = 0.25 P(X1=1, X2=0) = 0.25 P(X1=0, X2=1) = 0.25 P(X1=1, X2=1) = 0.25

Another Example P(X) P(X|Y=(0,1)) 1 1 1 1 1 1 1 P(X1=0, X2=0) = 0.25
1 1 1 1 1 1 1 Input Pattern #1 Input Pattern #2 Input Pattern #3 Input Pattern #4 P(X) P(X|Y=(0,1)) 1 2 3 4 1.0 0.5 Prob. Input pattern 1 2 3 4 1.0 0.5 Prob. Input pattern P(X1=0, X2=0) = 0.25 P(X1=1, X2=0) = 0.25 P(X1=0, X2=1) = 0.25 P(X1=1, X2=1) = 0.25 P(X1=0, X2=0) = 0.5 P(X1=1, X2=0) = 0 P(X1=0, X2=1) = 0.5 P(X1=1, X2=1) = 0

Thinking in terms of entropy…
P(X) (before knowing Y) 1 2 3 4 1.0 0.5 Prob. Input pattern P(X|Y=(0,1)) (after knowing Y=(0,1) ) 1 2 3 4 1.0 0.5 Prob. Input pattern 1 2 3 4 1.0 0.5 Prob. Input pattern 1 2 3 4 1.0 0.5 Prob. Input pattern Entropy reduced to zero (best representation) Entropy reduced partially (intermediate representation) Entropy not reduced at all (worst representation)

Mutual Information Uncertainty in the input, without knowing the output Uncertainty in the input, after you do know the output The more that knowledge of the output helps you reduce uncertainty about the input, the greater the mutual information between input and output. Calculating mutual information:

Measuring mutual information in a simple neural network
What connectivity maximizes mutual information? P(X1=1)=0.8 P(X2=1)=0.7

So what is the probability distribution function?
P(X1=0, X2=0) = (1-0.8)(1-0.7)=0.06 P(X1=1, X2=0) = (0.8)(1-0.7)=0.24 P(X1=0, X2=1) = (1-0.8)(0.7)=0.14 P(X1=1, X2=1) = (0.8)(0.7)=0.56 1 2 3 4 1.0 0.5 Prob. Input pattern

Computing the output pattern for a given input pattern
x=InPat(ll,:)'; % set input x to next pattern q=V*x; % find the weighted input sum q y=q>thr; % find thresholded output y For a given input pattern X, the output pattern Y is completely determined

To the lab!

Claude Shannon, the father

Similar presentations

Presentation on theme: "Claude Shannon, the father"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Claude Shannon, the father

Similar presentations

Presentation on theme: "Claude Shannon, the father"— Presentation transcript:

Similar presentations

About project

Feedback