Download presentation

Presentation is loading. Please wait.

Published byDiana Coffey Modified over 2 years ago

1
How much information does a language have? Shanon, C. Prediction and Entropy of Printed English, Bell System Technical Journal, 1951

2
Motivation/Skills

3
Redundancy The redundancy of ordinary English, not considering statistical structure over greater distances than about eight letters, is roughly 50%. This means that when we write En_ _ _sh ha_f o_ w_ _t w_ w_ _te i_ dete_ _ _ _e_ b_ t_e str_ct_r_ _ f _ _ _ lang_ _ _ _ a_d H_ _f i_ c_os_n fre_ _ _ Redundancy =1-H/H max

4

5
Entropy How much information is produced on average for each letter

6
L Evêqe en effet est très streect: le clergé, de temps en temps, se permet de révéler ses préférences envers des événements frenchement débreedés, mets l évêqe hème qe ses fêtes respectent des règles sévères et les trensgresser, c est fréqemment reesqer de se fère relegger. Saisi par l'inspiration, il composa illico un lai, qui, suivant la tradition du Canticum Canticorum Salomonis, magnifiait l'illuminant corps d'Anastasia : Ton corps, un grand galion où j'irai au long-cours, un sloop, un brigantin tanguant sous mon roulis, Ton front, un fort dont j'irai à l'assaut, un bastion, un glacis qui fondra sous l'aquilon du transport qui m'agit,

7
> E0.131 T0.105 A0.082 O0.08 N0.071 R0.068 I0.063 S0.061 H0.053 D0.038 L0.034 F0.029 C0.028 M0.025 U G0.02 Y P W0.015 B0.014 V0.009 K0.004 X0.002 J0.001 Q Z8E-04 E A O S R N I D L C T U M P B G Y V Q H F Z J X W K

8

9
How much information is obtained by adding one letter? S E E E0.131 T0.105 A0.082 X0.002 J0.001 Q Z8E-04 SE

10
Bits per letterFn 4.75F0 4.03F1 3.32F2 3.1F3

11
3 order IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE.

12
ProbabilityWord#.071The1.034of2.03and3 Vocabulary size (no. lemmas) % of content in OECExample lemmas 1025%the, of, and, to, that, have 10050%from, because, go, me, our, well, way %girl, win, decide, huge, difficult, series %tackle, peak, crude, purely, dude, modest 50,00095%saboteur, autocracy, calyx, conformist >1,000,00099%laggardly, endobenthic, pomological

13
ProbabilityWord#.071The1.034of2.03and3

14
Zipfs Law

15
Is English trying to warn us? America ensure oil opportunity bush admit specifically agents smell denied arafat unhealthy ProbabilityWord#.071The1.034of2.03and3

16
How to continue? Aoccdrnig to rseearch at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is that the frist and lsat ltteer is at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by it slef but the wrod as a wlohe.

17
Revealing the statistic of the language Q… words start with q ….q 8 words finish with q ….q …. Ira 0 qq 0.1

18
Revealing the statistic of the language THERE IS NO REVERSE ON A MOT0RCYCLE FRlEND 0F MINE FOUND THIS OUT RATHER DRAMATICALLY THE OTHER DAY R R R R

19
# of times guessed Position of the guessed letter

20
What is the probability to find the number 1 in the third position? THE REV ERS MOT THA 112

21
THE ANT ERS MOT HER 222 THA HEN ERS TH_ AN_ 312 HE_ REV ERS MOT AND 311 LASCU Probability to find the number I in the place N

22
Bounds THERE IS NO REVERSE ON A MOT0RCYCLE F0 (all the letter have the same probability) F1 (each letter has its own probability) F2 (correlation of two letters) F0 (all the numbers have the same probability) F1 (each number has its own probability) F2 (correlation of two numbers) FN

23
Bounds

24
Entropy

25

26

27
Bounds Redundancy ~ 75% Bits per letterFn 4.75F0 4.03F1 3.32F2 3.1F3

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google