# Language Learning Week 6 Pieter Adriaans: Sophia Katrenko:

## Presentation on theme: "Language Learning Week 6 Pieter Adriaans: Sophia Katrenko:"— Presentation transcript:

Language Learning Week 6 Pieter Adriaans: pietera@science.uva.nl Sophia Katrenko: katrenko@science.uva.nl

Contents Week 5  Information theory  Learning as Data Compression  Learning regular languages using DFA  Minimum Description Length Principle

The minimum description length (MDL) principle: J.Rissanen The best theory to explain a set of data is the one which minimizes the sum of - the length, in bits, of the description of the theory and - the length, in bits, of the data when encoded with the help of the theory Data = d Theory = t1Encoded data= t1(d) Theory = t2Encoded data= t2(d) |t2(d)| + |t2| < |t1(d)| + |t1| < |d| so t2 is the best theory

Local, Incremental (different compression paths) Data = d Theory = tEncoded data= t(d)  t1  t1(  )   t2( ,t1(  ))  t2t1  t3(t2( ,t1(  )),  )  t2t1t3

Learnability Non-compressible sets Non-constructively Compressible sets Constructively Compressible sets Learnable sets = locally, efficiently, incrementally compressible sets

Regular Languages: Deterministic Finite Automata (DFA) 0 32 1 a a a a b b bb DFA = NFA (Non-deterministic) = REG {w  {a,b} * | # a W and # a W both even} aa abab abaaaabbb

Learning DFA’s: 2) MCA(S+) a S+ = (c, abc, ababc} 012345 678 abbc 9 bc a c Maximal Canonical Automaton

Learning DFA’s: 4) State Merging (Oncina,Vidal 92, Lang 98) a S+ = (c, abc, ababc} 012345 8 abbc 9 c c Evidence Driven State merging: 0,2

Learning DFA’s: 4) State Merging (Oncina,Vidal 92, Lang 98) a S+ = (c, abc, ababc} 01345 8 a bc 9 c c Evidence Driven State merging: 1,3 and 8,9 b

Learning DFA’s: 4) State Merging (Oncina,Vidal 92, Lang 98) a S+ = (c, abc, ababc} 0145 bc 9 c Evidence Driven State merging: 0,4 b

Learning DFA’s: 4) State Merging (Oncina,Vidal 92, Lang 98) a S+ = (c, abc, ababc} 015 c 9 c Evidence Driven State merging: 9,5 b

Learning DFA’s: 4) State Merging (Oncina,Vidal 92, Lang 98) a S+ = (c, abc, ababc} 01 9 c Evidence Driven State merging: 0,4 and 9,5 b

Learning DFA’s via evidence driven state merging  Input S+, S-  Output: DFA  1) Form MCA(S+) 2) Form PTA(S+) 3) Do until no merging is possible: - choose merging of two states - perform cascade of forced mergings to get a deterministic automaton - if resulting DFA accepts sentences of S- backtrack and choose another couple 4) End  Drawback: we need negative examples!!!

Learning DFA using only positive examples with MDL S+ = (c, cab, cabab, cababab, cababababab } c 012 b a c 0 L1 L2 b a Coding in bits: |L1|  5 log 2 (3+1) 2 log 2 (1+1) = 20 |L2|  5 log 2 (3+1) 2 log 2 (1+3) = 40 # arrows # letters Empty letter # states Outside world

Learning DFA using only positive examples with MDL S+ = (c, cab, cabab, cababab, cababababab } c 012 b a c 0 L1 L2 b a Coding in bits: |L1|  5 log 2 (3+1) 2 log 2 (1+1) = 20 |L2|  5 log 2 (3+1) 2 log 2 (1+3) = 40 But: L1 has 4 choices in state 0: |L1(S+)|= 26 log 2 4 = 52 L2 has 2 choices in state 1: |L2(S+)|= 16 log 2 2 = 16 |L2| + |L2(S+)|= 40 + 16 < |L1| + |L1(S+)|= 20 + 52 L2 is the better theory according to MDL

Learning DFA’s using only positive examples with MDL  Input S+  Output: DFA  1) Form MCA(S+) 2) Form DFA = PTA(S+) 3) Do until no merging is possible: - choose merging of two states - perform cascade of forced mergings to get a deterministic automaton DFA’ - if |DFA’|+|DFA’(S+)|  |DFA|+|DFA(S+)| backtrack and choose another couple 4) End  Drawback: Local minima!

Base case: 2-part code optimization Observed Data Learned Theory InputProgram Non-loss compression Learning |Theory| < |Data|

Paradigm case: Finite Binary string Data: Theory: Data: Theory: |Theory| = |Program| + |input| < |Data| 000110100110011111010101011010100000 010101010101010101010101010101010101 Program For i = 1 to x print y Input x-=18; y= ’01’

Unsupervised Learning Observed Output InputProgram Non-loss compression Learning |Theory| < |Data| Non-random (computational) Proces Input Learned Theory Unknown System

Supervised Learning Observed Output InputProgram Non-loss compression Learning |Theory| < |Data| Non-random (computational) Proces Observed Input Learned Theory Unknown System

Adaptive System Observed Output InputProgram Learning Non-random (computational) Proces Observed Input Learned Theory Unknown System

Agent System Non-random (computational) Proces Unknown System Adaptive systems

Scientific Text: Bitterbase (Unilever)  The bitter taste of naringin and limonin was not affected by glutamic acid [rmflav 160] Exp.Ok;; Naringin, the second of the two bitter principles in citrus, has been shown to be a depressor of limonin bitterness detection thresholds [rmflav 1591];; Florisil reduces bitterness and tartness without altering ascorbic acid and soluble solids (primarily sugars) content [rmflav 584];; nfluence pH on system was studied. The best substrate for Rhodococcus fascians at pH 7.0 was limonoate whereas at pH 4.0 to 5.5 it appeared to be limonin. Results suggest that the citrus juice debittering process start only once the natural precursor of limonin (limonoate A ring lactone) has been transformed into limonin, the equilibrium displacement being governed by the citrus juice pH. [rmflav 474][rmflav 504];; Limonin D-ring lactone hydrolase, the enzyme catalysing the reversible lactonization/hydrolysis of D-ring in limonin, has been purified from citrus seeds and immobilized on Q-Sepharose to produce homogeneous limonoate A-ring lactone solutions. The immobilized limonin D-ring lactone hydrolase showed a good operational stability and was stable after sixty- seventy operations and storing at 4°C for six months.

Study of Benign Distributions

Colloquial Speech: Corpus Spoken Dutch  " omdat ik altijd iets met talen wilde doen." "dat stond in elk geval uh voorop bij mij." "en Nederlands leek me leuk." "da's natuurlijk een erg afgezaagd antwoord maar dat was 't wel." "en uhm ik ben d'r maar gewoon aan begonnen aan de en ik uh heb 't met uh ggg gezondheid." "ggg." "ik heb 't met uh met veel plezier gedaan." "ja prima." "ja 'k vind 't nog steeds leuk."

Study of Benign Distributions

Motherese: Sarah-Jaqueline  *JAC:kijk, hier heb je ook puzzeltjes.  *SAR:die (i)s van mij.  *JAC:die zijn van jouw, ja.  *SAR:die (i)s +...  *JAC:kijken wat dit is.  *SAR:kijken.  *JAC:we hoeven natuurlijk niet alle zooi te bewaren.  *SAR:en die.  *SAR:die (i)s van mij, die.  *JAC:die is niet kompleet.  *JAC:die legt mamma maar terug.  *SAR:die (i)s van mij.  *SAR:xxx.  *SAR:die ga in de kast, deze.  *JAC:die ["], ja.  *JAC:molenspel.  *SAR:mole(n)spel ["].

Study of Benign Distributions

Structured High Frequency Core Heavy Low Frequency Tail

Powerlaws  log c y = -a log c x + b  y = c b x- a Log x Log y

Observation  Word Frequencies in human utterances dominated by powerlaws  High Frequency core  Low Frequency heavy tail  Hypothesis: Language is open. Grammar is elastic. Occurence of new words is natural phenomenon. Syntactic/semantic bootstrapping must play an important role in language learning.  Bootstrapping might be important for ontology learning as well as child language acquisition  Better understanding of distributions is necessary

Appropriate prior distributions  Gaussian: human life span, duration of movies  Powerlaw: wordclass frequencies, gross of movies, length of poems  Erlang: reign of Pharaoh’s, number of years spent in office for members of the U.S. house of representatives  Griffitths & Tenenbaum Psychological Science 2006

‘Illusions’ caused by inappropriate use of prior distributions Casino: we see our loss as an investment (cf. survival of the fittest: the harder you try the bigger is your chance of success. Monty Hall paradox (aka Marilyn and the goats A Dutch book (against an agent) is a series of bets, each acceptable to the agent, but which collectively guarantee her loss, however the world turns out. Harvard medical school test: there are no false negatives, 1/1000 is false positive, 1/1000.000 has the disease.

25 % noise 50 % noise 75 % noise100 % noise

= = = = + 25 % NOISE + 50 % NOISE + 75 % NOISE + 100 % NOISE = 100 % NOISE Two-part code optimization Data = Theory + Theory(Data)

JPEG File Size 7 Kb JPEG File Size 8 Kb JPEG File Size 7 Kb

Fact  Standard data compression algorithms do an excellent job when one wants to study learning as data compression

Contents Week 5  Information theory  Learning as Data Compression  Learning regular languages using DFA  Minimum Description Length Principle