Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.

Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001

Unsupervised learning  Input: untagged text in orthographic or phonetic form  with spaces (or punctuation) separating words.  But no tagging or text preparation.

Output  List of stems, suffixes, and prefixes  List of signatures. A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem. A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem. Hence, a stem in a corpus has a unique signature. Hence, a stem in a corpus has a unique signature. A signature has a unique set of stems associated with it A signature has a unique set of stems associated with it  …

(example of signature in English)  NULL.ed.ing.s askcallpoint = askaskedasking asks call calledcallingcalls pointpointedpointingpoints

Minimum Description Length (MDL)  Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989)  Work by Michael Brent and Carl de Marcken on word-discovery using MDL

Essence of MDL We are given 1. a corpus, and 2. a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes. (“Given”? Given by who? We’ll get back to that.) (Remember: a distribution is a set of non- negative numbers summing to 1.0.)

 The higher the probability is that the morphology assigns to the (observed) corpus, the better that morphology is as a model of that data.  Better said: -1 * log probability (corpus) is a measure of how well the morphology models the data: the smaller that number is, the better the morphology models the data. This is known as the optimal compressed length of the data, given the model. Using base 2 logs, this number is a measure in information theoretic bits.

Essence of MDL…  The goodness of the morphology is also measured by how compact the morphology is.  We can measure the compactness of a morphology in information theoretic bits.

How can we measure the compactness of a morphology?  Let’s consider a naïve version of description length: count the number of letters.  This naïve version is nonetheless helpful in seeing the intuition involved.

Naive Minimum Description Length Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Notice that the description length goes UP if we analyze sing into s+ing

Essence of MDL… The best overall theory of a corpus is the one for which the sum of  log prob (corpus) +  length of the morphology (that’s the description length) is the smallest.

Essence of MDL…

Overall logic  Search through morphology space for the morphology which provides the smallest description length.

Corpus Pick a large corpus from a language -- 5,000 to 1,000,000 words.

Corpus Bootstrap heuristic Feed it into the “bootstrapping” heuristic...

Corpus Out of which comes a preliminary morphology, which need not be superb. Morphology Bootstrap heuristic

Corpus Morphology Bootstrap heuristic incremental heuristics Feed it to the incremental heuristics...

Corpus Morphology Bootstrap heuristic incremental heuristics modified morphology Out comes a modified morphology.

Corpus Morphology Bootstrap heuristic incremental heuristics modified morphology Is the modification an improvement? Ask MDL!

Corpus Morphology Bootstrap heuristic modified morphology If it is an improvement, replace the morphology... Garbage

Corpus Bootstrap heuristic incremental heuristics modified morphology Send it back to the incremental heuristics again...

Morphology incremental heuristics modified morphology Continue until there are no improvements to try.

1. Bootstrap heuristic  A function that takes words as inputs and gives an initial hypothesis regarding what are stems and what are affixes.  In theory, the search space is enormous: each word w of length |w| has at least |w| analyses, so search space has at least members.

Better bootstrap heuristics Heuristic, not perfection! Several good heuristics. Best is a modification of a good idea of Zellig Harris (1955): Current variant: Cut words at certain peaks of successor frequency. Problems: can over-cut; can under-cut;

Successor frequency g o v e r n Empirically, only one letter follows “gover”: “n”

Successor frequency g o v e r n m Empirically, 6 letters follows “govern”: “m” i o s e #

Successor frequency g o v e r n m Empirically, 1 letter follows “governm”: “e” e g o v e r 1 n 6 m 1 e peak of successor frequency

Lots of errors… c o n s e r v a t i v e s 9 18 11 6 4 1 2 1 1 2 1 1 wrong rightwrong

Even so… We set conditions: Accept cuts with stems at least 5 letters in length; Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment) Then for each stem, collect all of its suffixes into a signature; and accept only signatures with at least 5 stems to it.

2. Incremental heuristics Course-grained to fine-grained  1. Stems and suffixes to split: Accept any analysis of a word if it consists of a known stem and a known suffix. Accept any analysis of a word if it consists of a known stem and a known suffix.  2. Loose fit: suffixes and signatures to split: Collect any string that precedes a known suffix. Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis. We’ll return to this in a moment. Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis. We’ll return to this in a moment.

Incremental heuristic  3.Slide stem-suffix boundary to the left: Again, use MDL to decide. How do we use MDL to decide?

Using MDL to judge a potential stem act, acted, action, acts, acting. We have the suffixes NULL, ed, ion, ing, and s, but no signature NULL.ed.ion.ing.s Let’s compute cost versus savings of signature NULL.ed.ion.ing.s Savings: Stem savings: 4 copies of the stem act: that’s 3 x 4 = 12 letters = almost 60 bits.

Cost of NULL.ed.ing.s  A pointer to each suffix: To give a feel for this: Total cost of suffix list: about 30 bits. Cost of pointer to signature: total cost is

 Cost of signature: about 45 bits  Savings: about 60 bits so MDL says: Do it! Analyze the words as stem + suffix. Notice that the cost of the analysis would have been higher if one or more of the suffixes had not already “existed”.

Original morphology + Compressed data Repair heuristics: using MDL We could compute the entire MDL in one state of the morphology; make a change; compute the whole MDL in the proposed (modified) state; and compared the two lengths. Revised morphology+ compressed data <><>

Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.

Similar presentations

Presentation on theme: "Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.

Similar presentations

Presentation on theme: "Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001."— Presentation transcript:

Similar presentations

About project

Feedback