Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.

Slides:



Advertisements
Similar presentations
The Equivalence of Sampling and Searching Scott Aaronson MIT.
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Fast Algorithms For Hierarchical Range Histogram Constructions
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Automatic Morphology and Minimum Description Length John Goldsmith Department of Linguistics.
Time and Global States Part 3 ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder.
Greedy Algorithms (Huffman Coding)
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Unsupervised language acquisition Carl de Marcken 1996.
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.
Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,
Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior Mikhail Belkin and John Goldsmith The University of Chicago July.
CS 206 Introduction to Computer Science II 09 / 12 / 2008 Instructor: Michael Eckmann.
Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000.
Is ASCII the only way? For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a.
Unsupervised language acquisition Carl de Marcken 1996.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Linguistica: Unsupervised Learning of Natural Language Morphology Using MDL John Goldsmith Department of Linguistics The University of Chicago.
A survey on stream data mining
Variant definitions of pointer length in MDL Aris Xanthos, Yu Hu, and John Goldsmith University of Chicago.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Probabilistic models in Phonology John Goldsmith University of Chicago Tromsø: CASTL August 2005.
Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003.
CSE 143 Lecture 18 Huffman slides created by Ethan Apter
1 Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein linja-autonautonkuljettajallakaan linja-auton auto kuljettajallakaan.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Foundations of (Theoretical) Computer Science Chapter 1 Lecture Notes (Section 1.4: Nonregular Languages) David Martin With some.
Ceng 476 Projects Projects Project TitleProject No.# Groups#Persons/Group Migros Şok Market12 2 PTT222 Cafeteria Students Section322 Cafeteria Staff.
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, , 1989.
Lexical Analysis Hira Waseem Lecture
Morphology 3 Unsupervised Morphology Induction Sudeshna Sarkar IIT Kharagpur.
Dr.Abeer Mahmoud ARTIFICIAL INTELLIGENCE (CS 461D) Dr. Abeer Mahmoud Computer science Department Princess Nora University Faculty of Computer & Information.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Set A formal collection of objects, which can be anything May be finite or infinite Can be defined by: – Listing the elements – Drawing a picture – Writing.
Bits and Huffman Encoding Please get a piece of paper and a pen and put your name and netid on it. Make sure you can turn in it after class without losing.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
CS 206 Introduction to Computer Science II 02 / 02 / 2009 Instructor: Michael Eckmann.
CS 149: Operating Systems March 5 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
1 Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe] Decision Trees.
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2006 Lecture 22 November 9, 2006Carnegie Mellon University b b a b a a a b a b.
CSE 143 Lecture 22 Huffman slides created by Ethan Apter
1 Discussion Class 3 Stemming Algorithms. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to.
Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Module 11: File Structure
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Tries 07/28/16 11:04 Text Compression
Text Based Information Retrieval
Huffman Coding Based on slides by Ethan Apter & Marty Stepp
CS 430: Information Discovery
Hypothesis Theory examples.
Huffman Coding.
Machine Learning in Practice Lecture 17
Huffman Coding Greedy Algorithm
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Presentation transcript:

Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001

Unsupervised learning  Input: untagged text in orthographic or phonetic form  with spaces (or punctuation) separating words.  But no tagging or text preparation.

Output  List of stems, suffixes, and prefixes  List of signatures. A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem. A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem. Hence, a stem in a corpus has a unique signature. Hence, a stem in a corpus has a unique signature. A signature has a unique set of stems associated with it A signature has a unique set of stems associated with it  …

(example of signature in English)  NULL.ed.ing.s askcallpoint = askaskedasking asks call calledcallingcalls pointpointedpointingpoints

Minimum Description Length (MDL)  Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989)  Work by Michael Brent and Carl de Marcken on word-discovery using MDL

Essence of MDL We are given 1. a corpus, and 2. a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes. (“Given”? Given by who? We’ll get back to that.) (Remember: a distribution is a set of non- negative numbers summing to 1.0.)

 The higher the probability is that the morphology assigns to the (observed) corpus, the better that morphology is as a model of that data.  Better said: -1 * log probability (corpus) is a measure of how well the morphology models the data: the smaller that number is, the better the morphology models the data. This is known as the optimal compressed length of the data, given the model. Using base 2 logs, this number is a measure in information theoretic bits.

Essence of MDL…  The goodness of the morphology is also measured by how compact the morphology is.  We can measure the compactness of a morphology in information theoretic bits.

How can we measure the compactness of a morphology?  Let’s consider a naïve version of description length: count the number of letters.  This naïve version is nonetheless helpful in seeing the intuition involved.

Naive Minimum Description Length Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Notice that the description length goes UP if we analyze sing into s+ing

Essence of MDL… The best overall theory of a corpus is the one for which the sum of  log prob (corpus) +  length of the morphology (that’s the description length) is the smallest.

Essence of MDL…

Overall logic  Search through morphology space for the morphology which provides the smallest description length.

Corpus Pick a large corpus from a language -- 5,000 to 1,000,000 words.

Corpus Bootstrap heuristic Feed it into the “bootstrapping” heuristic...

Corpus Out of which comes a preliminary morphology, which need not be superb. Morphology Bootstrap heuristic

Corpus Morphology Bootstrap heuristic incremental heuristics Feed it to the incremental heuristics...

Corpus Morphology Bootstrap heuristic incremental heuristics modified morphology Out comes a modified morphology.

Corpus Morphology Bootstrap heuristic incremental heuristics modified morphology Is the modification an improvement? Ask MDL!

Corpus Morphology Bootstrap heuristic modified morphology If it is an improvement, replace the morphology... Garbage

Corpus Bootstrap heuristic incremental heuristics modified morphology Send it back to the incremental heuristics again...

Morphology incremental heuristics modified morphology Continue until there are no improvements to try.

1. Bootstrap heuristic  A function that takes words as inputs and gives an initial hypothesis regarding what are stems and what are affixes.  In theory, the search space is enormous: each word w of length |w| has at least |w| analyses, so search space has at least members.

Better bootstrap heuristics Heuristic, not perfection! Several good heuristics. Best is a modification of a good idea of Zellig Harris (1955): Current variant: Cut words at certain peaks of successor frequency. Problems: can over-cut; can under-cut;

Successor frequency g o v e r n Empirically, only one letter follows “gover”: “n”

Successor frequency g o v e r n m Empirically, 6 letters follows “govern”: “m” i o s e #

Successor frequency g o v e r n m Empirically, 1 letter follows “governm”: “e” e g o v e r 1 n 6 m 1 e peak of successor frequency

Lots of errors… c o n s e r v a t i v e s wrong rightwrong

Even so… We set conditions: Accept cuts with stems at least 5 letters in length; Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment) Then for each stem, collect all of its suffixes into a signature; and accept only signatures with at least 5 stems to it.

2. Incremental heuristics Course-grained to fine-grained  1. Stems and suffixes to split: Accept any analysis of a word if it consists of a known stem and a known suffix. Accept any analysis of a word if it consists of a known stem and a known suffix.  2. Loose fit: suffixes and signatures to split: Collect any string that precedes a known suffix. Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis. We’ll return to this in a moment. Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis. We’ll return to this in a moment.

Incremental heuristic  3.Slide stem-suffix boundary to the left: Again, use MDL to decide. How do we use MDL to decide?

Using MDL to judge a potential stem act, acted, action, acts, acting. We have the suffixes NULL, ed, ion, ing, and s, but no signature NULL.ed.ion.ing.s Let’s compute cost versus savings of signature NULL.ed.ion.ing.s Savings: Stem savings: 4 copies of the stem act: that’s 3 x 4 = 12 letters = almost 60 bits.

Cost of NULL.ed.ing.s  A pointer to each suffix: To give a feel for this: Total cost of suffix list: about 30 bits. Cost of pointer to signature: total cost is

 Cost of signature: about 45 bits  Savings: about 60 bits so MDL says: Do it! Analyze the words as stem + suffix. Notice that the cost of the analysis would have been higher if one or more of the suffixes had not already “existed”.

Original morphology + Compressed data Repair heuristics: using MDL We could compute the entire MDL in one state of the morphology; make a change; compute the whole MDL in the proposed (modified) state; and compared the two lengths. Revised morphology+ compressed data <><>