Variant definitions of pointer length in MDL Aris Xanthos, Yu Hu, and John Goldsmith University of Chicago.

Slides:



Advertisements
Similar presentations
Distributions of sampling statistics Chapter 6 Sample mean & sample variance.
Advertisements

Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Morphology.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Entropy Rates of a Stochastic Process
Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999.
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Unsupervised language acquisition Carl de Marcken 1996.
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,
Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior Mikhail Belkin and John Goldsmith The University of Chicago July.
Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.
Unsupervised language acquisition Carl de Marcken 1996.
Linguistica: Unsupervised Learning of Natural Language Morphology Using MDL John Goldsmith Department of Linguistics The University of Chicago.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Probabilistic models in Phonology John Goldsmith University of Chicago Tromsø: CASTL August 2005.
Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003.
1 Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein linja-autonautonkuljettajallakaan linja-auton auto kuljettajallakaan.
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Collecting Correlated Information from a Sensor Network Micah Adler University of Massachusetts, Amherst.
Information Theory and Security
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Noise, Information Theory, and Entropy
Huffman Codes Message consisting of five characters: a, b, c, d,e
Introduction to AEP In information theory, the asymptotic equipartition property (AEP) is the analog of the law of large numbers. This law states that.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Randomized Algorithms - Treaps
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
Standard Statistical Distributions Most elementary statistical books provide a survey of commonly used statistical distributions. The reason we study these.
Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Prof. Amr Goneid Department of Computer Science & Engineering
Basic Concepts of Encoding Codes, their efficiency and redundancy 1.
Communication Technology in a Changing World Week 2.
Data Structure. Two segments of data structure –Storage –Retrieval.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Information Theory The Work of Claude Shannon ( ) and others.
Coding Theory Efficient and Reliable Transfer of Information
Abdullah Aldahami ( ) April 6,  Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average.
Natural Language Processing Chapter 2 : Morphology.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Bahareh Sarrafzadeh 6111 Fall 2009
A Framework for Network Survivability Characterization Soung C. Liew and Kevin W. Lu IEEE Journal on Selected Areas in Communications, January 1994 (ICC,
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
ECE 101 An Introduction to Information Technology Information Coding.
Images. Audio. Cryptography - Steganography MultiMedia Compression } Movies.
Data Analysis. Qualitative vs. Quantitative Data collection methods can be roughly divided into two groups. It is essential to understand the difference.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
CONTENTS 1. Introduction 2. The Basic Checker-playing Program
ISNE101 – Introduction to Information Systems and Network Engineering
Streaming & sampling.
COT 5611 Operating Systems Design Principles Spring 2014
Chapter 8 – Binary Search Tree
Analysis & Design of Algorithms (CSCE 321)
Data Compression.
Ticket in the Door GA Milestone Practice Test
Ticket in the Door GA Milestone Practice Test
Presentation transcript:

Variant definitions of pointer length in MDL Aris Xanthos, Yu Hu, and John Goldsmith University of Chicago

Degrees of freedom in MDL modeling MDL does not specify the form of the grammar being inferred. Carl de Marcken (1996) There are alternatives to pointers for representing connections. Different representations may lead to different grammars.

Linguistica (Goldsmith 2001) Website: linguistica.uchicago.edu Data: corpus segmented into words Model: List of stems List of suffixes List of signatures { }{ } walk jump... ed ing... A sample signature:

Reminder: MDL analysis Corpus C 2 or more competing models describing C Model M assigns a probability to C : pr( C | M ) Compressed length of C given M : L( C | M ) = - log 2 pr( C | M ) Length of model M : L( M ) Description length of C given M : DL( C | M ) = L( C | M ) + L( M )

Learning process Bootstrapping heuristic: word = stem + suffix Successive heuristics propose modifications. MDL sanctions modifications. Compute L( corpus | model ) + L( model ) before and after modification. If it results in a decrease in DL, retain modification, otherwise discard it.

Length of the morphology L( morphology ) = sum of the lengths of lists (stems, suffixes, signatures) Length of a list = sum of the lengths of elements in it + small cost for list structure Length of a stem / suffix is proportional to the number of symbols in it.

{ }{ } walk jump... ed ing... { }{ } Length of the morphology (2) A signature specifies that a set of stems associate with a set of suffixes: { } walk jump great... List of stems { } ed ing est... List of suffixes

Length of the morphology (3) A pointer is a symbol that stands for a given morpheme. The information content of a pointer to a morpheme m is - log 2 pr( m )  The more probable the morpheme, the smaller the cost of a pointer to it: pr( m )- log 2 pr( m ) bits bits 10 bits

Length of the morphology (4) Length of signature = sum of lengths of 2 lists of pointers (to stems and to suffixes) Length of each list = sum of information cost of pointers in it + small cost for list structure

{ }{ } { } walk jump great... { } ed ing est... Morphology: { }{ } { } walk jump great... { } ed ing est... Morphology: Compressed length of the corpus walking in the... Corpus:

Compressed length of the corpus (2) Compressed length of a word w = information content of pointer to signature σ + information content of pointer to stem t given σ + information content of pointer to suffix f given σ = - log 2 pr ( σ ) - log 2 pr (t | σ ) - log 2 pr (f | σ ) L( corpus | morphology ) = sum of lengths of each individual word

Alternatives to pointers There are alternatives to pointers for representing connections in the morphology. { } walk jump great... chin List of (all) stems { } signature σ { } 110…0110…0 binary string

List of pointers vs. binary strings The number of symbols in a binary string is constant and equal to the total number of stems. The information content of the string depends on the distribution of 0's and 1's in it: total number of stems times entropy of string

Expected difference in DL Theoretical inference (see details in paper): 1. Binary strings are shorter when: the distribution of stems tends to be uniform the distribution of the number of stems being pointed to tends to be uniform 2. Lists of pointers are shorter when: the distribution of stems departs from uniformity the average number of stems being pointed to is small

A specific example Current state of the morphology: { }{ } walk jump... ed ing { }{ } walks broke...  Proposed modification: walks = walk + s

{ }{ }... jump... ed ing A specific example (2) State of the morphology after modification: Cost: pointers to ed, ing and s { }{ } walk ed ing s { }{ } walks broke...  Savings: the string walks, a pointer to it

Crucial difference The compressed length of binary strings is independent of the frequency of the items being pointed to. This encoding does not favor the creation of pointers to frequent items (or the deletion of pointers to rare items).

Conclusion There is more than one way of representing the connections between items in a grammar. The choice of a representation can have important consequences on the grammar being induced. Mathematical details can be found in the paper.