Linguistica. Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks,

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

Bison Management Suppose you take over the management of a certain Bison population. The population dynamics are similar to those of the population we.
Chapter Three: Closure Properties for Regular Languages
Introduction to Algorithms Greedy Algorithms
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
1 Transportation Model. 2 Basic Problem The basic idea in a transportation problem is that there are sites or sources of product that need to be shipped.
Morphology.
Automatic Morphology and Minimum Description Length John Goldsmith Department of Linguistics.
Morphology Nuha Alwadaani.
Welcome to Morphology Monday Everything you need to know before the fun begins!!!
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.
Language is very difficult to put into words. -- Voltaire What do we mean by “language”? A system used to convey meaning made up of arbitrary elements.
1 Greedy Algorithms. 2 2 A short list of categories Algorithm types we will consider include: Simple recursive algorithms Backtracking algorithms Divide.
Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior Mikhail Belkin and John Goldsmith The University of Chicago July.
Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000.
Is ASCII the only way? For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a.
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.
Unsupervised language acquisition Carl de Marcken 1996.
Linguistica: Unsupervised Learning of Natural Language Morphology Using MDL John Goldsmith Department of Linguistics The University of Chicago.
Stacks. 2 What is a stack? A stack is a Last In, First Out (LIFO) data structure Anything added to the stack goes on the “top” of the stack Anything removed.
Variant definitions of pointer length in MDL Aris Xanthos, Yu Hu, and John Goldsmith University of Chicago.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Probabilistic models in Phonology John Goldsmith University of Chicago Tromsø: CASTL August 2005.
Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003.
Syntax and Grammar John Goldsmith Cognitive Neuroscience May 1999.
1 Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein linja-autonautonkuljettajallakaan linja-auton auto kuljettajallakaan.
August 15 click! 1 Basics Kitsap Regional Library.
Corpus Linguistics Case study 2 Grammatical studies based on morphemes or words. G Kennedy (1998) An introduction to corpus linguistics, London: Longman,
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Cryptanalysis of the Vigenere Cipher Using Signatures and Scrawls To break a Vigenere cipher you need to know the keyword length. – The Kasiski and Friedman.
Consider the statement 1000 x 100 = We can rewrite our original statement in power (index) format as 10 3 x 10 2 = 10 5 Remembering (hopefully!)
CSE 143 Lecture 11 Maps Grammars slides created by Alyssa Harding
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
MultiModality Registration Using Hilbert-Schmidt Estimators By: Srinivas Peddi Computer Integrated Surgery II April 27 th, 2001 Final Presentation.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Solution of. Linear Differential Equations The first special case of first order differential equations that we will look is the linear first order differential.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Natural Language Processing Chapter 2 : Morphology.
Sight Words.
MORPHOLOGY definition; variability among languages.
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
 2003 CSLI Publications Ling 566 Oct 17, 2011 How the Grammar Works.
Bushy Binary Search Tree from Ordered List. Behavior of the Algorithm Binary Search Tree Recall that tree_search is based closely on binary search. If.
Copyright © 2014 Curt Hill Algorithms From the Mathematical Perspective.
Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.
How To Make Easysite Forms By Joshua Crawley Contact:
Week 3. Clauses and Trees English Syntax. Trees and constituency A sentence has a hierarchical structure Constituents can have constituents of their own.
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
Introduction to Linguistics
عمادة التعلم الإلكتروني والتعليم عن بعد
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
Copyright © Cengage Learning. All rights reserved.
Chapter 6 Morphology.
Discrete Event Simulation - 4
Theory of Computation Turing Machines.
Fundamentals of Data Representation
Língua Inglesa - Aspectos Morfossintáticos
Stacks.
A Joint Model of Orthography and Morphological Segmentation
Presentation transcript:

Linguistica

Powerpoint? This presentation borrows heavily from slides written by John Goldsmith who has graciously given me permission to use them. Thanks, John. He also says I should enjoy my trip, and one way to do that is to not have to write as many slides while I’m here!

Linguistica A C++ program that runs under Windows, Mac OS X, and Linux that is available at: faculty/goldsmith/ There are explanations, papers, and other downloadable tools available there.

References (for the 1st part) Goldsmith (2001) “Unsupervised Learning of the Morphology of a Natural Language” Computational Linguistics

Overview Look at Linguistica in action: English, French Theoretical foundations Underlying heuristics Further work

Linguistica A program that takes in a text in an “unknown” language… …and produces a morphological analysis:  a list of stems, prefixes, suffixes;  more deeply embedded morphological structure;  regular allomorphy

Linguistica

Actions and outlines of information Here: lists of stems, affixes, signatures, etc. Here: some messages from the analyst to the user.

Read a corpus Brown corpus: 1,200,000 words of typical English French Encarta or anything else you like, in a text file. Set the number of words you want read, then select the file.

A stem’s signature is the list of suffixes it appears with in the corpus, in alphabetical order. abilit ies.yabilities, ability abolitionabolition absence.tabsence, absent absolute NULL.lyabsolute, absolutely List of stems

List of signatures

Signature: NULL.ed.ing.s for example, account accounted accounting accounts add added adding adds

Signature ion. NULL composite concentratecorporate détente discriminateevacuateinflateopposite participateprobateprosecutetense What is this? compositeand composition composite  composit  composit + ion It infers that ion deletes a stem-final ‘e’ before attaching. We’ll see how we can find a more sophisticated signature…

Top signatures in English

Over-arching theory The selection of a grammar, given the data, is an optimization problem. Optimization means finding a maximum or minimum of some objective function Minimum Description Length provides us with a means for understanding grammar selection as minimizing a function. (We’ll get to MDL in a moment)

What’s being minimized by writing a good morphology? The number of letters is part of it Compare:

Naive Minimum Description Length Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 61 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Notice that the description length goes UP if we analyze sing into s+ing

Minimum Description Length (MDL) Rissanen (1989) (not a CL paper) The best “theory” of a set of data is the one which is simultaneously: 1. most compact or concise, and 2. provides the best modeling of the data “Most compact” can be measured in bits, using information theory “Best modeling” can also be measured in bits…

Essence of MDL

Description Length = Conciseness: Length of the morphology. It’s almost as if you count up the number of symbols in the morphology (in the stems, the affixes, and the rules). Length of the modeling of the data. We want a measure which gets bigger as the morphology is a worse description of the data. Add these two lengths together = Description Length

Conciseness of the morphology Sum all the letters, plus all the structure inherent in the description, using information theory.

Entropy was the weighted (by p(x) ) sum of the information content or optimal compressed length ( –log 2 p(x) ) of x. It’s called that because it is always possible to develop a compression scheme by which a symbol x, emitted with probability p(x), is represented by a placeholder of length - log 2 p(x) bits. Remember Entropy?

Optimal Compressed Length The reason this is mentioned is that we will have lots of pieces of information in our model, and we’d like to figure out how much “space” it takes up. Remember, we want the smallest model possible, so we are going to want the best compression for anything in our model Also, remember this:

Conciseness of stem list and suffix list Number of letters in stem cost of setting up this entity: length of pointer in bits Number of letters in suffix = number of bits/letter < 5

Signature list length list of pointers to signatures indicates the number of distinct elements in X

Length of the modeling of the data Probabilistic morphology: the measure: -1 * log probability ( data ) where the morphology assigns a probability to any data set. This is known in information theory as the optimal compressed length of the data (given the model).

Probability of a data set? A grammar can be used not (just) to specify what is grammatical and what is not, but to assign a probability to each string (or structure). If we have two grammars that assign different probabilities, then the one that assigns a higher probability to the observed data is the better one.

This follows from the basic principle of rationality in the Universe: Maximize the probability of the observed data.

From all this, it follows: There is an objective answer to the question: which of two analyses of a given set of data is better? However, there is no general, practical guarantee of being able to find the best analysis of a given set of data. Hence, we need to think of (this sort of) linguistics as being divided into two parts:

An evaluator (which computes the Description Length); and A set of heuristics, which create grammars from data, and which propose modifications of grammars, in the hopes of improving the grammar. (Remember, these “things” are mathematical things: algorithms.)

Let’s step back for a minute Why is this problem so hard at first? Because figuring out the best analysis of any given word generally requires having figured out the rough outlines of the whole overall morphology. (Same is true for other parts of the grammar!). How do we start?

You all know the answer to this question already… We start with Zellig Harris’ successor frequency! Although we got some good answers, we also saw that it made lots of mistakes So…

As a boot-strapping method to construct a first approximation of the signatures: Harris’ method is pretty good. We accept only stems of 5 letters or more; Only cuts where the SuccFreq is > 1, and where the neighboring SuccFreq is 1. (This setup was experiment 16 from the lab on Monday)

Let’s look at how the work is done (in the abstract), step by step...

Corpus Pick a large corpus from a language -- 5,000 to 1,000,000 words.

Corpus Bootstrap heuristic Feed it into the “bootstrapping” heuristic...

Corpus Out of which comes a preliminary morphology, which need not be superb. Morphology Bootstrap heuristic

Corpus Morphology Bootstrap heuristic incremental heuristics Feed it to the incremental heuristics (…which we haven’t seen yet)

Corpus Morphology Bootstrap heuristic incremental heuristics modified morphology Out comes a modified morphology.

Corpus Morphology Bootstrap heuristic incremental heuristics modified morphology Is the modification an improvement? Ask MDL!

Corpus Morphology Bootstrap heuristic modified morphology If it is an improvement, replace the morphology... Garbage

Corpus Bootstrap heuristic incremental heuristics modified morphology Send it back to the incremental heuristics again...

Morphology incremental heuristics modified morphology Continue until there are no improvements to try.

The details of learning morphology There is nothing sacred about the particular choice of heuristic steps

Steps Successor Frequency: strict Extend signatures to cases where a word is composed of a known stem and a known suffix. Loose fit: Look at all unanalyzed words. Look to see if they can cut: stem + suffix, where the suffix already exists. Do this in all possible ways. See if any of these lead to stems with signatures that already exist. If so, take the “best” one. If not, compute the utility of the signature using MDL.

Check existing signatures: Using MDL to find best stem/suffix cut. Examples…

Check signatures (English) on/ve → ion/ive an/en → man/men l/tion → al/ation m/t → alism/alist, etc. How?

Check signatures Signature l/tion with stems: federainauguraorientasubstantia We need to compute the Description Length of the analysis as it stands versus as it would be if we shifted varying parts of the stems to the suffixes.

“Check signatures” French: NULL nt r >> a ant ar NULL nt >> i int ent t >> oient oit NULL r >> i ir f on ve >> sif sion sive eur ion >> seur sion ce t >> ruce rut se x >> ouse oux l ux >> al aux me te >> ume ute eurs ion >> teurs tion f ve >> dif dive it nt >> ait ant que sme >> ïque ïsme NULL s ur >> e es eur ient nt >> aient ant f on >> sif sion nt r >> ent er

100,000 tokens, 12,208 types Zellig redux1,403 stems 140 signatures 68 suffixes Extend signatures 226 signatures Loose fit2, signatures 68 suffixes Check signatures 2, Smooth stems 2,

Allomorphy Find relations among stems: find principles of allomorphy, like “delete stem-final e before –ing” on the grounds that this simplifies the collection of Signatures: Compare the signatures  NULL.ing, and  e.ing.

NULL.ing and e.ing NULL.ing: its stems do not end in –e -ing (almost) never appears after stem- final e. (ex. singeing) So e.ing and NULL.ing can both be subsumed under: ing.NULL, where ing means a suffix ing which deletes a preceding e.

Find layers of affixation Find roots (from among the Stem collection) In other words, recursively look through our list of Stems and see if we could (or should) be analyzing them again: readings = reading+s = read+ing+s Etc.

What’s the future work? 1. Identifying suffixes through syntactic behavior (  syntax) 2. Better allomorphy (  phonology) 3. Languages with more morphemes/ word (“rich” morphology)

“Using eigenvectors of the bigram graph to infer grammatical features and categories” (Belkin & Goldsmith 2002)

Method Build a graph in which “similar” words are adjacent; Compute the normalized laplacian (linear algebra -- it just sound fancy!) of that graph; Compute the eigenvectors with the lowest non- zero eigenvalues; (more linear algebra) Plot them.

Left-side similarities: A word’s left neighbors is the set of words that appear immediately to its left in a corpus. That’s a vector v L in a space of size V = size of vocabulary. For any given word w*, what are the N words whose left- neighbors are most similar to w*’s left-neighbors? Cosine of angle between v i and w* is

Map 1,000 English words by left- hand neighbors non-finite verbs: be, do, go, make, see, get, take, go, say, put, find, give, provide, keep, run… finite verbs: was, had, has, would, said, could, did, might, went, thought, told, knew, took, asked… world, way, same, united, right, system, city, case, church, problem, company, past, field, cost, department, university, rate, door, ?: and, to, in that, for, he, as, with, on, by, at, or, from…

Map 1,000 English words by right- hand neighbors adjectives social national white local political personal private strong medical final black French technical nuclear british Prepositions: of in for on by at from into after through under since during against among within along across including near

End