Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Morphology and Minimum Description Length John Goldsmith Department of Linguistics.

Similar presentations


Presentation on theme: "Automatic Morphology and Minimum Description Length John Goldsmith Department of Linguistics."— Presentation transcript:

1 Automatic Morphology and Minimum Description Length John Goldsmith Department of Linguistics

2 Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 3 Situate MDL within a linguistic context... Comparison with Early Generative Grammar 4 Situate MDL within a broader intellectual context 5 More substantive description of Automorphology’s design 6 The broader perspective

3 Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 4 Comparison with Early Generative Grammar 3 Situate MDL within a linguistic context... 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology

4 WinAutomorphology 1 n A version available on the web at http://humanities.uchicago.edu/faculty/ goldsmith n A C++ Windows program that accepts data as input and provides a morphological analysis....

5 Automorphology

6 n What do you have to put into a program like that? How much do you have to put into a program like that? n That is, does it have to have a lot of innate knowledge? Does it help for it to have a lot of innate knowledge? n If you build such a program, how do you know if it does it the same way as a child?

7 What do we want? If you give the program a computer file containing Tom Sawyer, it should tell you that the language has a category of words that take the suffixes ing,s,ed, and NULL; another category that takes the suffixes 's, s, and NULL; If you give it Jules Verne, it tells you there's a category with suffixes: a aient ait ant (chanta, chantaient, chantait, chantant)

8 n And it should tell you about irregular stem allomorphy if your language contains it.

9 That's what AutoMorphology does. How much data do you need? n You get reasonable results fast, with 5,000 words, but results are much better with 50,000, and much better with 500,000 words (length of corpus).

10 Unsupervised learning... n No prepared corpus; no tagging; just the facts. n The goal is to reconstruct the logic of linguistics in a quantitative fashion (to the extent that is necessary).

11 Unsupervised learning n A fully explicit linguistic hypothesis. n A device (an algorithm) with immediate practical uses. n Arguably the embodiment of linguistic theory: the explicit and quantifiable specification of the relationship between data and analysis (grammar).

12

13 n For the purposes of version 1 of AutoMorphology, I will restrict myself to Indo-European languages, and in general languages in which the average number of suffixes per word is not greater than 2. (We drop this requirement in AutoMorphology 2.)

14 Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 3 Situate MDL within a linguistic context... Comparison with Early Generative Grammar 4 Situate MDL within a broader intellectual context 5 More substantive description of Automorphology’s design 6 The broader perspective

15 Minimum Description Length Jorma Rissanen (1989) Data Analyzer Analysis Select the analyzer and analysis such that the sum of their lengths is a minimum.

16 Data Analyzer Analysis Analyzer Analysis Analyzer Analysis Analyzer Analysis Analyzer Analysis Etc...

17 The challenge Is to find a means of quantifying n the length of an analyzer, and n the length of an analysis

18 “Compressed form of data?” Think of data as a dense, rich, detailed description (evidence), and Think of compressed form as n Description in high level language + n Description of the particulars of the event in question (a.k.a. boundary conditions, etc.)...

19 “Analyzer” Is the set of statements that allows translation between high-level and low- level descriptions.

20 Minimizing sum of length of Analyzer + Compressed form of data = Aim for conciseness in high-level description + Principles of analysis

21 Don’t overlook the fact... …that the goal of MDL analysis is nothing less than the solution of the problem of induction. How do we justify generalization, given evidence?

22 the problem of induction Speechchild/linguistic theory grammar Datascientisttheory Sensebrainthought/percept Evidencemindbelief

23 Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 4 Comparison with Early Generative Grammar 3 Situate MDL within a linguistic context... 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology

24 Data Morphological analyser Morphological analysis of that corpus “signature”

25 Very simply put... n Just state “ed” “s” “ing” “heit” “ité” once in the grammar; n pay for its occurrence (how many bits does it take to pay for those few letters) just once; n then make repeated reference (use pointers) to those entries.

26 References, pointers... n Are not free. n Information theory tells us exactly what they cost. The fundamental measure is Shannon’s: a pointer to an item of reference frequency P out of a universe of N possibilties is of length: log (N/P)

27 Summing over all items, and weighting by count gives us the famous formula:

28 A probabilistic morphology: n Assigns a probability to all words that it can generate; and these probabilities must add up to 1.0. n A word is three choices: –choice of signature –choice of stem within signature –choice of suffix within signature

29 n Each of those is assigned a probability, based on counts. n Probability of a signature

30 Similarly, the probability of a stem is the number of times of its occurrence divided by the number of occurrences of that signature in the corpus.

31 Likewise for the suffixes… If the analysis is wrong, the numbers will be much worse than if it’s right. “The numbers” a model of frequencies of words.

32 Maximum Likelihood n The best morphology is the one that assigns the highest probability to the observed data. n …known in the biz as Maximum Likelihood.

33 Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 4 Comparison with Early Generative Grammar 3 Situate MDL within a linguistic context... 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology

34 Compare with Early Generative Grammar (EGG) Data Linguistic Theory Analysis 1 Analysis 2 Preference: A1/A2

35 Linguistic theory Data Analysis Linguistic theory Data Analysis Yes/No Linguistic theory Analysis 1 Analysis 2 Data 1 is better/ 2 is better

36 Implicit in EGG was the notion... that the best Linguistic Theory could be selected by... Getting a set of n candidate LTs; submitting to each a set of corpora; search (using unknown heuristics) for best analyses of each corpus within each LT; The LT wins for whom the sum total of all of the analyses is the smallest.

37 No cost to UG n In EGG, there was no cost associated with the size of UG -- in effect, no plausibility measure.

38 In MDL, in contrast…. n we can argue for a grammar for a given corpus. n We can also argue at the Linguistic Theory level if we so choose...

39 n Select n corpora, and select that LT on the basis of LT’s length plus the length of all of the grammars derived from it, plus the lengths of the compressed corpora derived from those grammars. n Pick the LT with the shorted some total length.

40 Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 4 Comparison with Early Generative Grammar 3 Situate MDL within a linguistic context... 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology

41 Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 4 Comparison with Early Generative Grammar 3 Situate MDL within a linguistic context... 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology

42 Distinction between heuristics and “theory” n In the context of MDL, the heuristics are extratheoretical, but from the point of view of the (psycho-)linguist, they are very important. n The heuristics propose; the theory disposes.

43 Stems with their signatures abrupt NULL ly ness. abs ence ent. absent -minded NULL ia ly. absent-minded NULL ly absentee NULL ism absolu NULL e ment. absorb ait ant e er é ée abus ait er abîm e es ée.

44 Now build up signature collection... Top 10, 100K words 1.NULL.ed.ing. 65 1214 2.NULL.ed.ing.s. 27 1464 3.NULL.s. 290 8184 4.'s.NULL.s. 27 2645 5.NULL.ed.s. 26 541 6.NULL.ly. 128 2124 7.NULL.ed. 87 767 8.'s.NULL. 75 3655 9.NULL.d.s. 14 510 10.NULL.ing. 62 983

45 Verbose signature....NULL.ed.ing. 58 heapcheckrevolt plunderlookobtain escortproclaimarrest gaindestroystay suspectkillconsent knocktracksucceed answerfrightenglitter....

46 Stem allomorphy In a corpus of French, we find pairs of stems: ç:c/_# 10commenç\commenc menaç\menac renonç\renonc avanç\avanc annonç\annonc s'effaç\s'effac enfonç\enfonc recommenç\recommenc perç\perc forç\forc lanç\lanc

47

48 compressed length of corpus:

49 Heuristics Find more than one stem that commutes with more than one suffix

50 n Negotiate for where the stem/suffix break should be: mea all take the suffixes n/ns. christia roma reig rui saxo tow

51 Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 4 Comparison with Early Generative Grammar 3 Situate MDL within a linguistic context... 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for the broader context

52

53 Data Mind/head/brain/ nervous system Analysis But are the contributions of these two of equal magnitude, in the case of language? Otherwise put, to what extent does the structure here reside in the data -- and to what extent in the analyzer?

54 A rich, deductive structure? n No shadow of a rich, deductive structure in the learner casting its image on the form of the learned morphology. n A pure structuralism -- a structuralism without Jakobsonian dualities (but see my paper on Jakobson….)

55


Download ppt "Automatic Morphology and Minimum Description Length John Goldsmith Department of Linguistics."

Similar presentations


Ads by Google