Day 4 Classic OT Although we’ve seen most of the ingredients of OT, there’s one more big thing you need to know to be able to read OT papers and listen.

Slides:

Advertisements

Similar presentations

Optimality Theory Presented by Ashour Abdulaziz, Eric Dodson, Jessica Hanson, and Teresa Li.

Advertisements

Linear Regression.

Authority 2. HW 8: AGAIN HW 8 I wanted to bring up a couple of issues from grading HW 8. Even people who got problem #1 exactly right didn’t think about.

Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.

Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.

Mathematics in Today's World

Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort

Chapter 3 Probability.

1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.

4.3 Random Variables. Quantifying data Given a sample space, we are often interested in some numerical property of the outcomes. For example, if our collection.

Probability And Expected Value ————————————

Evaluating Hypotheses

LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/24.

Randomness, Probability, and Simulation

January 24-25, 2003Workshop on Markedness and the Lexicon1 On the Priority of Markedness Paul Smolensky Cognitive Science Department Johns Hopkins University.

Information Theory and Security

Collaborative Filtering Matrix Factorization Approach

Probability Chapter 3. § 3.1 Basic Concepts of Probability.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Yea, we finally get to do Statistics! Kert Viele Department of Statistics University of Kentucky.

Sets, Combinatorics, Probability, and Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesProbability.

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.

Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.

Great Theoretical Ideas In Computer Science Steven Rudich, Anupam GuptaCS Spring 2004 Lecture 22April 1, 2004Carnegie Mellon University

Forecasting Techniques: Single Equation Regressions Su, Chapter 10, section III.

1 Chapters 6-8. UNIT 2 VOCABULARY – Chap 6 2 ( 2) THE NOTATION “P” REPRESENTS THE TRUE PROBABILITY OF AN EVENT HAPPENING, ACCORDING TO AN IDEAL DISTRIBUTION.

From Randomness to Probability

Probability and Independence 1 Krishna.V.Palem Kenneth and Audrey Kennedy Professor of Computing Department of Computer Science, Rice University.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Benk Erika Kelemen Zsolt

Great Theoretical Ideas in Computer Science.

Formal Semantics Chapter Twenty-ThreeModern Programming Languages, 2nd ed.1.

LECTURE 14 TUESDAY, 13 OCTOBER STA 291 Fall

Summer 2004CS 4953 The Hidden Art of Steganography A Brief Introduction to Information Theory  Information theory is a branch of science that deals with.

First topic: clustering and pattern recognition Marc Sobel.

Models of Linguistic Choice Christopher Manning. 2 Explaining more: How do people choose to express things? What people do say has two parts: Contingent.

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 14 From Randomness to Probability.

SEM Basics 2 Byrne Chapter 2 Kline pg 7-15, 50-51, ,

Multiple Regression. From last time There were questions about the bowed shape of the confidence limits around the regression line, both for limits around.

Entropy (YAC- Ch. 6)  Introduce the thermodynamic property called Entropy (S)  Entropy is defined using the Clausius inequality  Introduce the Increase.

MATH 256 Probability and Random Processes Yrd. Doç. Dr. Didem Kivanc Tureli 14/10/2011Lecture 3 OKAN UNIVERSITY.

1 Chapter 4, Part 1 Basic ideas of Probability Relative Frequency, Classical Probability Compound Events, The Addition Rule Disjoint Events.

Spring 2008Programming Development Techniques 1 Topic 5.5 Higher Order Procedures (This goes back and picks up section 1.3 and then sections in Chapter.

Great Theoretical Ideas in Computer Science for Some.

Introduction to Counting. Why a lesson on counting? I’ve been doing that since I was a young child!

Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.

The Standard Genetic Algorithm Start with a “population” of “individuals” Rank these individuals according to their “fitness” Select pairs of individuals.

Warm Up: Quick Write Which is more likely, flipping exactly 3 heads in 10 coin flips or flipping exactly 4 heads in 5 coin flips ?

Optimality Theory. Linguistic theory in the 1990s... and beyond!

Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.

1 LING 696B: Maximum-Entropy and Random Fields. 2 Review: two worlds Statistical model and OT seem to ask different questions about learning UG: what.

The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.

The normal approximation for probability histograms.

The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.

AP Statistics From Randomness to Probability Chapter 14.

Great Theoretical Ideas in Computer Science.

Deep Feedforward Networks

Chapter 4 Probability Concepts

5.2 Probability

Chapter 25 Comparing Counts.

5.1 Probability of Simple Events

A Brief Introduction to Information Theory

Day 4 Classic OT Although we’ve seen most of the ingredients of OT, there’s one more big thing you need to know to be able to read OT papers and listen.

Chapter 3 Probability.

Random Variables Binomial Distributions

Chapter 26 Comparing Counts.

Conceptual Puzzles & Theoretical Elegance (principledness. naturalness

Chapter 26 Comparing Counts.

Constructing a Test We now know what makes a good question:

Presentation transcript:

Day 4 Classic OT Although we’ve seen most of the ingredients of OT, there’s one more big thing you need to know to be able to read OT papers and listen to OT talks Constraints interact through strict ranking instead of through weighting

Analogy: alphabetical order Constraints – HaveEarly1stLetter – HaveEarly2ndLetter – HaveEarly3rdLetter – HaveEarly4thLetter – HaveEarly5thLetter –...

Harmonic grammar Cabana wins because it does much better on less-important constraints 1 st w=5 2 nd w=4 3 rd w=3 4 th w=2 5 th w=1 harm. banana azalea azote cabana

Classic Optimality Theory Strict ranking: all the candidates that aren’t the best on the top constraint are eliminated – “!” means “eliminated here” – Shading on rest of row indicates it doesn’t matter how well or poorly the candidate does on subsequent constraints 1 st 2 nd 3 rd 4 th 5 th banana1!13 azalea25114 azote2514!194 cabana2!113

Classic Optimality Theory Repeat the elimination for subsequent constraints Here, the two remaining candidates tie (both are the best), so we move to the next constraint Winner(s) = the candidates that remain 1 st 2 nd 3 rd 4 th 5 th banana1!13  azalea25114 azote2514!194 cabana2!113

Example tableaux: find the winner Constraint1C2C3C4 a.** b.** c.**

Example tableaux: find the winner C1C2C3C4 a.*** b.** c.**

Example tableaux: find the winner C1C2C3C4 a.* b.**** c.**

Example tableaux: find the winner C1C2C3C4 a.*** b.**** c.***

“Harmonically bounded” candidates A fancy term for candidates that can’t win under any ranking Simple harmonic bounding: What can’t (c) win under any ranking? C2C3C4 a.** b.** c.***

“Harmonically bounded” candidates Joint harmonic bounding: What can’t (c) win under any ranking? C1C2 a.** b.** c.**

Why this matters for variation “Multi-site” variation: more than one place in word that can vary Which candidates can win under some ranking? /akitamiso/*iMax-V a. [akitamiso]** b. [aktamiso]** c. [akitamso]** d. [aktamso]** /akitamiso/Max-V*i a. [akitamiso]** b. [aktamiso]** c. [akitamso]** d. [aktamso]**

Why this matters for variation Even if the ranking is allowed to vary, candidates like (b) and (c) can never occur /akitamiso/*iMax-V a. [akitamiso]** b. [aktamiso]** c. [akitamso]** d. [aktamso]** /akitamiso/Max-V*i a. [akitamiso]** b. [aktamiso]** c. [akitamso]** d. [aktamso]**

How about in MaxEnt? Can (b) and (c) ever occur? /akitamiso/*iMax-V a. [akitamiso]** b. [aktamiso]** c. [akitamso]** d. [aktamso]**

How about in Noisy Harmonic Grammar? Suppose the two constraints have the same weight /akitamiso/*i w=1 Max-V w=1 a. [akitamiso]** b. [aktamiso]** c. [akitamso]** d. [aktamso]**

Special case in Noisy HG /apataka/*aCa w=a Ident(lo) w=b harmonywins (or ties) if a. [apataka]***-3aa < ½ b b. [epataka]***-2a-b-- c. [apetaka]**-a-ba < b < 2a d. [apateka]**-a-ba < b < 2a e. [apatake]***-2a-b-- f. [epateka]**-2bb < a g. [epatake]***-a-2b-- d. [apetake]**-2bb < a

Summary for harmonic bounding In OT, harmonically bounded candidates can never win under any ranking – means that applying a change to one part of a word but not another is impossible In MaxEnt, all candidates have some probability of winning. In Noisy HG, harmonically bounded candidates can win only in special cases. See Jesney 2007 for a nice discussion of harmonic bounding in weighted models.

Is it good or bad that (b) and (c) can’t win in OT? In my opinion, probably bad, because there are several cases where candidates like (b) and (c) do win... /akitamiso/*iMax-V a. [akitamiso]** b. [aktamiso]** c. [akitamso]** d. [aktamso]**

French optional schwa deletion There’s a long literature on this. See Riggle & Wilson 2005, Kaplan 2011 Kimper 2011 for references. La queue de ce renardno deletion La queue d’ ce renardsome deletion La queue de c’ renardsome deletion La queue de ce r’nardsome deletion La queue d’ ce r’nardas much deletion as possible, without violating *CCC

Pima plural marking Munro & Riggle 2004, Uto-Aztecan language of Mexico, about 650 speakers [Lewis 2009]. Infixing reduplication marks plural. In compounds, any combination of members can reduplicate, as long as at least one does: Singular: [ ʔ us-kàlit-váinom], lit. tree-car-knife ‘wagon-knife’ Plural options: ʔ u ʔ us-kàklit-vápainom‘wagon-knives’ ʔ u ʔ us-kàklit-váinom ʔ u ʔ us-kàlit-vápainom ʔ us-kàklit-vápainom ʔ u ʔ us-kàlit-váinom ʔ us-kàklit-váinom ʔ us-kàlit-vápainom

Simplest theory of variation in OT: Anttila’s partial ranking (Anttila 1997) Some constraints’ rankings are fixed; others vary I’m using the red line here to indicate varying ranking /θ ɪ k/ Max-CIdent(place)*θIdent(cont)*Dental  a a [θ ɪ k] **  b b [t ̪ɪ k] ** c [ ɪ k] *! d [s ɪ k] *!

Anttilan partial ranking Max-C Ident(place) *θIdent(continuant) *Dental

Linearization In order to generate a form, the constraints have to be put into a linear order Each linear order consistent with the grammar’s partial order is equally probable grammarlinearization 1 (50%)lineariztn 2 (50%) Max-CMax-CMax-CIdent(place) Id(place)*θIdent(cont) Ident(cont)*θ *θ Id(cont)*Dental*Dental *Dental  [t ̪ɪ k]  [θ ɪ k]

Properties of this theory No learning algorithm, unfortunately Makes strong predictions about variation numbers: – If there are 2 constraints, what are the possible Anttilan grammars? – What variation pattern does each one predict?

Finnish example (Anttila 1997) The genitive suffix has two forms – “strong”: -iden/-iten (with additional changes) – “weak”: -(j)en(data from p. 3)

Factors affecting variation Anttila shows that choice is governed by... – avoiding sequence of heavies or lights (*HH, *LL) – avoiding high vowels in heavy syllables (*H/I) or low vowels in light syllables (*L/A)

Anttila’s grammar (p. 21) (Without going through the whole analysis)

Sample of the results (p. 23)

Day 4 summary We’ve seen Classic OT, and a simple way to capture variation in that theory But there’s no learning algorithm available for this theory, so its usefulness is limited Also, predictions may be too restrictive – E.g. if there are 2 constraints, the candidates must be distributed 100%-0%, 50%-50%, or 0%- 100%

Next time (our final day) A theory of variation in OT that permits finer- grained predictions, and has a learning algorithm Ways to deal with lexical variation

Day 4 references Anttila, A. (1997). Deriving variation from grammar. In F. Hinskens, R. van Hout, & W. L. Wetzels (Eds.), Variation, Change, and Phonological Theory (pp. 35–68). Amsterdam: John Benjamins. Jesney, K. (2007). The locus of variation in weighted constraint grammars. In Workshop on Variatin, Gradience and Frequency in Phonology. Presented at the Workshop on Variatin, Gradience and Frequency in Phonology, Stanford University. Kaplan, A. F. (2011). Variation Through Markedness Suppression. Phonology, 28(03), 331–370. doi: /S Kimper, W. A. (2011). Locality and globality in phonological variation. Natural Language & Linguistic Theory, 29(2), 423–465. doi: /s Lewis, M. P. (Ed.). (2009). Ethnologue: languages of the world (16th ed.). Dallas, TX: SIL International. Munro, P., & Riggle, J. (2004). Productivity and lexicalization in Pima compounds. In Proceedings of BLS. Riggle, J., & Wilson, C. (2005). Local optionality. In L. Bateman & C. Ussery (Eds.), NELS 35.

Day 5: Before we start Last time I promised to show you numbers for multi-site variation in MaxEnt If weights are equal: /akitamiso/*i w= 1 Max-V w = 1 harmonyprob. a. [akitamiso]**e b. [aktamiso]**e c. [akitamso]**e d. [aktamso]**e

Day 5: Before we start As weights move apart, “compromise” candidates remain more frequent than no-deletion candidate /akitamiso/*i w= 1 Max-V w = 2 harmonyprob. a. [akitamiso]**e -2 = b. [aktamiso]**e -3 = c. [akitamso]**e- 3 = d. [aktamso]**e -6 = sum = 0.24

Stochastic OT Today we’ll see a richer model of variation in Classic (strict-ranking) OT. But first, we need to discuss the concept of a probability distribution

What is a probability distribution It’s a function from possible outcomes (of some random variable) to probabilities. A simple example: flipping a fair coin which side lands upprobabiliy heads0.5 tails0.5

Rolling 2 dice sum of 2 diceprobability 2 (1+1)1/36 3 (1+2, 2+1)2/36 4 (1+3, 2+2, 3+1)3/36 5 (1+4, 2+3, 3+2, 4+1)4/36 6 (1+5, 2+4, 3+3, 4+2, 5+1)5/36 7 (1+6, 2+5, 3+4, 4+3, 5+2, 6+1)6/36 8 (2+6, 3+5, 4+4, 5+3, 6+2)5/36 9 (3+6, 4+5, 5+4, 6+3)4/36 10 (4+6, 5+5, 6+4)3/36 11 (5+6, 6+5)2/36 12 (6+6)1/36

Probability distributions over grammars One way to think about within-speaker variation is that, at each moment, the speaker has multiple grammars to choose between. This idea is often invoked in syntactic variation (e.g., Yang 2010) – E.g., SVO order vs. verb-second order

Probability distributions over Classic OT grammars We could have a theory that allows any probability distribution: – Max-C >> *θ >> Ident(continuant): 0.10 (t ̪ɪ n) – Max-C >> Ident(continuant) >> *θ: 0.50 (θ ɪ n) – *θ >> Max-C >> Ident(continuant): 0.05 (t ̪ɪ n) – *θ >> Ident(continuant)>> Max-C: 0.20 ( ɪ n) – Ident(continuant) >> Max-C >> *θ: 0.05(θ ɪ n) – Ident(continuant) >> *θ >> Max-C: 0 ( ɪ n) The child has to learn a number for each ranking (except one)

Probability distributions over Classic OT grammars But I haven’t seen any proposal like that in phonology Instead, the probability distributions are usually constrained somehow

Anttilan partial ranking as a probability distribution over Classic OT grammars Id(place) *θ Id(cont) means Id(place) >> *θ >> Id(cont): 50% Id(place) >> Id(cont) >> *θ: 50% *θ>> Id(place) >> Id(cont): 0% *θ>> Id(cont) >> Id(place): 0% Id(cont) >> *θ>> Id(place): 0% Id(cont) >> Id(place) >> *θ: 0%

A less-restrictive theory: Stochastic OT Early version of the idea from Hayes & MacEachern – Each constraint is associated with a range, and those ranges also have fringes (margem), indicated by “?” or “??” p. 43

Stochastic OT Each time you want to generate an output, choose one point from each constraint’s range, then use a total ranking according to those points. This approach defines (though without precise quantification) a probability distribution over constraint rankings.

Making it quantitative Boersma 1997: the first theory to quantify ranking preference. In the grammar, each constraint has a “ranking value”: *θ101 Ident(cont) 99 Every time a person speaks, they add a little noise to each of these numbers – then rank the constraints according to the new numbers. ⇒ Go to demo [Day5_StochOT_Materials.xls] Once again, this defines a probability distribution over constraint rankings An Anttilan grammar is a special case of a Stochastic OT grammar

Boersma’s Gradual Learning Algorithm for stochastic OT 1. Start out with both constraints’ ranking values at You hear an adult say something—suppose /θ ɪ k/ →[θ ɪ k] 3. You use your current ranking values to produce an output. Suppose it’s /θ ɪ k/ → [t ̪ɪ k]. 4. Your grammar produced the wrong result! (If the result was right, repeat from Step 2) 5. Constraints that [θ ɪ k] violates are ranked too low; constraints that [t ̪ɪ k] violates are too high. 6. So, promote and demote them, by some fixed amount (say 0.33 points) /θ ɪ k/ *θIdent(cont) the adult said this [θ ɪ k] * demote to your grammar produced this [t ̪ɪ k] * promote to

Gradual Learning Algorithm demo (same Excel file, different worksheet)

Problems with the GLA for stochastic OT Unlike with MaxEnt grammars, the space is not convex: there’s no guarantee that there isn’t a better set of ranking values far away from the current ones And in any case, the GLA isn’t a “hill- climbing” algorithm. It doesn’t have a function it’s trying to optimize, but just a procedure for changing in response to data

Problems with GLA for stochastic OT Pater 2008: constructed cases where some constraints never stop getting promoted (or demoted) – This means the grammar isn’t even converging to a wrong solution—it’s not converging at all! I’ve experienced this in appyling the algorithm myself

Still, in many cases stochastic OT works well E.g., Boersma & Hayes 2001 – Variation in Ilokano reduplication and metathesis – Variation in English light/dark /l/ – Variation in Finnish genitives (as we saw last time)

Type variation All the theories of variation we’ve used so far predict token variation – In this case, every theory wrongly predicts that both words vary /mão+s/Ident(round)*ãos mãos* mães* /pão+s/Ident(round)*ãos pãos* pães*

Indexed constraints Pater 2009, Becker 2009 Some constraints apply only to certain words /mão+s/ TypeA Ident(round) TypeA *ãosIdent(round) TypeB mãos* mães*! /pão+s/ TypeB Ident(round) TypeA *ãosIdent(round) TypeB pãos*! pães*

Indexed constraints If the grammar is itself variable, we can have some words whose behavior is variable (Huback 2011 example) /sidadão+s/ TypeC Ident(round) TypeC weight: 100 *ãos weight: 98 sidadãos* sidadães*

Where to go from here: R and regression Download R – Download Harald Baayen’s book Analyzing Linguistic Data: A Practical INtroduction to Statistics using R – Work through the analyses in the book – Baayen gives all the R commands and lets you download the data sets, so you can do the analyses in the book as you read about them

Where to go: Optimality Theory Read John McCarthy’s book Doing Optimality Theory: Applying Theory to Data – A practical guide for actually doing OT If you enjoy that, read John McCarthy’s book Optimality Theory: A Thematic Guide – Goes into more theoretical depth There is a book in Portuguese, João Costa’s 2001 Gramática, conflitos e violações. Introdução à Teoria da Optimidade Download OTSoft – – If you give it the candidates, constraints, and violations, it will tell you the ranking

Where to go: Stochastic OT and Gradual Learning Algorithm Read Boersma & Hayes’s 2001 article “Empirical tests of the Gradual Learning Algorithm” Download the data sets for the article and play with them in OTSoft – under part 3 – Try different GLA options – Try learning algorithms other than GLA

Where to go: Harmonic Grammar and Noisy HG Unfortunately, I don’t know of any friendly introductions to these Download OT-Help and try the examples – people.umass.edu/othelp/ – The OT-Help manual might be the easiest-to-read summary of Harmonic Grammar that exists! – Try the sample files

Where to go: MaxEnt The original proposal to use MaxEnt for phonology was Goldwater & Johnson 2003, but it’s difficult to read Andy Martin’s 2007 UCLA dissertation has an easier-to-read introduction (chapter 4) – artin_dissertationUCLA2007.pdf You could try using OTSoft to fit a MaxEnt model to the Boersma/Hayes data

Where to go: MaxEnt’s Gaussian prior To use the prior (bias against changing weights from default), download the MaxEnt Grammar Tool – – In addition to the usual OTSoft input file, you need to make a file with mu and sigma 2 for each constraint (there is a sample file) Good examples to read of using the prior – Chapter 4 of Andy Martin’s dissertation – White & Hayes 2013 article, “Phonological naturalness and phonotactic learning” / WhitePhonologicalNaturalnessAndPhonotacticLearning.pdf

Where to go: lexical variation Becker’s 2009 UMass dissertation, “Phonological Trends in the Lexicon: The Role of Constraints”, develops the lexical-indexing approach – Hayes & Londe’s 2006 paper “Stochastic phonological knowledge: the case of Hungarian vowel harmony” uses another approach (Zuraw’s UseListed) –

Thanks for attending! Stay in touch: Working on a phonology project (with or without variation)? I’d be interested to read it.

Day 5 references Becker, M. (2009). Phonological trends in the lexicon: the role of constraints (Ph.D. dissertation). University of Massachusetts Amherst. Boersma, P. (1997). How we learn variation, optionality, and probability. Proceedings of the Institute of Phonetic Sciences of the University of Amsterdam, 21, 43–58. Boersma, P., & Hayes, B. (2001). Empirical tests of the gradual learning algorithm. Linguistic Inquiry, 32, 45–86. Goldwater, S., & Johnson, M. (2003). Learning OT Constraint Rankings Using a Maximum Entropy Model. In J. Spenader, A. Eriksson, & Ö. Dahl (Eds.), Proceedings of the Stockholm Workshop on Variation within Optimality Theory (pp. 111–120). Stockholm: Stockholm University. Hayes, B., & Londe, Z. C. (2006). Stochastic Phonological Knowledge: The Case of Hungarian Vowel Harmony. Phonology, 23(01), 59–104. doi: /S

Day 5 references Hayes, B., & MacEachern, M. (1998). Quatrain form in English folk verse. Language, 64, 473–507. Hayes, B., & White, J. (2013). Phonological Naturalness and Phonotactic Learning. Linguistic Inquiry, 44(1), 45–75. doi: /LING_a_00119 Huback, A. P. (2011). Irregular plurals in Brazilian Portuguese: An exemplar model approach. Language Variation and Change, 23(02), 245–256. doi: /S Martin, A. (2007). The evolving lexicon (Ph.D. Dissertation). University of California, Los Angeles. Pater, J. (2008). Gradual Learning and Convergence. Linguistic Inquiry. Pater, J. (2009). Morpheme-specific phonology: constraint indexation and inconsistency resolution. In S. Parker (Ed.), Phonological argumentation: essays on evidence and motivation. Equinox. Yang, C. (2010). Three factors in language variation. Lingua, 120(5), 1160–1177. doi: /j.lingua