1 Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein linja-autonautonkuljettajallakaan linja-auton auto kuljettajallakaan.

1 Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein linja-autonautonkuljettajallakaan linja-auton auto kuljettajallakaan nkuljettajallakaan kuljettaja lla

2 Aim: To find optimal segmentation of the input text into morpheme-like units (morphs) by using unsupervised algorithms. Two segmentation techniques: 1. The first method is based on the Minimum Description Length (MDL) principle. 2. The second method is based on the Maximum Likelihood (ML) principle.

3 Segmentation method

4 Definitions  Input text:lat text that contains only an alphabet language letters and spaces.  Input text: flat text that contains only an alphabet language letters and spaces.  Word: is a sequence of letters bounded by spaces or start/end of the input text.  Output: vocabulary of morphs (codebook)  Morph Type: definition of a morph in the codebook  Morph Token: instance of a morph type in the input text

5 Method 1:

6 C = Cost(Input text) + Cost(Codebook)  m 1,…m n : sequence of morph tokens that makes up the input text.  l(m i ): the length of morph m i  k: number of bits to code a character  p(m i ): token count of m i divided by total count of morph tokens. MDL Cost function

7 Search Algorithm For each word in input text do { If word has been observed before then { 1. Remove word from the data structure 2. Remove word’s morphs from the codebook } Segmentation (word) Segmentation (word)}

8 1.Recursive segmentation Segmentation (string = c 1,…c n ) { 1. Evaluate every possible split of the string into 2 parts. 2. Select the split (or no split) with min(MDL cost). string split index is i. 3. If “no split” (i=0) selected Codebook = Codebook U {string} Else Segmentation (c 1,..,c i ); Segmentation (c i+1,..,c n ); }

9 The order of splits can be represented as a binary tree. The order of splits can be represented as a binary tree. affections ionsaffect ions Example affect ions affect, ion, s The morphs are: affect, ion, s Codebook: 1.affect 2.s 3.…

10 Problem: Problem: Words encountered in the beginning and not observed since may have a “wrong” segmentation, since at some point more suitable morphs have entered the codebook. Solution: Solution: “Dreaming” stage.

11 “Dreaming” At regular intervals do: 1. Stop reading words from the input 2. Go over the words already encountered in random order. 3. Resegment these words.

12 Method 2:

13 Method 2: Pre-processing: list of words and the frequencies of each word in the corpus. Pre-processing: list of words and the frequencies of each word in the corpus. The total cost consists of the input text only The total cost consists of the input text only Cost(Input text) = Σ –logp(m i ) morth tokens morth tokens  m i : morph tokens that makes up the input text.  p(m i ): token count of m i divided by total count of morph tokens.

14 Search Algorithm – Sequential Segmentation 1. Initialize: Split words into morths at random intervals. (used Poisson distribution) 2. Repeat for a number of iterations: a) Estimate morph probability b) Re-segment the text using the Viterbi Algorithm for finding segmentation with lowest cost. c) If not the last iteration: Evaluate the segmentation against Rejection Criteria. If not accepted, segment this word randomly (as in 1) Rejection CriteriaRejection Criteria

15 Rejection criteria Reject the segmentation of a word if it contains one of the following: 1. Rare morph:morph that was used in only one word type in the previous iteration. 2. Sequence of one-letter morphs example:carefu + l + l + y

16 Open issues – Method 2 Why the coast function is defined? Why the coast function is defined? What is the iteration stage? What is the iteration stage? How do the resegmentation works? How do the resegmentation works? How this method gives us the right morphs? How this method gives us the right morphs?

17 Evaluation Measures 1. Correspondence with linguistic morphemes. Using Goldsmith’s program Linguistica. 2. Efficiency of compression of the data. Can be evaluated using MDL cost function. 3. Computational efficiency. Can be estimated from the running time of the program.

19 1.Correct and complete segmentation (i.e. all relevant morphemes were identified). 2.Correct but incomplete segmentation (i.e. not all morphemes were identified). 3.Incorrect segmentation (i.e. some proposed boundaries didn’t correspond to an actual morphemes).

21 Conclusions Recursive splitting and MDL cost performed better. (method 1 is the best based on results)

1 Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein linja-autonautonkuljettajallakaan linja-auton auto kuljettajallakaan.

Similar presentations

Presentation on theme: "1 Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein linja-autonautonkuljettajallakaan linja-auton auto kuljettajallakaan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein linja-autonautonkuljettajallakaan linja-auton auto kuljettajallakaan.

Similar presentations

Presentation on theme: "1 Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein linja-autonautonkuljettajallakaan linja-auton auto kuljettajallakaan."— Presentation transcript:

Similar presentations

About project

Feedback