Download presentation
Presentation is loading. Please wait.
Published byJoella Bennett Modified over 9 years ago
1
Carnegie Mellon Christian Monson ParaMor Finding Paradigms Across Morphology Christian Monson
2
Carnegie Mellon Christian Monson 2 Turkish Morphology – Beads on a String takepassivenegative present progressive 2 nd person singular You are not being taken
3
Carnegie Mellon Christian Monson 3 götürülmsunsunüyor takepassivenegative present progressive You are not being taken 2 nd person singular Turkish Morphology – Beads on a String
4
Carnegie Mellon Christian Monson 4 Applications of Computational Morphology Machine Translation –Turkish-English (Oflazer, 2007) –Czech-English (Goldwater and McClosky, 2005) Speech Recognition –Finnish (Creutz, 2006) Information Retrieval
5
Carnegie Mellon Christian Monson 5 Challenges of Computational Morphology Time Consuming for a New Language –Kemal Oflazer estimates 3-4 months to build basic Turkish analyzer Plus lexicon development and maintenance Expertise Needed –Greenlandic Official language of Greenland Agglutinative Inuit language 50,000 speakers Per Langaard
6
Carnegie Mellon Christian Monson 6 The Solution Raw Text Unsupervised Morphology Induction
7
Carnegie Mellon Christian Monson 7 ParaMor – Paradigm Morphology ParaMor Identify Search Cluster Filter Segment Evaluation Results ParaMor –Unsupervised morphology induction system Paradigm –The natural structure of morphology
8
Carnegie Mellon Christian Monson 8 Paradigms – The Structure of Morphology ülmsunsunüyor takepassivenegative present progressive 2 nd person singular StemVoicePolarity Tense & Mood Person & Number götür
9
Carnegie Mellon Christian Monson 9 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative present progressive 1 st person singular umum götür
10
Carnegie Mellon Christian Monson 10 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative present progressive 3 rd person singular umum Ø götür
11
Carnegie Mellon Christian Monson 11 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative present progressive 1 st person plural umum Ø uzuz götür
12
Carnegie Mellon Christian Monson 12 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative present progressive umum Ø uzuz götür
13
Carnegie Mellon Christian Monson 13 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative future umum Ø uzuz yecek götür
14
Carnegie Mellon Christian Monson 14 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative umum Ø uzuz yecek götür
15
Carnegie Mellon Christian Monson 15 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number umum Ø uzuz yecek
16
Carnegie Mellon Christian Monson 16 Paradigms – The Structure of Morphology ülmumüyor umum Ø uzuz yecek Paradigms
17
Carnegie Mellon Christian Monson 17 Paradigms – The Structure of Morphology ülmumüyor umum Ø uzuz yecek Paradigms Paradigm –Set of mutually replaceable strings
18
Carnegie Mellon Christian Monson 18 Paradigms – The Structure of Morphology ülmumüyor umum Ø uzuz yecek Paradigm –Set of mutually replaceable strings
19
Carnegie Mellon Christian Monson 19 The ParaMor Algorithm ParaMor Identify Search Cluster Filter Segment Evaluation Results Identify suffix paradigms in 3 steps
20
Carnegie Mellon Christian Monson 20 The ParaMor Algorithm ParaMor Identify Search Cluster Filter Segment Evaluation Results Identify suffix paradigms in 3 steps 1.Search for candidate paradigms
21
Carnegie Mellon Christian Monson 21 The ParaMor Algorithm ParaMor Identify Search Cluster Filter Segment Evaluation Results Identify suffix paradigms in 3 steps 1.Search for candidate paradigms 2.Cluster candidates modeling the same paradigm
22
Carnegie Mellon Christian Monson 22 The ParaMor Algorithm Identify suffix paradigms in 3 steps 1.Search for candidate paradigms 2.Cluster candidates modeling the same paradigm 3.Filter ParaMor Identify Search Cluster Filter Segment Evaluation Results
23
Carnegie Mellon Christian Monson 23 The ParaMor Algorithm Identify suffix paradigms in 3 steps 1.Search for candidate paradigms 2.Cluster candidates modeling the same paradigm 3.Filter Segment words –Using the discovered paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results
24
Carnegie Mellon Christian Monson 24 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms All character boundaries are candidate morpheme boundaries
25
Carnegie Mellon Christian Monson 25 s 10662 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms autorizaciones buscabamos costas importadoras vallas … Begin search with the most frequent word-final string Spanish
26
Carnegie Mellon Christian Monson 26 s 10662 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms autorizaciones buscabamos costas importadoras vallas … Ø s 5501 Identify the most frequent mutually replaceable string –Stems that occur with one suffix in a paradigm will likely occur with other suffixes in that paradigm Spanish
27
Carnegie Mellon Christian Monson 27 s 10662 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms Stop adding suffixes –When the most frequent mutually replaceable string severly decreases the stem count. Ø s 5501 Ø r s 287 autorizaciones buscabamos costas importadoras vallas …
28
Carnegie Mellon Christian Monson 28 s 10662 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms Move on to the next most frequent word-final string Ø s 5501 Ø r s 287 a 8981
29
Carnegie Mellon Christian Monson 29 a 8981 s 10662 a o 2304 a o os 1410 a as o os 892 Ø s 5501 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms
30
Carnegie Mellon Christian Monson 30 n 6051 a 8981 s 10662 Ø n 1874 Ø n r 509 Ø do n r 354 Ø da das do dos n ndo r ron 118 a o 2304 a o os 1410 a as o os 892 Ø s 5501 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms
31
Carnegie Mellon Christian Monson 31 n 6051 a 8981 s 10662 Ø n 1874 Ø n r 509 Ø do n r 354 Ø da das do dos n ndo r ron 118 a o 2304 a o os 1410 a as o os 892 Ø s 5501 es 2751 Ø es 874 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms
32
Carnegie Mellon Christian Monson 32 an 1786 n 6051 a 8981 s 10662 a an 1049 a an ar 413 a an ar ó 353 a ada adas ado ados an ar aron ó 149 Ø n 1874 Ø n r 509 Ø do n r 354 Ø da das do dos n ndo r ron 118 a o 2304 a o os 1410 a as o os 892 Ø s 5501 es 2751 Ø es 874 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms
33
Carnegie Mellon Christian Monson 33... strado 15 rado 167 an 1786 n 6051 a 8981 s 10662 a an 1049 a an ar 413 a an ar ó 353 a ada adas ado ados an ar aron ó 149 rada radas rado rados 53 rada rado rados 67 rada rado 89 ra rada radas rado rados ran rar raron ró 23 Ø n 1874 Ø n r 509 Ø do n r 354 Ø da das do dos n ndo r ron 118 a o 2304 a o os 1410 a as o os 892 Ø s 5501 strada strado 12 strada strado stró 9 strada strado strar stró 8 strada stradas strado strar stró 7 es 2751 Ø es 874 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms...
34
Carnegie Mellon Christian Monson 34 Cluster Candidates per Paradigm 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types ParaMor Identify Search Cluster Filter Segment Evaluation Results
35
Carnegie Mellon Christian Monson 35 Cluster Candidates per Paradigm 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.664 451 Covered Types ParaMor Identify Search Cluster Filter Segment Evaluation Results
36
Carnegie Mellon Christian Monson 36 Cluster Candidates per Paradigm 15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó 25 Stems: anunci, aplic, apoy, celebr, consider, … 375 Covered Types 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.664 451 Covered Types ParaMor Identify Search Cluster Filter Segment Evaluation Results
37
Carnegie Mellon Christian Monson 37 Cluster Candidates per Paradigm 15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó 25 Stems: anunci, aplic, apoy, celebr, consider, … 375 Covered Types 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.664 451 Covered Types 17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.715 532 Covered Types ParaMor Identify Search Cluster Filter Segment Evaluation Results
38
Carnegie Mellon Christian Monson 38 Filter Candidate Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results 2 types of filtering 1.Remove small unclustered candidate paradigms 2.Remove candidates modeling unlikely morpheme boundaries (Harris, 1955)
39
Carnegie Mellon Christian Monson 39 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas
40
Carnegie Mellon Christian Monson 40 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas a ada adas ado ados an ar aron ó...
41
Carnegie Mellon Christian Monson 41 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas a ada adas ado ados an ar aron ó... administrada
42
Carnegie Mellon Christian Monson 42 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +adas administrada a ada adas ado ados an ar aron ó...
43
Carnegie Mellon Christian Monson 43 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +adas a as o os administrada
44
Carnegie Mellon Christian Monson 44 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +adas, administrad +as a as o os administrada Old way: Separate alternative analysis
45
Carnegie Mellon Christian Monson 45 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +adas, administrad +as a as o os administrada administr +ad +as New way: Augment the current segmentation
46
Carnegie Mellon Christian Monson 46 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +ad +a +s Ø sØ s administradaØ administr +adas, administrad +as, administrada +s
47
Carnegie Mellon Christian Monson 47 Morpho Challenge 2007 ParaMor Identify Search Cluster Filter Segment Evaluation Results Peer operated competition –For unsupervised morphology induction algorithms 4 languages –English –German –Finnish –Turkish
48
Carnegie Mellon Christian Monson 48 ParaMor in Morpho Challenge 2007 ParaMor Identify Search Cluster Filter Segment Evaluation Results Developed on Spanish –ParaMor’s free parameters were frozen
49
Carnegie Mellon Christian Monson 49 2 Methods of Evaluation ParaMor Identify Search Cluster Filter Segment Evaluation Results 1.Linguistic Segmentations compared to a morphologically analyzed lexicon AnalysisAnswer administradasadministr +ad +a +sadministrar +Adj +Fem +Pl administradaadministr +ad +aadministrar +Adj +Fem
50
Carnegie Mellon Christian Monson 50 2 Methods of Evaluation ParaMor Identify Search Cluster Filter Segment Evaluation Results 1.Linguistic Segmentations compared to a morphologically analyzed lexicon AnalysisAnswer administradasadministr +ad +a +sadministrar +Adj +Fem +Pl administradaadministr +ad +aadministrar +Adj +Fem
51
Carnegie Mellon Christian Monson 51 2 Methods of Evaluation ParaMor Identify Search Cluster Filter Segment Evaluation Results 2.Task based Information retrieval –Short two-sentence queries –About international news topics –Binary relevance assessments –About 50 queries and 20K relevance judgements for each language.
52
Carnegie Mellon Christian Monson 52 Linguistic Evaluation F1F1 Bernhard 2 ParaMor Identify Search Cluster Filter Segment Evaluation Results Morfessor 47.2
53
Carnegie Mellon Christian Monson 53 Linguistic Evaluation F1F1 Bernhard 2 ParaMor Identify Search Cluster Filter Segment Evaluation Results 47.2 MorfessorParaMor 50.6
54
Carnegie Mellon Christian Monson 54 Linguistic Evaluation F1F1 Bernhard 2 MorfessorParaMorParaMor & Morfessor ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2Morfessor 47.2 50.650.7
55
Carnegie Mellon Christian Monson 55 Linguistic Evaluation F1F1 Bernhard 2 ParaMor Identify Search Cluster Filter Segment Evaluation Results 50.7 MorfessorParaMorParaMor & Morfessor 60.8
56
Carnegie Mellon Christian Monson 56 Linguistic Evaluation F1F1 Bernhard 2 ParaMor Identify Search Cluster Filter Segment Evaluation Results MorfessorParaMorParaMor & Morfessor 60.8 56.3
57
Carnegie Mellon Christian Monson 57 Linguistic Evaluation F1F1 ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2MorfessorParaMorParaMor & Morfessor Bernhard 2 MorfessorParaMorParaMor & Morfessor 60.8 56.3 52.953.4
58
Carnegie Mellon Christian Monson 58 Linguistic Evaluation F1F1 ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2MorfessorParaMorParaMor & Morfessor Bernhard 2 MorfessorParaMorParaMor & Morfessor 60.8 56.3 52.9 53.4
59
Carnegie Mellon Christian Monson 59 Linguistic Evaluation F1F1 ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2MorfessorParaMorParaMor & Morf. Bernhard 2MorfessorParaMorParaMor & Morfessor Bernhard 2MorfessorParaMorParaMor & Morfessor 60.8 56.3 52.9 53.4 48.248.5
60
Carnegie Mellon Christian Monson 60 Linguistic Evaluation F1F1 ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2MorfessorParaMorParaMor & Morf. MorfessorParaMorParaMor & Morfessor Bernhard 2MorfessorParaMorParaMor & Morfessor Bernhard 2MorfessorParaMorParaMor & Morfessor 60.8 56.3 52.9 53.4 48.248.5 24.7 52.0
61
Carnegie Mellon Christian Monson 61 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results McNameePar. 27.0 – No Morphological Analysis 28.9 26.4
62
Carnegie Mellon Christian Monson 62 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results McNameeParaMor 27.0 – No Morphological Analysis 28.9 29.3
63
Carnegie Mellon Christian Monson 63 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results MorfessorParaMorMcNameeParaMorMorfessor BaselineParaMor & M. 30.7 – No Morphological Analysis 28.9 29.3 38.3 32.1
64
Carnegie Mellon Christian Monson 64 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results MorfessorParaMorMcNameeParaMorMorfessor BaselineParaMor & M. 30.7 – No Morphological Analysis 28.9 29.3 38.3 38.2
65
Carnegie Mellon Christian Monson 65 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results MorfessorParaMorMorfessorParaMorMcNameeParaMorMorfessor BaselineParaMor & MorfessorMorfessor BaselineParaMor & Morfessor 32.0 – No Morphological Analysis 28.9 29.3 38.8 38.2 41.2 37.2
66
Carnegie Mellon Christian Monson 66 ParaMor: State-of-the-Art Unsupervised Morphology Induction System Combined system among the best in Morpho Challenge 2007 Consistent across languages Better than no morphology –Task based (IR) measure
67
Carnegie Mellon Christian Monson 67 Many Future Directions Improve Performance –F 1 of 50-60% is state-of-the-art! –Inflection classes –Morphophonology Beyond beads-on-a-string
68
Carnegie Mellon Christian Monson 68 Thank You!
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.