Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training.

Similar presentations


Presentation on theme: "1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training."— Presentation transcript:

1 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training data Part One: Generating training data Part Two: Training and Applying a Classificator Part Two: Training and Applying a Classificator Preliminary results Preliminary results Further research Further research

2 2 Example: clearly early The examples used throughout this presentation are clearly and early In one case, the stem is clear and in the other early In one case, the stem is clear and in the other early Other word forms of same lemmas: Other word forms of same lemmas: clearly: clearest, clear, clearer, clearing clearly: clearest, clear, clearer, clearing early: earlier, erliest early: earlier, erliest Semantically related words: Semantically related words: clearly: logically, really, totally, weakly, … clearly: logically, really, totally, weakly, … early: morning, noon, day, month, time, … early: morning, noon, day, month, time, … Correct morpheme boundaries analysis: Correct morpheme boundaries analysis: clearly → clear-ly but not *clearl-y or *clea-rly clearly → clear-ly but not *clearl-y or *clea-rly early → early or earl-y but not *ear-ly early → early or earl-y but not *ear-ly

3 3 1. Three approaches to morpheme boundary detection Three kinds of approaches: 1. Genetic Algorithms and the Minimum Description Length model (Kazakov 97 & 01), (Goldsmith 01), (Creutz 03 & 05) (Kazakov 97 & 01), (Goldsmith 01), (Creutz 03 & 05) This approach utilizes only word list, not the context information for each word from corpus. This approach utilizes only word list, not the context information for each word from corpus. This possibly results in an upper limit on achievable performance (especially with regards to irregularities). This possibly results in an upper limit on achievable performance (especially with regards to irregularities). One advantage is that smaller corpora sufficient One advantage is that smaller corpora sufficient 2. Semantics based (Schone & Jurafsky 01), (Baroni 03) (Schone & Jurafsky 01), (Baroni 03) General problem of this approach with examples like deeply and deepness where semantic similarity is unlikely General problem of this approach with examples like deeply and deepness where semantic similarity is unlikely 3. Letter Successor Variety (LSV) based (Harris 55), (Hafer & Weiss 74) first application, but low performance (Harris 55), (Hafer & Weiss 74) first application, but low performance Also applied only to a word list Also applied only to a word list Further hampered by noise in the data Further hampered by noise in the data

4 4 2. New solution in two parts sentences cooccurrences The talk was very informative The talk 1 Talk was 1 … similar words Talk speech 20 Was is 15 … clear-ly lately early … clearlylate ear ¤ root cl late ¤ ¤ ¤ ¤ train classifier clear-ly late-ly early … apply classifier compute LSV s = LSV * freq * multiletter * bigram bigram

5 5 2.1. First part: Generating training data with LSV and distributed Semantics Overview: Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information Frequency of word A and B is n A and n B Frequency of word A and B is n A and n B Frequency of cooccurrence of A with B is n AB Frequency of cooccurrence of A with B is n AB Corpus size is n Corpus size is n Significance computation is Poisson approximation of log- likelihood (Dunning 93) (Quasthoff & Wolff 02) Significance computation is Poisson approximation of log- likelihood (Dunning 93) (Quasthoff & Wolff 02)

6 6 Neighbors of “clearly“ Most significant left neighbors veryquitesoIt‘smostit‘sshowsresultsthat‘sstatedQuite Most significant right neighbors definedwrittenlabeledmarkedvisibledemonstratedsuperiorstatedshowsdemonstratesunderstood clearly It’s clearly labeled very clearly shows

7 7 2.2. New solution as combination of two existing approaches Overview: Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information Use these neighbor cooccurrences to find words that have similar cooccurrence profiles → those that are surrounded by the same cooccurrences bear mostly the same grammatical marker Use these neighbor cooccurrences to find words that have similar cooccurrence profiles → those that are surrounded by the same cooccurrences bear mostly the same grammatical marker

8 8 Similar words to “clearly“ Most significant right neighbors definedwrittenlabeledmarkedvisibledemonstratedsuperiorstatedshowsdemonstratesunderstood … weakly legally closely clearly greatly linearly really … Most significant left neighbors veryquitesoIt‘smostit‘sshowsresultsthat‘sstatedQuite

9 9 2.3. New solution as combination of two existing approaches Overview: Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information Use these neighbor cooccurrences to find words that have similar cooccurrence profiles → those that are surrounded by the same cooccurrences bear mostly the same grammatical marker Use these neighbor cooccurrences to find words that have similar cooccurrence profiles → those that are surrounded by the same cooccurrences bear mostly the same grammatical marker Sort those words by edit distance and keep 150 most similar → since further words only add random noise Sort those words by edit distance and keep 150 most similar → since further words only add random noise

10 10 Similar words to “clearly“ sorted by edit distance Most significant left neighbors veryquitesoIt‘smostit‘sshowsresultsthat‘sstatedQuite Most significant right neighbors definedwrittenlabeledmarkedvisibledemonstratedsuperiorstatedshowsdemonstratesunderstood Sorted List clearly closely greatly legally linearly really weakly …

11 11 2.4. New solution as combination of two existing approaches Overview: Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information Use context information to gather common direct neighbors of the input word → they are most probably marked by the same grammatical information Use these neighbor cooccurrences to find words that have similar cooccurrence profiles → those that are surrounded by the same cooccurrences bear mostly the same grammatical marker Use these neighbor cooccurrences to find words that have similar cooccurrence profiles → those that are surrounded by the same cooccurrences bear mostly the same grammatical marker Sort those words by edit distance and keep 150 most similar → since further words only add random noise Sort those words by edit distance and keep 150 most similar → since further words only add random noise Compute letter successor variety for each transition between two characters of the input word Compute letter successor variety for each transition between two characters of the input word Report boundaries where the LSV is above threshold

12 12 2.5. Letter successor variety Letter successor variety: Harris (55) Letter successor variety: Harris (55) where word-splitting occurs if the number of distinct letters that follows a given sequence of characters surpasses the threshold. Input are the 150 most similar words Input are the 150 most similar words Observing how many different letters occur after a part of the string: Observing how many different letters occur after a part of the string: #c- In the given list after #c- 5 letters #c- In the given list after #c- 5 letters #cl- only 3 letters #cl- only 3 letters #cle- only 1 letter #cle- only 1 letter … -ly# but reversed before –ly# 16 different letters (16 different stems preceding the suffix –ly#) -ly# but reversed before –ly# 16 different letters (16 different stems preceding the suffix –ly#) # c l e a r l y # # c l e a r l y # 28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters) 28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters) 1 1 2 1 3 16 10 14 f. right (thus before -y# 10 var. letters) 1 1 2 1 3 16 10 14 f. right (thus before -y# 10 var. letters)

13 13 2.5.1. Balancing factors LSV score for each possible boundary is not normalized and needs to be weighted against several factors that otherwise add noise: LSV score for each possible boundary is not normalized and needs to be weighted against several factors that otherwise add noise: freq: Frequency differences between beginning and middle of word freq: Frequency differences between beginning and middle of word multiletter: Representation of single phonemes with several letters multiletter: Representation of single phonemes with several letters bigram: Certain fixed combinations of letters bigram: Certain fixed combinations of letters Final score s for each possible boundary is then: Final score s for each possible boundary is then: s = LSV * freq * multiletter * bigram

14 14 2.5.2. Balancing factors: Frequency LSV is not normalized against frequency 28 different first letters within 150 words 28 different first letters within 150 words 5 different second letters within 11 words, beginning with c 5 different second letters within 11 words, beginning with c 3 different third letters within 4 words, beginning with cl 3 different third letters within 4 words, beginning with cl Computing frequency weight freq: 4 out of 11 begin with #cl- then weight is 4/11 4 out of 11 begin with #cl- then weight is 4/11 # c l e a r l y # # c l e a r l y # 150 11 4 1 1 1 1 1 of 11 4 begin with cl 150 11 4 1 1 1 1 1 of 11 4 begin with cl 0.1 0.4 0.3 1 1 1 1 1 from left 0.1 0.4 0.3 1 1 1 1 1 from left

15 15 2.5.3. Balancing factors: Multiletter Phonemes Problem: Two or more letters which together represent one phoneme “carry away” the nominator for the overlap factor quotient: Problem: Two or more letters which together represent one phoneme “carry away” the nominator for the overlap factor quotient: Letter split variety: # s c h l i m m e # s c h l i m m e 7 1 7 2 1 1 2 7 1 7 2 1 1 2 2 1 1 1 2 4 15 2 1 1 1 2 4 15 Computing overlap factor: 150 27 18 18 6 5 5 5 2 2 2 2 3 7 105 150 2 2 2 2 3 7 105 150 ^ thus at this point the LSV 7 is weighted 1 (18/18), but since sch is one phoneme, it should have been 18/150 ! ^ thus at this point the LSV 7 is weighted 1 (18/18), but since sch is one phoneme, it should have been 18/150 ! Solution: Ranking of bi- and trigrams, highest receives weight of 1.0 Solution: Ranking of bi- and trigrams, highest receives weight of 1.0 Overlap factor is recomputed as weighted average: Overlap factor is recomputed as weighted average: In this case that means 1.0 * 27/150, since ‘sch’ is the highest trigram and has a weight of 1.0. In this case that means 1.0 * 27/150, since ‘sch’ is the highest trigram and has a weight of 1.0.

16 16 2.5.4. Balancing factors: Bigrams It is obvious that –th– in English is almost never to be divided It is obvious that –th– in English is almost never to be divided Computation of bigram ranking over all words in word list and give 0.1 weight to highest ranked and 1.0 to lowest ranked. Computation of bigram ranking over all words in word list and give 0.1 weight to highest ranked and 1.0 to lowest ranked. LSV score then multiplied with resulting weight. LSV score then multiplied with resulting weight. Thus, the German –ch- which is the highest ranked bigram receives a penalty of 0.1 and thus it is nearly impossible that it becomes a morpheme boundary Thus, the German –ch- which is the highest ranked bigram receives a penalty of 0.1 and thus it is nearly impossible that it becomes a morpheme boundary

17 17 2.5.5. Sample computation Compute letter successor variety: # c l e a r - l y # # e a r l y # # c l e a r - l y # # e a r l y # 28 5 3 1 1 1 1 1 40 5 1 1 2 1 28 5 3 1 1 1 1 1 40 5 1 1 2 1 1 1 2 1 3 16 10 10 1 2 1 4 6 19 1 1 2 1 3 16 10 10 1 2 1 4 6 19 Balancing: Frequencies: 150 11 4 1 1 1 1 1 150 9 2 2 2 1 150 11 4 1 1 1 1 1 150 9 2 2 2 1 1 1 2 2 5 76 90 150 1 2 2 6 19 150 1 1 2 2 5 76 90 150 1 2 2 6 19 150 Balancing: Multiletter weights: Bi l 0.4 0.1 0.5 0.2 0.5 0.0 0.2 0.2 0.5 0.0 Tri r 0.1 0.1 0.1 0.1 0.0 0.0 0.1 0.0 Bi l 0.5 0.2 0.5 0.0 0.1 0.3 0.5 0.0 0.1 0.3 Tri r 0.1 0.1 0.0 0.0 0.2 0.0 0.0 0.2 Balancing: Bigram weight: 0.1 0.5 0.2 0.5 0.0 0.1 0.2 0.5 0.0 0.1 0.1 0.5 0.2 0.5 0.0 0.1 0.2 0.5 0.0 0.1 Left and Right LSV scores: 0.1 0.3 0.0 0.4 1.0 0.9 0.0 0.0 0.5 1.7 0.1 0.3 0.0 0.4 1.0 0.9 0.0 0.0 0.5 1.7 0.3 0.9 0.1 0.0 12.4 3.7 1.0 0.0 0.7 0.2 0.3 0.9 0.1 0.0 12.4 3.7 1.0 0.0 0.7 0.2 Computing right score for clear-ly: 16*(76/90+0.1*76/150)/(1.0+0.1)*(1-0.0)=12.4 Sum scores for left and right: 0.4 1.2 0.1 0.4 13.4 4.6 1.0 0.1 1.2 2.0 0.4 1.2 0.1 0.4 13.4 4.6 1.0 0.1 1.2 2.0 threshold: 5 clear-ly early clear-ly early

18 18 3. Second Part: Training and Applying classifier Any word list can be stored in a trie (Fredkin:60) or in a more efficient version of a trie, a PATRICIA compact tree (PCT) (Morrison:68) Any word list can be stored in a trie (Fredkin:60) or in a more efficient version of a trie, a PATRICIA compact tree (PCT) (Morrison:68) Example: Example:clearlyearlylatelyclearlate rye al t era lal ce ¤ root l c l a e ¤ ¤ ¤ ¤ a ¤ = End or beginning of word

19 19 3.1. PCT as a Classificator clearlylate ear¤ root cl late¤ ¤ ¤ ¤ clear-ly, late-ly, early, Clear, late clearly late ear¤ root cl late¤ ¤ ¤ ¤ ly=2 ly=1 Amazing?ly amazing-ly ly=1 add known information retrieve known information Apply deepest found node dear?ly dearly ¤=1

20 20 4. Evaluation Boundary measuring: each boundary detected can be correct or wrong (precision) or boundaries can be not detected (recall) First evaluation is global LSV with the proposed improvements First evaluation is global LSV with the proposed improvements

21 21 Evaluating LSV Precision vs. Recall

22 22 Evaluating LSV F-measure

23 23 Evaluating combination Precision vs. Recall

24 24 Evaluating combination F-measure

25 25 Comparing combination with global LSV

26 26 4.1. Results German newspaper corpus with 35 million sentences English newspaper corpus with 13 million sentences t=5GermanEnglish lsv Precision 80,2070,35 lsv Recall 34,5210,86 lsv F-measure 48,2718,82 combined Precision 68,7752,87 combined Recall 72,1152,56 combined F-measure 70,4055,09

27 27 4.2. Statistics en lsv en comb tr lsv tr comb fi lsv fi comb Corpus size 13 million 1 million 4 million nunmber of word( form)s 167.377582.9231.636.336 analysed words 49.15994.23726.307460.79168.8401.380.841 boundaries70.106131.46531.569812.45484.1933.138.039 morph. length 2,602,562,293,032,323,73 length of analysed words 8,978,919,7510,6211,9413,34 length of unanalysed words 7,566,7710,128,1512,9110,47 morphemes per word 2,432,402,202,762,223,27

28 28 4.3. Assessing true error rate Typical sample list of words considered as wrong due to CELEX: Typical sample list of words considered as wrong due to CELEX: Tau-sendeTausend-e Tau-sendeTausend-e senegales-isch-esenegalesisch-e senegales-isch-esenegalesisch-e sensibelst-ensens-ibel-sten sensibelst-ensens-ibel-sten separat-ist-isch-esepar-at-istisch-e separat-ist-isch-esepar-at-istisch-e tris-ttrist tris-ttrist triump-haltriumph-al triump-haltriumph-al trock-entrocken trock-entrocken unueber-troff-enun-uebertroffen unueber-troff-enun-uebertroffen trop-f-entropf-en trop-f-entropf-en trotz-t-entrotz-ten trotz-t-entrotz-ten ver-traeum-t-evertraeumt-e ver-traeum-t-evertraeumt-e Reasons: Reasons: Gender –e (in (Creutz & Lagus 05) for example counted as correct) Gender –e (in (Creutz & Lagus 05) for example counted as correct) compounds (sometimes separated, sometimes not) compounds (sometimes separated, sometimes not) -t-en Error -t-en Error With proper names –isch often not analyzed With proper names –isch often not analyzed Connecting elements Connecting elements

29 29 4.4. Real example Orien-talOrien-tal-ischeOrien-tal-istOrien-tal-ist-enOrien-tal-ist-ikOrien-tal-ist-inOrient-ier-ungOrient-ier-ungenOrient-ier-ungs-hilf-eOrient-ier-ungs-hilf-enOrient-ier-ungs-los-igkeitOrient-ier-ungs-punktOrient-ier-ungs-punkt-eOrient-ier-ungs-stuf-eVer-trau-enskriseVer-trau-ensleuteVer-trau-ens-mannVer-trau-ens-sacheVer-trau-ensvorschußVer-trau-ensvo-tumVer-trau-ens-würd-igkeitVer-traut-esVer-trieb-enVer-trieb-spartn-erVer-triebeneVer-triebenenverbändeVer-triebs-beleg-e

30 30 5. Further research Examine quality on various language types Examine quality on various language types Improve trie-based classificator Improve trie-based classificator Possibly combine with other existing algorithms Possibly combine with other existing algorithms Find out how to acquire morphology of non- concatenative languages Find out how to acquire morphology of non- concatenative languages Deeper analysis: Deeper analysis: find deletions find deletions alternations alternations insertions insertions morpheme classes etc. morpheme classes etc.

31 31 6. References (Argamon et al. 04) Shlomo Argamon, Navot Akiva, Amihood Amir, and Oren Kapah. Effcient unsupervized recursive word segmentation using minimun desctiption length. In Proceedings of Coling 2004, Geneva, Switzerland, 2004. GLDV-Tagung, pages 93-99, Leipzig, March 1998. Deutscher Universitätsverlag. (Argamon et al. 04) Shlomo Argamon, Navot Akiva, Amihood Amir, and Oren Kapah. Effcient unsupervized recursive word segmentation using minimun desctiption length. In Proceedings of Coling 2004, Geneva, Switzerland, 2004. GLDV-Tagung, pages 93-99, Leipzig, March 1998. Deutscher Universitätsverlag. (Baroni 03) Marco Baroni. Distribution-driven morpheme discovery: A computational/experimental study. Yearbook of Morphology, pages 213- 248, 2003. France, http://www.sle.sharp.co.uk/senseval2/, 5-6 July 2001. (Baroni 03) Marco Baroni. Distribution-driven morpheme discovery: A computational/experimental study. Yearbook of Morphology, pages 213- 248, 2003. France, http://www.sle.sharp.co.uk/senseval2/, 5-6 July 2001. (Creutz & Lagus 05) Mathias Creutz and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora using morfessor 1.0. In Publications in Computer and Information Science, Report A81. Helsinki University of Technology, March 2005. (Creutz & Lagus 05) Mathias Creutz and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora using morfessor 1.0. In Publications in Computer and Information Science, Report A81. Helsinki University of Technology, March 2005. (Déjean 98) Hervé Déjean. Morphemes as necessary concept for structures discovery from untagged corpora. In D.M.W. Powers, editor, NeMLaP3/CoNLL98 Workshop on Paradigms and Grounding in Natural Language Learning, ACL, pages 295-299, Adelaide, January 1998. (Déjean 98) Hervé Déjean. Morphemes as necessary concept for structures discovery from untagged corpora. In D.M.W. Powers, editor, NeMLaP3/CoNLL98 Workshop on Paradigms and Grounding in Natural Language Learning, ACL, pages 295-299, Adelaide, January 1998. (Dunning 93) T. E. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74, 1993. (Dunning 93) T. E. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74, 1993.

32 32 6. References II (Goldsmith 01) John Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):153-198, 2001. (Goldsmith 01) John Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):153-198, 2001. (Hafer & Weiss 74) Margaret A. Hafer and Stephen F. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371-385, 1974. (Hafer & Weiss 74) Margaret A. Hafer and Stephen F. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371-385, 1974. (Harris 55) Zellig S. Harris. From phonemes to morphemes. Language, 31(2):190-222, 1955. (Harris 55) Zellig S. Harris. From phonemes to morphemes. Language, 31(2):190-222, 1955. (Kazakov 97) Dimitar Kazakov. Unsupervised learning of na¨ive morphology with genetic algorithms. In A. van den Bosch, W. Daelemans, and A. Weijters, editors, Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks, pages 105-112, Prague, Czech Republic, April 1997. (Kazakov 97) Dimitar Kazakov. Unsupervised learning of na¨ive morphology with genetic algorithms. In A. van den Bosch, W. Daelemans, and A. Weijters, editors, Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks, pages 105-112, Prague, Czech Republic, April 1997. (Quasthoff & Wolff 02) Uwe Quasthoff and Christian Wolff. The poisson collocation measure and its applications. In Second International Workshop on Computational Approaches to Collocations. 2002. (Quasthoff & Wolff 02) Uwe Quasthoff and Christian Wolff. The poisson collocation measure and its applications. In Second International Workshop on Computational Approaches to Collocations. 2002. (Schone & Jurafsky 01) Patrick Schone and Daniel Jurafsky. Language- independent induction of part of speech class labels using only language universals. In Workshop at IJCAI-2001, Seattle, WA., August 2001. Machine Learning: Beyond Supervision. (Schone & Jurafsky 01) Patrick Schone and Daniel Jurafsky. Language- independent induction of part of speech class labels using only language universals. In Workshop at IJCAI-2001, Seattle, WA., August 2001. Machine Learning: Beyond Supervision.

33 33 E. Gender-e vs. Frequency-e vs. Gender-e Schule 8.4 Devise 7.8 Sonne 4.5 Abendsonne 5.3 Abende 5.5 Liste 6.5 vs. other-e andere 8.4 keine 6.8 rote 11.6 stolze 8.0 drehte 10.8 winzige 9.7 lustige 13.2 rufe 4.4 Dumme 12.6 Frequency-e Affe 2.7 Junge 5.3 Knabe 4.6 Bursche 2.4 Backstage 3.0


Download ppt "1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training."

Similar presentations


Ads by Google