Presentation is loading. Please wait.

Presentation is loading. Please wait.

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Induction of a Simple Morphology for Highly-Inflecting Languages {Mathias.Creutz,

Similar presentations


Presentation on theme: "HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Induction of a Simple Morphology for Highly-Inflecting Languages {Mathias.Creutz,"— Presentation transcript:

1 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Induction of a Simple Morphology for Highly-Inflecting Languages {Mathias.Creutz, Krista.Lagus}@hut.fi Current Themes in Computational Phonology and Morphology, 7th Meeting of the ACL Special Interest Group in Computational Phonology, ACL-2004. Barcelona, 26 July 2004 kahvi + n + juo + ja + lle + kin nyky + ratkaisu + i + sta + mme tietä + isi + mme + kö + hän open + mind + ed + ness un + believ + able

2 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 2 Goals and challenges Learn representations of –the smallest meaningful units of language (morphemes) –and their interaction –in an unsupervised manner from raw text –making as general and language-independent assumptions as possible. Evaluate –against a given gold-standard morphological analysis of word forms first step: learn and evaluate a morpheme segmentation of word forms –integrated in NLP applications (speech recognition)

3 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 3 Focus: Agglutinative morphology Finnish words often consist of lengthy sequences of morphemes — stems, suffixes and prefixes: –kahvi + n + juo + ja + lle + kin (coffee + of + drink + -er + for + also) –nyky + ratkaisu + i + sta + mme (current + solution + -s + from + our) –tietä + isi + mme + kö + hän (know + would + we + INTERR + indeed)  Huge number of different possible word forms  Important to know the inner structure of words in NLP  The number of morphemes per word varies much

4 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 4 1. MDL model (Creutz & Lagus, 2002) (inspired by work of, e.g., J. Goldsmith) ”Invent” a set distinct strings = morphs Morph lexicon Pick morphs from the lexicon and place them in a sequence Corpus / word list Learning from data  tä  ssä  pala  peli  on  tuhat  a   tä ssä pala peli ssä on tuhat pala a Aim at the most concise represent- ation possible

5 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 5 2. Probabilistic formulation (Creutz, 2003) (inspired by work of, e.g., M. R. Brent and M. G. Snover) Frequency prior Length prior ”Invent” a set distinct strings = morphs Morph lexicon Pick morphs from the lexicon and place them in a sequence Corpus / word list  tä  ssä  pala  peli  on  tuhat  a   tä ssä pala peli ssä on tuhat pala a

6 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 6 Reflections on solutions 1 and 2 ”Dumb” text compression algorithms –Common substrings of words appear as one segment, even when compositional structure, e.g.,: keskustelussa (keskustel + u + ssa; ”discuss+ion in”) biggest (bigg + est) –Rare substrings of words are split, even when no compositional structure, e.g., a + den + auer (Adenauer; German politician) in + s + an + e (in + sane) –Too weak structural constraints, e.g., suffixes recognized in the beginning of words: s + can (scan)

7 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 7 3. Category-learning probabilistic model Word structure captured by a regular expression: word = ( prefix* stem suffix* )+ Morph sequences (words) are generated by a Hidden Markov model: p(STM | PRE)p(SUF | SUF) inykystaratkaisu#mme# p(’mme’ | SUF)p(’nyky’ | PRE) Transition probs Emission probs

8 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 8 Category algorithm 1. Start with an existing baseline morph segmentation (Creutz, 2003): nyky + rat + kaisu + ista + mme 2. Initialize category membership probs for each morph, e.g., p(PRE | ’nyky’). Assume asymmetries between the categories:

9 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 9 Initialization of category membership probs Introduce a noise category for cases where none of the proper classes is likely: Distribute remaining probability mass proportionally, e.g.,

10 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 10 Category algorithm (continued) 1. Start with an existing baseline morph segmentation: nyky + rat + kaisu + ista + mme 2. Initialize category membership probs for each morph. 3. Tag morphs as prefix, stem, suffix, or noise. Then run EM on taggings: nyky + rat + kaisu + ista + mme 4. Split morphs that consist of other known morphs. Then EM: nyky + rat + kaisu + i + sta + mme 5. Join noise morphs with their neighbours. Then EM: nyky + ratkaisu + i + sta + mme

11 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 11 Experiments Algorithms –Baseline model (Bayesian formulation) –Category-Learning model –Goldsmith’s ”Linguistica” (MDL formul.) Data –Finnish data sets (CSC + STT) 10 000 words, 50 000 words, 250 000 words, 16 million words –English data sets (Brown corpus) 10 000 words, 50 000 words, 250 000 words believ hop liv mov us e ed es ing

12 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 12 ”Gold standard” used in evaluation Morpheme segmentation obtained for Finnish and English words –by processing the output of Two-level morphology analyzers (FINTWOL and ENGTWOL by Lingsoft, Inc.) Some ”fuzzy morpheme boundaries” allowed –mainly stem-final alternation considered as a seam or joint allowed to belong to the stem or suffix, e.g., Windsori + n or Windsor + in; Windsore + i + lla or Windsor + ei + lla (cf. Windsor) invite + s or invit + es; invite or invit + e (cf. invit + ing) Compute precision and recall of correctly discovered morpheme boundaries

13 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 13 Results (evaluated against the gold-standard) Categories Baseline Linguistica Baseline Categories Linguistica 250k 16M 10k

14 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 14 Discussion The Category algorithm –overcomes many of the shortcomings of the Baseline algorithm excessive or too little segmentation suffixes in beginning of words –generalizes more than Linguistica, e.g., +allus + ion + s (Categories) vs. allusions (Linguistica) –Dem + i (Categories) vs. Demi (Linguistica) –invents its own solutions aihe + e + sta vs. aihe + i + sta (”about [the] topic/-s”) phrase, phrase + s, phrase + d

15 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 15 Future directions The Category algorithm could be expressed more elegantly –not as a post-processing procedure making use of a baseline segmentation Segmentation into morphs is useful –e.g., n-gram language modeling in speech recognition Detection of allomorphy, i.e., segmentation into morphemes would be even more useful –e.g., information retrieval (?)

16 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 16 Public demo A demo of the baseline and category- learning algorithm is available on the Internet at http://www.cis.hut.fi/projects/morpho/. Test it on your own Finnish or English input!

17 HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 17 Search for the optimal segmentation of the words in a corpus Recursive binary splitting reopened openminded reopen minded re open mind ed Morphs conferences opening words openminded reopened Randomly shuffle words Convergence of descr. length? yes no Done


Download ppt "HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Induction of a Simple Morphology for Highly-Inflecting Languages {Mathias.Creutz,"

Similar presentations


Ads by Google