Presentation on theme: "HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Morpho Challenge in Pascal Challenges Workshop Venice, 12 April 2006 Morfessor in."— Presentation transcript:
HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Morpho Challenge in Pascal Challenges Workshop Venice, 12 April 2006 Morfessor in the Morpho Challenge Mathias Creutz and Krista Lagus Helsinki University of Technology (HUT) Adaptive Informatics Research Centre
HUT 2 Challenge for NLP: too many words E.g., Finnish words often consist of lengthy sequences of morphemes — stems, suffixes and prefixes: –kahvi + n + juo + ja + lle + kin (coffee + of + drink + -er + for + also) –nyky + ratkaisu + i + sta + mme (current + solution + -s + from + our) –tietä + isi + mme + kö + hän (know + would + we + INTERR + indeed) Huge number of word forms, few examples of each By splitting we get fewer basic units, each with more examples Important to know the inner structure of words Source: Creutz & Lagus, 2005 tech.rep.
HUT 3 Solution approaches Hand-made morphological analyzers (e.g., based on Koskenniemi’s TWOL = two-level morphology) accurate –labour-intensive construction, commercial, coverage, updating when languages change, addition of new languages Data-driven methods, preferably minimally supervised (e.g., John Goldsmith’s Linguistica) +adaptive, language-independent –lower accuracy –many existing algorithms assume few morphemes per word, unsuitable for compounds and multiple affixes
HUT 4 Goal: segmentation Learn representations of –the smallest individually meaningful units of language (morphemes) –and their interaction –in an unsupervised and data-driven manner from raw text –making as general and as language-independent assumptions as possible. Evaluate –against a gold-standard morphological analysis of word forms –integrated in NLP applications (e.g. speech recognition) Hutmegs Morfessor
HUT 5 Further challenges in morphology learning Beyond segmentation: allomorphy (“foot – feet, goose – geese”) Detection of semantic similarity (“sing – sings – singe – singed”) Learning of paradigms (e.g., John Goldsmith’s Linguistica) believ hop liv mov us e ed es ing
HUT 6 Linguistic evaluation using Hutmegs (Helsinki University of Technology Morphological Evaluation Gold Standard) Hutmegs contains gold standard segmentations obtained by processing the morphological analyses of FinTWOL and CELEX –1.4 million Finnish word forms (FInTWOL, from Lingsoft Inc.) Input: ahvenfileerullia (perch filet rolls) FINTWOL: ahven#filee#rulla N PTV PL Hutmegs: ahven + filee + rull + i + a –120 000 English word forms (CELEX, from LDC) Input: housewives CELEX: house wife, NNx, P Hutmegs: house + wive + s Publicly available, see M. Creutz and K. Lindén. 2004. Morpheme Segmentation Gold Standards for Finnish and English.
HUT 7 Morfessor models in the Challenge Morfessor Baseline (2002) Program code available since 2002 Provided as a baseline model for the Morpho Challenge Improves speech recognition; experiments since 2003 No model of morphotactics Morfessor Categories ML (2004) Category-based modeling (HMM) of morphotactics No speech recognition experiments before this challenge No public software yet Morfessor Categories MAP (2005) More elegant mathematically M1 M2 M3
HUT 8 Avoiding overlearning by controlling model complexity When using powerful machine learning methods, overlearning is always a problem Occam’s razor: given two equally accurate theories, choose the one that is less complex We have used: Heuristic control affecting the size of the lexicon Deriving a cost function that incorporates a measure of model size, using –MDL (Minimum Description length) –MAP learning (Maximum A Posteriori) M2 M3 M1
HUT 9 Morfessor Baseline Originally called the ”Recursive MDL method” Optimizes roughly: + MDL based cost function optimizes size of the model - Morph contextual information not utilized undersegmentation of frequent strings (“forthepurposeof”) oversegmentation of rare strings (“in + s + an + e”) syntactic / morphotactic violations (“s + can”) M1 P (M | corpus ) P (M) P (corpus | M) where M = (lexicon, grammar) and therefore = P (lexicon) P (corpus | lexicon) = P ( ) P ( ) letters morphs
HUT 10 Search for the optimal model Recursive binary splitting reopened openminded reopen minded re open mind ed Morphs conferences opening words openminded reopened Randomly shuffle words Convergence of model prob.? yes no Done M1
HUT 11 Challenge Results: Comparison to gold standard splitting (F-measures) Morfessor Baseline: M1 Winners
HUT 12 Morfessor- Categories – ML & MAP Lexicon / Grammar dualism –Word structure captured by a regular expression: word = ( prefix* stem suffix* )+ –Morph sequences (words) are generated by a Hidden Markov model (HMM): –Lexicon: morpheme properties and contextual properties –Morph segmentation is initialized using M1 P(STM | PRE)P(SUF | SUF) ificoverationsimpl#s# P(’s’ | SUF)P(’over’ | PRE) Transition probs Emission probs M2 M3
HUT 13 Morph lexicon Morph distributional features Form 1402913614over 41415simpl 17259146181s Frequency Length String... Right perplexity Left perplexity Morphs M2 M3
HUT 14 How morph distributional features affect morph categories Prior probability distributions for category membership of a morph, e.g., P(PRE | ’over’) Assume asymmetries between the categories:
HUT 15 There is an additional non-morpheme category for cases where none of the proper classes is likely: Distribute remaining probability mass proportionally, e.g., How distributional features affect categories (2)
HUT 16 MAP vs. ML optimization Morfessor Categories-MAP: 1402913614over 41415simpl 17259146181s... P(STM | PRE)P(SUF | SUF) ific overationsimpl# s # P(’s’ | SUF)P(’over’ | PRE) M3 Morfessor Categories-ML: arg max P(Corpus | Lexicon) Lexicon M2 Control lexicon size heuristically
HUT 17 Hierarchical structures in lexicon straightforwardness straightforwardness straight for forward Stem Suffix M3 ward Non-morpheme Maintain the hierarchy of splittings for each word Ability to code efficiently also common substrings which are not morphemes (e.g. syllables in foreign names) Bracketed output
HUT 18 Example segmentations FinnishEnglish [ aarre kammio ] issa[ accomplish es ] [ aarre kammio ] on[ accomplish ment ] bahama laiset[ beautiful ly ] bahama [ saari en ][ insur ed ] [ epä [ [ tasa paino ] inen ] ][ insure s ] maclare n[ insur ing ] [ nais [ autoili ja ] ] a[ [ [ photo graph ] er ] s ] [ sano ttiin ] ko[ present ly ] found töhri ( mis istä )[ re siding ] [ [ voi mme ] ko ][ [ un [ expect ed ] ] ly ] M3
HUT 19 Challenge Results: Comparison to gold standard splitting (F-measures) Morfessor Categories models: M2 and M3 Morfessor Baseline: M1 Committees Winner
HUT 23 A reason for differences? Source: Creutz & Lagus, 2005 tech.rep.
HUT 24 Discussion This was the first time our Category methods were evaluated in speech recognition, with nice results! Comparison with Morfessors and challenge participants is not quite fair Possibilities to extend the M3 model –add word contextual features for “meaning” –more fine-grained categories –beyond concatenative phenomena (e.g., goose – geese) –allomorphy (e.g., beauty, beauty + ’s, beauti + es, beauti + ful)
HUT 25 Questions for the Morpho Challenge How language-general in fact are the methods? –Norwegian, French, German, Arabic,... Did we, or can we succeed in inducing ”basic units of meaning”? –Evaluation in other NLP problems: MT, IR, QA, TE,... –Application of morphs to non-NLP problems? Machine vision, image analysis, video analysis... Will there be another Morpho Challenge?
HUT 26 See you in another challenge! best wishes, Krista (and Sade)
HUT 27 Muistiinpanojani –kuvaa lyhyesti omat menetelmät –pohdi omien menetelmien eroja suhteessa niiden ominaisuuksiin –ole nöyrä, tuo esiin miksi vertailu on epäreilu (aiempi kokemus; oma data; ja puh.tunnistuskin on tuttu sovellus, joten sen ryhmän aiempi tutkimustyö on voinut vaikuttaa menetelmänkehitykseemme epäsuorasti) + pohdintaa meidän menetelmien eroista? + esimerkkisegmentointeja kaikistamme? + Diskussiokamaa Mikon paperista ja meidän paperista + eka tuloskuva on nyt sekava + värit eri tavalla kuin muissa: vaihda värit ja tuplaa, nosta voittajaa paremmin esiin
HUT 28 Discussion Possibility to extend the model –rudimentary features used for “meaning” –more fine-grained categories –beyond concatenative phenomena (e.g., goose – geese) –allomorphy (e.g., beauty, beauty + ’s, beauti + es, beauti + ful) Already now useful in applications –automatic speech recognition (Finnish, Turkish)