HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Induction of a Simple Morphology for Highly-Inflecting Languages {Mathias.Creutz,

Slides:



Advertisements
Similar presentations
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Morphology.
HELSINKI UNIVERSITY OF TECHNOLOGY ADAPTIVE INFORMATICS RESEARCH CENTRE Morpho Challenge in Pascal Challenges Workshop Venice, 12 April 2006 Morfessor in.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.
Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.
RePortS: A Simpler, Intuitive Approach to Morpheme Induction Emily Pitler Samarth Keshava Yale University.
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias.Creutz,
Distributional Cues to Word Boundaries: Context Is Important Sharon Goldwater Stanford University Tom Griffiths UC Berkeley Mark Johnson Microsoft Research/
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.
Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Morphological analysis
Variant definitions of pointer length in MDL Aris Xanthos, Yu Hu, and John Goldsmith University of Chicago.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
1 Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein linja-autonautonkuljettajallakaan linja-auton auto kuljettajallakaan.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Evidence from Content INST 734 Module 2 Doug Oard.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Isolated-Word Speech Recognition Using Hidden Markov Models
Graphical models for part of speech tagging
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
Carnegie Mellon Christian Monson ParaMor Finding Paradigms Across Morphology Christian Monson.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Tokenization & POS-Tagging
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
PCI th Panhellenic Conference in Informatics Clustering Documents using the 3-Gram Graph Representation Model 3 / 10 / 2014.
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.
CSC321: Neural Networks Lecture 16: Hidden Markov Models
Natural Language Processing Chapter 2 : Morphology.
John Lafferty Andrew McCallum Fernando Pereira
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer.
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Natural Language Processing (NLP)
Hidden Markov Models - Training
Morphological Segmentation Inside-Out
Língua Inglesa - Aspectos Morfossintáticos
Text Categorization Berlin Chen 2003 Reference:
Natural Language Processing (NLP)
A Joint Model of Orthography and Morphological Segmentation
Natural Language Processing (NLP)
Presentation transcript:

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Induction of a Simple Morphology for Highly-Inflecting Languages {Mathias.Creutz, Current Themes in Computational Phonology and Morphology, 7th Meeting of the ACL Special Interest Group in Computational Phonology, ACL Barcelona, 26 July 2004 kahvi + n + juo + ja + lle + kin nyky + ratkaisu + i + sta + mme tietä + isi + mme + kö + hän open + mind + ed + ness un + believ + able

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 2 Goals and challenges Learn representations of –the smallest meaningful units of language (morphemes) –and their interaction –in an unsupervised manner from raw text –making as general and language-independent assumptions as possible. Evaluate –against a given gold-standard morphological analysis of word forms first step: learn and evaluate a morpheme segmentation of word forms –integrated in NLP applications (speech recognition)

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 3 Focus: Agglutinative morphology Finnish words often consist of lengthy sequences of morphemes — stems, suffixes and prefixes: –kahvi + n + juo + ja + lle + kin (coffee + of + drink + -er + for + also) –nyky + ratkaisu + i + sta + mme (current + solution + -s + from + our) –tietä + isi + mme + kö + hän (know + would + we + INTERR + indeed)  Huge number of different possible word forms  Important to know the inner structure of words in NLP  The number of morphemes per word varies much

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 4 1. MDL model (Creutz & Lagus, 2002) (inspired by work of, e.g., J. Goldsmith) ”Invent” a set distinct strings = morphs Morph lexicon Pick morphs from the lexicon and place them in a sequence Corpus / word list Learning from data  tä  ssä  pala  peli  on  tuhat  a   tä ssä pala peli ssä on tuhat pala a Aim at the most concise represent- ation possible

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 5 2. Probabilistic formulation (Creutz, 2003) (inspired by work of, e.g., M. R. Brent and M. G. Snover) Frequency prior Length prior ”Invent” a set distinct strings = morphs Morph lexicon Pick morphs from the lexicon and place them in a sequence Corpus / word list  tä  ssä  pala  peli  on  tuhat  a   tä ssä pala peli ssä on tuhat pala a

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 6 Reflections on solutions 1 and 2 ”Dumb” text compression algorithms –Common substrings of words appear as one segment, even when compositional structure, e.g.,: keskustelussa (keskustel + u + ssa; ”discuss+ion in”) biggest (bigg + est) –Rare substrings of words are split, even when no compositional structure, e.g., a + den + auer (Adenauer; German politician) in + s + an + e (in + sane) –Too weak structural constraints, e.g., suffixes recognized in the beginning of words: s + can (scan)

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 7 3. Category-learning probabilistic model Word structure captured by a regular expression: word = ( prefix* stem suffix* )+ Morph sequences (words) are generated by a Hidden Markov model: p(STM | PRE)p(SUF | SUF) inykystaratkaisu#mme# p(’mme’ | SUF)p(’nyky’ | PRE) Transition probs Emission probs

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 8 Category algorithm 1. Start with an existing baseline morph segmentation (Creutz, 2003): nyky + rat + kaisu + ista + mme 2. Initialize category membership probs for each morph, e.g., p(PRE | ’nyky’). Assume asymmetries between the categories:

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 9 Initialization of category membership probs Introduce a noise category for cases where none of the proper classes is likely: Distribute remaining probability mass proportionally, e.g.,

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 10 Category algorithm (continued) 1. Start with an existing baseline morph segmentation: nyky + rat + kaisu + ista + mme 2. Initialize category membership probs for each morph. 3. Tag morphs as prefix, stem, suffix, or noise. Then run EM on taggings: nyky + rat + kaisu + ista + mme 4. Split morphs that consist of other known morphs. Then EM: nyky + rat + kaisu + i + sta + mme 5. Join noise morphs with their neighbours. Then EM: nyky + ratkaisu + i + sta + mme

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 11 Experiments Algorithms –Baseline model (Bayesian formulation) –Category-Learning model –Goldsmith’s ”Linguistica” (MDL formul.) Data –Finnish data sets (CSC + STT) words, words, words, 16 million words –English data sets (Brown corpus) words, words, words believ hop liv mov us e ed es ing

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 12 ”Gold standard” used in evaluation Morpheme segmentation obtained for Finnish and English words –by processing the output of Two-level morphology analyzers (FINTWOL and ENGTWOL by Lingsoft, Inc.) Some ”fuzzy morpheme boundaries” allowed –mainly stem-final alternation considered as a seam or joint allowed to belong to the stem or suffix, e.g., Windsori + n or Windsor + in; Windsore + i + lla or Windsor + ei + lla (cf. Windsor) invite + s or invit + es; invite or invit + e (cf. invit + ing) Compute precision and recall of correctly discovered morpheme boundaries

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 13 Results (evaluated against the gold-standard) Categories Baseline Linguistica Baseline Categories Linguistica 250k 16M 10k

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 14 Discussion The Category algorithm –overcomes many of the shortcomings of the Baseline algorithm excessive or too little segmentation suffixes in beginning of words –generalizes more than Linguistica, e.g., +allus + ion + s (Categories) vs. allusions (Linguistica) –Dem + i (Categories) vs. Demi (Linguistica) –invents its own solutions aihe + e + sta vs. aihe + i + sta (”about [the] topic/-s”) phrase, phrase + s, phrase + d

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 15 Future directions The Category algorithm could be expressed more elegantly –not as a post-processing procedure making use of a baseline segmentation Segmentation into morphs is useful –e.g., n-gram language modeling in speech recognition Detection of allomorphy, i.e., segmentation into morphemes would be even more useful –e.g., information retrieval (?)

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 16 Public demo A demo of the baseline and category- learning algorithm is available on the Internet at Test it on your own Finnish or English input!

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 26 July 2004 Mathias Creutz 17 Search for the optimal segmentation of the words in a corpus Recursive binary splitting reopened openminded reopen minded re open mind ed Morphs conferences opening words openminded reopened Randomly shuffle words Convergence of descr. length? yes no Done