1 Unsupervised and Knowledge-free Morpheme Segmentation and Analysis Stefan Bordag University of Leipzig Components Components Detailing Detailing Compound.

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Greedy best-first search Use the heuristic function to rank the nodes Search strategy –Expand node with lowest h-value Greedily trying to find the least-cost.
Recursion Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
The evaluation and optimisation of multiresolution FFT Parameters For use in automatic music transcription algorithms.
Best-First Search: Agendas
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Unsupervised Morpheme Analysis – Overview of Morpho Challenge 2007 in CLEF Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen Helsinki University.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Topic 15 Implementing and Using Stacks
Simple Neural Nets For Pattern Classification
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.
09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
1 Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein linja-autonautonkuljettajallakaan linja-auton auto kuljettajallakaan.
Topic 15 Implementing and Using Stacks
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Chapter 2: Algorithm Discovery and Design
1 Functional Testing Motivation Example Basic Methods Timing: 30 minutes.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Evidence from Content INST 734 Module 2 Doug Oard.
Huffman Codes Message consisting of five characters: a, b, c, d,e
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Chapter 4 Context-Free Languages Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
Invitation to Computer Science, Java Version, Second Edition.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Multi-way Trees. M-way trees So far we have discussed binary trees only. In this lecture, we go over another type of tree called m- way trees or trees.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
SINGULAR VALUE DECOMPOSITION (SVD)
B-Trees. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
Types of Algorithms. 2 Algorithm classification Algorithms that use a similar problem-solving approach can be grouped together We’ll talk about a classification.
Problem Reduction So far we have considered search strategies for OR graph. In OR graph, several arcs indicate a variety of ways in which the original.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Bushy Binary Search Tree from Ordered List. Behavior of the Algorithm Binary Search Tree Recall that tree_search is based closely on binary search. If.
Copyright © Curt Hill Other Trees Applications of the Tree Structure.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.
Hidden Markov Models BMI/CS 576
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Web News Sentence Searching Using Linguistic Graph Similarity
Control Structures: Part 2
CS 430: Information Discovery
Presentation transcript:

1 Unsupervised and Knowledge-free Morpheme Segmentation and Analysis Stefan Bordag University of Leipzig Components Components Detailing Detailing Compound splitting Compound splitting Iterated LSV Iterated LSV Split trie taining Split trie taining Morpheme Analysis Morpheme Analysis Results Results Discussion Discussion

2 1. Components The main components of the current LSV-based segmentation algorithm The main components of the current LSV-based segmentation algorithm Compound splitter (new) Compound splitter (new) LSV component (new: iterated) LSV component (new: iterated) Trie classificator (new: split in two phases) Trie classificator (new: split in two phases) Morpheme analysis (entirely new) is based on Morpheme analysis (entirely new) is based on Morpheme segmentation (see above) Morpheme segmentation (see above) Clustering of morphs to morphemes Clustering of morphs to morphemes Contextual similarity of morphemes Contextual similarity of morphemes Main focus on modularity so that each module has a specific function that could be replaced by a better algorithm by someone else Main focus on modularity so that each module has a specific function that could be replaced by a better algorithm by someone else

Compound Splitter Based on the observation that for LSV especially long words pose a problem Based on the observation that for LSV especially long words pose a problem Simple heuristic: whenever a word is decomposable into several words which have Simple heuristic: whenever a word is decomposable into several words which have minimum length of 4 minimum length of 4 minimum frequency of 10 (or some other arbitrary figures) minimum frequency of 10 (or some other arbitrary figures) results in many missed, but at least some correct divisions (Precision at this point being more important than Recall) P=88% R=10% F=18% P=88% R=10% F=18% Decompositions which have more words with higher frequencies win in cases where several decompositions possible Decompositions which have more words with higher frequencies win in cases where several decompositions possible

Original solution in two parts sentences co-occurrences The talk was very informative The talk 1 Talk was 1 … similar words Talk speech 20 Was is 15 … clear-ly lately early … clearlylate ear ¤ root cl late ¤ ¤ ¤ ¤ train classifier clear-ly late-ly early … apply classifier compute LSV s = LSV * freq * multiletter * bigram bigram

Original Letter successor variety Letter successor variety: Harris (55) Letter successor variety: Harris (55) where word-splitting occurs if the number of distinct letters that follows a given sequence of characters surpasses the threshold. Input 150 contextually most similar words Input 150 contextually most similar words Observing how many different letters occur after a part of the string: Observing how many different letters occur after a part of the string: #cle- only 1 letter #cle- only 1 letter -ly# but reversed before –ly# 16 different letters (16 different stems preceding the suffix –ly#) -ly# but reversed before –ly# 16 different letters (16 different stems preceding the suffix –ly#) # c l e a r l y # # c l e a r l y # f. left (thus after #cl 5 various letters) f. left (thus after #cl 5 various letters) f. right (thus before -y# 10 var. letters) f. right (thus before -y# 10 var. letters)

Balancing factors LSV score for each possible boundary is not normalized and needs to be weighted against several factors that otherwise add noise: LSV score for each possible boundary is not normalized and needs to be weighted against several factors that otherwise add noise: freq: Frequency differences between beginning and middle of word freq: Frequency differences between beginning and middle of word multiletter: Representation of single phonemes with several letters multiletter: Representation of single phonemes with several letters bigram: Certain fixed combinations of letters bigram: Certain fixed combinations of letters Final score s for each possible boundary is then: Final score s for each possible boundary is then: s = LSV * freq * multiletter * bigram

Iterated LSV The Iteration of LSV based previously found information The Iteration of LSV based previously found information For example when computing For example when computing ignited with the most similar words already analysed into: ignited with the most similar words already analysed into: caus-ed, struck, injur-ed, blazed, fire, … caus-ed, struck, injur-ed, blazed, fire, … Then there is more evidence for ignit-ed because most words ending with -ed were found to have -ed as a morpheme Then there is more evidence for ignit-ed because most words ending with -ed were found to have -ed as a morpheme Implementation in the form of a weight iterLSV Implementation in the form of a weight iterLSV iterLSV = #wordsEndingIsMorph / #wordsSameEnding hence: hence: s = LSV * freq * multiletter * bigram * iterLSV

Pat. Comp. Trie as Classificator clearlylate ear¤ root cl late¤ ¤ ¤ ¤ clear-ly, late-ly, early, Clear, late clearly late ear¤ root cl late¤ ¤ ¤ ¤ ly=2 ly=1 Amazing?ly amazing-ly ly=1 add known information Apply deepest found node dear?ly dearly ¤=1 retrieve known information

Splitting trie application The trie classificator could decide for ignit-ed based on top-node in trie from back The trie classificator could decide for ignit-ed based on top-node in trie from back –d with classes –ed:50;-d:10;-ted:5;… –d with classes –ed:50;-d:10;-ted:5;… Hence not taking any context in the word into account Hence not taking any context in the word into account New version save_trie (aus opposed to rec_trie) trains one trie from LSV data and decides only if New version save_trie (aus opposed to rec_trie) trains one trie from LSV data and decides only if at least one more letter additionally to the letters in the proposed morpheme matches in the word at least one more letter additionally to the letters in the proposed morpheme matches in the word save_trie and rec_trie are then trained and applied conecutively save_trie and rec_trie are then trained and applied conecutively ed s ed=2 ed=1 r injur-ed caus-ed save_trie => ignited rec_trie => ignit-ed

Effect of the improvements compounds compounds P=88% R=10% F=18% P=88% R=10% F=18% compounds + recTrie compounds + recTrie P=66% R=28% F=39% P=66% R=28% F=39% compounds + lsv_0 + recTrie compounds + lsv_0 + recTrie P=71% R=58% F=64% P=71% R=58% F=64% compounds + lsv_2 + recTrie compounds + lsv_2 + recTrie P=69% R=63% F=66% P=69% R=63% F=66% compounds + lsv_2 + saveTrie + recTrie compounds + lsv_2 + saveTrie + recTrie P=69% R=66% F=67% P=69% R=66% F=67% Most notably these changes reach the same performance level as the original lsv_0 + recTrie (F=70) on a corpus three times smaller Most notably these changes reach the same performance level as the original lsv_0 + recTrie (F=70) on a corpus three times smaller However, applying on three times bigger corpus only increases number of split words, not quality of those split! However, applying on three times bigger corpus only increases number of split words, not quality of those split!

11 3. Morpheme Analysis Assumes visible morphs (i.e. output of a segmentation algorithm) Assumes visible morphs (i.e. output of a segmentation algorithm) This enables to compute co-occurrence of morphs This enables to compute co-occurrence of morphs Which enables computing contextual similarity of morps Which enables computing contextual similarity of morps which enables clustering morphs to morphemes which enables clustering morphs to morphemes Traditional representation of morphemes Traditional representation of morphemes barefooted BARE FOOT +PAST barefooted BARE FOOT +PAST flying FLY_V +PCP1 flying FLY_V +PCP1 footprints FOOT PRINT +PL footprints FOOT PRINT +PL For processing equivalent representation of morphemes For processing equivalent representation of morphemes barefootedbare 5foot.6foot.foot ed barefootedbare 5foot.6foot.foot ed flyingfly inag.ing.ingu.iong flyingfly inag.ing.ingu.iong footprints5foot.6foot.foot prints footprints5foot.6foot.foot prints

Computing alternation for each morph m for each cont. similar morph s of m for each cont. similar morph s of m if LD_Similar(s,m) if LD_Similar(s,m) r = makeRule(s,m) r = makeRule(s,m) store(r->s,m) store(r->s,m) for each word w for each morph m of w for each morph m of w if in_store(m) if in_store(m) sig = createSignature(m) sig = createSignature(m) write sig write sig else else write m write m m=foot s={feet,5foot,…} LD(foot,5foot)=1 _-5 -> foot,5foot barefooted {bare,foot,ed} foot has _-5 and _-6 sig: foot.5foot.6foot

Real examples Rules: m-s : 49.0 barem,bares blum,blus erem,eres estem,estes etem,etes eurem,eures ifm,ifs igem,iges ihrem,ihres jedem,jedes lme,lse losem,loses mache,sache mai,sai m-s : 49.0 barem,bares blum,blus erem,eres estem,estes etem,etes eurem,eures ifm,ifs igem,iges ihrem,ihres jedem,jedes lme,lse losem,loses mache,sache mai,sai _-u : 46.0 bahn,ubahn bdi,bdiu boot,uboot bootes,ubootes cor,coru dejan,dejuan dem,demu dem,deum die,dieu em,eum en,eun en,uen erin,eurin _-u : 46.0 bahn,ubahn bdi,bdiu boot,uboot bootes,ubootes cor,coru dejan,dejuan dem,demu dem,deum die,dieu em,eum en,eun en,uen erin,eurin m-r : 44.0 barem,barer dem,der demselb,derselb einem,einer ertem,erter estem,ester eurem,eurer igem,iger ihm,ihr ihme,ihre ihrem,ihrer jedem,jeder m-r : 44.0 barem,barer dem,der demselb,derselb einem,einer ertem,erter estem,ester eurem,eurer igem,iger ihm,ihr ihme,ihre ihrem,ihrer jedem,jederSignatures: muessenmuess.muesst.muss en muessenmuess.muesst.muss en ihrerihre.ihrem.ihren.ihrer.ihres ihrerihre.ihrem.ihren.ihrer.ihres werdewerd.wird.wuerd e werdewerd.wird.wuerd e Ihrenihre.ihrem.ihren.ihrer.ihres.ihrn Ihrenihre.ihrem.ihren.ihrer.ihres.ihrn

More examples kabinettsaufteilung kabinet.kabinett.kabinetts aauf.aeuf.auf.aufs.dauf.hauf tail.teil.teile.teils.teilt bung.dung.kung.rung.tung.ung.ungs entwaffnungsbericht enkt.ent.entf.entp waff.waffn.waffne.waffnet lungs.rungs.tungs.ung.ungn.ungs berich.bericht grundstuecksverwaltung gruend.grund stuecks nver.sver.veer.ver walt bung.dung.kung.rung.tung.ung.ungs grundt gruend.grund t

15 4. Results (competition 1) GERMAN GERMAN AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard % 37.69% 47.22% Bernhard % 57.35% 52.89% Bordag % 40.58% 48.64% Bordag 5a 60.45% 41.57% 49.27% McNamee % 9.28% 15.43% Zeman % 28.46% 36.98% Monson&co Morfessor 67.16% 36.83% 47.57% Monson&co ParaMor 59.05% 32.81% 42.19% Monson&co Paramor&Morfessor 51.45% 55.55% 53.42% Morfessor MAP 67.56% 36.92% 47.75% ENGLISH ENGLISH AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard % 52.47% 60.72% Bernhard % 60.01% 60.81% Bordag % 31.50% 41.27% Bordag 5a 59.69% 32.12% 41.77% McNamee % 17.55% 25.01% Zeman % 42.07% 46.90% Monson&co Morfessor 77.22% 33.95% 47.16% Monson&co ParaMor 48.46% 52.95% 50.61% Monson&co Paramor&Morfessor 41.58% 65.08% 50.74% Morfessor MAP 82.17% 33.08% 47.17%

Results (competition 1) TURKISH TURKISH AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard % 10.93% 19.18% Bernhard % 14.80% 24.65% Bordag % 17.45% 28.75% Bordag 5a 81.31% 17.58% 28.91% McNamee % 10.83% 18.57% McNamee % 6.59% 12.24% McNamee % 3.31% 6.39% Zeman % 18.79% 29.23% Morfessor MAP 76.36% 24.50% 37.10% FINNISH FINNISH AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard % 25.01% 37.63% Bernhard % 40.44% 48.20% Bordag % 23.61% 35.52% Bordag 5a 71.32% 24.40% 36.36% McNamee % 8.56% 14.41% McNamee % 5.68% 10.49% McNamee % 3.35% 6.45% Zeman % 20.92% 30.87% Morfessor MAP 76.83% 27.54% 40.55%

Problems of Morpheme Analysis Surprise #1: nearly no effect on evaluation results! Possible reasons: Surprise #1: nearly no effect on evaluation results! Possible reasons: rules: not taking type frequency into account (hence overvaluing errors) rules: not taking type frequency into account (hence overvaluing errors) rules: not taking context into account (instead of _-5 better _5f- _fo) rules: not taking context into account (instead of _-5 better _5f- _fo) segmentation: produces many errors, analysis has to put up with a lot of noise segmentation: produces many errors, analysis has to put up with a lot of noise

Problems of Segmentation Surprise #2: Size of corpus has no large influence on quality of segmentations Surprise #2: Size of corpus has no large influence on quality of segmentations it influences only how many nearly perfect segmentation are found by LSV it influences only how many nearly perfect segmentation are found by LSV but that is by far outweighted by the errors of the trie but that is by far outweighted by the errors of the trie Strength of LSV is to segment irregular words properly Strength of LSV is to segment irregular words properly because they have high frequency and are usually short because they have high frequency and are usually short Strength of most other proposed methods with segmenting long and infrequent words Strength of most other proposed methods with segmenting long and infrequent words Combination evidently desireable Combination evidently desireable

Further avenues? Most notable problem currently is assumption of clustering of phonemes that represent a morph / morpheme, that is AAA + BBB usually becomes AAABBB, not ABABAB Most notable problem currently is assumption of clustering of phonemes that represent a morph / morpheme, that is AAA + BBB usually becomes AAABBB, not ABABAB For languages that merge morphemes this is inappropriate For languages that merge morphemes this is inappropriate Better solution perhaps similar to U-DOP by Rens Bod? Better solution perhaps similar to U-DOP by Rens Bod? that means generating all possible parsing trees for each token that means generating all possible parsing trees for each token then collating them for the type and generating possible optimal parses then collating them for the type and generating possible optimal parses possibly generating tries not just for type, but also for some context, for example relevant context highlighted: Yesterday we arrived by plane. possibly generating tries not just for type, but also for some context, for example relevant context highlighted: Yesterday we arrived by plane.

20 THANK YOU!