Download presentation
Presentation is loading. Please wait.
1
A Splitter for German compound words Pasquale Imbemba Free University of Bozen-Bolzano Supervisor: Dr. Raffaella Bernardi
2
2 Scenario User input Compound words in German ● Problem for IR -retrieval of German books -no direct keyword matching ● Problem for CLIR -retrieval of IT & EN books -No direct translation in dictionary
3
3 Problem: German compound words Compounding is productive: Combine pre-existing morphemes to form a new word (aka Univerbierung) Compounds of nouns most frequent cases ● User input may not be in the lexicon used by CLIR search engines Donau + Dampf + Schiff + Fahrt (tr.: Steam navigation on the Danube) ● User input may be a lexicalized “compound” word Malerei (tr.: painting) no: Maler+Ei (tr.: painter and egg) ● Hence, need of a splitter to handle both cases ● Furthermore, language is in continuous evolution (neologism); need of constantly up-to-date lexical resources
4
4 State of the art ● TAGH (Berlin-Brandenburg Academy of Sciences / University of Potsdam) Weighted FSA: choose combination with least cost ● MORPHY (University of Paderborn) Reduce to base form and affixes, look them up ● MORPA (Tilburg University) Probabilistic calculus to determine segmentation ● De Rijke/Monz (University of Amsterdam) Shallow approach ● Given a word, if substring is in lexicon, subtract it. Repeat until no substring is left.
5
5 Tools ● Splitter Mechanism to segment nouns ● Implemented, evaluated and improved De Rijke/Monz algorithm using Java ● Lexicon Morphy (57,000 nouns), dated (Lezius) deWaC (440,000 nouns), recent (Baroni & Kilgarriff) ● Lexical resource to execute lookup onto Extracted nouns from Morphy & deWaC Regular Expression filtering on deWaC Resources indexed with Lucene
6
6 De Rijke/Monz algorithm Split (word) For i := 1 to length-1 do if substring(0,i)isInNounLex && split(substr(i+1,length) != “ “ do r = split(substr(i+1,length) return concat (substr(1,i),+,r) if (isInNounLex(word)) return word; else return ““; Ölpreispreispreis r = split(substr,i to length) preis r = preis preis Preis Öl
7
7 Enhanced Splitter workflow ● Cascading lexical resources Increases split correctness Improves overall correctness ● Lookup first Lexicalized elements Reduces amount of incorrect splits
8
8 Splitter diagram
9
9 MuSiL Integration Query Input Donaudampfschifffahrt Name Recognition DonauDampfschifffahrt Morphological Analysis Dampfschifffahrt_N Multilingual Dictionary Multiword recognition Dampfschifffahrt_N EN: vapour_N | steam_N (...) EN: ship_N | (...) IT: vapore_nm | (...) IT: nave_nf | (...) Split and Translate Splitter EN: drive_N | navigation_N (...) IT: guida_nf | navigazione_nf (...) 1 2 3 Multilingual Thesaurus
10
10 Evaluation ● Total correctness improved ● By increasing the amount of non splits with deWaC and Morphy
11
11 Complexity of the split function De Rijke/Monz –Best case: We scan the input word from first to last position –Worst case: Calls to split Exponential growth Our splitter: –Best case: We find the word immediately to exist in the lexical resources of nouns –Worst case: Execute function recursively every time we encounter a word in the lexicon and the remaining substring is not empty (see De Rijke/Monz)
12
12 Performance on MuSiL ● Increased amount of retrieved documents ● More relevant documents are top ranked Without splitter componentWith splitter component DEITENPrecisionDEITENPrecision Abenteuer+Geschichten100100%1057251% Beruf+Orientierung1300100%133922443% Kommunikation+Politik----286931747% Wert+Papier+Handel+Gesetz400100%0369017% Doppel+Besteuerung+Abkommen402100%0015100% Aufmerksamkeit+Defizit+Syndrom4402536%30873% Hirn+Leistung+Training1500100%44818946% Kunst+Erziehung+Bewegung100100%14447825136% Emotion+Regulierung100100%036671% Unternehmen+Netzwerke802561%736167727%
13
13 Conclusion and future work ● Good: Cascade method Deal with lexicalized elements ● Open topics: Choose correct segmentation among alternatives Metrics for correctness of segmentation ● Weights, probability …
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.