Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Splitter for German compound words Pasquale Imbemba Free University of Bozen-Bolzano Supervisor: Dr. Raffaella Bernardi.

Similar presentations


Presentation on theme: "A Splitter for German compound words Pasquale Imbemba Free University of Bozen-Bolzano Supervisor: Dr. Raffaella Bernardi."— Presentation transcript:

1 A Splitter for German compound words Pasquale Imbemba Free University of Bozen-Bolzano Supervisor: Dr. Raffaella Bernardi

2 2 Scenario User input Compound words in German ● Problem for IR -retrieval of German books -no direct keyword matching ● Problem for CLIR -retrieval of IT & EN books -No direct translation in dictionary

3 3 Problem: German compound words Compounding is productive:  Combine pre-existing morphemes to form a new word (aka Univerbierung)  Compounds of nouns most frequent cases ● User input may not be in the lexicon used by CLIR search engines  Donau + Dampf + Schiff + Fahrt (tr.: Steam navigation on the Danube) ● User input may be a lexicalized “compound” word  Malerei (tr.: painting) no: Maler+Ei (tr.: painter and egg) ● Hence, need of a splitter to handle both cases ● Furthermore, language is in continuous evolution (neologism); need of constantly up-to-date lexical resources

4 4 State of the art ● TAGH (Berlin-Brandenburg Academy of Sciences / University of Potsdam)  Weighted FSA: choose combination with least cost ● MORPHY (University of Paderborn)  Reduce to base form and affixes, look them up ● MORPA (Tilburg University)  Probabilistic calculus to determine segmentation ● De Rijke/Monz (University of Amsterdam)  Shallow approach ● Given a word, if substring is in lexicon, subtract it. Repeat until no substring is left.

5 5 Tools ● Splitter  Mechanism to segment nouns ● Implemented, evaluated and improved De Rijke/Monz algorithm using Java ● Lexicon  Morphy (57,000 nouns), dated (Lezius)  deWaC (440,000 nouns), recent (Baroni & Kilgarriff) ● Lexical resource to execute lookup onto  Extracted nouns from Morphy & deWaC  Regular Expression filtering on deWaC  Resources indexed with Lucene

6 6 De Rijke/Monz algorithm Split (word) For i := 1 to length-1 do if substring(0,i)isInNounLex && split(substr(i+1,length) != “ “ do r = split(substr(i+1,length) return concat (substr(1,i),+,r) if (isInNounLex(word)) return word; else return ““; Ölpreispreispreis r = split(substr,i to length) preis r = preis preis Preis Öl

7 7 Enhanced Splitter workflow ● Cascading lexical resources  Increases split correctness  Improves overall correctness ● Lookup first  Lexicalized elements  Reduces amount of incorrect splits

8 8 Splitter diagram

9 9 MuSiL Integration Query Input Donaudampfschifffahrt Name Recognition DonauDampfschifffahrt Morphological Analysis Dampfschifffahrt_N Multilingual Dictionary Multiword recognition Dampfschifffahrt_N EN: vapour_N | steam_N (...) EN: ship_N | (...) IT: vapore_nm | (...) IT: nave_nf | (...) Split and Translate Splitter EN: drive_N | navigation_N (...) IT: guida_nf | navigazione_nf (...) 1 2 3 Multilingual Thesaurus

10 10 Evaluation ● Total correctness improved ● By increasing the amount of non splits with deWaC and Morphy

11 11 Complexity of the split function De Rijke/Monz –Best case: We scan the input word from first to last position –Worst case: Calls to split Exponential growth Our splitter: –Best case: We find the word immediately to exist in the lexical resources of nouns –Worst case: Execute function recursively every time we encounter a word in the lexicon and the remaining substring is not empty (see De Rijke/Monz)

12 12 Performance on MuSiL ● Increased amount of retrieved documents ● More relevant documents are top ranked Without splitter componentWith splitter component DEITENPrecisionDEITENPrecision Abenteuer+Geschichten100100%1057251% Beruf+Orientierung1300100%133922443% Kommunikation+Politik----286931747% Wert+Papier+Handel+Gesetz400100%0369017% Doppel+Besteuerung+Abkommen402100%0015100% Aufmerksamkeit+Defizit+Syndrom4402536%30873% Hirn+Leistung+Training1500100%44818946% Kunst+Erziehung+Bewegung100100%14447825136% Emotion+Regulierung100100%036671% Unternehmen+Netzwerke802561%736167727%

13 13 Conclusion and future work ● Good:  Cascade method  Deal with lexicalized elements ● Open topics:  Choose correct segmentation among alternatives  Metrics for correctness of segmentation ● Weights, probability …


Download ppt "A Splitter for German compound words Pasquale Imbemba Free University of Bozen-Bolzano Supervisor: Dr. Raffaella Bernardi."

Similar presentations


Ads by Google