# 1 Unsupervised and Knowledge-free Morpheme Segmentation and Analysis Stefan Bordag University of Leipzig Components Components Detailing Detailing Compound.

## Presentation on theme: "1 Unsupervised and Knowledge-free Morpheme Segmentation and Analysis Stefan Bordag University of Leipzig Components Components Detailing Detailing Compound."— Presentation transcript:

1 Unsupervised and Knowledge-free Morpheme Segmentation and Analysis Stefan Bordag University of Leipzig Components Components Detailing Detailing Compound splitting Compound splitting Iterated LSV Iterated LSV Split trie taining Split trie taining Morpheme Analysis Morpheme Analysis Results Results Discussion Discussion

2 1. Components The main components of the current LSV-based segmentation algorithm The main components of the current LSV-based segmentation algorithm Compound splitter (new) Compound splitter (new) LSV component (new: iterated) LSV component (new: iterated) Trie classificator (new: split in two phases) Trie classificator (new: split in two phases) Morpheme analysis (entirely new) is based on Morpheme analysis (entirely new) is based on Morpheme segmentation (see above) Morpheme segmentation (see above) Clustering of morphs to morphemes Clustering of morphs to morphemes Contextual similarity of morphemes Contextual similarity of morphemes Main focus on modularity so that each module has a specific function that could be replaced by a better algorithm by someone else Main focus on modularity so that each module has a specific function that could be replaced by a better algorithm by someone else

3 2.1. Compound Splitter Based on the observation that for LSV especially long words pose a problem Based on the observation that for LSV especially long words pose a problem Simple heuristic: whenever a word is decomposable into several words which have Simple heuristic: whenever a word is decomposable into several words which have minimum length of 4 minimum length of 4 minimum frequency of 10 (or some other arbitrary figures) minimum frequency of 10 (or some other arbitrary figures) results in many missed, but at least some correct divisions (Precision at this point being more important than Recall) P=88% R=10% F=18% P=88% R=10% F=18% Decompositions which have more words with higher frequencies win in cases where several decompositions possible Decompositions which have more words with higher frequencies win in cases where several decompositions possible

4 2.2. Original solution in two parts sentences co-occurrences The talk was very informative The talk 1 Talk was 1 … similar words Talk speech 20 Was is 15 … clear-ly lately early … clearlylate ear ¤ root cl late ¤ ¤ ¤ ¤ train classifier clear-ly late-ly early … apply classifier compute LSV s = LSV * freq * multiletter * bigram bigram

5 2.3. Original Letter successor variety Letter successor variety: Harris (55) Letter successor variety: Harris (55) where word-splitting occurs if the number of distinct letters that follows a given sequence of characters surpasses the threshold. Input 150 contextually most similar words Input 150 contextually most similar words Observing how many different letters occur after a part of the string: Observing how many different letters occur after a part of the string: #cle- only 1 letter #cle- only 1 letter -ly# but reversed before –ly# 16 different letters (16 different stems preceding the suffix –ly#) -ly# but reversed before –ly# 16 different letters (16 different stems preceding the suffix –ly#) # c l e a r l y # # c l e a r l y # 28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters) 28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters) 1 1 2 1 3 16 10 14 f. right (thus before -y# 10 var. letters) 1 1 2 1 3 16 10 14 f. right (thus before -y# 10 var. letters)

6 2.4. Balancing factors LSV score for each possible boundary is not normalized and needs to be weighted against several factors that otherwise add noise: LSV score for each possible boundary is not normalized and needs to be weighted against several factors that otherwise add noise: freq: Frequency differences between beginning and middle of word freq: Frequency differences between beginning and middle of word multiletter: Representation of single phonemes with several letters multiletter: Representation of single phonemes with several letters bigram: Certain fixed combinations of letters bigram: Certain fixed combinations of letters Final score s for each possible boundary is then: Final score s for each possible boundary is then: s = LSV * freq * multiletter * bigram

7 2.5. Iterated LSV The Iteration of LSV based previously found information The Iteration of LSV based previously found information For example when computing For example when computing ignited with the most similar words already analysed into: ignited with the most similar words already analysed into: caus-ed, struck, injur-ed, blazed, fire, … caus-ed, struck, injur-ed, blazed, fire, … Then there is more evidence for ignit-ed because most words ending with -ed were found to have -ed as a morpheme Then there is more evidence for ignit-ed because most words ending with -ed were found to have -ed as a morpheme Implementation in the form of a weight iterLSV Implementation in the form of a weight iterLSV iterLSV = #wordsEndingIsMorph / #wordsSameEnding hence: hence: s = LSV * freq * multiletter * bigram * iterLSV

8 2.6. Pat. Comp. Trie as Classificator clearlylate ear¤ root cl late¤ ¤ ¤ ¤ clear-ly, late-ly, early, Clear, late clearly late ear¤ root cl late¤ ¤ ¤ ¤ ly=2 ly=1 Amazing?ly amazing-ly ly=1 add known information Apply deepest found node dear?ly dearly ¤=1 retrieve known information

9 2.7. Splitting trie application The trie classificator could decide for ignit-ed based on top-node in trie from back The trie classificator could decide for ignit-ed based on top-node in trie from back –d with classes –ed:50;-d:10;-ted:5;… –d with classes –ed:50;-d:10;-ted:5;… Hence not taking any context in the word into account Hence not taking any context in the word into account New version save_trie (aus opposed to rec_trie) trains one trie from LSV data and decides only if New version save_trie (aus opposed to rec_trie) trains one trie from LSV data and decides only if at least one more letter additionally to the letters in the proposed morpheme matches in the word at least one more letter additionally to the letters in the proposed morpheme matches in the word save_trie and rec_trie are then trained and applied conecutively save_trie and rec_trie are then trained and applied conecutively ed s ed=2 ed=1 r injur-ed caus-ed save_trie => ignited rec_trie => ignit-ed

10 2.8. Effect of the improvements compounds compounds P=88% R=10% F=18% P=88% R=10% F=18% compounds + recTrie compounds + recTrie P=66% R=28% F=39% P=66% R=28% F=39% compounds + lsv_0 + recTrie compounds + lsv_0 + recTrie P=71% R=58% F=64% P=71% R=58% F=64% compounds + lsv_2 + recTrie compounds + lsv_2 + recTrie P=69% R=63% F=66% P=69% R=63% F=66% compounds + lsv_2 + saveTrie + recTrie compounds + lsv_2 + saveTrie + recTrie P=69% R=66% F=67% P=69% R=66% F=67% Most notably these changes reach the same performance level as the original lsv_0 + recTrie (F=70) on a corpus three times smaller Most notably these changes reach the same performance level as the original lsv_0 + recTrie (F=70) on a corpus three times smaller However, applying on three times bigger corpus only increases number of split words, not quality of those split! However, applying on three times bigger corpus only increases number of split words, not quality of those split!

11 3. Morpheme Analysis Assumes visible morphs (i.e. output of a segmentation algorithm) Assumes visible morphs (i.e. output of a segmentation algorithm) This enables to compute co-occurrence of morphs This enables to compute co-occurrence of morphs Which enables computing contextual similarity of morps Which enables computing contextual similarity of morps which enables clustering morphs to morphemes which enables clustering morphs to morphemes Traditional representation of morphemes Traditional representation of morphemes barefooted BARE FOOT +PAST barefooted BARE FOOT +PAST flying FLY_V +PCP1 flying FLY_V +PCP1 footprints FOOT PRINT +PL footprints FOOT PRINT +PL For processing equivalent representation of morphemes For processing equivalent representation of morphemes barefootedbare 5foot.6foot.foot ed barefootedbare 5foot.6foot.foot ed flyingfly inag.ing.ingu.iong flyingfly inag.ing.ingu.iong footprints5foot.6foot.foot prints footprints5foot.6foot.foot prints

12 3.1. Computing alternation for each morph m for each cont. similar morph s of m for each cont. similar morph s of m if LD_Similar(s,m) if LD_Similar(s,m) r = makeRule(s,m) r = makeRule(s,m) store(r->s,m) store(r->s,m) for each word w for each morph m of w for each morph m of w if in_store(m) if in_store(m) sig = createSignature(m) sig = createSignature(m) write sig write sig else else write m write m m=foot s={feet,5foot,…} LD(foot,5foot)=1 _-5 -> foot,5foot barefooted {bare,foot,ed} foot has _-5 and _-6 sig: foot.5foot.6foot

13 3.2. Real examples Rules: m-s : 49.0 barem,bares blum,blus erem,eres estem,estes etem,etes eurem,eures ifm,ifs igem,iges ihrem,ihres jedem,jedes lme,lse losem,loses mache,sache mai,sai m-s : 49.0 barem,bares blum,blus erem,eres estem,estes etem,etes eurem,eures ifm,ifs igem,iges ihrem,ihres jedem,jedes lme,lse losem,loses mache,sache mai,sai _-u : 46.0 bahn,ubahn bdi,bdiu boot,uboot bootes,ubootes cor,coru dejan,dejuan dem,demu dem,deum die,dieu em,eum en,eun en,uen erin,eurin _-u : 46.0 bahn,ubahn bdi,bdiu boot,uboot bootes,ubootes cor,coru dejan,dejuan dem,demu dem,deum die,dieu em,eum en,eun en,uen erin,eurin m-r : 44.0 barem,barer dem,der demselb,derselb einem,einer ertem,erter estem,ester eurem,eurer igem,iger ihm,ihr ihme,ihre ihrem,ihrer jedem,jeder m-r : 44.0 barem,barer dem,der demselb,derselb einem,einer ertem,erter estem,ester eurem,eurer igem,iger ihm,ihr ihme,ihre ihrem,ihrer jedem,jederSignatures: muessenmuess.muesst.muss en muessenmuess.muesst.muss en ihrerihre.ihrem.ihren.ihrer.ihres ihrerihre.ihrem.ihren.ihrer.ihres werdewerd.wird.wuerd e werdewerd.wird.wuerd e Ihrenihre.ihrem.ihren.ihrer.ihres.ihrn Ihrenihre.ihrem.ihren.ihrer.ihres.ihrn

14 3.3. More examples kabinettsaufteilung kabinet.kabinett.kabinetts aauf.aeuf.auf.aufs.dauf.hauf tail.teil.teile.teils.teilt bung.dung.kung.rung.tung.ung.ungs entwaffnungsbericht enkt.ent.entf.entp waff.waffn.waffne.waffnet lungs.rungs.tungs.ung.ungn.ungs berich.bericht grundstuecksverwaltung gruend.grund stuecks nver.sver.veer.ver walt bung.dung.kung.rung.tung.ung.ungs grundt gruend.grund t

15 4. Results (competition 1) GERMAN GERMAN AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard 1 63.20% 37.69% 47.22% Bernhard 2 49.08% 57.35% 52.89% Bordag 5 60.71% 40.58% 48.64% Bordag 5a 60.45% 41.57% 49.27% McNamee 3 45.78% 9.28% 15.43% Zeman - 52.79% 28.46% 36.98% Monson&co Morfessor 67.16% 36.83% 47.57% Monson&co ParaMor 59.05% 32.81% 42.19% Monson&co Paramor&Morfessor 51.45% 55.55% 53.42% Morfessor MAP 67.56% 36.92% 47.75% ENGLISH ENGLISH AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard 1 72.05% 52.47% 60.72% Bernhard 2 61.63% 60.01% 60.81% Bordag 5 59.80% 31.50% 41.27% Bordag 5a 59.69% 32.12% 41.77% McNamee 3 43.47% 17.55% 25.01% Zeman - 52.98% 42.07% 46.90% Monson&co Morfessor 77.22% 33.95% 47.16% Monson&co ParaMor 48.46% 52.95% 50.61% Monson&co Paramor&Morfessor 41.58% 65.08% 50.74% Morfessor MAP 82.17% 33.08% 47.17%

16 4.1. Results (competition 1) TURKISH TURKISH AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard 1 78.22% 10.93% 19.18% Bernhard 2 73.69% 14.80% 24.65% Bordag 5 81.44% 17.45% 28.75% Bordag 5a 81.31% 17.58% 28.91% McNamee 3 65.00% 10.83% 18.57% McNamee 4 85.49% 6.59% 12.24% McNamee 5 94.80% 3.31% 6.39% Zeman - 65.81% 18.79% 29.23% Morfessor MAP 76.36% 24.50% 37.10% FINNISH FINNISH AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard 1 75.99% 25.01% 37.63% Bernhard 2 59.65% 40.44% 48.20% Bordag 5 71.72% 23.61% 35.52% Bordag 5a 71.32% 24.40% 36.36% McNamee 3 45.53% 8.56% 14.41% McNamee 4 68.09% 5.68% 10.49% McNamee 5 86.69% 3.35% 6.45% Zeman - 58.84% 20.92% 30.87% Morfessor MAP 76.83% 27.54% 40.55%

17 5.1. Problems of Morpheme Analysis Surprise #1: nearly no effect on evaluation results! Possible reasons: Surprise #1: nearly no effect on evaluation results! Possible reasons: rules: not taking type frequency into account (hence overvaluing errors) rules: not taking type frequency into account (hence overvaluing errors) rules: not taking context into account (instead of _-5 better _5f- _fo) rules: not taking context into account (instead of _-5 better _5f- _fo) segmentation: produces many errors, analysis has to put up with a lot of noise segmentation: produces many errors, analysis has to put up with a lot of noise

18 5.2. Problems of Segmentation Surprise #2: Size of corpus has no large influence on quality of segmentations Surprise #2: Size of corpus has no large influence on quality of segmentations it influences only how many nearly perfect segmentation are found by LSV it influences only how many nearly perfect segmentation are found by LSV but that is by far outweighted by the errors of the trie but that is by far outweighted by the errors of the trie Strength of LSV is to segment irregular words properly Strength of LSV is to segment irregular words properly because they have high frequency and are usually short because they have high frequency and are usually short Strength of most other proposed methods with segmenting long and infrequent words Strength of most other proposed methods with segmenting long and infrequent words Combination evidently desireable Combination evidently desireable

19 5.3. Further avenues? Most notable problem currently is assumption of clustering of phonemes that represent a morph / morpheme, that is AAA + BBB usually becomes AAABBB, not ABABAB Most notable problem currently is assumption of clustering of phonemes that represent a morph / morpheme, that is AAA + BBB usually becomes AAABBB, not ABABAB For languages that merge morphemes this is inappropriate For languages that merge morphemes this is inappropriate Better solution perhaps similar to U-DOP by Rens Bod? Better solution perhaps similar to U-DOP by Rens Bod? that means generating all possible parsing trees for each token that means generating all possible parsing trees for each token then collating them for the type and generating possible optimal parses then collating them for the type and generating possible optimal parses possibly generating tries not just for type, but also for some context, for example relevant context highlighted: Yesterday we arrived by plane. possibly generating tries not just for type, but also for some context, for example relevant context highlighted: Yesterday we arrived by plane.

20 THANK YOU!

Download ppt "1 Unsupervised and Knowledge-free Morpheme Segmentation and Analysis Stefan Bordag University of Leipzig Components Components Detailing Detailing Compound."

Similar presentations