Presentation is loading. Please wait.

Presentation is loading. Please wait.

N-gram Tokenization for Indian Language Text Retrieval Paul McNamee 13 December 2008.

Similar presentations

Presentation on theme: "N-gram Tokenization for Indian Language Text Retrieval Paul McNamee 13 December 2008."— Presentation transcript:

1 N-gram Tokenization for Indian Language Text Retrieval Paul McNamee 13 December 2008

2 Talk Outline Introduction Monolingual Experiments from CLEF 2000-2007  Words  Stemmed words (Snowball)  Character n-grams (n=4,5)  N-gram stems  Automatically segmented words (Morfessor algorithm)  Skipgrams (n-grams with skips) Why are n-grams effective? Bilingual Experiments (CLEF) FIRE Results Summary

3 13 December 2008 Morphological Processes Inflection  box, boxes (plural); actor (male), actress (female) Conjugation  write, written, writing; swim, swam, swum Derivation  sleep, sleepy; play (verb), player (noun), playful (adjective) Word Formation  Compounding: news + paper = newspaper; air + port = airport  Clipping: professor -> prof; facsimile-> fax  Acronyms: GOI = Government of India

4 13 December 2008 Why Do We Normalize Text? It seems desirable to group related words together for query/document processing Why?  To make lexicographers happy?  To improve system performance? If performance is the goal, then it ought not to matter whether the indexing terms look like morphemes, or not

5 13 December 2008 Rule-Based Stemming: Snowball Applicable to alphabetic languages An approximation to lemmatization Identify a root morpheme by chopping off prefixes and suffixes Used for Dutch, English, Finnish, French, German, Italian, Spanish, and Swedish Snowball rulesets also exist for Hungarian and Portuguese No Indian language support Most stemmers are rule-based -ing =>  juggling => juggl -es =>  juggles => juggl -le => -ljuggle => juggl The Snowball project provides high quality, rule- based stemmers for many European languages

6 13 December 2008 N-gram Tokenization Advantages: simple, address morphology, surrogate for short phrases, robust against spelling & diacritical errors, language-independence Disadvantages: conflation (e.g., simmer, slimmer, glimmer, immerse), n-grams incur both speed and disk usage penalties Represent text as overlapping substrings Fixed length of n of 4 or 5 is effective in alphabetic languages For text of length m, there are m-n+1 n-grams swimmers _swim swimm wimme immer mmers mers_

7 13 December 2008 Single N-gram Stemming Traditional (rule-based) stemming attempts to remove the morphologically variable portion of words  Negative effects from over- and under-conflation HungarianBulgarian _hun (20547) _bul (10222) hung (4329) bulg (963) unga (1773) ulga (1955) ngar (1194) lgar (1480) gari (2477) aria (11036) rian (18485) ian_ (49777) Short n-grams covering affixes occur frequently - those around the morpheme tend to occur less often. This motivates the following approach: (1) For each word choose the least frequently occurring character 4- gram (using a 4-gram index) (2) Benefits of n-grams with run- time efficiency of stemming Continues work in Mayfield and McNamee, ‘Single N-gram Stemming’, SIGIR 2003

8 13 December 2008 Statistical Segmentation Morfessor Algorithm  Given a dictionary list, learns to split words into segments  A form of statistical stemming based on Minimum Description Length (MDL)  > 70% of world languages have concatenative morphology  Creutz & Lagus, ACL-2002  2007 Morphology Challenge  Successful on an IR task  Multiple segments per word are generated Examples  affect+ion+ate  author+ized  juggle+d  juggle+r+s  sea+gull+s See McNamee, Nicholas, & Mayfield, ‘Don’t Have a Stemmer? Be un+concern+ed’, SIGIR 2008

9 13 December 2008 Character Skipgrams Character n-grams: robust matching technique Skipgrams: super robust matching  Some letters are omitted (essentially a wildcard match)  sw*m matches swim / swam / swum  f**t matches foot / feet Skip bi-grams for fuzzy matching  Pirkola et al. (2002): learning cross-lingual translation mappings in related languages  Mustafa (2004): monolingual Arabic retrieval Example: 4,2 skipgrams for Hopkins  4 letters, 2 skips  hkin, hpin, hpkn, hoin, hokn, hopn  oins, okns, okis, opns, opis, opks  Note: more skipgrams than plain n-grams Slight gains in Czech, Hungarian, Persian Application to OCR’d docs?

10 13 December 2008 Generating Indexing Terms WordSnowba ll Morfessor5-grams authoredauthorauthor+ed_auth, autho, uthor, thore, hored, ored_ authorizedauthorauthor+ized_auth, autho, uthor, thori, horiz, orize, rized, ized_ authorship author+ship_auth, autho, uthor, thors, horsh, orshi, rship, ship_ reauthorizati on reauthorre+author+izat ion _reau, reaut, eauth, autho, uthor, thori, horiz, oriza, rizat, izati, zatio, ation, tion_ afoot a+foot_afoo, afoot, foot_ footballsfootbalfootball+s_foot, footb, ootba, otbal, tbaall, balls, alls_ footloosefootloosfoot+loose_foot, footl, ootlo, otloo, tloos, loose, oose_ footprint foot+print_foot, footp, ootpr, otpri, tprin, print, rint_ feet _feet, feet_ jugglejuggljuggle_jugg, juggl, uggle, ggle_ juggledjuggljuggle+d_jugg, juggl, uggle, ggled, gled_ jugglersjugglerjuggle+r+s_jugg, juggl, uggle, ggler, glers, lers_

11 13 December 2008 JHU/APL HAIRCUT System The Hopkins Automatic Information Retriever for Combing Unstructured Text (HAIRCUT)  Uses state-of-the-art statistical language model Ponte & Croft, ‘A Language Modeling Approach to Information Retrieval,’ SIGIR-98 Miller, Leek, and Schwartz, ‘A Hidden Markov Model Information Retrieval System’, SIGIR-99. Typically set λ to 0.5  Language-neutral  Supports large dictionaries  Used at TREC (10x), CLEF (9x), NTCIR(2x)

12 13 December 2008 CLEF Ad Hoc Test Sets (2000 – 2007) #docssize0001020304050607 Bulgarian (BG)69 k213 MB4950 149 Czech (CS)82 k178 MB50 Dutch (NL)190 k540 MB50 56156 English (EN)170 k580 MB3347425442504950367 Finnish (FI)55 k137 MB3045 120 French (FR)178 k470 MB34495052495049333 German (DE)295 k660 MB37495056192 Hungarian (HU)50 k105 MB504850148 Italian (IT)157 k363 MB34474951181 Portuguese (PT)107 k340 MB4650 146 Russian (RU)17 k68 MB283462 Spanish (ES)453 k1086 MB495057156 Swedish (SV)143 k352 MB4953102

13 13 December 2008 Tokenization Alternatives Stemming  Effective in Romance languages  Not always available N-grams  Language-neutral  Large gains in complex languages Other techniques  Statistical stemming beats words  Segmentation  Single n-gram stems  No run-time penalty

14 13 December 2008 Monolingual Tokenization

15 13 December 2008 IR & Language Family 5-gram Gains  Tied to morphological complexity  Small improvements in Romance family Estimating Complexity  Mean word length  Spearman rho = 0.77  Information-theoretic approach  Spearman rho = 0.67  Kettunen et al., Juola HUHU FI DE CS SV NL HUHU FI CS DE RURU SV BG

16 13 December 2008 Why are N-grams Effective? (1) Spelling  N-grams localize single letter spelling errors  In news about 1 in 2000 words is misspelled (2) Phrasal Clues  Word spanning n-grams hint at phrases  Only slight differences observed

17 13 December 2008 (3) Because of Morphological Variation? N-grams might gain their power by controlling for morphological variation  N-grams focused on root morphemes tend to match across inflected forms Juola (1998) and Kettunen (2006) did experiments ‘removing’ morphology from language  Such as replacing each surface form with a 6-digit number I compared words and 5-grams under normal and permuted letter conditions  golfer: legfro  golfed: dofegl  golfing: ligfron

18 13 December 2008 Source of N-gram Power Idea: remove morphology from a language Letter order of words was randomly permuted  golfer -> legfro, team-> eamt  golfing, golfer, golfed no longer share a morpheme 4 conditions: {words,5-grams} x {normal,shuffled}

19 13 December 2008 Corpus-Based Translation Given aligned parallel texts and a particular term to translate  Find set of documents (sentences) in the source language containing the term  Examine corresponding foreign documents  Extract ‘good’ candidate(s)  Goodness can be based on term similarity measures (Dice, MI, IBM Model 1, etc.) The Rosetta Stone was discovered in 1799 by Napoleonic forces in Egypt. British physicist Thomas Young determined that cartouches were names of royalty. In 1821 Jean François Champollion began deciphering hieroglyphics using parallel data in Demotic and Greek El precio del petróleo aumentó ayer. La economía reaccionó agudamente … The price of oil increased yesterday. The economy reacted sharply …

20 13 December 2008 Character n-grams can be statistically translated, just like words N-grams (such as n=4,5) are smaller than words  May capture affixes and morphological roots  ‘work’ (from working) maps to ‘abaj’ (as in trabajaba)  ‘yrup’ (from syrup) maps to ‘rabe’ (as in jarabe)  Suitable with Proper Nouns  ‘therl’ (from Netherlands) to ‘ses b’ (as in Países Bajos) GermanItalian wordmilchlatte stemmilchlatt 4-gramsmilc ilch latt 5-grams_milc milch ilch_ _latt latte FrenchDutch wordlaitmelk stemlaitmelk 4-gramslaitmelk 5-grams_lait lait_ _melk melk_ N-gram Translations

21 13 December 2008 CorpusSizeGenreCLEF Languages Bible785kReligious CZ, DE, EN, ES, FI, FR, IT, NL, PT, RU, SV JRC/Acquis32MEU Law BG, CZ, DE, EN, ES, FI, FR, HU, IT, NL, PT, RU, SV Europarl33MParlimentary Debate DE, EN, ES, FI, FR, IT, NL, PT, SV OJEU84MGovernmental Affairs DE, EN, ES, FI, FR, IT, NL PT, SV Parallel Sources Bible: Therefore was the name of it called Babel; because Jehovah did there confound the language of all the earth: and from thence did Jehovah scatter them abroad upon the face of all the earth. Acquis: (24) In order to contribute to the conservation of octopus and in particular to protect the juveniles, it is necessary to establish, in 2006, a minimum size of octopus from the maritime waters under the sovereignty or jurisdiction of third countries and situated in the CECAF region pending the adoption of a regulation amending Regulation (EC) No 850/98. Europarl: Mr President, the tsunami tragedy should be no less significant to the world’s leaders and to Europe than 11 September. OJEU: 11. Trafficking in women for sexual exploitation. A4-0372/97. Resolution on the Communi- cation from the Commission to the Council and the European Parliament on trafficking in women for the purpose of sexual exploitation (COM(96)0567 - C4-0638/96). The European Parliament,

22 13 December 2008 Effectiveness & Corpus Size English queries translated using Europarl Corpus sub-sampled from 1 to 100%.

23 13 December 2008 Effectiveness by size (2)

24 13 December 2008 FIRE Index Characteristics Vocabulary size in ILs seems abnormally small Possibly a bug in my pre-processing or tokenization, perhaps related to Unicode (e.g., continuation or modification characters)

25 13 December 2008 Tokenization for FIRE 2008 words4-grams5-gramssk41 Top @ FIRE BNBengali 0.12310.32800.35820.33520.4719 ENEnglish 0.54950.52410.54150.52640.5572 HIHindi 0.06720.28200.34870.27460.3487 MRMarathi 0.17350.37400.36750.34780.4483 Average 0.22830.38340.40400.37100.4565 Difficult to interpret results with anomalous vocabulary Need Failure Analysis Performance using words in ILs seems quite depressed Hindi 5-gram run had good relative performance Difference vs. 4-grams much larger than typically seen

26 13 December 2008 Relative Gains w/ Relevance Feedback Query expansion using top 10 documents 50 terms (words), 150 terms (4/5-grams), 400 terms (sk41) Fairly effective: 20-40% gains

27 13 December 2008 In Conclusion Compared several forms of representing text  In European languages n-grams obtain 20% gain over words  Rule-based stemming good in Romance languages  Morfessor segments, n-gram stems better than words, not as good as Snowball stemmer N-grams gains  Greatest in morphologically richer languages  Lost when morphology ‘removed’ from language FIRE  N-grams and RF also effective in ILs  Must resolve vocabulary issue  Difficulty finding parallel text, but would like to investigate bilingual retrieval

Download ppt "N-gram Tokenization for Indian Language Text Retrieval Paul McNamee 13 December 2008."

Similar presentations

Ads by Google