Presentation is loading. Please wait.

Presentation is loading. Please wait.

DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,

Similar presentations


Presentation on theme: "DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,"— Presentation transcript:

1 DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Ireland

2 Outline Motivation Task Description Bengali Stemming Approach Hindi Stemming Approach Results Conclusions and Future Work

3 Motivation Some languages have complex inflectional and derivational morphology, i.e. the same base form can correspond to multiple surface word forms Example: company, companies → company; hopeful → hope For information retrieval, indexing surface forms would lead to many mismatches between query terms and index terms extracted from documents Index base forms/stems: Reduce different surface forms to the same index form (stem, lemma) to increase the chance of matching query term with document terms

4 Task Description Morpheme Extraction Task: Investigate effect of morphologic analysis/ lemmatization/ stemming on information retrieval (IR) performance (for Indian languages) Subtasks: Subtask 1: manual evaluation of morpheme extraction Subtask 2: IR evaluation using the proposed morpheme representation as index terms. Evaluation metric is mean average precision (MAP)

5 Stemming Approaches Light vs aggressive stemming Rule-based vs. corpus-based stemming manually created vs. cluster of related words iteratively remove word suffixes problem: overstemming, i.e. removed suffix is too long e.g. international/intern; news/new understemming, i.e. removed suffix is too short e.g. forgetfulness/forgetful irregular forms e.g. feet/foot; women/woman

6 Our Bengali Stemming Approach Rule-based stemmer created by native speaker Focus on nouns (most important for IR) Four categories [Bhattacharya et al. 2005]: Title markers added as suffixes to proper nouns e.g. “ দেবী ” (Mrs.), “ বাবু ” (sir) Classifier for plurality and specificity/gender of a noun e.g. ছবিগুলো (Pictures), ছবিটা (the Picture), ছাত্রী (female student) Case marker for possessive or accusative relations e.g. পরিবারের (family’s) Emphasizer to emphasize the current word e.g. ছবিই (only a picture), ছবিটাই (only this picture)

7 Bengali Stemmer Drop emphasizers (iteratively) e.g. আধিক্যই  আধিক্য Drop classifiers and case markers e.g. মন্ত্রীরাও  মন্ত্রী, ভারতের  ভারত Drop title markers e.g. মমতাদেবী  মমতা Drop plural suffixes e.g. ভারতীয়দের  ভারতীয় Drop derivational suffixes e.g. স্থিতীশীল  স্থিতী

8 Our Hindi Stemming Approach Hindi has less complex inflectional morphology fewer stemming rules Rule-based stemmer Stemming rules manually created by native Hindi speaker

9 Hindi Stemmer Iteratively remove Hindi vowels, Matras, Anusvara, and “ य ” (character ya) from the right of a string until first consonant is encountered Drop derivational suffixes, e.g. लड़कों (to boys)  लड़का (boy) लड़कियों (to girls)  लड़की (girl)

10 MET Experiments Experiments for Bengali and Hindi Stemmers implemented in C Submission as source code Stemmed forms are used for retrieval with Terrier

11 Results TeamLanguageMAP BaselineBengali0.2740 JUBengali0.3307(+20.69%) DCUBengali0.3300(+20.44%) IIT-KGPBengali0.3225(+17.70%) CVPR-TeamBengali0.3159(+15.29%) ISMBengali0.3103(+13.25%) BaselineHindi0.2821 DCUHindi0.2963(+5.03%) ISMHindi0.2793(-0.99%)

12 Conclusions Bengali stemmer: 2 nd best performance Hindi stemmer: Best performance Both have also been used successfully in previous ad-hoc IR experiments for FIRE

13 Future work Explore use of exclusion lists for irregular cases Extend rule set (i.e. handle verbs) Compare to other stemmers for Bengali/Hindi e.g. Indian language in version 4 of Lucene; stemmers from Jacques Savoy’s web page on cross-language IR Investigate morphology of named entities

14 Thank+s for your attention Any question+s ?


Download ppt "DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,"

Similar presentations


Ads by Google