Presentation is loading. Please wait.

Presentation is loading. Please wait.

MET-2013 Amit Jain Nitish Gupta Sukomal Pal Indian School of Mines, Dhanbad.

Similar presentations


Presentation on theme: "MET-2013 Amit Jain Nitish Gupta Sukomal Pal Indian School of Mines, Dhanbad."— Presentation transcript:

1 ISM@FIRE MET-2013 Amit Jain Nitish Gupta Sukomal Pal Indian School of Mines, Dhanbad

2 Contents  Introduction to Morpheme  ISMStemmer  Result of MET at FIRE-2013  Problems in ISMStemmer  Conclusion

3 Morpheme In linguistics, a morpheme is the smallest grammatical unit in a language. Every word comprises one or more morphemes. Morphological analysis is the process of segmenting a word into its component. e.g. "Unbreakable" comprises three morphemes: un- (a morpheme signifying "not") -break- (the stem, a free morpheme), and -able (a morpheme signifying "can be done").

4 Stemmer Attempts to reduce word variants to its stem or root form Example – education, educating, educative will all reduce to educat Reasons: search engines are based on string matching similarity of a document wrt a query mostly determined by exact term overlap vocabulary mismatch as natural language documents use different form of a word for the same content

5 Why stemming? (contd…) Example – Suppose we have to search some information about “education” For children education is very important What is the reason we educate children Query: education doc 1 doc 2 doc 3 Educating young minds is the job of a teacher Government aims to make people educated doc 4

6 Why stemming? (contd…) For children education is very important Government aims to make people educated What is the reason we educate children Query: education doc 1 doc 2 doc 3 By stemming: Original word - education, educate Stemmed word - educat Educating young minds is the job of a teacher doc 4

7 ISMstemmer Approaches for Stemming  Language based approach  Statistical approach ISMStemmer is statistical Based on suffix extraction Suffix identified applying Apriori Algorithm (Agrawal and Srikant, 1994)

8 ISMStemmer algorithm Single Colum Refined File Generate valid suffixes (Apriori Algo) Strip off valid suffixes to get stems aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling aborn absolu absorp abuild aquisi activa add admira admitt agre agree allott ambl angl

9 Suffix Generation Input is Single Column Sorted Refined File Reverse the unique sorted word file Generate frequent suffixes (of length 1-character, 2- characters and so on). Find valid suffixes whose frequency is above a pre- decided threshold value α. ing ed tion. er ment Valid Suffixes aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling dedda dettolla … noitidda noitulosba … gnidliuba gnieera Gnilgng …..

10 Evaluation of ISMstemmer For evaluation of ISMstemmer we have participated in: Morpheme Extraction Task (MET) of FIRE-2013 ISMstemmer submitted evaluated at IR Labs: DAIICT, Gujarat tested on 5 languages of South Asian origin has given efficient results with 3 languages

11 MET Results (IR Evaluation) Language Baseline MAP Obtained % improveme nt Bengali 0.2740 0.3158 15.25% Hindi 0.2821 0.2793 -0.99% Gujarati 0.2677 0.2824 5.49% Marathi 0.2320 0.2797 20.56% Odia 0.1537 0.1583 2.99%

12 Results ( Linguistic Evaluation) Tamil: Precision: 80.22%; non-affixes: 80.22% Recall: 18.86%; non-affixes: 18.86% F-measure: 30.54%; non-affixes: 30.54% Bengali: Precision: 60.64%; non-affixes: 60.64% Recall: 32.15%; non-affixes: 32.15% F-measure: 42.02%; non-affixes: 42.02% Tamil: Bengali:

13 Post-hoc Analysis Over stemming 1.accent, accentual, accentuate – accent 2.accept, acceptant, acceptor – accept 3.access, accessible, accession – access due to overstemming  acce Stemming of Named Entities 1. Beijing  Beij

14 Analysis

15 Future plan Need to consider the prefix as well -Clustering based on prefix Identification NEs (Use o NERs) ….

16 THANK YOU!. Questions?


Download ppt "MET-2013 Amit Jain Nitish Gupta Sukomal Pal Indian School of Mines, Dhanbad."

Similar presentations


Ads by Google