Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Mono & Cross Language Experiments on Persian Text Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian Database Research Group School of Electrical and Computer.

Similar presentations


Presentation on theme: "1 Mono & Cross Language Experiments on Persian Text Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian Database Research Group School of Electrical and Computer."— Presentation transcript:

1 1 Mono & Cross Language Experiments on Persian Text Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian Database Research Group School of Electrical and Computer Engineering University of Tehran Database Research Group 18 Sep

2 Outline Persian Language Persian Test Collections Hamshahri in CLEF 2008 UT Participants Using Part of Speech Tagging in Persian Information Retrieval Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Local Cluster Analysis Using Part of Speech Tagging Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text Cross Language Experiments at 2008 Next Year 2

3 The Persian Language A branch of Indo-European Languages Official Language of Iran, Afghanistan and Tajikistan Its morphological analysis is Comparably difficult The word “خبر” has two plural forms: Persian rules: “خبرها” Arabic rules: “اخبار” 3

4 Writing Style Issues: e.g. ”می شود“ and “میشود” are the same e.g. ”کتابها“ and ”کتاب ها“ are the same KASRE: e.g. چراغ علی خانه را سوزاند has two different meanings: CheraghAli burned the house Ali’s lantern burned the house Some Processing Issues 4

5  5 Encoding

6 Persian in the Middle East 6 Source: Internet World Statistics, December 31, 2007 User Population Growth on the Web ( )

7 Persian Test Collections IR Domain Ghavanin (domain specific) Hamshahri (news) WEB: NLP Domain Bijankhan (2 Million Word) WEB: 7

8 Hamshahri in CLEF News articles of Hamshahri newspaper from year 1996 to 2002 Size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) 22 assessors Evaluation based on DIRECT System

9 Hamshahri in CLEF Collection size564 MB (Unicode text) No. Of documents166,774 No. Of unique terms417,339 Average length of documents380 Terms No. Of categories9 No. Of Topics50 bilingual

10 Implementation of our methods We submitted top 100 for each run 10

11 11 Using Part of Speech Tagging in Persian Information Retrieval Reza Karimpour, Amineh Ghorbani, Azadeh Pishdad, Mitra Mohtarami, Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian

12 Config.CorpusQuery 1TaggedTitle with equal weighting for all POS tags 2Stemmed and taggedStemmed title with equal weighting for all POS tags 3StemmedStemmed title without POS tagging 4StemmedStemmed Title plus description 5 Stemmed (stop words removed) Stemmed Title plus description (stop words removed) 6Tagged Title plus description with equal weighting for all POS tags 7Tagged Title with various weighting schemes for different POS tags 8NormalTitle (Neither stemmed nor tagged) 12 Using Part of Speech Tagging in Persian Information Retrieval

13 13 20 less used tags omitted, others equal weight Noun=3 Verb=2 Adj=1 Adv=1 Noun=3 Verb=0 Avj=3 Adv = 0 Noun=0 Verb=2 Adj=0 Adv=0 Noun=0 Verb=0 Adj=1 Adv=0 Noun=0 Verb=0 Adj=0 Adv=1 Average precision R-Precision Using Part of Speech Tagging in Persian Information Retrieval

14 14 Using Part of Speech Tagging in Persian Information Retrieval

15 Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Zahra Aghazade, Nazanin Dehghani, Leili Farzinvash, Razieh Rahimi, Abolfazel AleAhmad, Hadi Amiri, Farhad Oroumchian Weighting ModelDescription BB2 Bose-Einstein model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization BM25 The BM25 probabilistic model DFR_BM25 The DFR version of BM25 IFB2 Inverse Term Frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization In_expB2 Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization In_expC2 Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization with natural logarithm InL2 Inverse document frequency model for randomness, succession for first normalization, and Normalization 2 for term frequency normalization PL2 Poisson estimation for randomness, succession for first normalization, and Normalization 2 for term frequency normalization TF_IDF The tf*idf weighting function, where tf is given by Robertson's tf and idf is given by the standard Sparck Jones' idf 15 Terrier Open Source Retrieval Engine: ir.dcs.gla.ac.uk/terrier/

16 Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Weighting ModelAverage PrecisionR-Precision BB BM DFR_BM IFB In_expB In_expC InL PL TF_IDF

17 Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track And two other variations of this operator: IOWA and NOWA 17

18 Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track 18

19 Retrieval MethodToolkitAverage PrecisionR-PrecisionDif TF_IDF with unstemmed single terms Terrier PL2 with 4gram terms Terrier Indri with stemmed terms Lemur IOWA NOWA Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Post hoc Results

20 Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text Amir Hossein Jadidinejad, Mitra Mohtarami,Hadi Amiri 20

21 Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text 21 But the result was not good on the test set

22 Cross Language Experiments at 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian 22 Runtot-retrel-retMAPRetrieval ModelTool Using Light Stemmer Vector SpaceLucene Without Stemmer Vector SpaceLucene 3Grams Language ModelingLemur 4Grams Language ModelingLemur 5Grams Language ModelingLemur Term-Based Language ModelingLemur

23 Probabilistic Structured Queries (PSQ) Combinatorial Translation Probability (CTP) Cross Language Experiments at 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Query Translation 23 

24 Cross Language Experiments at 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Query Translation Results 24

25 Cross Language Experiments at 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Document Translation Using Shiraz machine translation system from CRL of NMSU Took 10 days to translate 130,000+ docs from Persian to English 25

26 Cross Language Experiments at 2008 Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian Document Translation & Hybrid Results 26

27 Next Year Ham2 for the Next Year Extended Version of Hamshahri Collection 2 times larger (~1.5 GB) 27 HAM /1385/851011/news/_adabh.htm دوشنبه 11 دي سال چهاردهم - شماره Jan 1, ادب و هنر Literature and Art <![CDATA[ مديركل كتاب و كتابخواني وزارت فرهنگ و ارشاد اسلامي خبر داد آيين نامه خريد كتاب اصلاح شد ]]> /1385/851011/news/ jpg

28 28 Questions? Thanks For Your Attention Database Research Group


Download ppt "1 Mono & Cross Language Experiments on Persian Text Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian Database Research Group School of Electrical and Computer."

Similar presentations


Ads by Google