Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Tehran FuFaIR: a Fuzzy Farsi Information Retrieval System Amir Nayyeri School of Electrical and Computer Engineering University of Tehran.

Similar presentations


Presentation on theme: "University of Tehran FuFaIR: a Fuzzy Farsi Information Retrieval System Amir Nayyeri School of Electrical and Computer Engineering University of Tehran."— Presentation transcript:

1 University of Tehran FuFaIR: a Fuzzy Farsi Information Retrieval System Amir Nayyeri School of Electrical and Computer Engineering University of Tehran Farhad Oroumchian University of Wollongong in Dubai

2 AICCSA06 2 Overview Persian Language Persian Language Related Work Related Work Fuzzy IR Fuzzy IR Farsi IR Farsi IR FuFaIR Explanation FuFaIR Explanation Experimental Results Experimental Results Conclusion and Future Work Conclusion and Future Work

3 AICCSA06 3 Persian Language Spoken in several countries (Iran, Afghanistan, Tajikistan …) Spoken in several countries (Iran, Afghanistan, Tajikistan …) This language has evolved over the years been influenced by many languages This language has evolved over the years been influenced by many languages Contains foreign words from many languages such as Arabic, Turkish, French, English, … Contains foreign words from many languages such as Arabic, Turkish, French, English, … In some cases these words still follow the grammatical rules of their original languages for example: In some cases these words still follow the grammatical rules of their original languages for example: “Maktab” مكتب (singular)  “MAKATEB” مكاتب (plural) “Maktab” مكتب (singular)  “MAKATEB” مكاتب (plural) In some cases these words could use grammatical rules of both languages i.e. In some cases these words could use grammatical rules of both languages i.e. “Khabar” خبر (singular)  “Khabar” خبر (singular)  “AKHBAR” اخبار (Arabic) “AKHBAR” اخبار (Arabic) “KHABAR-HA” خبرها (Persian) “KHABAR-HA” خبرها (Persian) Morphological analyzers for this language need to deal with many forms of words Morphological analyzers for this language need to deal with many forms of words

4 AICCSA06 4 Information Retrieval and Natural Language Processing for Persian (Farsi) Faculty of Engineering of University of Tehran started working on processing of Persian about 7 years ago. Faculty of Engineering of University of Tehran started working on processing of Persian about 7 years ago. From 3 years ago, it has been a joint co- operation between UT and UOWD. From 3 years ago, it has been a joint co- operation between UT and UOWD. Since then several thousand experiments on processing and retrieval of Persian text have been performed. Since then several thousand experiments on processing and retrieval of Persian text have been performed.

5 AICCSA06 5 Test Collections 1. Qvanin Collection Documents: Iranian Law Collection Documents: Iranian Law Collection 177089 passages 177089 passages 41 queries and Relevance Judgments 41 queries and Relevance Judgments 2. Hamshari Collection Documents: 300 MB News from Hamshari Newspaper Documents: 300 MB News from Hamshari Newspaper 3. Part of Speech Tagging Collection A tag set of 40 tags A tag set of 40 tags 2590000+ tagged words 2590000+ tagged words

6 AICCSA06 6 Natural Language Processing Investigating Automatic Part of Speech Tagging based on machine learning approaches: Investigating Automatic Part of Speech Tagging based on machine learning approaches: Probabilistic (Hidden Markov Model) Probabilistic (Hidden Markov Model) Rule based Rule based Entropy based Entropy based Neural Networks Neural Networks The best so far has reached a 96% accuracy. The best so far has reached a 96% accuracy.

7 AICCSA06 7 Information Retrieval Experiments All Major Retrieval Models of English text retrieval have been tested and their combinations (i.e.) Fuzzy Logic MMM, Paice, Vector Space Probabilistic BM25 N-Grams N=2, N=3, N=4 Combinational With many different term weighting schemes.

8 AICCSA06 8 NameWeighting tf.idf tf*log(N/n) / (  (tf 2 ) *  (qtf 2 )) lnc.ltc (1+log(tf))*(1+log(qtf))*log((1+N)/n) / (  (tf 2 ) *  (qtf 2 )) nxx.bpx (0.5+0.5*tf/max tf)+log((N-n)/n) tfc.nfc tf*log(N/n)*(0.5+0.5*qtf/max qtf)*log(N/n) / (  (tf 2 ) *  (qtf 2 )) tfc.nfx1 tf* log(N/n)*(0.5+0.5*qtf/max qtf) *log(N/n) / (  (tf * log(N/n)) 2 ) tfc.nfx2 tf*log(N/n)*(0.5+0.5*qtf/max qtf)*log(N/n) / (  (tf 2 )) Lnu.ltu((1+log(tf))*(1+log(qtf))*log((1+N)/n))/ ((1+log(average tf)) * ((1-s) + s * N.U.W/ average N.U.W) 2) List of Weights that produced the best results Best

9 AICCSA06 9 NoSystemNoSystemNoSystem Fuzzy Logic Vector Space 1paice-tf.idf 11mmm-tf.idf202gram-Lnu.ltu 2paice-lnc.ltc12mmm-lnc.ltc212gram-tfc.nfx 3paice-Lnu.ltu13mmm-Lnu.ltu222gram-lnc.ltc 4paice-nxx.bpx14mmm-nxx.bpx233gram-Lnu.ltu 5paice-tfc.nfx115mmm-tfc.nfx1243gram-tfc.nfx 6paice-tfc.nfc16mmm-tfc.nfc Probabilistic253gram-lnc.ltc 7BM25 264gram-Lnu.ltu 82gram-BM2517vector-Lnu.ltu274gram-tfc.nfx 93gram-BM2518 vector- tfc.nfx2 284gram-lnc.ltc 104gram-BM2519vector-lnc.ltc Best

10 AICCSA06 10 The context of the current work Improving the quality of Persian retrieval Improving the quality of Persian retrieval Improving IR systems that used Fuzzy Logic as their retrieval model Improving IR systems that used Fuzzy Logic as their retrieval model

11 AICCSA06 11 Related Work – Fuzzy IR Fuzzy logic has been used in IR from early days. Fuzzy logic has been used in IR from early days. But only a few of them could show superiority in comparison with Classical approaches like vector space. But only a few of them could show superiority in comparison with Classical approaches like vector space. This has been confirmed for Persian language also. This has been confirmed for Persian language also. The current work has been mostly inspired by one of them: The current work has been mostly inspired by one of them: D.E. Losada, F.D. Hermida, A. Bugarin, S. Barro. Experiments on using fuzzy quantified sentences in adhoc retrieval. ACM Symposium on Applied Aomputin, 2004.

12 AICCSA06 12 Mixed Min & Max – MMM Calculates the degree of membership of a document to the fuzzy set of the terms in the query as below OR Query: (قيموميت يا حضانت)  ((Guardian OR GOD Parent Q or = (A 1 OR A 2 OR A 3 OR …) SIM(Q or, D) = C or1 * max(d A1, d A2, …) +C or2 * min(d A1, d A2, …) AND Query (املاك و ثبت ) (Registration AND Properties)  Q and = (A 1 AND A 2 AND A 3 AND …) SIM(Q and, D) = C and1 * min(d A1, d A2, …) + C and2 * max(d A1, d A2, …) C and, C or softness coefficient Cand1 = [0.5,0.8] Cand2 = 1 – Cand1 Cor1 > 0.2 Cor2 = 1- Cor1

13 AICCSA06 13 Paice Model Calculates the degree of membership of a document to the fuzzy set of terms in the query as below: AND Query (املاك و ثبت )  (Registration AND Properties) Q and = (A 1 and A 2 and A 3 and …) OR Query: (قيموميت يا حضانت)  (Guardian OR GOD Parent ) Q or = (A 1 or A 2 or A 3 or …) SIM(Q, D) =  r i-1 td i /  r i-1 r = 1.0 for and queries (td i ascending order) r = 0.7 for or queries (tdi descending order)

14 AICCSA06 14 Comparison of Fuzzy Systems Experiments on Qavanin Collection

15 AICCSA06 15 Probabilistic Systems (BM25) Experiments on Qavanin Collection

16 AICCSA06 16 Comparison of Vector Space Systems With BM25 Experiments on Qavanin Collection

17 AICCSA06 17 Comparison of Best Vector Space With Best N-grams Experiments on Qavanin Collection

18 AICCSA06 18 FuFaIR The query is considered as a fuzzy set of relevant documents in the database The query is considered as a fuzzy set of relevant documents in the database The documents will be sent to the client sorted based on their degree of membership to the query's fuzzy set The documents will be sent to the client sorted based on their degree of membership to the query's fuzzy set The larger the value of µ i the more relevant is the document to the query The larger the value of µ i the more relevant is the document to the query i

19 AICCSA06 19 FuFaIR (Cont.) each term is assigned a membership degree to a document based on the importance of that term for representing the document’s content. each term is assigned a membership degree to a document based on the importance of that term for representing the document’s content. Membership degree can be computed with classical IR parameters such as tf/idf Membership degree can be computed with classical IR parameters such as tf/idf The input query is considered as an algebraic sentence whose elements are: The input query is considered as an algebraic sentence whose elements are: Terms Terms Fuzzy operators such as AND, OR, and NOT Fuzzy operators such as AND, OR, and NOT Applying the operators on terms the final Fuzzy Set results Applying the operators on terms the final Fuzzy Set results i

20 AICCSA06 20 FuFaIR (Cont.) The membership degree of a document to an individual term is defined as follows in our method: The membership degree of a document to an individual term is defined as follows in our method: i f t,d = Frequency of term t in document d idf (t) = Inverse document frequency of term t

21 AICCSA06 21 Overview Persian Language Persian Language Related Work Related Work Fuzzy IR Fuzzy IR Farsi IR Farsi IR Fuzzy Logic Overview Fuzzy Logic Overview FuFaIR Explanation FuFaIR Explanation Experimental Results Experimental Results Conclusion and Future Work Conclusion and Future Work

22 AICCSA06 22 Experimental Results Parameters: Parameters: Hamshahri Corpora has been used Hamshahri Corpora has been used Total size of the collection:300+MB Total size of the collection:300+MB Indexing has been performed after stop word elimination Indexing has been performed after stop word elimination No stemming has been applied No stemming has been applied 30 queries have been used for these experiments 30 queries have been used for these experiments Precision has been computed for top 20 retrieved documents. Precision has been computed for top 20 retrieved documents.

23 AICCSA06 23 Experimental Results (Cont.) Some Sample Queries: The Bidel music group concert کنسرت موسيقي گروه بيدل Iran AND USA relationsروابط ايران و امريکا Economic benefit of Iran’s agriculture سود اقتصادي کشاورزي ايران The punishment of doping in swimming مجازات دوپينگ در شنا Cancer treatment methods روشهاي درمان سرطان Classic music in Iranموسيقي کلاسيک در ايران

24 AICCSA06 24 Experimental Results (Cont.) As a bench mark the best Persian retrieval model so far has been selected. That is the Vector Space model with Lnu-ltu weighting scheme. As a bench mark the best Persian retrieval model so far has been selected. That is the Vector Space model with Lnu-ltu weighting scheme. Pivot and the slope parameters have been set to 13.36, and 0.75, respectively Pivot and the slope parameters have been set to 13.36, and 0.75, respectively The effectiveness of these values had been shown by previous works (See Paper). The effectiveness of these values had been shown by previous works (See Paper). To calculate the performance of each run, the precision at 5, 10, 15 and 20 document cut-offs have been calculated and averaged over all 30 queries. To calculate the performance of each run, the precision at 5, 10, 15 and 20 document cut-offs have been calculated and averaged over all 30 queries.

25 AICCSA06 25 Experimental Results (Cont.) Comparison Results:

26 AICCSA06 26 Conclusion & Future Work Conclusion Main contribution of this paper: Main contribution of this paper: Design, implementation and testing of FuFaIR a Fuzzy retrieval system for Persian language. Design, implementation and testing of FuFaIR a Fuzzy retrieval system for Persian language. fuzzy quantifiers are also added to the original model to provide more flexibility fuzzy quantifiers are also added to the original model to provide more flexibility In comparison with Vector Space, FuFaIR significantly better performance In comparison with Vector Space, FuFaIR significantly better performance Future Works: Testing different interpretation of the Fuzzy operators on the Persian corpora Testing different interpretation of the Fuzzy operators on the Persian corpora Examining the true value and contribution of a Persian stemmer in retrieval. Examining the true value and contribution of a Persian stemmer in retrieval.

27 AICCSA06 27 Questions ?

28 AICCSA06 28 Conception of Fuzzy Logic Many decision-making and problem-solving tasks are too complex to be defined precisely Many decision-making and problem-solving tasks are too complex to be defined precisely however, people succeed by using imprecise knowledge however, people succeed by using imprecise knowledge Fuzzy logic resembles human reasoning in its use of approximate information and uncertainty to generate decisions. Fuzzy logic resembles human reasoning in its use of approximate information and uncertainty to generate decisions.

29 AICCSA06 29 Natural Language Consider: Consider: Joe is tall -- what is tall? Joe is tall -- what is tall? Joe is very tall -- what does this differ from tall? Joe is very tall -- what does this differ from tall? Natural language (like most other activities in life and indeed the universe) is not easily translated into the absolute terms of 0 and 1. Natural language (like most other activities in life and indeed the universe) is not easily translated into the absolute terms of 0 and 1. “ false ” “ true ”

30 AICCSA06 30 Fuzzy Logic An approach to uncertainty that combines real values [0…1] and logic operations An approach to uncertainty that combines real values [0…1] and logic operations Fuzzy logic is based on the ideas of fuzzy set theory and fuzzy set membership often found in natural (e.g., spoken) language. Fuzzy logic is based on the ideas of fuzzy set theory and fuzzy set membership often found in natural (e.g., spoken) language.

31 AICCSA06 31 Example: “Young” Example: Example: Ann is 28, 0.8 in set “Young” Ann is 28, 0.8 in set “Young” Bob is 35, 0.1 in set “Young” Bob is 35, 0.1 in set “Young” Charlie is 23, 1.0 in set “Young” Charlie is 23, 1.0 in set “Young” Unlike statistics and probabilities, the degree is not describing probabilities that the item is in the set, but instead describes to what extent the item is the set. Unlike statistics and probabilities, the degree is not describing probabilities that the item is in the set, but instead describes to what extent the item is the set.

32 AICCSA06 32 Membership function of fuzzy logic Age 254055 YoungOld 1 Middle 0.5 DOM Degree of Membership Fuzzy values Fuzzy values have associated degrees of membership in the set. 0

33 AICCSA06 33 Benefits of fuzzy logic You want the value to switch gradually as Young becomes Middle and Middle becomes Old. This is the idea of fuzzy logic. You want the value to switch gradually as Young becomes Middle and Middle becomes Old. This is the idea of fuzzy logic.

34 AICCSA06 34 Fuzzy Set Operations Fuzzy OR (  ): the union of two fuzzy sets is the maximum (MAX) of each element from two sets. Fuzzy OR (  ): the union of two fuzzy sets is the maximum (MAX) of each element from two sets. E.g. E.g. A = {1.0, 0.20, 0.75} A = {1.0, 0.20, 0.75} B = {0.2, 0.45, 0.50} B = {0.2, 0.45, 0.50} A  B = {MAX(1.0, 0.2), MAX(0.20, 0.45), MAX(0.75, 0.50)} A  B = {MAX(1.0, 0.2), MAX(0.20, 0.45), MAX(0.75, 0.50)} = {1.0, 0.45, 0.75}

35 AICCSA06 35 Fuzzy Set Operations Fuzzy AND (  ): the intersection of two fuzzy sets is just the MIN of each element from the two sets. Fuzzy AND (  ): the intersection of two fuzzy sets is just the MIN of each element from the two sets. E.g. E.g. A  B = {MIN(1.0, 0.2), MIN(0.20, 0.45), MIN(0.75, 0.50)} = {0.2, 0.20, 0.50} A  B = {MIN(1.0, 0.2), MIN(0.20, 0.45), MIN(0.75, 0.50)} = {0.2, 0.20, 0.50}

36 AICCSA06 36 Fuzzy Set Operations The complement of a fuzzy variable with DOM x is (1-x). The complement of a fuzzy variable with DOM x is (1-x). Complement: The complement of a fuzzy set is composed of all elements’ complement. Complement: The complement of a fuzzy set is composed of all elements’ complement. Example. Example. A c = {1 – 1.0, 1 – 0.2, 1 – 0.75} = {0.0, 0.8, 0.25} A c = {1 – 1.0, 1 – 0.2, 1 – 0.75} = {0.0, 0.8, 0.25}


Download ppt "University of Tehran FuFaIR: a Fuzzy Farsi Information Retrieval System Amir Nayyeri School of Electrical and Computer Engineering University of Tehran."

Similar presentations


Ads by Google