ISSPA 2007 12 January 1 N -Gram and Local Context Analysis for Persian text retrieval Tehran University Abolfazl AleAhmad, Parsia Hakimian, Farzad Mahdikhani.

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

Chapter 5: Introduction to Information Retrieval
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Improved TF-IDF Ranker
Current and Future Research Directions University of Tehran Database Research Group 1 October 2009 Abolfazl AleAhmad, Ehsan Darrudi, Hadi.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Modern Information Retrieval
ISP433/633 Week 3 Query Structure and Query Operations.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Modern Information Retrieval Chapter 5 Query Operations.
Evaluating the Performance of IR Sytems
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries.
University of Tehran FuFaIR: a Fuzzy Farsi Information Retrieval System Amir Nayyeri School of Electrical and Computer Engineering University of Tehran.
Search Engines and Information Retrieval Chapter 1.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.
1 Query Operations Relevance Feedback & Query Expansion.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Chapter 6: Information Retrieval and Web Search
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
CAASL July Using OWA Fuzzy Operator to Merge Retrieval System Results Tehran University Hadi Amiri, Abolfazl AleAhmad, Caro Lucas, Masoud.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
(Pseudo)-Relevance Feedback & Passage Retrieval Ling573 NLP Systems & Applications April 28, 2011.
Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour
GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
CS 533 – 5 min. Presentations M. Sami Arpa Enes Taylan
Queensland University of Technology
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Lecture 12: Relevance Feedback & Query Expansion - II
Text Based Information Retrieval
An Automatic Construction of Arabic Similarity Thesaurus
Martin Rajman, Martin Vesely
Relevance and Reinforcement in Interactive Browsing
Retrieval Utilities Relevance feedback Clustering
INF 141: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

ISSPA January 1 N -Gram and Local Context Analysis for Persian text retrieval Tehran University Abolfazl AleAhmad, Parsia Hakimian, Farzad Mahdikhani School of Electrical and Computer Engineering University of Tehran Farhad Oroumchian University of Wollongong in Dubai

2 University of Tehran - Database Research Group Outline The Persian Language Used Methods Pivoted normalization N-Gram approach Local Context Analysis The test collections Our experiment and the results Conclusion

3 University of Tehran - Database Research Group Outline  The Persian Language Used Methods Pivoted normalization N-Gram approach Local Context Analysis Our test collections Our experiment and the results Conclusion

4 University of Tehran - Database Research Group The Persian Language It is Spoken in countries like Iran, Tajikistan and Afghanistan It has Arabic like script for writing and consists of 32 characters that are written continuously from right to left It’s morphological analyzers need to deal with many forms of words that are not actually Farsi Example The word “کافر” (singular)  “کفار” (plural) Or “عادت” that has two plural forms in Farsi: –Farsi form“عادت ها” –Arabic form“عادات” So N-Grams are a solution

5 University of Tehran - Database Research Group Our Study We investigated vector space model on the Persian language: unstemmed single term N-gram based Local Context Analysis Using HAMSHAHRI collection which contains 160,000+ news articles

6 University of Tehran - Database Research Group Outline  The Persian Language Used Methods Pivoted normalization N-Gram approach Local Context Analysis Our test collections Our experiment and the results Conclusion

7 University of Tehran - Database Research Group NameWeighting tf.idf tf*log(N/n) / (  (tf 2 ) *  (qtf 2 )) lnc.ltc (1+log(tf))*(1+log(qtf))*log((1+N)/n) / (  (tf 2 ) *  (qtf 2 )) nxx.bpx ( *tf/max tf)+log((N-n)/n) tfc.nfc tf*log(N/n)*( *qtf/max qtf)*log(N/n) / (  (tf 2 ) *  (qtf 2 )) tfc.nfx1 tf* log(N/n)*( *qtf/max qtf) *log(N/n) / (  (tf * log(N/n)) 2 ) tfc.nfx2 tf*log(N/n)*( *qtf/max qtf)*log(N/n) / (  (tf 2 )) Lnu.ltu ((1+log(tf))*(1+log(qtf))*log((1+N)/n))/ ((1+log(average tf)) * ((1-s) + s * N.U.W/ average N.U.W) 2) List of Weights that produced the best results Best Vector Space Model

8 University of Tehran - Database Research Group Problem with Document length normalization It is supposed to remove the difference between the document's lengths Under cosine normalization shorter documents get higher weights but they are less relevant. Average of median bin length Average probability of Relevance/Retrieval

9 University of Tehran - Database Research Group Lnu.ltu weighting scheme A good weight proposed by Amit Singhal, et al. and tested on TREC collections Based on reducing the gap between relevance and retrieval Lnu = ltu =

10 University of Tehran - Database Research Group Pivoted Normalization Document Length Probability Final Normalization Factor Old Normalization Factor Source: A. Singhal, et al. “Pivoted Document Length Normalization”

11 University of Tehran - Database Research Group Outline  The Persian Language Used Methods Pivoted normalization N-Gram approach Local Context Analysis Our test collections Our experiment and the results Conclusion

12 University of Tehran - Database Research Group NGRAMS are strings of length n. In this approach the whole text is considered as a stream of characters and then it is broken down to substrings of length n. It is remarkably resistant to textual errors (e.g. OCR) and no linguistic knowledge is needed. Example: “مخابرات” for n=4 مخاب خابر ابرا برات رات NGRAM Approach (Cont.)

13 University of Tehran - Database Research Group Outline  The Persian Language Used Methods Pivoted normalization N-Gram approach Local Context Analysis Our test collections Our experiment and the results Conclusion

14 University of Tehran - Database Research Group Word Mismatch Problem Automatic query expansion is a good solution for the issue of word mismatch in IR: Local Analysis + Expansion based on high ranking documents - Needs an extra search - Some queries may retrieve few relevant documents Global Analysis + It has robust average performance - Expensive in terms of disk space and CPU - Individual Queries can be significantly degraded

15 University of Tehran - Database Research Group Local Context Analysis Local Context Analysis is an automatic query expansion method combines global analysis (use of context & phrase structure) and local feedback (Top ranked documents) LCA is fully automated and there is no need to collect any information from user other than the initial query + It is computationally practical - But has the extra search to retrieve top ranked documents

16 University of Tehran - Database Research Group LCA has three main steps: 1. Run user’s query, break the top N retrieved documents into passages and rank them again. 2. Calculate similarity of each concept in the top ranked passages with the entire original query using similarity function: 3. the top M ranked concepts are added to the original query and initial retrieval method is done with the expanded query Local Context Analysis (Cont.)

17 University of Tehran - Database Research Group Outline  The Persian Language Used Methods Pivoted normalization N-Gram approach Local Context Analysis Our test collections Our experiment and the results Conclusion

18 University of Tehran - Database Research Group Test Collections  Qvanin Collection Documents: Iranian Law Collection passages 41 queries and Relevance Judgments  Hamshari Collection Documents: 600+ MB News from Hamshari Newspaper news articles 60 queries and Relevance Judgments  BijanKhan Tagged Collection Documents: 100+ MB from different sources A tag set of 41 tags tagged words

19 University of Tehran - Database Research Group Hamshahri Collection We used HAMSHAHRI (a test collection for Persian text prepared and distributed by DBRG (IR team) of University of Tehran) The 3 rd version: –contains about distinct textual news articles in Farsi –60 queries and relevance judgments for top 20 relevant documents for each query

20 University of Tehran - Database Research Group Some examples of Queries Women rights law قانون حقوق زنان Contamination in Persian gulf آلودگی خلیج فارس Birds migration کوچ پرندگان Increase of gas price افزایش قیمت بنزین Iranian Wrestling کشتی فرنگی ایران

21 University of Tehran - Database Research Group Outline The Persian Language Used Methods Pivoted normalization N-Gram approach Local Context Analysis Our test collections Our experiment and the results Conclusion 

22 University of Tehran - Database Research Group Term-based vector space model A. Singhal, et al. in their paper “Pivoted Document Length Normalization” reported that the following two configurations have the best performance: Slope=0.25 and using pivoted unique normalization (P.U.N.). Pivot = average no. of unique terms in a document Slope=0.75 and using pivoted cosine normalization (P.C.N.). Pivot = average cosine factor for 1+log(tf)

23 University of Tehran - Database Research Group Our experiment results Comparison of vector space model slope=0.25 and slope=0.75

24 University of Tehran - Database Research Group Our experiment results Comparison of vector space model and LCA : In LCA we used Lnu.ltu (slope=0.25 and P.U.N.)

25 University of Tehran - Database Research Group N-Gram Experiments Next, we assessed N-gram based vector space model for N = 3,4,5 on the HAMSHAHRI collection. In addition to Lnu.ltu we assessed atc.atc in which both query and documents are weighted as follows: atc =

26 University of Tehran - Database Research Group N-Gram experiment results N-Grams using atc.atc and lnu.ltu (slope=0.25) weighting schemes

27 University of Tehran - Database Research Group Previous Works: Comparison of Vector Space System with FuFaIR They used the first version of HAMSHAHRI collection (300+ MB) in Their experiments. It has 30 Queries In vector space model the Slope set to 0.75 and the Pivot set to Conclusion

28 University of Tehran - Database Research Group Comparison of vector space systems with BM25 Conclusion

29 University of Tehran - Database Research Group Experiments on Qavanin Collection Conclusion Source: F. Oroumchian, F. Mazhar Garamaleki. “An Evaluation of Retrieval performance Using Farsi Text”. First Eurasia Conference on Advances in Information and Communication Technology, Tehran, Iran, October Comparison of Best Vector Space With Best N-grams

30 University of Tehran - Database Research Group Our experiment best results Experiments using atc.atc and lnu.ltu (slope=0.25) weighting schemes

31 University of Tehran - Database Research Group Results Analysis (N-Gram) AS It was shown, 4-gram based vector space with Lnu.ltu weighting scheme has better performance than FuFaIR and other vector space models: It is in contradiction with the performance of them in English. The rational is that most Farsi words' roots are about 4 characters. Our results are more valid than previous works because we used a better collection Conclusion

32 University of Tehran - Database Research Group Results Analysis (LCA) Local Context Analysis only marginally improved the results over the Lnu.ltu method Lnu.ltu weighting method is performing very well on the Farsi language It’s better to tune LCA parameters for the HAMSHAHRI collection Conclusion

33 University of Tehran - Database Research Group Thanks, Questions ?