Automatically Building a Stopword List for an Information Retrieval System University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Improved TF-IDF Ranker
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
Modern Information Retrieval
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Evaluating the Performance of IR Sytems
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
“ SINAI at CLEF 2005 : The evolution of the CLEF2003 system.” Fernando Martínez-Santiago Miguel Ángel García-Cumbreras University of Jaén.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Vector Space Models.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Indri at TREC 2004: UMass Terabyte Track Overview Don Metzler University of Massachusetts, Amherst.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
QUERY-PERFORMANCE PREDICTION: SETTING THE EXPECTATIONS STRAIGHT Date : 2014/08/18 Author : Fiana Raiber, Oren Kurland Source : SIGIR’14 Advisor : Jia-ling.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Plan for Today’s Lecture(s)
An Efficient Algorithm for Incremental Update of Concept space
Compact Query Term Selection Using Topically Related Text
Modern Information Retrieval
Chapter 5: Information Retrieval and Web Search
Panagiotis G. Ipeirotis Luis Gravano
Automatic Global Analysis
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Automatically Building a Stopword List for an Information Retrieval System University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis

Outline Stopwords Investigation of two approaches Approach based on Zipfs Law New Term-based random sampling approach Experimental Setup Results and Analysis Conclusion

What is a Stopword? Common words in a document e.g. the, is, and, am, to, it Contains no information about documents Low discrimination value in terms of IR meaningless, no contribution Search with stopwords will usually result in retrieving irrelevant documents

Objective Different collection contains different contents and word patterns Different collections may require a different set of stopwords Given a collection of documents Investigate ways to automatically create a stopword list

Objective (cont) 1. Baseline Approach (benchmark) 4 variants inspired by Zipfs Law TF Normalised TF IDF Normalised IDF 2. How informative a term is (new proposed approach)

Foxs Classical Stopword List and Its Weakness Contains 733 stopwords > 20 years old Lacks potentially new words Defined for General Purpose different collections require different stopword lists Outdated

Zipfs Law Based on the term frequencies of terms, rank these terms accordingly term with highest TF will have rank = 1, next highest term with rank = 2 etc Zipfs Law

Baseline Approach Algorithm Generate a list of frequencies vs terms based on corpus Sort the frequencies in descending order Rank the terms according to their frequencies. Highest frequencies would have rank=1 and next highest would have rank=2 etc. Draw a graph of frequencies vs rank

Baseline Approach Algorithm (cont.)

Choose a threshold and any words that appear above the threshold are treated as stopwords Run the queries with the above said stopword list, all stopwords in the queries will be removed Evaluate system with Average Precision

Baseline Approach - Variants Term Frequency Normalised Term Frequency Inverse Document Frequency (IDF) Normalised IDF

Baseline Approach – Choosing Threshold Produce best set of stopwords > 50 stopword lists for each variant Investigate the frequencies difference between two consecutive ranks big difference (i.e. sudden jump) Important to choose appropriate threshold

Term-Based Random Sampling Approach (TBRSA) Our proposed new approach Depends on how informative a term is Based on the Kullback-Leibler divergence measure Similar to the idea of query expansion

Kullback-Leibler Divergence Measure Used to measure the distance between two distributions. In our case, distribution of two terms, one of which is a random term The weight of a term t in the sampled document set is given by: where and

TBRAS Algorithm KL divergence measure Repeat Y times Random term Normalise weights by max weight Rank in ascending order Top X ranked Retrieve

TBRSA Algorithm (cont.) Extract top L ranked as stopwords sort merge

Advantages / Disadvantages Advantages based on how informative a term is computational effort minimal, compared to baselines better coverage of collection No need to monitor progress Disadvantages Generates first term randomly, could retrieve a small data set Repeat experiments Y times

Experimental Setup Four TREC collections Each collection is indexed and stemmed with no pre-defined stopwords removed No assumption of stopwords in the beginning Long queries were used Title, Description and Narrative Maximise our chances of using the new stopword lists

Experimental Platform Terrier - TERabyte RetrIEveR IR Group, University of Glasgow Based on Divergence From Randomness (DFR) framework Deriving parameter-free probabilistic models PL2 model

PL2 Model One of the DFR document weighting models Relevance score of a document d for query Q is: where

Collections disk45, WT2G, WT10G and DOTGOV CollectionSize# Docs# Tokensc value disk452GB WT2G2GB WT10G10GB DOTGOV18GB

Queries CollectionQuery Sets# Queries disk45TREC7 and TREC8 of ad-hoc tasks100 WT2GTREC850 WT10GTREC1050 DOTGOVTREC11 and TREC12 merged100

Merging Stopword Lists Merging classical with best generated using baseline and novel approach respectively Adding 2 lists together, removing duplicates Might be stronger in terms of effectiveness Follows from classical IR technique of combining evidence

Results and Analysis Produce as many sets of stopwords (by choosing different thresholds for baseline approach) Compare results obtained to Foxs classical stopword list, based on average precision

Baseline Approach – Overall Results * indicates significant difference at 0.05 level Normalised IDF and for every collection CollectionClassicalTFNorm TFIDFNorm IDFp-value disk WT2G * WT10G DOTGOV

Baseline Approach – Additional Terms Produced disk45WT2GWT10GDOTGOV financialhtmlablecontent companyhttpcopyrightgov presidenthtmokdefine peopleinternethttpyear marketwebhtmladministrate londontodayjanuaryhttp nationalpolicyhistoryweb structurecontentfacileconomic januarydocumenthtmlyear

TBRSA – Overall Results * indicates significant difference at 0.05 level disk45 and WT2G both show improvements CollectionClassicalBest Obtainp-value disk WT2G WT10G DOTGOV *

TBRSA – Additional Terms Produced disk45WT2GWT10GDOTGOV columnadvancecopyrightserver generalbeachfriendmodify califoniacompanymemorylength industryenvironmentmountaincontent monthgardenproblemaccept directorindustryscienceinform deskmaterialspecialconnect economicpollutioninternetgov businessschooldocumentbyte

Refinement - Merging New approach (TBRSA) gives comparable results Computation effort is less Foxs classical stopword list was very effective, despite its old age Worth using Queries were quite conservative

Merging – Baseline Approach * indicates significant difference at 0.05 level Produced a more effective stopword list CollectionClassicalNorm IDFMergedp-value disk WT2G * WT10G * DOTGOV

Merging – TBRSA * indicates significant difference at 0.05 level Produced an improved stopword list with less computational effort CollectionClassicalBest ObtainedMergedp-value disk WT2G * WT10G DOTGOV

Conclusion & Future Work Proposed a novel approach for automatically generating a stopword list Effectiveness and robustness Compared to 4 baseline variants, based on Zipfs Law Merge classical stopword list with best found result to produce a more effective stopword list

Conclusion & Future Work (cont.) Investigate other divergence metrics Poisson-based approach Verb vs Noun I can open a can of tuna with a can opener to be or not to be Detect nature of context Might have to keep some of the terms but remove others

Thank you! Any questions? Thank you for your attention