Automatically Building a Stopword List for an Information Retrieval System University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis
Outline Stopwords Investigation of two approaches Approach based on Zipfs Law New Term-based random sampling approach Experimental Setup Results and Analysis Conclusion
What is a Stopword? Common words in a document e.g. the, is, and, am, to, it Contains no information about documents Low discrimination value in terms of IR meaningless, no contribution Search with stopwords will usually result in retrieving irrelevant documents
Objective Different collection contains different contents and word patterns Different collections may require a different set of stopwords Given a collection of documents Investigate ways to automatically create a stopword list
Objective (cont) 1. Baseline Approach (benchmark) 4 variants inspired by Zipfs Law TF Normalised TF IDF Normalised IDF 2. How informative a term is (new proposed approach)
Foxs Classical Stopword List and Its Weakness Contains 733 stopwords > 20 years old Lacks potentially new words Defined for General Purpose different collections require different stopword lists Outdated
Zipfs Law Based on the term frequencies of terms, rank these terms accordingly term with highest TF will have rank = 1, next highest term with rank = 2 etc Zipfs Law
Baseline Approach Algorithm Generate a list of frequencies vs terms based on corpus Sort the frequencies in descending order Rank the terms according to their frequencies. Highest frequencies would have rank=1 and next highest would have rank=2 etc. Draw a graph of frequencies vs rank
Baseline Approach Algorithm (cont.)
Choose a threshold and any words that appear above the threshold are treated as stopwords Run the queries with the above said stopword list, all stopwords in the queries will be removed Evaluate system with Average Precision
Baseline Approach - Variants Term Frequency Normalised Term Frequency Inverse Document Frequency (IDF) Normalised IDF
Baseline Approach – Choosing Threshold Produce best set of stopwords > 50 stopword lists for each variant Investigate the frequencies difference between two consecutive ranks big difference (i.e. sudden jump) Important to choose appropriate threshold
Term-Based Random Sampling Approach (TBRSA) Our proposed new approach Depends on how informative a term is Based on the Kullback-Leibler divergence measure Similar to the idea of query expansion
Kullback-Leibler Divergence Measure Used to measure the distance between two distributions. In our case, distribution of two terms, one of which is a random term The weight of a term t in the sampled document set is given by: where and
TBRAS Algorithm KL divergence measure Repeat Y times Random term Normalise weights by max weight Rank in ascending order Top X ranked Retrieve
TBRSA Algorithm (cont.) Extract top L ranked as stopwords sort merge
Advantages / Disadvantages Advantages based on how informative a term is computational effort minimal, compared to baselines better coverage of collection No need to monitor progress Disadvantages Generates first term randomly, could retrieve a small data set Repeat experiments Y times
Experimental Setup Four TREC collections Each collection is indexed and stemmed with no pre-defined stopwords removed No assumption of stopwords in the beginning Long queries were used Title, Description and Narrative Maximise our chances of using the new stopword lists
Experimental Platform Terrier - TERabyte RetrIEveR IR Group, University of Glasgow Based on Divergence From Randomness (DFR) framework Deriving parameter-free probabilistic models PL2 model
PL2 Model One of the DFR document weighting models Relevance score of a document d for query Q is: where
Collections disk45, WT2G, WT10G and DOTGOV CollectionSize# Docs# Tokensc value disk452GB WT2G2GB WT10G10GB DOTGOV18GB
Queries CollectionQuery Sets# Queries disk45TREC7 and TREC8 of ad-hoc tasks100 WT2GTREC850 WT10GTREC1050 DOTGOVTREC11 and TREC12 merged100
Merging Stopword Lists Merging classical with best generated using baseline and novel approach respectively Adding 2 lists together, removing duplicates Might be stronger in terms of effectiveness Follows from classical IR technique of combining evidence
Results and Analysis Produce as many sets of stopwords (by choosing different thresholds for baseline approach) Compare results obtained to Foxs classical stopword list, based on average precision
Baseline Approach – Overall Results * indicates significant difference at 0.05 level Normalised IDF and for every collection CollectionClassicalTFNorm TFIDFNorm IDFp-value disk WT2G * WT10G DOTGOV
Baseline Approach – Additional Terms Produced disk45WT2GWT10GDOTGOV financialhtmlablecontent companyhttpcopyrightgov presidenthtmokdefine peopleinternethttpyear marketwebhtmladministrate londontodayjanuaryhttp nationalpolicyhistoryweb structurecontentfacileconomic januarydocumenthtmlyear
TBRSA – Overall Results * indicates significant difference at 0.05 level disk45 and WT2G both show improvements CollectionClassicalBest Obtainp-value disk WT2G WT10G DOTGOV *
TBRSA – Additional Terms Produced disk45WT2GWT10GDOTGOV columnadvancecopyrightserver generalbeachfriendmodify califoniacompanymemorylength industryenvironmentmountaincontent monthgardenproblemaccept directorindustryscienceinform deskmaterialspecialconnect economicpollutioninternetgov businessschooldocumentbyte
Refinement - Merging New approach (TBRSA) gives comparable results Computation effort is less Foxs classical stopword list was very effective, despite its old age Worth using Queries were quite conservative
Merging – Baseline Approach * indicates significant difference at 0.05 level Produced a more effective stopword list CollectionClassicalNorm IDFMergedp-value disk WT2G * WT10G * DOTGOV
Merging – TBRSA * indicates significant difference at 0.05 level Produced an improved stopword list with less computational effort CollectionClassicalBest ObtainedMergedp-value disk WT2G * WT10G DOTGOV
Conclusion & Future Work Proposed a novel approach for automatically generating a stopword list Effectiveness and robustness Compared to 4 baseline variants, based on Zipfs Law Merge classical stopword list with best found result to produce a more effective stopword list
Conclusion & Future Work (cont.) Investigate other divergence metrics Poisson-based approach Verb vs Noun I can open a can of tuna with a can opener to be or not to be Detect nature of context Might have to keep some of the terms but remove others
Thank you! Any questions? Thank you for your attention