Download presentation
Presentation is loading. Please wait.
Published byKatelyn Stevens Modified over 10 years ago
1
Automatically Building a Stopword List for an Information Retrieval System University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis
2
Outline Stopwords Investigation of two approaches Approach based on Zipfs Law New Term-based random sampling approach Experimental Setup Results and Analysis Conclusion
3
What is a Stopword? Common words in a document e.g. the, is, and, am, to, it Contains no information about documents Low discrimination value in terms of IR meaningless, no contribution Search with stopwords will usually result in retrieving irrelevant documents
4
Objective Different collection contains different contents and word patterns Different collections may require a different set of stopwords Given a collection of documents Investigate ways to automatically create a stopword list
5
Objective (cont) 1. Baseline Approach (benchmark) 4 variants inspired by Zipfs Law TF Normalised TF IDF Normalised IDF 2. How informative a term is (new proposed approach)
6
Foxs Classical Stopword List and Its Weakness Contains 733 stopwords > 20 years old Lacks potentially new words Defined for General Purpose different collections require different stopword lists Outdated
7
Zipfs Law Based on the term frequencies of terms, rank these terms accordingly term with highest TF will have rank = 1, next highest term with rank = 2 etc Zipfs Law
9
Baseline Approach Algorithm Generate a list of frequencies vs terms based on corpus Sort the frequencies in descending order Rank the terms according to their frequencies. Highest frequencies would have rank=1 and next highest would have rank=2 etc. Draw a graph of frequencies vs rank
10
Baseline Approach Algorithm (cont.)
11
Choose a threshold and any words that appear above the threshold are treated as stopwords Run the queries with the above said stopword list, all stopwords in the queries will be removed Evaluate system with Average Precision
12
Baseline Approach - Variants Term Frequency Normalised Term Frequency Inverse Document Frequency (IDF) Normalised IDF
13
Baseline Approach – Choosing Threshold Produce best set of stopwords > 50 stopword lists for each variant Investigate the frequencies difference between two consecutive ranks big difference (i.e. sudden jump) Important to choose appropriate threshold
14
Term-Based Random Sampling Approach (TBRSA) Our proposed new approach Depends on how informative a term is Based on the Kullback-Leibler divergence measure Similar to the idea of query expansion
15
Kullback-Leibler Divergence Measure Used to measure the distance between two distributions. In our case, distribution of two terms, one of which is a random term The weight of a term t in the sampled document set is given by: where and
16
TBRAS Algorithm KL divergence measure Repeat Y times Random term Normalise weights by max weight 0.00.10.30.50.7 Rank in ascending order Top X ranked 0.00.10.3 Retrieve
17
TBRSA Algorithm (cont.) Extract top L ranked as stopwords 0.05 1.00.75 0.10.15 0.750.8 1.00.3 0.1 0.051.00.850.50.90.80.10.150.3 0.7 0.0 0.3 0.7 sort merge 0.00.70.10.30.150.05
18
Advantages / Disadvantages Advantages based on how informative a term is computational effort minimal, compared to baselines better coverage of collection No need to monitor progress Disadvantages Generates first term randomly, could retrieve a small data set Repeat experiments Y times
19
Experimental Setup Four TREC collections http://trec.nist.gov/data/docs_eng.html Each collection is indexed and stemmed with no pre-defined stopwords removed No assumption of stopwords in the beginning Long queries were used Title, Description and Narrative Maximise our chances of using the new stopword lists
20
Experimental Platform Terrier - TERabyte RetrIEveR IR Group, University of Glasgow Based on Divergence From Randomness (DFR) framework Deriving parameter-free probabilistic models PL2 model http://ir.dcs.gla.ac.uk/terrier/
21
PL2 Model One of the DFR document weighting models Relevance score of a document d for query Q is: where
22
Collections disk45, WT2G, WT10G and DOTGOV CollectionSize# Docs# Tokensc value disk452GB5281558013972.13 WT2G2GB24749110202772.75 WT10G10GB169209632063462.43 DOTGOV18GB124775328218212.00
23
Queries CollectionQuery Sets# Queries disk45TREC7 and TREC8 of ad-hoc tasks100 WT2GTREC850 WT10GTREC1050 DOTGOVTREC11 and TREC12 merged100
24
Merging Stopword Lists Merging classical with best generated using baseline and novel approach respectively Adding 2 lists together, removing duplicates Might be stronger in terms of effectiveness Follows from classical IR technique of combining evidence
25
Results and Analysis Produce as many sets of stopwords (by choosing different thresholds for baseline approach) Compare results obtained to Foxs classical stopword list, based on average precision
26
Baseline Approach – Overall Results * indicates significant difference at 0.05 level Normalised IDF and for every collection CollectionClassicalTFNorm TFIDFNorm IDFp-value disk450.21230.21300.21230.21130.21300.8845 WT2G0.25690.26500.26760.26820.27000.001508* WT10G0.20000.20490.20760.2079 0.1231 DOTGOV0.12230.12120.12080.1227 0.55255
27
Baseline Approach – Additional Terms Produced disk45WT2GWT10GDOTGOV financialhtmlablecontent companyhttpcopyrightgov presidenthtmokdefine peopleinternethttpyear marketwebhtmladministrate londontodayjanuaryhttp nationalpolicyhistoryweb structurecontentfacileconomic januarydocumenthtmlyear
28
TBRSA – Overall Results * indicates significant difference at 0.05 level disk45 and WT2G both show improvements CollectionClassicalBest Obtainp-value disk450.21230.21290.868 WT2G0.25690.26680.07544 WT10G0.20000.19000.4493 DOTGOV0.12230.11800.002555*
29
TBRSA – Additional Terms Produced disk45WT2GWT10GDOTGOV columnadvancecopyrightserver generalbeachfriendmodify califoniacompanymemorylength industryenvironmentmountaincontent monthgardenproblemaccept directorindustryscienceinform deskmaterialspecialconnect economicpollutioninternetgov businessschooldocumentbyte
30
Refinement - Merging New approach (TBRSA) gives comparable results Computation effort is less Foxs classical stopword list was very effective, despite its old age Worth using Queries were quite conservative
31
Merging – Baseline Approach * indicates significant difference at 0.05 level Produced a more effective stopword list CollectionClassicalNorm IDFMergedp-value disk450.21230.2130 0.8845 WT2G0.25690.27000.27120.00746* WT10G0.20000.20790.21090.03854* DOTGOV0.12230.12270.12410.6775
32
Merging – TBRSA * indicates significant difference at 0.05 level Produced an improved stopword list with less computational effort CollectionClassicalBest ObtainedMergedp-value disk450.21230.2129 0.868 WT2G0.25690.26680.27030.008547* WT10G0.20000.19000.20660.4451 DOTGOV0.12230.11800.12280.5085
33
Conclusion & Future Work Proposed a novel approach for automatically generating a stopword list Effectiveness and robustness Compared to 4 baseline variants, based on Zipfs Law Merge classical stopword list with best found result to produce a more effective stopword list
34
Conclusion & Future Work (cont.) Investigate other divergence metrics Poisson-based approach Verb vs Noun I can open a can of tuna with a can opener to be or not to be Detect nature of context Might have to keep some of the terms but remove others
35
Thank you! Any questions? Thank you for your attention
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.