Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatically Building a Stopword List for an Information Retrieval System University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis.

Similar presentations


Presentation on theme: "Automatically Building a Stopword List for an Information Retrieval System University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis."— Presentation transcript:

1 Automatically Building a Stopword List for an Information Retrieval System University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis

2 Outline Stopwords Investigation of two approaches Approach based on Zipfs Law New Term-based random sampling approach Experimental Setup Results and Analysis Conclusion

3 What is a Stopword? Common words in a document e.g. the, is, and, am, to, it Contains no information about documents Low discrimination value in terms of IR meaningless, no contribution Search with stopwords will usually result in retrieving irrelevant documents

4 Objective Different collection contains different contents and word patterns Different collections may require a different set of stopwords Given a collection of documents Investigate ways to automatically create a stopword list

5 Objective (cont) 1. Baseline Approach (benchmark) 4 variants inspired by Zipfs Law TF Normalised TF IDF Normalised IDF 2. How informative a term is (new proposed approach)

6 Foxs Classical Stopword List and Its Weakness Contains 733 stopwords > 20 years old Lacks potentially new words Defined for General Purpose different collections require different stopword lists Outdated

7 Zipfs Law Based on the term frequencies of terms, rank these terms accordingly term with highest TF will have rank = 1, next highest term with rank = 2 etc Zipfs Law

8

9 Baseline Approach Algorithm Generate a list of frequencies vs terms based on corpus Sort the frequencies in descending order Rank the terms according to their frequencies. Highest frequencies would have rank=1 and next highest would have rank=2 etc. Draw a graph of frequencies vs rank

10 Baseline Approach Algorithm (cont.)

11 Choose a threshold and any words that appear above the threshold are treated as stopwords Run the queries with the above said stopword list, all stopwords in the queries will be removed Evaluate system with Average Precision

12 Baseline Approach - Variants Term Frequency Normalised Term Frequency Inverse Document Frequency (IDF) Normalised IDF

13 Baseline Approach – Choosing Threshold Produce best set of stopwords > 50 stopword lists for each variant Investigate the frequencies difference between two consecutive ranks big difference (i.e. sudden jump) Important to choose appropriate threshold

14 Term-Based Random Sampling Approach (TBRSA) Our proposed new approach Depends on how informative a term is Based on the Kullback-Leibler divergence measure Similar to the idea of query expansion

15 Kullback-Leibler Divergence Measure Used to measure the distance between two distributions. In our case, distribution of two terms, one of which is a random term The weight of a term t in the sampled document set is given by: where and

16 TBRAS Algorithm KL divergence measure Repeat Y times Random term Normalise weights by max weight 0.00.10.30.50.7 Rank in ascending order Top X ranked 0.00.10.3 Retrieve

17 TBRSA Algorithm (cont.) Extract top L ranked as stopwords 0.05 1.00.75 0.10.15 0.750.8 1.00.3 0.1 0.051.00.850.50.90.80.10.150.3 0.7 0.0 0.3 0.7 sort merge 0.00.70.10.30.150.05

18 Advantages / Disadvantages Advantages based on how informative a term is computational effort minimal, compared to baselines better coverage of collection No need to monitor progress Disadvantages Generates first term randomly, could retrieve a small data set Repeat experiments Y times

19 Experimental Setup Four TREC collections http://trec.nist.gov/data/docs_eng.html Each collection is indexed and stemmed with no pre-defined stopwords removed No assumption of stopwords in the beginning Long queries were used Title, Description and Narrative Maximise our chances of using the new stopword lists

20 Experimental Platform Terrier - TERabyte RetrIEveR IR Group, University of Glasgow Based on Divergence From Randomness (DFR) framework Deriving parameter-free probabilistic models PL2 model http://ir.dcs.gla.ac.uk/terrier/

21 PL2 Model One of the DFR document weighting models Relevance score of a document d for query Q is: where

22 Collections disk45, WT2G, WT10G and DOTGOV CollectionSize# Docs# Tokensc value disk452GB5281558013972.13 WT2G2GB24749110202772.75 WT10G10GB169209632063462.43 DOTGOV18GB124775328218212.00

23 Queries CollectionQuery Sets# Queries disk45TREC7 and TREC8 of ad-hoc tasks100 WT2GTREC850 WT10GTREC1050 DOTGOVTREC11 and TREC12 merged100

24 Merging Stopword Lists Merging classical with best generated using baseline and novel approach respectively Adding 2 lists together, removing duplicates Might be stronger in terms of effectiveness Follows from classical IR technique of combining evidence

25 Results and Analysis Produce as many sets of stopwords (by choosing different thresholds for baseline approach) Compare results obtained to Foxs classical stopword list, based on average precision

26 Baseline Approach – Overall Results * indicates significant difference at 0.05 level Normalised IDF and for every collection CollectionClassicalTFNorm TFIDFNorm IDFp-value disk450.21230.21300.21230.21130.21300.8845 WT2G0.25690.26500.26760.26820.27000.001508* WT10G0.20000.20490.20760.2079 0.1231 DOTGOV0.12230.12120.12080.1227 0.55255

27 Baseline Approach – Additional Terms Produced disk45WT2GWT10GDOTGOV financialhtmlablecontent companyhttpcopyrightgov presidenthtmokdefine peopleinternethttpyear marketwebhtmladministrate londontodayjanuaryhttp nationalpolicyhistoryweb structurecontentfacileconomic januarydocumenthtmlyear

28 TBRSA – Overall Results * indicates significant difference at 0.05 level disk45 and WT2G both show improvements CollectionClassicalBest Obtainp-value disk450.21230.21290.868 WT2G0.25690.26680.07544 WT10G0.20000.19000.4493 DOTGOV0.12230.11800.002555*

29 TBRSA – Additional Terms Produced disk45WT2GWT10GDOTGOV columnadvancecopyrightserver generalbeachfriendmodify califoniacompanymemorylength industryenvironmentmountaincontent monthgardenproblemaccept directorindustryscienceinform deskmaterialspecialconnect economicpollutioninternetgov businessschooldocumentbyte

30 Refinement - Merging New approach (TBRSA) gives comparable results Computation effort is less Foxs classical stopword list was very effective, despite its old age Worth using Queries were quite conservative

31 Merging – Baseline Approach * indicates significant difference at 0.05 level Produced a more effective stopword list CollectionClassicalNorm IDFMergedp-value disk450.21230.2130 0.8845 WT2G0.25690.27000.27120.00746* WT10G0.20000.20790.21090.03854* DOTGOV0.12230.12270.12410.6775

32 Merging – TBRSA * indicates significant difference at 0.05 level Produced an improved stopword list with less computational effort CollectionClassicalBest ObtainedMergedp-value disk450.21230.2129 0.868 WT2G0.25690.26680.27030.008547* WT10G0.20000.19000.20660.4451 DOTGOV0.12230.11800.12280.5085

33 Conclusion & Future Work Proposed a novel approach for automatically generating a stopword list Effectiveness and robustness Compared to 4 baseline variants, based on Zipfs Law Merge classical stopword list with best found result to produce a more effective stopword list

34 Conclusion & Future Work (cont.) Investigate other divergence metrics Poisson-based approach Verb vs Noun I can open a can of tuna with a can opener to be or not to be Detect nature of context Might have to keep some of the terms but remove others

35 Thank you! Any questions? Thank you for your attention


Download ppt "Automatically Building a Stopword List for an Information Retrieval System University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis."

Similar presentations


Ads by Google