Download presentation

Presentation is loading. Please wait.

Published byKatelyn Stevens Modified over 2 years ago

1
Automatically Building a Stopword List for an Information Retrieval System University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis

2
Outline Stopwords Investigation of two approaches Approach based on Zipfs Law New Term-based random sampling approach Experimental Setup Results and Analysis Conclusion

3
What is a Stopword? Common words in a document e.g. the, is, and, am, to, it Contains no information about documents Low discrimination value in terms of IR meaningless, no contribution Search with stopwords will usually result in retrieving irrelevant documents

4
Objective Different collection contains different contents and word patterns Different collections may require a different set of stopwords Given a collection of documents Investigate ways to automatically create a stopword list

5
Objective (cont) 1. Baseline Approach (benchmark) 4 variants inspired by Zipfs Law TF Normalised TF IDF Normalised IDF 2. How informative a term is (new proposed approach)

6
Foxs Classical Stopword List and Its Weakness Contains 733 stopwords > 20 years old Lacks potentially new words Defined for General Purpose different collections require different stopword lists Outdated

7
Zipfs Law Based on the term frequencies of terms, rank these terms accordingly term with highest TF will have rank = 1, next highest term with rank = 2 etc Zipfs Law

8

9
Baseline Approach Algorithm Generate a list of frequencies vs terms based on corpus Sort the frequencies in descending order Rank the terms according to their frequencies. Highest frequencies would have rank=1 and next highest would have rank=2 etc. Draw a graph of frequencies vs rank

10
Baseline Approach Algorithm (cont.)

11
Choose a threshold and any words that appear above the threshold are treated as stopwords Run the queries with the above said stopword list, all stopwords in the queries will be removed Evaluate system with Average Precision

12
Baseline Approach - Variants Term Frequency Normalised Term Frequency Inverse Document Frequency (IDF) Normalised IDF

13
Baseline Approach – Choosing Threshold Produce best set of stopwords > 50 stopword lists for each variant Investigate the frequencies difference between two consecutive ranks big difference (i.e. sudden jump) Important to choose appropriate threshold

14
Term-Based Random Sampling Approach (TBRSA) Our proposed new approach Depends on how informative a term is Based on the Kullback-Leibler divergence measure Similar to the idea of query expansion

15
Kullback-Leibler Divergence Measure Used to measure the distance between two distributions. In our case, distribution of two terms, one of which is a random term The weight of a term t in the sampled document set is given by: where and

16
TBRAS Algorithm KL divergence measure Repeat Y times Random term Normalise weights by max weight Rank in ascending order Top X ranked Retrieve

17
TBRSA Algorithm (cont.) Extract top L ranked as stopwords sort merge

18
Advantages / Disadvantages Advantages based on how informative a term is computational effort minimal, compared to baselines better coverage of collection No need to monitor progress Disadvantages Generates first term randomly, could retrieve a small data set Repeat experiments Y times

19
Experimental Setup Four TREC collections Each collection is indexed and stemmed with no pre-defined stopwords removed No assumption of stopwords in the beginning Long queries were used Title, Description and Narrative Maximise our chances of using the new stopword lists

20
Experimental Platform Terrier - TERabyte RetrIEveR IR Group, University of Glasgow Based on Divergence From Randomness (DFR) framework Deriving parameter-free probabilistic models PL2 model

21
PL2 Model One of the DFR document weighting models Relevance score of a document d for query Q is: where

22
Collections disk45, WT2G, WT10G and DOTGOV CollectionSize# Docs# Tokensc value disk452GB WT2G2GB WT10G10GB DOTGOV18GB

23
Queries CollectionQuery Sets# Queries disk45TREC7 and TREC8 of ad-hoc tasks100 WT2GTREC850 WT10GTREC1050 DOTGOVTREC11 and TREC12 merged100

24
Merging Stopword Lists Merging classical with best generated using baseline and novel approach respectively Adding 2 lists together, removing duplicates Might be stronger in terms of effectiveness Follows from classical IR technique of combining evidence

25
Results and Analysis Produce as many sets of stopwords (by choosing different thresholds for baseline approach) Compare results obtained to Foxs classical stopword list, based on average precision

26
Baseline Approach – Overall Results * indicates significant difference at 0.05 level Normalised IDF and for every collection CollectionClassicalTFNorm TFIDFNorm IDFp-value disk WT2G * WT10G DOTGOV

27
Baseline Approach – Additional Terms Produced disk45WT2GWT10GDOTGOV financialhtmlablecontent companyhttpcopyrightgov presidenthtmokdefine peopleinternethttpyear marketwebhtmladministrate londontodayjanuaryhttp nationalpolicyhistoryweb structurecontentfacileconomic januarydocumenthtmlyear

28
TBRSA – Overall Results * indicates significant difference at 0.05 level disk45 and WT2G both show improvements CollectionClassicalBest Obtainp-value disk WT2G WT10G DOTGOV *

29
TBRSA – Additional Terms Produced disk45WT2GWT10GDOTGOV columnadvancecopyrightserver generalbeachfriendmodify califoniacompanymemorylength industryenvironmentmountaincontent monthgardenproblemaccept directorindustryscienceinform deskmaterialspecialconnect economicpollutioninternetgov businessschooldocumentbyte

30
Refinement - Merging New approach (TBRSA) gives comparable results Computation effort is less Foxs classical stopword list was very effective, despite its old age Worth using Queries were quite conservative

31
Merging – Baseline Approach * indicates significant difference at 0.05 level Produced a more effective stopword list CollectionClassicalNorm IDFMergedp-value disk WT2G * WT10G * DOTGOV

32
Merging – TBRSA * indicates significant difference at 0.05 level Produced an improved stopword list with less computational effort CollectionClassicalBest ObtainedMergedp-value disk WT2G * WT10G DOTGOV

33
Conclusion & Future Work Proposed a novel approach for automatically generating a stopword list Effectiveness and robustness Compared to 4 baseline variants, based on Zipfs Law Merge classical stopword list with best found result to produce a more effective stopword list

34
Conclusion & Future Work (cont.) Investigate other divergence metrics Poisson-based approach Verb vs Noun I can open a can of tuna with a can opener to be or not to be Detect nature of context Might have to keep some of the terms but remove others

35
Thank you! Any questions? Thank you for your attention

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google