An Efficient Algorithm for Incremental Update of Concept space Presented by Felix Cheung
Overview Background Introduction to Concept Space The Problem of Concept Space The Idea of the Solution Performance Evaluation Conclusion
Background Vocabulary Problem The failure is caused by variety of terms Such as HIV vs. AIDS Two people choose the same words with less 20% One of solutions: thesauri
Thesauri A thesaurus is a book of words that are put in groups together according to connections between their meaning To solve vocabulary problem If a search retrieves too few documents, a user can expand his query The problem of thesauri Manual construction is very complex
Introduction to Concept Space It is an automatic approach to thesaurus construction Given terms j & k, a concept space has associations Wjk and Wkj Wjk and Wkj are asymmetric An association is a value between 0 and 1
Concept Space Construction The construction of concept space consists of two phases An automatic indexing phase A document collection is processed to build inverted lists
Inverted Lists doc. id tf a b
Concept Space Construction The construction of concept space consists of two phases A co-occurrence analysis phase The associations of every term pair are computed based on the following equation
The sum of TFIDF scores To compute the sum of all TFIDF scores of keyword j in all the documents: where term frequency of j in doc i number of docs with j number of docs in db
Weighting Factor The Weighting Factor is used to penalize the general terms
The sum of co-occurrence TFIDF scores To find the sum of all co-occurrence TFIDF scores of keywords j and k in all the documents where number of docs with both j and k min(tfij, tfik) number of docs in db
A Complete Concept Space A complete concept space is gigantic Each term may have a few thousand related terms => overwhelm searchers Only highly related terms are suggested
Highly related terms There are 1,708,551 co-occurrence pairs The max no. of related terms = 100 If no. of related terms > 100, only 100 terms with highest association values retained (strong associations) Only highly-ranked association is contained – called partial concept space
The Problem of Concept Space In a dynamic environment, the collection changes with time => concept space update The simplest approach => reconstruct from scratch Disadvantage: time consuming To study incremental update problem of partial concept spaces
The Definition A set of document (D) A new document collection (D’) add A document collection (D) A updated concept space(CSD’) A constructed concept space (CSD) update Only n strong associations kept
The Idea of pruning algorithm Avoid scanning inverted lists directly Calculate an easy-computed upper bound of W’jk Compare with a threshold j The property of j If j, W’jk must not be a strong association
The upper bound
How to determine j Compute n associations W’jki‘s for which Wjki is strong w.r.t the document D (n i 1) Set j = min(W’jki) Given p, if j > , W’jp< all n W’jki’s
Pruning Algorithm Compute the association W’jki w.r.t D’ if Wjki is strong w.r.t. D for each term j Determine j among n such associations of term j Compute the upper bound of W’jp if Wjp is weak w.r.t. D Compute W’jp if j Only keep the n largest associations of j
Quantization is in term of The amount of storage is very big High precision is not needed Some quantization techniques can be applied to reduce the storage requirment
Performance Evaluation “The Ohsumed Test Collection” is used 348,566 abstracts with 240247 terms 169 MB large (after stop-word removal and stemming) The algorithm is run on a 700 MHz Pentium III Xeon machine
Experiment I Half of documents are picked as the original collection D The other half of documents are partitioned into 10 equal parts These parts are added to D successively and cumulatively
Experiment I Result (I)
Experiment I Result (II)
Experiment I Result (III)
Experiment I Result (IV)
Experiment I Result (V)
Experiment II Another factors affects the performance- the size of added documents The size of added documents changes from 17,400 to 174,000
Experiment II Result
Storage requirement
Conclusion Concept space approach is a very useful tool for information retrieval The construction and incremental update are very time consuming In many application, only a partial concept is needed To reduce the storage requirement, some quantization methods are proposed
Conclusion (Con’t) The pruning algorithms are effective in avoiding computing weak associations 9-time speedup can be achieved