Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Efficient Algorithm for Incremental Update of Concept space

Similar presentations


Presentation on theme: "An Efficient Algorithm for Incremental Update of Concept space"— Presentation transcript:

1 An Efficient Algorithm for Incremental Update of Concept space
Presented by Felix Cheung

2 Overview Background Introduction to Concept Space
The Problem of Concept Space The Idea of the Solution Performance Evaluation Conclusion

3 Background Vocabulary Problem
The failure is caused by variety of terms Such as HIV vs. AIDS Two people choose the same words with less 20% One of solutions: thesauri

4 Thesauri A thesaurus is a book of words that are put in groups together according to connections between their meaning To solve vocabulary problem If a search retrieves too few documents, a user can expand his query The problem of thesauri Manual construction is very complex

5 Introduction to Concept Space
It is an automatic approach to thesaurus construction Given terms j & k, a concept space has associations Wjk and Wkj Wjk and Wkj are asymmetric An association is a value between 0 and 1

6 Concept Space Construction
The construction of concept space consists of two phases An automatic indexing phase A document collection is processed to build inverted lists

7 Inverted Lists doc. id tf a b

8 Concept Space Construction
The construction of concept space consists of two phases A co-occurrence analysis phase The associations of every term pair are computed based on the following equation

9 The sum of TFIDF scores To compute the sum of all TFIDF scores of keyword j in all the documents: where term frequency of j in doc i number of docs with j number of docs in db

10 Weighting Factor The Weighting Factor is used to penalize the general terms

11 The sum of co-occurrence TFIDF scores
To find the sum of all co-occurrence TFIDF scores of keywords j and k in all the documents where number of docs with both j and k min(tfij, tfik) number of docs in db

12 A Complete Concept Space
A complete concept space is gigantic Each term may have a few thousand related terms => overwhelm searchers Only highly related terms are suggested

13 Highly related terms There are 1,708,551 co-occurrence pairs
The max no. of related terms = 100 If no. of related terms > 100, only 100 terms with highest association values retained (strong associations) Only highly-ranked association is contained – called partial concept space

14 The Problem of Concept Space
In a dynamic environment, the collection changes with time => concept space update The simplest approach => reconstruct from scratch Disadvantage: time consuming To study incremental update problem of partial concept spaces

15 The Definition A set of document (D) A new document collection (D’)
add A document collection (D) A updated concept space(CSD’) A constructed concept space (CSD) update Only n strong associations kept

16 The Idea of pruning algorithm
Avoid scanning inverted lists directly Calculate an easy-computed upper bound of W’jk Compare with a threshold j The property of j If  j, W’jk must not be a strong association

17 The upper bound

18 How to determine j Compute n associations W’jki‘s for which Wjki is strong w.r.t the document D (n  i  1) Set j = min(W’jki) Given p, if j > , W’jp< all n W’jki’s

19 Pruning Algorithm Compute the association W’jki w.r.t D’ if Wjki is strong w.r.t. D for each term j Determine j among n such associations of term j Compute the upper bound of W’jp if Wjp is weak w.r.t. D Compute W’jp if  j Only keep the n largest associations of j

20 Quantization is in term of The amount of storage is very big
High precision is not needed Some quantization techniques can be applied to reduce the storage requirment

21 Performance Evaluation
“The Ohsumed Test Collection” is used 348,566 abstracts with terms 169 MB large (after stop-word removal and stemming) The algorithm is run on a 700 MHz Pentium III Xeon machine

22 Experiment I Half of documents are picked as the original collection D
The other half of documents are partitioned into 10 equal parts These parts are added to D successively and cumulatively

23 Experiment I Result (I)

24 Experiment I Result (II)

25 Experiment I Result (III)

26 Experiment I Result (IV)

27 Experiment I Result (V)

28 Experiment II Another factors affects the performance- the size of added documents The size of added documents changes from 17,400 to 174,000

29 Experiment II Result

30 Storage requirement

31 Conclusion Concept space approach is a very useful tool for information retrieval The construction and incremental update are very time consuming In many application, only a partial concept is needed To reduce the storage requirement, some quantization methods are proposed

32 Conclusion (Con’t) The pruning algorithms are effective in avoiding computing weak associations 9-time speedup can be achieved


Download ppt "An Efficient Algorithm for Incremental Update of Concept space"

Similar presentations


Ads by Google