Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science Federal University of Minas Gerais, Brazil

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Introduction Vector space model (VSM)  Query terms and documents are represented as weighted vectors in a vector space  Query answers are documents whose representative vectors have high similarity to the query vector  Term weighting scheme: TF x IDF

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Motivation In VSM, index terms are assumed to be mutually independent  Linear weighting function  Not realistic but easy to compute Our hypothesis: Exploration of correlation among index terms might improve retrieval effectiveness

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Our Goal Propose a new model for computing index term weights, based on set theory  Terms  Sets of terms (termsets)  Correlation among index terms  High retrieval effectiveness keeping computational costs small Exploit the intuition that related term occurrences often occur close to each other

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Related Work Correlation among index terms  Raghavan and Yu (1979)  Rijsbergen (1977), Harper and Rijsbergen (1978)  Wong et al. (1985 and 1987)  Common limitations: Expensive to compute dependency factors Exhaustive application of term co-occurences hurts overall effectiveness and performance Association rule mining  Zaki (2000)

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Termsets T = {t 1, t 2, …, t t } is the set of t unique terms of a collection of documents D. An n-termset s is an ordered set of n terms, such that s  T. ds is the frequency of a termset s. S is the set of 2 t unique termsets that may appear in a document (power set of T).

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Termsets: Example D = {d 1, d 2, d 3 } T = {A,C,D,T} S ={s A,s C,…,s AC, s AD,…,s ACDT } Collection D A C TA C T d1d1 C D C D d2d2 C D T d3d3 s A = {A} (1-termset) s CD = {C,D} (2-termset) s CDT = {C,D,T} (3-termset) ds A = 1 ds CD = 2 ds CDT = 1

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Termsets: Definitions Frequent termset  Is a termset with frequency greater or equal to a given minimal frequency. Closed termset  Is a frequent termset that is (1) the largest among its subsets and (2) its subsets occur in the same set of documents. The use of closed termsets reduces significantly the number of termsets taken into consideration

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil { } A: 1C: 3D: 2T: 2 AC: 1AT: 1 ACT: 1 CD: 2CT: 2DT: 1 CDT: 1 Termsets: Example Collection D A C TA C T d1d1 C D C D d2d2 C D T d3d3 Empty set Frequent Termset Closed Termset

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Set-Based Model Documents and queries are described by sets of closed termsets, instead of terms. Closed termsets provide all elements of the TF x IDF scheme. Computational cost is linear on the number of documents in the collection.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Set-Based Model: Termset Weights Extension of a TF x IDF scheme  sf i,j  number of occurrences of s i in d j  ds i  number of occurrences of s i in D  Ids i  inverted freq. of occurrence of s i in D SBM  VSM, if only 1-termsets are considered

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Set-Based Model: Similarity Calculation sAsA  s AT sTsT d1d1 d2d2 Q  Normalization uses just terms instead of termsets

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Set-Based Model: Query Mechanism SBM Algorithm: 1.Obtain the 1-termsets from query terms; 2.Enumerate all closed termsets from 1-termsets; 3.Calculate similarities between query and documents using the closed termsets; 4.Normalize document similarities; 5.Select the k largest document similarities.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Experimental Results Reference Collection CFCWSJTReC-3 # Documents1,240173,2521,078,166 # Distinct Terms2,105230,9021,016,709 # Queries100300 # Query Size3.8218.8822.43 Size (MB)1.95093,225

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil TReC-3: Recall x Precision

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Average Precision Collection Average Precision (%)SBM Gain (%) VSMGVSMSBMVSMGVSM CFC22.4224.4726.5618.478.54 WSJ31.7634.2741.7831.5521.91 TReC-332.58 * 44.5936.86 * * GVSM could not be evaluated for TReC-3 collection due to exponential cost of the min-term build phase

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Average Precision at 10 Collection Average Precision at 10 (%)SBM Gain (%) VSMGVSMSBMVSMGVSM CFC10.9712.9316.0246.0323.90 WSJ12.7116.5819.1750.8215.62 TReC-313.66 * 21.4256.80 * GVSM could not be evaluated for TReC-3 collection due to exponential cost of the min-term build phase

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Computational Efficiency Collection Avg. Response Time (s)Increase (%) VSMGVSMSBMGVSMSBM CFC0.00230.00560.0025243.58.7 WSJ0.42862.01430.6296469.946.9 TReC-31.2732 * 2.2930 * 80.1 * GVSM could not be evaluated for TReC-3 collection due to exponential cost of the min-term build phase

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Conclusions and Future Work SBM exploits index terms correlations improving retrieval effectiveness efficiently. Future work: Investigate behavior of SBM when applied to larger collections. Extend SBM to take into account the proximity information of index terms.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Termsets: Complexity Worst CaseAvg. Case O(2 |q|.N)O(c.N) Time Complexity: Space Complexity Worst Case: O(r.2 l.N) |q| = query size, c = number of closed termsets, N = number of documents, r = number of maximal termsets, l = length of the largest termset.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil TReC-3: Number of Closed Termsets CollectionWorst CaseAverage Case CFC14.123.14 WSJ456,419.213,217.28 TReC-35,650,707.184,081.25 The average case scenario is significantly smaller than the worst case scenario.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil TReC-3: Minimal Frequency Trade-off between precision, the number of termsets taken into consideration and performance

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Termsets: Enumeration An incremental algorithm that employs a very powerful pruning strategy. 1.Enumeration of (n+1)-termsets from n-termsets Union of all pairs (s i,s j ) that have the same prefix. 2.Evaluation if a frequent termset ‘s’ being verified is closed Check if all current termsets have ‘s’ as its closure, being discarded if such condition holds.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil Termsets: Example 1-termsets ls A = {d 1 } ls C = {d 1,d 2,d 3 } ls D = {d 2,d 3 } ls T = {d 1,d 3 } 2-termsets ls AC = {d 1 } ls AT = {d 1 } 3-termsets ls ACT = {d 1 } Closed termset ls ACT = {d 1 } Collection D A C T d1d1 C D C D d2d2 C D T d3d3

Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

Similar presentations

Presentation on theme: "Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

Similar presentations

Presentation on theme: "Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback