Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using TF-IDF to Determine Word Relevance in Document Queries

Similar presentations


Presentation on theme: "Using TF-IDF to Determine Word Relevance in Document Queries"— Presentation transcript:

1 Using TF-IDF to Determine Word Relevance in Document Queries
Juan Ramos Department of Computer Science, Rutgers University, BPO Way, Piscataway, NJ, 08855

2 Information Retrieval Problem
Given corpus D, query q = w1, w2, … wn, return documents d that maximize Pr(d | q, D). Easy to dismiss given widespread use of query retrieval today (web searches, database management, etc.)

3 Approaches to Ad Hoc Retrieval
Probability and Statistics Naïve Bayes Approaches include the user’s mindset. Vector Models Latent Semantic Indexing Reduce n-dimensional vector space of documents Return documents whose distance to query is small

4 TF-IDF Weighing Scheme
Given corpus D, word w, document d, calculate wd = fw, d * log (|D|/fw, D) Many varieties of basic mathematical scheme Procedure Scan each d, compute each wi, d, return set D’ that maximizes Σi wi, d

5 Experiment Documents from Linguistic Data Consortium’s United Nations Parallel Text Corpus Support noise by enforcing case-sensitivity, no parsing of SGML symbols Brute force approach- consider only fw, d

6 Results

7 Extensions and Further Research
Genetic TF-IDF: evolve weighing schemes that compete with TF-IDF. Hillclimbing, gradient descent TF-IDF. Cross-language settings: return documents in different language than query.

8 References Berger, A & Lafferty, J. (1999). Information Retrieval as Statistical Translation. In Proceedings of the 22nd ACM Conference on Research and Development in Information Retrieval (SIGIR’99), Berger, A et al (2000). Bridging the Lexical Chasm: Statistical Approaches to Answer Finding. In Proc. Int. Conf. Research and Development in Information Retrieval,

9 References pt. 2 Berry, Michael W. et al. (1995). Using Linear Algebra for Intelligent Information Retrieval. SIAM Review, 37(4): Brown, Peter F. et al. (1990). A Statistical Approach to Machine Translation. In Computational Linguistics 16(2):

10 References Pt. 3 Oren, Nir. (2002). Reexamining tf.idf based information retrieval with Genetic Programming. In Proceedings of SAICSIT 2002, 1-10. Salton, G. & Buckley, C. (1988). Term-weighing approache sin automatic text retrieval. In Information Processing & Management, 24(5):


Download ppt "Using TF-IDF to Determine Word Relevance in Document Queries"

Similar presentations


Ads by Google