Term Weighting approaches in automatic text retrieval. Presented by Ehsan.

Slides:



Advertisements
Similar presentations
Boolean and Vector Space Retrieval Models
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Introduction to Information Retrieval
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Evaluating the Performance of IR Sytems
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Which of the two appears simple to you? 1 2.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Vector Space Models.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
IR 6 Scoring, term weighting and the vector space model.
Plan for Today’s Lecture(s)
Text Based Information Retrieval
Information Retrieval and Web Search
Multimedia Information Retrieval
Information Retrieval and Web Search
Basic Information Retrieval
From frequency to meaning: vector space models of semantics
Chapter 5: Information Retrieval and Web Search
Ying Dai Faculty of software and information science,
Boolean and Vector Space Retrieval Models
Information Retrieval and Web Design
Term Frequency–Inverse Document Frequency
CS 430: Information Discovery
Presentation transcript:

Term Weighting approaches in automatic text retrieval. Presented by Ehsan

References Modern Information Retrieval: Text book Slides on Vectorial Model by Dr. Rada The paper itself

The main idea Text indexing system based on weighted single terms is better than the one based on more complex text representation Crucial importance: effective term weighting.

Basic IR Attach content identifier to both stored texts and user queries. A content identifier/term is a word or a group of words extracted from the document/queries Underlying assumption Semantics of the documents and queries can be expressed by this terms

Two things to consider What is an appropriate content identifier? Are all the identifier of same importance? If not, how can we discriminate a term from the others?

Choosing content identifier Use single term/word as individual identifier Use more complex text representation as identifier An example “Industry is the mother of good luck” Mother said, “Good luck”.

Complex text representation 1. Set of related terms based on statistical co- occurrence 2. Term phrases consisting of one of more governing terms (head of the phrase) together with corresponding depending terms 3. Grouping words under a common heading like thesaurus 4. Constructing knowledge base to represent the content of the subject area

What is better: single or complex terms? Construction of complex text representation is inherently difficult. Need sophisticated syntactic/statistical analysis program An example Using term phrase 20% increase in some cases Other cases it is quite discouraging Knowledge base Effective vocabulary tools covering subject areas of reasonable scope is still sort of under-development Conclusion Using single terms as content identifier is preferable

The second issue How to discriminate terms? Term weight of course! Effectiveness of IR system Document with relevant items must be retrieved Documents with irrelevant/extraneous items must be rejected.

Precision and Recall Recall Number of relevant document retrieved divided by total number of relevant documents Precision Out of the documents retrieved, how many of them are relevant Our goal High recall to retrieve as many relevant documents as possible High precision to reject extraneous documents. Basically, it is a trade off.

Weighting mechanism To get high recall Term frequency, tf When high frequency term are prevalent in the whole document collection With high tf every single documents will be retrieved To get high precision Inverse document frequency Varies inversely with the number of documents, n in which the term appears. Idf is given by log 2 (N/ n), where N is total number of documents To discriminate terms We use tf X idf

Two more things to consider Current “tf X id” mechanism favors larger documents introduce a normalizing factor in the weight to equalize the length of the document. Probabilistic mode Term weight is the proportion of the relevant documents in which a term occurs divided by proportion of irrelevant items in which the term occurs Is given by log ((N-n)/n)

Term weighting components Term frequency components b, t, n Collection frequency components x, f, p Normalization components x, c What would be weighting system given by tfc.nfx?

Experimental evidence Query vectors For tf short query, use n Long query, use t For idf Use f For normalization Use x

Experimental evidence Document vectors For tf Technical vocabulary, use n More varied vocabulary, use t For idf Use f in general Documents from different domain use x For normalization Documents with heterogeneous length, use c Homogenous documents, use x

Conclusion Best document weighting tfc, nfc (or tpc, npc) Best query weighting nfx, tfx, bfx (or npx, tpx, bpx) Questions?