(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Improved TF-IDF Ranker
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Under The Hood [Part I] Web-Based Information Architectures MSEC – Mini II 28-October-2003 Jaime Carbonell.
A review on “Answering Relationship Queries on the Web” Bhushan Pendharkar ASU ID
Term Processing & Normalization Major goal: Find the best possible representation Minor goals: Improve storage and speed First: Need to transform sequence.
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP433/633 Week 3 Query Structure and Query Operations.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
LAST WEEK  Retrieval evaluation  Why?  How?  Recall and precision – Venn’s Diagram & Contingency Table.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Modern Information Retrieval Chapter 1 Introduction.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
The Vector Space Model …and applications in Information Retrieval.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modern Information Retrieval Chapter 4 Query Languages.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
(C) 2000, The University of Michigan 1 Database Application Design Handout #11 March 24, 2000.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea Class web page:
Text mining.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
1 Query Operations Relevance Feedback & Query Expansion.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Information Retrieval Search Engine Technology (4) Prof. Dragomir R. Radev.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Query Languages Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web- and Multimedia-based Information Systems Lecture 2.
(C) 2000, The University of Michigan 1 Database Application Design Handout #5 February 4, 2000.
Vector Space Models.
Information Retrieval
Modern Information Retrieval Lecture 2: Key concepts in IR.
Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
(C) 2003, The University of Michigan1 Information Retrieval Handout #10 April 7, 2003.
(C) 2003, The University of Michigan1 Information Retrieval Handout #5 January 28, 2005.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Information Retrieval (4) Prof. Dragomir R. Radev
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Information Organization: Overview
CS 430: Information Discovery
Text Categorization Assigning documents to a fixed set of categories
Chapter 5: Information Retrieval and Web Search
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Information Organization: Overview
Information Retrieval and Web Design
Presentation transcript:

(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003

(C) 2003, The University of Michigan2 Course Information Instructor: Dragomir R. Radev Office: 3080, West Hall Connector Phone: (734) Office hours: M&F Course page: Class meets on Mondays, 1-4 PM in 409 West Hall

(C) 2003, The University of Michigan3 Queries and documents

(C) 2003, The University of Michigan4 Queries Single-word queries Context queries –Phrases –Proximity Boolean queries Natural Language queries

(C) 2003, The University of Michigan5 Pattern matching Words, prefixes, suffixes, substrings, ranges, regular expressions Structured queries (e.g., XML)

(C) 2003, The University of Michigan6 Relevance feedback Query expansion Term reweighting Pseudo-relevance feedback Latent semantic indexing Distributional clustering

(C) 2003, The University of Michigan7 Document processing Lexical analysis Stopword elimination Stemming Index term identification Thesauri

(C) 2003, The University of Michigan8 Porter’s algorithm 1. The measure, m, of a stem is a function of sequences of vowels followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC) m V where the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute 2. * - stem ends with letter X 3. *v* - stem ends in a vowel 4. *d - stem ends in double consonant 5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y

(C) 2003, The University of Michigan9 Porter’s algorithm Suffix conditions take the form current_suffix = = pattern Actions are in the form old_suffix -> new_suffix Rules are divided into steps to define the order of applying the rules. The following are some examples of the rules: STEP CONDITION SUFFIX REPLACEMENT EXAMPLE 1a NULL sses ss stresses->stress 1b *v* ing NULL making->mak 1b1 NULL at ate inflat(ed)->inflate 1c *v* y I happy->happi 2 m>0 aliti al formaliti->formal 3 m>0 icate ic duplicate->duplic 4 m>1 able NULL adjustable->adjust 5a m>1 e NULL inflate->inflat 5b m>1 and NULL single letter controll->control

(C) 2003, The University of Michigan10 Porter’s algorithm Example: the word “duplicatable” duplicat rule 4 duplicate rule 1b1 duplic rule 3 The application of another rule in step 4, removing “ic,” cannot be applied since one rule from each step is allowed to be applied.

(C) 2003, The University of Michigan11 Porter’s algorithm

(C) 2003, The University of Michigan12 Relevance feedback Automatic Manual Method: identifying feedback terms Q’ = a 1 Q + a 2 R - a 3 N Often a 1 = 1, a 2 = 1/|R| and a 3 = 1/|N|

(C) 2003, The University of Michigan13 Example Q = “safety minivans” D1 = “car safety minivans tests injury statistics” - relevant D2 = “liability tests safety” - relevant D3 = “car passengers injury reviews” - non- relevant R = ? S = ? Q’ = ?

(C) 2003, The University of Michigan14 Automatic query expansion Thesaurus-based expansion Distributional similarity-based expansion

(C) 2003, The University of Michigan15 WordNet and DistSim wn reason -hypen - hypernyms wn reason -synsn - synsets wn reason -simsn - synonyms wn reason -over - overview of senses wn reason -famln - familiarity/polysemy wn reason -grepn - compound nouns /clair3/tools/relatedwords/relate reason

(C) 2003, The University of Michigan16 Related (substitutable) words Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Wordnet Book: autobiography, essay, biography, memoirs, novels Computer: adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper

(C) 2003, The University of Michigan17 Indexing and searching

(C) 2003, The University of Michigan18 Computing term salience Term frequency (IDF) Document frequency (DF) Inverse document frequency (IDF)

(C) 2003, The University of Michigan19 Scripts to compute tf and idf cd /clair4/class/ir-w03/hw2./tf.pl 053.txt | sort -nr +1 | more./tfs.pl 053.txt | sort -nr +1 | more./stem.pl reasonableness./build-idf.pl./idf.pl | sort -n +2 | more

(C) 2003, The University of Michigan20 Applications of TFIDF Cosine similarity Indexing Clustering