WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Modern Information Retrieval Chapter 8 Indexing and Searching.
Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis.
Modern Information Retrieval
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
Modern Information Retrieval
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
WMES3103 : INFORMATION RETRIEVAL
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Ch 4: Information Retrieval and Text Mining
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
The Vector Space Model …and applications in Information Retrieval.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Indexing and Searching
Modern Information Retrieval Chapter 4 Query Languages.
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Search. Search issues How do we say what we want? –I want a story about pigs –I want a picture of a rooster –How many televisions were sold in Vietnam.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Introduction to Digital Libraries Information Retrieval.
Vector Space Models.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Evidence from Content INST 734 Module 2 Doug Oard.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Why indexing? For efficient searching of a document
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Text Indexing and Search
Tries 5/27/2018 3:08 AM Tries Tries.
Information Retrieval and Web Search
Implementation Issues & IR Systems
CS 430: Information Discovery
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Data Mining Chapter 6 Search Engines
Chapter 5: Information Retrieval and Web Search
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Information Retrieval B
Indexing and Searching
Presentation transcript:

WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING

INTRODUCTION Searching for a basic query done via 2 options: Scanning the text sequentially = sequential or online searching = finding the occurrences of a pattern in a text when the text is not preprocessed Good when the text is small or text collection is volatile (modified frequently) or no indexing space available Build data structures over the text or indexes to speed up the search Good to build and maintain index when text collection is large and semi-static (updated at reasonably regular intervals)

INDEXING Key weight – frequency dependent, determine ranking  best match tf*idf – weighting tf: key frequency in a document idf: the inverse of the number of documents containing the key

AUTOMATIC INDEXING PROCESS Text representation Recognize string Delete Stopwords Identify Stems Replace stems by identifiers Count posting Weight Use thesaurus And phrases

AUTOMATIC INDEXING PROCESS In the process: Stem identification – word normalization, NLP Short codes are used as identifiers Thesaurus – rare stems are clustered Phrases – frequent stems are combined into less frequent phrases

Nowadays, medium size databases (200 Mb) combine online and indexed searching 3 main indexing techniques Inverted files – best choice for most applications Suffix trees and arrays – faster for phrase searching but harder to build and maintain Signature files – popular in 1980’s but outperformed by inverted files Will concentrate on inverted files only

INVERTED FILE Inverted file = inverted index = word-oriented mechanism for indexing a text collection in order to speed up the searching task Composed of 2 elements – vocabulary and occurrences Vocabulary = set of all different words in the text For each word a list of all the text positions where the appears is stored Occurrences = the set of all those lists

Example A sample text and an inverted index built on it the words are converted to lower- case and some are not indexed the occurences point to character positions in the text

INVERTED FILE Positions can refer to words or characters Word positions (eg. position i refers to the i-th word) simplifies phrase and proximity queries Character positions (eg. position i refers to the i-th character) facilitates direct access to matching text positions Space required for vocabulary is small - eg. 1 Gb of the TREC-2 collection has a size of 5 Mb – can be further reduced by stemming and other techniques

INVERTED FILE Occurrences require more space – each word in the text is referenced once in the structure building an inverted index from the sample text Refer to word doc. Attached.word doc.

Searching on an inverted file Done via 3 basic steps : Vocabulary search – the words and patterns present in the query are isolated and searched in the vocabulary Retrieval of occurrences – lists of the occurrences of all the words found are retrieved Manipulation of occurrences – occurrences are processed to solve phrases, proximity or Boolean operations

TRIES * Tries or digital search trees are multiway trees that store set of strings.Every edge of the tree is labelled with a letter. To search a string in a trie, one starts at the root and scans the string characterwise, descending by the appropriate edge of the trie. This continues until a leaf is found.