Data Structure. Two segments of data structure –Storage –Retrieval.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Space-for-Time Tradeoffs
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Tries Standard Tries Compressed Tries Suffix Tries.
Modern Information Retrieval Chapter 8 Indexing and Searching.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Information Retrieval in Practice
Modern Information Retrieval
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
Dictionaries and Hash Tables1  
WMES3103 : INFORMATION RETRIEVAL
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
IR Data Structures Making Matching Queries and Documents Effective and Efficient.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Information Retrieval in Text Part I Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Hash Tables1 Part E Hash Tables  
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Indexing and Searching
File Management.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Overview of Search Engines
Tutorial 3: Adding and Formatting Text. 2 Objectives Session 3.1 Type text into a page Copy text from a document and paste it into a page Check for spelling.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Query Processing Presented by Aung S. Win.
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
CS 430: Information Discovery
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Hash Tables1   © 2010 Goodrich, Tamassia.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
© 2004 Goodrich, Tamassia Hash Tables1  
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Introduction to Digital Libraries Information Retrieval.
Web- and Multimedia-based Information Systems Lecture 2.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
It consists of two parts: collection of files – stores related data directory structure – organizes & provides information Some file systems may have.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Hash 2004, Spring Pusan National University Ki-Joune Li.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Module 11: File Structure
Search Engine Architecture
Indexing Structures for Files and Physical Database Design
Physical Database Design and Performance
CS 430: Information Discovery
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Physical Database Design
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Presentation transcript:

Data Structure

Two segments of data structure –Storage –Retrieval

Item normalization Document File Creation Document Manager Document Search Manager Original document file Proc. Token search file

–Stemming –Inverted file system –N-gram –PAT trees and arrays –Signature –Hypertext –Stemming –Inverted file system –N-gram –PAT trees and arrays –Signature –Hypertext

–Inverted file system most common data structure Minimizes secondary storage access –When using multiple search terms Document, inversion list / posting files, dictionary Storing an inversion of documents –Inverted file system most common data structure Minimizes secondary storage access –When using multiple search terms Document, inversion list / posting files, dictionary Storing an inversion of documents

–N-gram Fixed length consecutive series of ‘n’ characters Algorithmically based upon a fixed number of characters Searchable data structure transformed into a overlapping n-grams to create the searchable database (fig. 4.7) Does not involve semantics - concepts –N-gram Fixed length consecutive series of ‘n’ characters Algorithmically based upon a fixed number of characters Searchable data structure transformed into a overlapping n-grams to create the searchable database (fig. 4.7) Does not involve semantics - concepts

–N-gram Symbol # to represent inter-word symbol (fig. 4.7) –Blank, period, semi-colon, colon etc. Word fragments Uses –Spelling error detection and correction (fig. 4.8) –Text compression Ignores words and treat input as a continuous data –N-gram Symbol # to represent inter-word symbol (fig. 4.7) –Blank, period, semi-colon, colon etc. Word fragments Uses –Spelling error detection and correction (fig. 4.8) –Text compression Ignores words and treat input as a continuous data

–N-gram False hits can occur when without # The longer n-gram, the less likely is the error Problems –Increased size of inversion lists –No semantic meaning and concept relationship Can achieve high recall –N-gram False hits can occur when without # The longer n-gram, the less likely is the error Problems –Increased size of inversion lists –No semantic meaning and concept relationship Can achieve high recall

–PAT trees PATRICIA trees –Practical algorithm to retrieve information coded in alphanumerics Each position in the input string is the anchor point for a sub-string that starts at that point and includes all new text up to the end of the input Substrings are termed as sistrings (Figure ) Best for string searching but not widely used commercially –PAT trees PATRICIA trees –Practical algorithm to retrieve information coded in alphanumerics Each position in the input string is the anchor point for a sub-string that starts at that point and includes all new text up to the end of the input Substrings are termed as sistrings (Figure ) Best for string searching but not widely used commercially

Signature To provide a fast test to eliminate the majority of items that are not related to a query A linear scan of the compressed version of items Coding based upon words in the item Words are mapped onto a word signature –A fixed length code with a fixed number of bits set to 1 –Set to 1 determined by the hash function –ORed to create the signature of an item –Fig 4.13 Words in the query are mapped to the word signature Search via template matching Signature To provide a fast test to eliminate the majority of items that are not related to a query A linear scan of the compressed version of items Coding based upon words in the item Words are mapped onto a word signature –A fixed length code with a fixed number of bits set to 1 –Set to 1 determined by the hash function –ORed to create the signature of an item –Fig 4.13 Words in the query are mapped to the word signature Search via template matching

Signature Longer code length reduces probability of collision hashing the same words Fewer bits per code reduce the effect of a code word pattern present in the final signature block while the word is actually not in the item Signature Longer code length reduces probability of collision hashing the same words Fewer bits per code reduce the effect of a code word pattern present in the final signature block while the word is actually not in the item

Hypertext (HTML and XML) Allow one item to reference another item via an embedded pointer A node (separate item) Link (reference pointer) –Similar or different data type than the original Navigates –Managing the loosely structured information Issue –Linkage integrity (no update of the removed or deleted items) Hypertext (HTML and XML) Allow one item to reference another item via an embedded pointer A node (separate item) Link (reference pointer) –Similar or different data type than the original Navigates –Managing the loosely structured information Issue –Linkage integrity (no update of the removed or deleted items)

Hypertext (HTML and XML) Dynamic HTML –Combination of the latest HTML tags and options, style sheets and programming –Creation of animated Web pages and responsive to user interaction Dynamic HTML Object Model –Object-oriented view of Web pages and its elements –Cascading style sheets –Programming addressing the page elements with dynamic fonts Hypertext (HTML and XML) Dynamic HTML –Combination of the latest HTML tags and options, style sheets and programming –Creation of animated Web pages and responsive to user interaction Dynamic HTML Object Model –Object-oriented view of Web pages and its elements –Cascading style sheets –Programming addressing the page elements with dynamic fonts

DOCUMENTSDICTIONARYINVERSION LISTS Doc #1, computer,bit (2)bit - 1, 3 bit, byte DOCUMENTSDICTIONARYINVERSION LISTS Doc #1, computer,bit (2)bit - 1, 3 bit, byte

Inversion list –Weights –Words with special characteristics e.g. date Searching –Locate the inversion lists –Apply appropriate logic between lists –Final hit of the list of items is the result Inversion list –Weights –Words with special characteristics e.g. date Searching –Locate the inversion lists –Apply appropriate logic between lists –Final hit of the list of items is the result

–B trees e.g. of order m A root node with between 2 and 2m keys All other internal nodes have between m and 2m keys All keys are kept in order from smaller to larger All leaves are at the same level or differ by at most one level –B trees e.g. of order m A root node with between 2 and 2m keys All other internal nodes have between m and 2m keys All keys are kept in order from smaller to larger All leaves are at the same level or differ by at most one level

–Inversion list structures Provide optimum performance in searching large databases Minimization of data flow Involve only directly related data Good for storing concepts and their relationship Each list for representing a concept A concordance of all of the items containing the concepts Location of the concepts Do not solely work for natural language processing –Inversion list structures Provide optimum performance in searching large databases Minimization of data flow Involve only directly related data Good for storing concepts and their relationship Each list for representing a concept A concordance of all of the items containing the concepts Location of the concepts Do not solely work for natural language processing

Stemming algorithm Goal: to improve performance and require less system resources by reducing number of unique words that a system has to contain Currently reviewed for potential improvements of recall and associated decline in precision Trade-off: increased overhead for processing token vs. reduced search time overhead for processing query terms with trailing ‘don’t cares’ for the inclusion of all the variants Creates a large index for the stem vs. term masking (ORing) Stemming algorithm Goal: to improve performance and require less system resources by reducing number of unique words that a system has to contain Currently reviewed for potential improvements of recall and associated decline in precision Trade-off: increased overhead for processing token vs. reduced search time overhead for processing query terms with trailing ‘don’t cares’ for the inclusion of all the variants Creates a large index for the stem vs. term masking (ORing)

Stemming algorithm Conflation: refer to mapping multiple morphological variants to a single representation (stem) Stem: carries the meaning of the concept associated with the word Affixes (endings) introduce subtle modifications to the concept or are used for syntactical purposes Languages: grammars defining usage and evolve on human usage Existence of exceptions and non-consistent variants – thus requires exception look-up tables beside normal reduction rules Stemming algorithm Conflation: refer to mapping multiple morphological variants to a single representation (stem) Stem: carries the meaning of the concept associated with the word Affixes (endings) introduce subtle modifications to the concept or are used for syntactical purposes Languages: grammars defining usage and evolve on human usage Existence of exceptions and non-consistent variants – thus requires exception look-up tables beside normal reduction rules

Stemming algorithm Compression – savings in storage and processing? Savings – dictionary, requires weighted positional information in stem inversion list and un-stemmed inversion list Size of inversion list Compression does not significantly reduce storage requirements – small vs. large-sized collection Stemming algorithm Compression – savings in storage and processing? Savings – dictionary, requires weighted positional information in stem inversion list and un-stemmed inversion list Size of inversion list Compression does not significantly reduce storage requirements – small vs. large-sized collection

Stemming algorithm Improve recall? As long as a semantically consistent stem can be identified for a set of words – generalization process of stemming Improve precision? Only if the expansion guarantees every item retrieved by the expansion is relevant Stemming algorithm Improve recall? As long as a semantically consistent stem can be identified for a set of words – generalization process of stemming Improve precision? Only if the expansion guarantees every item retrieved by the expansion is relevant

Stemming algorithm System must recognize the word before stemming Proper names and acronyms – no stemming applied since no common core concept Problems for natural language processing system – loss of information needed for aggregate levels of processing e.g. tenses needed to determine a particular concept Time – important in natural language processing Stemming algorithm System must recognize the word before stemming Proper names and acronyms – no stemming applied since no common core concept Problems for natural language processing system – loss of information needed for aggregate levels of processing e.g. tenses needed to determine a particular concept Time – important in natural language processing

Stemming algorithm Removal of suffixes and prefixes Table look-up – requires a large data structure (e.g. RetrievalWare due to large thesaurus/concept network) Successor stemming – determine prefix overlap as the length of a stem is increased e.g. tenses needed to determine a particular concept Time – important in natural language processing Stemming algorithm Removal of suffixes and prefixes Table look-up – requires a large data structure (e.g. RetrievalWare due to large thesaurus/concept network) Successor stemming – determine prefix overlap as the length of a stem is increased e.g. tenses needed to determine a particular concept Time – important in natural language processing

Stemming algorithm Porter stemming algorithm Dictionary look-up stemmers Successor stemmers Stemming algorithm Porter stemming algorithm Dictionary look-up stemmers Successor stemmers

Porter Stemming algorithm Based upon a set of conditions of the stem, suffix and prefix and associated actions given the condition Measure (m) of a stem is a function of sequences of vowels followed by a consonant. If v is a sequence of vowels and C is a sequence of consonants, then m is: C(VC) m V C and V – optional and m is the number VC repeats *, *v*, *d, *o Porter Stemming algorithm Based upon a set of conditions of the stem, suffix and prefix and associated actions given the condition Measure (m) of a stem is a function of sequences of vowels followed by a consonant. If v is a sequence of vowels and C is a sequence of consonants, then m is: C(VC) m V C and V – optional and m is the number VC repeats *, *v*, *d, *o

Dictionary Look-Up Stemmers Simple stemming rules – fewest exceptions (plural) Original term or stemmed version – looked-up in dictionary and replaced by the stem that best represents it e.g. Kstem – a morphological analyzer conflating word variants to a root form and avoid collapsing words with different meanings into the same root Six major data files: dictionary of words, supplemental list of words, exception list for words that should retain an “e” at the end, direct conflation, country nationality Dictionary Look-Up Stemmers Simple stemming rules – fewest exceptions (plural) Original term or stemmed version – looked-up in dictionary and replaced by the stem that best represents it e.g. Kstem – a morphological analyzer conflating word variants to a root form and avoid collapsing words with different meanings into the same root Six major data files: dictionary of words, supplemental list of words, exception list for words that should retain an “e” at the end, direct conflation, country nationality

Successor Stemmers Based upon length of the prefixes that optimally stem expansions of additional suffixes Based upon the analogy in structural linguistics that investigated word and morpheme boundaries based upon the distribution of phonemes e.g. bag, barn, bring, both, box, bottle (Fig. 4.2) Successor Stemmers Based upon length of the prefixes that optimally stem expansions of additional suffixes Based upon the analogy in structural linguistics that investigated word and morpheme boundaries based upon the distribution of phonemes e.g. bag, barn, bring, both, box, bottle (Fig. 4.2)

Successor Stemmers Methods: cut-off, peak and plateau, complete word method, and entropy method Cut-off method: cut-off value to define stem length, value varies for each possible set of words Peak and plateau: a segment break made after a character whose successor variety exceeds that of the character immediately preceding it and the character immediately following it (not needing cut-off) Complete word method: break on boundaries of complete words (not needing cut-off) Entropy method: uses the distribution of successor variety letters Figure 4.3 Successor Stemmers Methods: cut-off, peak and plateau, complete word method, and entropy method Cut-off method: cut-off value to define stem length, value varies for each possible set of words Peak and plateau: a segment break made after a character whose successor variety exceeds that of the character immediately preceding it and the character immediately following it (not needing cut-off) Complete word method: break on boundaries of complete words (not needing cut-off) Entropy method: uses the distribution of successor variety letters Figure 4.3

Stemming Algorithm Stemming affects recall (positive) in one study, not proven in many studies, but reduce precision – minimized via ranking items, categorization of terms and selective exclusion of some terms from stemming Stemming is dependent upon the nature of the vocabulary Performance measure: Error rate relative to truncation (distance from the origin to the coordinate of the stemmer being evaluated vs. the distance from the origin to the worst case intersection of the line generated by pure truncation), Fig. 4.4 Measure the ability to partition terms semantically and morphologically related to each other into “concept groups” Understemming index – concept groups with multiple stem Overstemming index – same stem is found in multiple groups Stemming Algorithm Stemming affects recall (positive) in one study, not proven in many studies, but reduce precision – minimized via ranking items, categorization of terms and selective exclusion of some terms from stemming Stemming is dependent upon the nature of the vocabulary Performance measure: Error rate relative to truncation (distance from the origin to the coordinate of the stemmer being evaluated vs. the distance from the origin to the worst case intersection of the line generated by pure truncation), Fig. 4.4 Measure the ability to partition terms semantically and morphologically related to each other into “concept groups” Understemming index – concept groups with multiple stem Overstemming index – same stem is found in multiple groups