Information Retrieval in Text Part I Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Information Retrieval in Text Part I Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM 1999. Reading Assignment: Chapters 1 and 2.

Outline Introduction Basic Process of Information Retrieval Content Representation –Document Purification and Analysis –Item Normalization –Index Construction Manual Indexing Automatic Indexing –Inverted File Structures –Signature Files

Introduction Expectations from our search engines –Type principal, where one meant principle –Type Lanzcos, where one meant Lanczos –Type right and left, where one meant Party associations Traffic laws Chaos –Find what we want from a gigantic collection of documents (handle the tsunami of data) We are asking the computer to supply the information we want, rather than the information we asked for –Reference librarians are already good at that, asking the patron few questions before directing him to the results

Introduction An Information system consists of – database of documents – search engine – interface – search results

Basic Process Of IR Basic process of information retrieval can be described as : –Representing content of document Document Purification and Analysis Item Normalization Index Construction –Representing User’s information Query Representation User Interface –Ranking and Relevance Feedback The main objective of an IR system is to increase precision and recall, efficiently.

Precision and Recall Precision: how many of the documents retrieved by an algorithm are correct Recall: how many of the documents that should have been retrieved by an algorithm were in fact retrieved Average Precision

Document Purification and Analysis Unless documents are cleaned up (making sure every document has a title, begin and end, handle non-textual portions like images) wrong [portions of] documents may be retrieved

Document Purification and Analysis Taking HTML documents, for example, one needs to decide which “tags” to index According to references published in 1997 and 1998, the following features are ignored in building a search engine index – tags – attribute – tags –Image maps, frames, and some URLs

Document Purification and Analysis Usually, search engines extract –text, excluding punctuation, from title tags, header tags, and the first characters of an html file. This may include The first 100 significant words The first 20 lines per record Search engines would ignore –invisible text –Text with smaller fonts –Words containing numbers

Document Purification and Analysis Text formatting –Use standard ASCII/Unicode May need to convert certain formats to text or extract text information from them (e.g. postscript, pdf) What about OCRed documents?

Item Normalization Words must be sliced and diced before being considered for index construction. This may include –Identification of processing tokens (words) –Characterizations of tokens –Stemming of tokens

Item Normalization Applying stop lists to the collections of processing tokens –ftp://ftp.cs.cornell.edu/pub/smart/english.stopftp://ftp.cs.cornell.edu/pub/smart/english.stop –E.g. able, about, after, allow, became, been, before, certainly, clearly, enough, everywhere, etc. –What to do with singletons (words appearing once in a collection of documents)?

Item Normalization Stemming: Removing of suffixes, and sometimes prefixes, to reduce a word to its root form. –E.g. reformation, reformative, reformatory, reformed, and reformism can all be stemmed to reform or form?????? –This saves considerable amount of space –However, one may lose the context of search E.g. someone looking for reformation and some results refer to reformatories (reform schools) –Syntactic stemmers vs. dictionary-based stemmers

Item Normalization Stemming Advantages –Reduces diversity of word representations Misspelled words are recognized Handles plurals and common suffixes –Increases recall Stemming Disadvantages –Retrieval of irrelevant documents (reduces precision) –Cannot be applied to proper nouns Currently available stemmers –Al Stem: http://tides.umiacs.umd.edu/software.htmlhttp://tides.umiacs.umd.edu/software.html –http://www.nongnu.org/aramorph/javadoc/gpl/pierrick/brihaye/aramorph/lu cene/ArabicStemmer.htmlhttp://www.nongnu.org/aramorph/javadoc/gpl/pierrick/brihaye/aramorph/lu cene/ArabicStemmer.html –Porter Stemmer: http://maya.cs.depaul.edu/~classes/ds575/porter.htmlhttp://maya.cs.depaul.edu/~classes/ds575/porter.html –http://webscripts.softpedia.com/scriptDownload/Porter-Stemmer- Download-42859.htmlhttp://webscripts.softpedia.com/scriptDownload/Porter-Stemmer- Download-42859.html

Index Construction Manual Indexing Automatic Indexing –Inverted File Structure –Signature Files –Vector Space Models

Manual Indexing Every document is catalogued based on some individual’s or group’s assessment of what that document is about, and an appropriate list of descriptive entries is generated. Advantage –Human indexers can establish relationships and concepts between seemingly different topics that can be very useful to future readers Broader, narrower and related subjects

Manual Indexing Disadvantages –Expensive –Time consuming (think of manually indexing the Web) –Can be subject to the background and personality of the indexer Cleverdon reported that if two groups of people construct thesauri in a particular subject area, the overlap of index terms was about 60% Moreover, if two indexers used the same thesaurus on the same document, common index terms that were shared were about 30%. –May not be reproducible in case of modification or loss of information

Manual Indexing Manual indexing has shifted its focus toward “the abstraction of concepts and judgments on the value of the information” G. Kowalski, 1997

Manual Indexing Yahoo! (up to 1999) –Instead of a web crawler, web masters submit URLs for Yahoo! to pursue. If Yahoo! thinks its appropriate, it is included in the index, otherwise not. Around 30% acceptance rate. What about sites fitting in more than one category? However, increases precision as index size is small

Manual Indexing EMBASE (Elsavier Science’s Bibliographic Database) Excerpta Medica DataBASE –Covers pharmacology and biomedicine –Uses machine-aided indexing to work hand in hand with manual indexing National Library of Medicine –Publishes MeSH (Medical Subject Headings) –Uses indexers to assign as many headings as necessary to characterize accurately the content of a journal article. H. W. Wilson Company (Similar to MeSH appropoach)

Automatic Indexing Using algorithms/software to extract terms for indexing is the predominant method for processing documents from large repositories. Consists of huge computerized robots crawling throughout the Web all day and night, collecting documents and indexing every word in the text. Concepts may result from the index construction stage (as with vector space models), or may feed the index construction (as with inverted file structures and signature files), which is similar to manual indexing.

Inverted File Structure Consists of a document file, an inversion list and a dictionary. Document File –Each document is given a unique identifier –Processing tokens within the document are identified Dictionary –A sorted list of all unique words or processing tokens in the system and a pointer to the location of its inversion list. –May also include the frequency of each term in the collection (global frequency) –N-grams and PAT trees are well-known data structures for processing dictionaries Inversion List –Contains the pointer from the term to which documents contain that term [and the position in that document].

DOC#1,computer, bit, byte DOC#2, memory, byte DOC#3,computer, bit,memory DOC#4, byte, computer DOCUMENTS bit (2) byte (3) computer (3) memory (2) DICTIONARY INVERSION LISTS bit: 1, 3 byte: 1, 2, 4 computer: 1, 3, 4 memory: 2, 3 Figure 1: Inversion File Structure

Inversion lists may also include the position within the document –May help in supporting queries of Phrases (consecutive keywords) Words within specified proximity

Pros Queries only interested in more recent information, only the latest databases need to be searched. Provide Optimum Performance. Concepts and their relationship can be stored.

Cons Space requirement for personal file system Needs exact spelling

Signature File Signature file search is a linear scan of the compressed version of items producing a response time linear with respect to file size. In Signature file indexing, each record is allocated a fixed-width signature, or bitstring, of w bits. Each word that appears in the record is hashed a number of times to determine the bits in the signature that should be set

Signature File Any record whose signature has a 1-bit corresponding to every 1-bit in the query signature is a potential answer Each such record must be fetched and checked directly against the query to determine whether it is a false match or a true match. Many variants of signature file are available

Signature Files

Pros & Cons Pros Support Ranked Queries Cons Variety of parameters be fixed in advance Expensive for disjunctive queries Response time is unpredictable Not Scalable

Information Retrieval in Text Part I Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Similar presentations

Presentation on theme: "Information Retrieval in Text Part I Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval in Text Part I Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Similar presentations

Presentation on theme: "Information Retrieval in Text Part I Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval."— Presentation transcript:

Similar presentations

About project

Feedback