Documents and Indexing Readings Overview Topic Discussions Schedule Set Projects and Papers Ideas.

Documents and Indexing Readings Overview Topic Discussions Schedule Set Projects and Papers Ideas

Indexing and Searching Queries models work against the index - Find words, word counts, phrases - Sequential search, indexed search Inverted Files & Other Indices Boolean Queries Sequential Searching Pattern Matching Structural Queries Data structures - The infrastructure of search - Varied per data set and query contexts

Inverted Files - Vocabulary with counts for occurrences - Positions where each word (or char) is stored Character positions – exact – full inverted indexes Words referenced – local – block addressing – then words are found - Pointers can be made efficient, compressed - Fixed size or content based - Dependent on text location and access

Searching through Inverted Files 1.Vocabulary search - Query is parsed and words/phrases are determined 2.Retrieval of occurrences - List of words and locations 3.Manipulation of occurrences Occurrences are processed for IR operation Block search if needed Vocabulary is small(er) and starting point Sequence queries use more processing Fast in general Easy to update

Other Indices for Text - Suffix Trees Helps with phrases, patterns (ACGT) File, text is like a huge string Blocks of text seen as suffixes (from X to EOT) Each virtual block is different Great at finding word beginnings - Suffix Arrays Pointers to text (matches) Comparison to each pointer - Look, stop, look to EOF, EOT - Sorting and Merging can occur

Signature Files A different method (or is it?) Vocabulary indexed hashes (a signature) - Cut blocks of text of b words each to bit masks - Bit masks are just bins of bits - Signature is sequence of bit masks for all the blocks and a pointer to each block Search by converting (hashing) the query to bit mask and then comparing Finds sets of words easier (phrases) Block tweaking can go on forever Easy to update… add new bit masks

Boolean Queries A set of operations to manipulate the indices and produce results Algorithm based - Find document qualifiers - Make relevance judgment (value) of document - Retrieve query matches for review Mulit-phase, levels of processing (merging) Narrows scope of the query Parsing leaves from the syntax tree (AND/OR) Optimization is in the details & situations

Sequential Searching Search through a free-form data set (no indexed specific structure) Find where the query pattern occurs - Query shorter than text - Number of potential matches for query in text Simple case: - Start at first character - Proceed linearly through the text Huge variety of methods

String Search Methods Brute Force – search over all position in the text - Match? No, Next Yes, Note location - Next - Sequential Windows (framed) text scanning - Knuth-Morris-Pratt: slides frame over text the size of the search string, starting on the next character from each confirmed miss - Boyer-Moore: slides frame over text, moving ahead the length of the whole search string

More String Search Shift-Or - Uses binary representation, bit masks for searching with a frame - Strengths and weaknesses of hashes Phrase search - Look for longest word first using Window scan - Look for less frequent term first Which method is best? - It depends on the document and IR model for queries

Putting it all together… 1.Query 2.Index 3.Search Method - Vocabulary is matched - Results of matches are merged into a list of documents and text locations in documents You can see how the IR model, the query and the index all combine to enable the IR system

Structural Queries Working with a document that has explicit or implicit structure - Explicit – HTML, tags, spaces, numbers… - Implicit – Genre, Language, Formats… Tags count as vocabulary words (mostly) Index represents structure IR models allow acting on structure in index Do we ever not have some kind of structure? - Should structure be ignored or removed? - Do structures scale and have consistency?

Compression Search is possible directly in compressed text This is a huge win for disk access (and CPU) Language compression: - Heap - vocabulary - Zipf – document scaling Indexes compress well, sequence and differences can be small overall sizes

Evaluating Indices & Search Fixed indices are best for static or predictable text One, single search method may be tuned for static text Hybrid solutions are best for varied documents Hybrid solutions are best as you want to be prepared for document and query type diversity

Users and Searching Document sets and therefore indices are growing rapidly Users have increasing expectations for both accuracy and speed Compression methods help please users Structured information will transform search selection and IR model preferences Computation power solves some problems (mostly)

Keeping Up with the Changing Web Building Indices is difficult enough in theory What about a continuously changing huge volume of information? Is old information good? What does up-to-date mean anymore? Is Knowledge a depreciating commodity? - Correctness + Value over time Different information changes at different rates - Really it’s new information How do you update an index with constantly changing information?

Changing Web Properties Known distributions for information change Sites and pages may have easily identifiable patterns of update - 4% change on every observation - Some don’t ever change (links too) If you check and a page hasn’t changed, what is the probability it will ever change? Rate of change is related to rate of attention - Machines vs. Users - Measures can be compared along with information

Dynamic Maint. of Indexes w/Landmarks Web Crawlers do the work in gathering pages Incremental crawling means incremented indices - Rebuild the whole index more frequently - Devise a scheme for updates (and deletions) - Use supplementary indices (i.e. date) New documents Changed documents 404 documents

Landmarks for Indexing Difference-based method Documents that don’t change are landmarks - Relative addressing - Clarke: block-based - Glimpse: chunking Only update pointers to pages Tags and document properties are landmarked Broader pointers mean less updates Faster indexing – Faster access?

Yahoo! Cataloging the Web How do information professionals build an “index” of the Web? Cataloging applies to the Web Indexing with synonyms Browsing indexes vs searching them Comprehensive index not the goal - Quality - Information Density Yahoo’s own ontology – points to site for full info Subject Trees with aliases (@) to other locations “More like this” comparisons as checksums

Yahoo uses tools for indexing

Investigation of Documents from the WWW What properties do Web documents have? What structure and formats do Web documents use? What properties do Web documents have? - Size – 4K avg. - Tags – ratio and popular tags - MIME types (file extensions) - URL properties and formats - Links – internal and external - Graphics - Readability

WWW Documents Investigation How do you collect data like this? - Web Crawler URL identifier, link follower - Index-like processing Markup parser, keyword identifier Domain name translation (and caching) How do these facts help with indexing? Have general characteristics changed? (This would be a great project to update.)

Properties of Highly-Rated Web Sites What about whole Web sites? What is a Web site? - Sub-sites? - Specific contextual, subject-based parts of a Web site? - Links from other Web pages: on the site and off - Web site navigation effects Will experts (like Yahoo catalogers) like a site?

Properties Links & formatting Graphics – one, but not too many Text formatting – 9 pt. with normal style Page (layout) formatting – min. colors Page performance (size and acess) Site architecture (pages, nav elements) - More links within and external - Interactive (search boxes, menus) Consistency within a site is key How would a user or index builder make use of these?

Extra Discussion Little Words, Big Difference - The difference that makes a difference - Singular and plural noun identification can change indices and retrieval results - Language use differences Decay and Failures - Dead links - Types of errors - Huge amount of dead links (PageRank effective) 28% in 1995-1999 Computer & CACM 41% in 2002 articles Better than the average Web page?

Topic Discussions Set Leading WIRED Topic Discussions - About 20 minutes reviewing issues from the week’s readings Key ideas from the readings Questions you have about the readings Concepts from readings to expand on - PowerPoint slides - Handouts - Extra readings (at least a few days before class) – send to wired listserv

Web IR Evaluation - 5 page written evaluation of a Web IR System - technology overview (how it works) Not an eval of a standard search engine Only main determinable diff is content - a brief overview of the development of this type of system (why it works better) - intended uses for the system (who, when, why) - (your) examples or case studies of the system in use and its overall effectiveness

How can (Web) IR be better? - Better IR models - Better User Interfaces More to find vs. easier to find Scriptable applications Projects and/or Papers Overview

Project Ideas Searchable Personal Digital Library Browser hacks for searching Mozilla keeps all the pages you surf so you can search through them later - Mozilla hack - Local search engin

Paper Ideas New datasets for IR - Search on the Desktop – issues, previous research and ideas - Collaborative searching – advantages and potential, but what about privacy?

Documents and Indexing Readings Overview Topic Discussions Schedule Set Projects and Papers Ideas.

Similar presentations

Presentation on theme: "Documents and Indexing Readings Overview Topic Discussions Schedule Set Projects and Papers Ideas."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Documents and Indexing Readings Overview Topic Discussions Schedule Set Projects and Papers Ideas.

Similar presentations

Presentation on theme: "Documents and Indexing Readings Overview Topic Discussions Schedule Set Projects and Papers Ideas."— Presentation transcript:

Similar presentations

About project

Feedback