8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.

Slides:



Advertisements
Similar presentations
Traditional IR models Jian-Yun Nie.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Modern Information Retrieval Chapter 1: Introduction
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
IR Models: Overview, Boolean, and Vector
Chapter 7 Retrieval Models.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Current Topics in Information Access: IR Background
DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
IR Models: Review Vector Model and Probabilistic.
Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Vector Space Models.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Why the interest in Queries?
Implementation Issues & IR Systems
Information Retrieval on the World Wide Web
Multimedia Information Retrieval
Representation of documents and queries
Chapter 5: Information Retrieval and Web Search
5. Vector Space and Probabilistic Retrieval Models
CS 430: Information Discovery
Presentation transcript:

8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval

8/28/97Information Organization and Retrieval Review Boolean Retrieval Ranked Retrieval Vector Space Model

Information need Index Pre-process Parse Collections Rank or Match Query text input

8/28/97Information Organization and Retrieval Boolean Model 3t33t3 1t11t1 2t22t2 1D11D1 2D22D2 3D33D3 4D44D4 5D55D5 6D66D6 8D88D8 7D77D7 9D99D9 10 D D 11 m1m1 m2m2 m3m3 m5m5 m4m4 m7m7 m8m8 m6m6 m 2 = t 1 t 2 t 3 m 1 = t 1 t 2 t 3 m 4 = t 1 t 2 t 3 m 3 = t 1 t 2 t 3 m 6 = t 1 t 2 t 3 m 5 = t 1 t 2 t 3 m 8 = t 1 t 2 t 3 m 7 = t 1 t 2 t 3

8/28/97Information Organization and Retrieval Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Beams Width measurement Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)

8/28/97Information Organization and Retrieval Boolean Problems Disjunctive (OR) queries lead to information overload Conjunctive (AND) queries lead to reduced, and commonly zero result Conjunctive queries imply reduction in Recall

8/28/97Information Organization and Retrieval Advantages and Disadvantage of the Boolean Model Complete expressiveness for any identifiable subset of collection Exact and simple to program The whole panoply of Boolean Algebra available Advantages Complex query syntax is often misunderstood (if understood at all) Problems of Null output and Information Overload Output is not ordered in any useful fashion Disadvantages

8/28/97Information Organization and Retrieval Boolean Extensions Fuzzy Logic –Adds weights to each term/concept –t a AND t b is interpreted as MIN(w(t a ),w(t b )) –t a OR t b is interpreted as MAX (w(t a ),w(t b )) Proximity/Adjacency operators –Interpreted as additional constraints on Boolean AND TOPIC system –Uses various weighted forms of Boolean logic and proximity information in calculating RSVs

8/28/97Information Organization and Retrieval Vector Space Model Documents are represented as vectors in term space –Terms are usually stems –Documents represented by binary vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents

8/28/97Information Organization and Retrieval Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6

8/28/97Information Organization and Retrieval Vector Space Documents and Queries D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D7D7 D8D8 D9D9 D 10 D 11 t2t2 t3t3 t1t1

8/28/97Information Organization and Retrieval Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

8/28/97Information Organization and Retrieval Vector Space with Term Weights and Cosine Matching D2D2 D1D1 Q Term B Term A D i =(d i1,w di1 ;d i2, w di2 ;…;d it, w dit ) Q =(q i1,w qi1 ;q i2, w qi2 ;…;q it, w qit ) Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)

8/28/97Information Organization and Retrieval Problems with Vector Space There is no real theoretical basis for the assumption of a term space –it is more for visualization that having any real basis –most similarity measures work about the same regardless of model Terms are not really orthogonal dimensions –Terms are not independent of all other terms

8/28/97Information Organization and Retrieval Today Probabilistic Retrieval (Introduction) Processing Ranked Queries (the role of inverted files) Web Crawlers - Distributed indexing of the WWW Probabilistic Retrieval (Details)

8/28/97Information Organization and Retrieval Probabilistic Retrieval Goes back to 1960’s (Maron and Kuhns) Robertson’s “Probabilistic Ranking Principle” –Retrieved documents should be ranked in decreasing probability that they are relevant to the user’s query. –How to estimate these probabilities? Several methods (Model 1, Model 2, Model 3) with different emphases on how estimates are done.

8/28/97Information Organization and Retrieval Probabilistic Models: Some Notation D = All present and future documents Q = All present and future queries (D i,Q j ) = A document query pair x = class of similar documents, y = class of similar queries, Relevance is a relation:

8/28/97Information Organization and Retrieval Probabilistic Models Model 1 -- Probabilistic Indexing, P(R|y,D i ) Model 2 -- Probabilistic Querying, P(R|Q j,x) Model 3 -- Merged Model, P(R| Q j, D i ) Model 0 -- P(R|y,x) Probabilities are estimated based on prior usage or relevance estimation

8/28/97Information Organization and Retrieval Probabilistic Models Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) Relies on accurate estimates of probabilities for accurate results

8/28/97Information Organization and Retrieval Vector and Probabilistic Models Support “natural language” queries Treat documents and queries the same Support relevance feedback searching Support ranked retrieval Differ primarily in theoretical basis and in how the ranking is calculated –Vector assumes relevance –Probabilistic relies on relevance judgments or estimates

8/28/97Information Organization and Retrieval Web Search Engines Most include some version of Vector Space or extended Boolean Some offer both “ranked” and Boolean, but not together. Some engines (such as those based on the original WAIS) are little more than coordination-level matching for ranked retrieval.

8/28/97Information Organization and Retrieval Web Search Engines Some engines use added natural language processing techniques to identify concepts –Lycos based on work by Michael Mauldin at CMU –Excite’s “concept-based” search may be a development of Latent Semantic Indexing Some search engines using Probabilistic methods (with proprietary extensions) –Inktomi/HotBot uses a form of SLR.

8/28/97Information Organization and Retrieval Web Search Engines Exact algorithms are not available for commercial WWW search engines Many search engines appear to be hybrids offering both ranked and Boolean elements

8/28/97Information Organization and Retrieval Web Search Conclusions Web Search engines are stretching the performance limits of ranked retrieval algorithms Most Web search engines today attempt to combine the best features of ranked and Boolean searching There is still a long way to go before All and Only the Relevant web pages are retrieved in response to your query

8/28/97Information Organization and Retrieval Web Crawlers How do the web search engines get all of the items they index? How do you store millions of words from hundreds of sites so that you can find them quickly (and efficiently)?

8/28/97Information Organization and Retrieval Depth-First Crawling Page 1 Page 3 Page 2 Page 1 Page 2 Page 1 Page 5 Page 6 Page 4 Page 1 Page 2 Page 1 Page 3 Site 6 Site 5 Site 3 Site 1 Site 2

8/28/97Information Organization and Retrieval Breadth First Page 1 Page 3 Page 2 Page 1 Page 2 Page 1 Page 5 Page 6 Page 4 Page 1 Page 2 Page 1 Page 3 Site 6 Site 5 Site 3 Site 1 Site 2

8/28/97Information Organization and Retrieval Inverted Files We have seen “Vector files” conceptually, an Inverted File is a vector file “inverted” so that rows become columns and columns become rows

8/28/97Information Organization and Retrieval How Are Inverted Files Created Documents are parsed to extract words (or stems) and these are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

8/28/97Information Organization and Retrieval How Inverted Files are Created After all document have been parsed the inverted file is sorted

8/28/97Information Organization and Retrieval How Inverted Files are Created Multiple term entries for a single document are merged and frequency information added

8/28/97Information Organization and Retrieval How Inverted Files are Created The file is split into a Dictionary and a Postings file

8/28/97Information Organization and Retrieval Inverted files Permit fast search for individual terms Search results for each term is a list of document IDs (and optionally, frequency and/or positional information) These lists can be used to solve Boolean queries: –country: d1, d2 –manor: d2 –country and manor: d2

8/28/97Information Organization and Retrieval Inverted Files Lots of alternative implementations –E.g.: Cheshire builds within-document frequency using a hash table during parsing –Document IDs and frequency info are stored in a B-tree index keyed by the term. See the chapter on inverted files in the reader for other implementations.

8/28/97Information Organization and Retrieval Probabilistic Models (Again) Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) Relies on accurate estimates of probabilities for accurate results

8/28/97Information Organization and Retrieval Probabilistic Models: Logistic Regression Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables. Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds:

8/28/97Information Organization and Retrieval Probabilistic Models: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

8/28/97Information Organization and Retrieval Probabilistic Models: Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown previously

8/28/97Information Organization and Retrieval Probabilistic Models Strong theoretical basis In principle should supply the best predictions of relevance given available information Can be implemented similarly to Vector Relevance information is required -- or is “guestimated” Important indicators of relevance may not be term -- though terms only are usually used Optimally requires on- going collection of relevance information AdvantagesDisadvantages