Information Retrieval Implementation issues Djoerd Hiemstra & Vojkan Mihajlovic University of Twente {d.hiemstra,v.mihajlovic}.utwente.nl.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
File Systems.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Inverted Index Hongning Wang
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Inverted Files, Signature Files, Bitmaps
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Information Retrieval IR 4. Plan This time: Index construction.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Overview of Search Engines
Information Retrieval Space occupancy evaluation.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Evidence from Content INST 734 Module 2 Doug Oard.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Search Engine Architecture
Text Indexing and Search
Indexing & querying text
Information Retrieval in Practice
Implementation Issues & IR Systems
Chapter 12: Query Processing
The Anatomy Of A Large Scale Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Anatomy of a search engine
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Lecture 2- Query Processing (continued)
Web Search Engines.
The Search Engine Architecture
Presentation transcript:

Information Retrieval Implementation issues Djoerd Hiemstra & Vojkan Mihajlovic University of Twente {d.hiemstra,v.mihajlovic}.utwente.nl

The lecture Ian H. Witten, Alistar Mofat, Timothy C. Bell, “Managing Gigabytes”, Morgan Kaufmann, pages (Section 3), (For the exam, the compression methods in Section 3.3., i.e., the part with the grey bar left of the text, does not have to be studies in detal) Sergey Brin and Lawrence Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Computer Networks and ISDN Systems, 1997.

Overview Brute force implementation Text analysis Indexing Index coding and query processing Web search engines Wrap-up

Overview Brute force implementation Text analysis Indexing Index coding and query processing Web search engines Wrap-up

Architecture 2000 FAST search Engine Knut Risvik

Architecture today 1. The web server sends the query to the index servers. The content inside the index servers is similar to the index in the back of a book - it tells which pages contain the words that match the query. 2. The query travels to the doc servers, which actually retrieve the stored documents. Snippets are generated to describe each search result. 3. The search results are returned to the user in a fraction of a second.

Storing the web More than 10 billion sites Assume each site contains 1000 terms Each term consists of 5 chars on average Each term a UTF character >=2bytes To store the web you need to search: –10 10 x 10 3 x 5 x 2B ~= 100TB What about: term statistics, hypertext info, pointers, search indexes, etc.? ~= PB Do we really need all this data?

Counting the web Text statistics: –Term frequency –Collection frequency –Inverse document frequency … Hypertext statistics: –Ingoing and outgoing links –Anchor text –Term positions, proximities, sizes, and characteristics …

Searching the web 100TB of data to be searched We need to find such a large hard disk (currently the biggest are 250GB) Hard disk transfer time 100MB/s Time needed to sequentially scan the data: 1 million seconds We have to wait for 10 days to get the answer to a query That is not all …

Problems in web search Web crawling –Deal with limits, freshness, duplicates, missing links, loops, server problems, virtual hosts, etc. Maintain large cluster of servers –Page servers: store and deliver the results of the queries –Index servers: resolve the queries Answer 250 million of user queries per day –Caching, replicating, parallel processing, etc. –Indexing, compression, coding, fast access, etc.

Implementation issues Analyze the collection –Avoid non-informative data for indexing –Decision on relevant statistics and info Index the collection –Which index type to use? –How to organize the index? Compress the data –Data compression –Index compression

Overview Brute force implementation Text analysis Indexing Index coding and query processing Web search engines Wrap-up

Term frequency Count how many times a tem occur in the collection (size N terms) => frequency (f) Order them in descending order => rank (r) The product of the frequency of words and their rank is approximately constant: f x r = C, C ~= N/10

Zipf distribution Linear scaleLogarithmic scale Terms by rank order Term count Term count

Consequences Few terms occur very frequently: a, an, the, … => non-informative (stop) words Many terms occur very infrequently: spelling mistakes, foreign names, … => noise Medium number of terms occur with medium frequency => useful

Word resolving power (van Rijsbergen 79)

Heap’s law for dictionary size collection size number of unique terms

Let’s store the web Let’s remove: –Stop words: N/10 + N/20 + … –Noise words ~ N/1000 –UTF => ASCII To store the web you need: –to use ~ 4/5 of the terms –4/5 x x 10 3 x 5 x 1B ~= 40TB How to search this vast amount of data?

Overview Brute force implementation Text analysis Indexing Index coding and query processing Web search engines Wrap-up

Indexing How would you index the web? Document index Inverted index Postings Statistical information Evaluating a query Can we really search the web index? Bitmaps and signature files

Example Document numberText 1Pease porridge hot, pease porridge cold 2Pease porridge in the pot 3Nine days old 4Some like it hot, some like it cold 5Some like it in the pot 6Nine days old Stop words: in, the, it.

Document index Doc. Id colddayshotlikenineoldpeaseporridgepotsome B1B #docs x [log 2 #docs] + #u_terms x #docs x 8b + #u_terms x (5 x 8b + [log 2 #u_terms]) x 5B x x 1B x (5 x 1B + 4B) ~= 10PB

Inverted index (1) termdoc. id cold1 hot1 pease1 1 porridge1 1 pease2 porridge2 pot2 days3 nine3 old3 4B1B x (4B + 5B) x (5 x 1B + 4B) = 90TB termdoc. id cold4 hot4 like4 4 some4 4 like5 pot5 some5 days6 nine6 old6 4B1B

Inverted index (2) termtfdoc. id cold11 hot11 pease21 porridge21 pease12 porridge12 pot12 days13 nine13 old13 4B1B5B 500 x x (4B + 1B + 5B) x (5 x 1B + 4B) = 50TB termtfdoc. id cold14 hot14 like24 some24 like15 pot15 some15 days16 nine16 old16 cold14 hot14 4B1B5B

Inverted index - Postings termnum. docpointer cold2-> days2-> hot2-> like2-> nine2-> old2-> pease2-> porridg e 2-> pot2-> some2-> 5 x 1B5B 500 x x (5B + 1B) x (5 x 1B + 5B + 5B) = 30TB + 15MB < 40TB doc. numtf B1B

Inverted index - Statistics 500 x x (5B + 1B) x (5 x 1B + 5B + 5B + 5B) = 30TB + 20MB termcfnum. docpointer cold22-> days22-> hot22-> like32-> nine22-> old22-> pease32-> porridge32-> pot22-> some32-> 5 x 1B2B5B doc. numtf B1B

Inverted index querying Cold and hot => doc1,doc4; score = 1/6 x 1/2 x 1/6 x 1/2 = 1/144 termcfnum. docpointer cold22-> days22-> hot22-> like32-> nine22-> old22-> pease32-> porridge32-> pot22-> some32-> 5 x 1B2B5B doc. numtf B1B

Break: can we search the web? Number of postings (term-document pairs): –Number of documents: ~10 10, –Average number of unique terms per document (document size ~1000): ~500 Number of unique terms: ~10 6 Formula: #docs x avg_tpd x ([log 2 #docs] + [log 2 max(tf)]) –+ #u_trm(5 x [log 2 #char_size] + [log 2 N/10] + [log 2 #docs/10] + [log 2 (#docs x avg_tpd)]) Can we still make the search more efficient? –Yes, but let’s take a look at other indexing techniques 500 x x (5B + 1B) x (5 x 1B + 5B + 5B + 5B) = 3 x x 10 7 ~ = 30TB

Bitmaps For every term in the dictionary a bitvector is stored Each bit represent presence or absence of a term in a document Cold and pease => & = termbitvector cold days hot like old nine pease porridge pot some x 1B1GB 10 6 x (1GB + 5 x 1B) = 1PB

Signature files A text index based on storing a signature for each text block to be able to filter out some blocks quickly A probabilistic method for indexing text k hash functions are generating n-bit values Signatures of two words can be identical

Signature file example termhash string cold days hot like nine old pease porridge pot some nr.signatureText Pease porridge hot, please porridge cold Pease porridge in the pot Nine days old Some like it hot, some like it cold Some like it in the pot Nine days old

Signature file searching If the corresponding word signature bits are set in the document, there is a high probability that the document contains the word. Cold (1 & 4) => OK Old (2,3,5 & 6) => not OK (2 & 5) => fetch the document at a query time and check if it occurs Cold & hot: (1 & 4) => OK Reduce the false hits by increasing the number of bits per term signature x (5B + 1KB) = 10PB

Indexing - Recap Inverted files –require less storage than other two –more robust for ranked retrieval –can be extended for phrase/proximity search –numerous techniques exist for speed & storage space reduction Bitmaps –an order of magnitude more storage than inverted files –efficient for Boolean queries

Indexing – Recap 2 Signature files –an order (or two) of magnitude more storage than inverted files –require un-necessary access to the main text because of false matches –no in-memory lexicon –Insertions can be handled easily Coded (compressed) inverted files are the state- of-the art index structure used by most search engines

Overview Brute force implementation Text analysis Indexing Index coding and query processing Web search engines Wrap-up

Inverted file coding The inverted file entries are usually stored in order of increasing document number –[ (the term “retrieval” occurs in 7 documents with document identifiers 2, 23, 81, 98, etc.)

Query processing (1) Each inverted file entry is an ascending sequence of integers –allows merging (joining) of two lists in a time linear in the size of the lists –Advanced Database Applications (211090): a merge join

Query processing (2) Usually queries are assumed to be conjunctive queries –query: information retrieval –is processed as information AND retrieval [ –intersection of posting lists gives: [23, 98]

Query processing (3) Remember the Boolean model? –intersection, union and complement is done on posting lists –so, information OR retrieval [ –union of posting lists gives: [1, 2, 14, 23, 45, 46, 81, 84, 98, 111, 120, 121, 126, 139]

Query processing (4) Estimate of selectivity of terms: –Suppose information occurs on 1 billion pages –Suppose retrieval occurs on 10 million pages size of postings (5 bytes per docid): –1 billion * 5B = 5 GB for information –10 million * 5B = 50 MB for retrieval Hard disk transfer time: –50 sec. for information sec. for retrieval –(ignore CPU time and disk latency)

Query processing (6) We just brought query processing down from 10 days to just 50.5 seconds (!) :-) Still... way too slow... :-(

Inverted file compression (1) Trick 1, store sequence of doc-ids: –[ as a sequence of gaps –[ No information is lost. Always process posting lists from the beginning, so easily decoded into the original sequence

Inverted file compression (2) Does it help? –maximum gap determined by the number of indexed web pages... –infrequent terms coded as a few large gaps –frequent terms coded by many small gaps Trick 2: use variable byte length encoding.

Variable byte encoding (1)  code: represent number x as: –first bits as the unary code for –remainder bits as binary code for –unary part specifies how many bits are required to code the remainder part For example x = 5 : –first bits: 110 –remainder: 01

Variable byte encoding (2)

Index sizes

Index size of “our Google” Number of postings (term-document pairs): –100 billion documents –500 unique terms on average –Assume on average 6 bits per doc-id 500 x 1010 x 6 bits ~= 4TB –about 15% of the uncompressed inverted file.

Query processing on compressed index size of postings (6 bits per docid): –1 billion * 6 bits = 750 Mb for information –10 million * 6 bits = 7.5 Mb for retrieval Hard disk transfer time: –7.5 sec. for information sec. for retrieval –(ignore CPU time and disk latency and decompressing time)

Query processing – Continued (1) We just brought query processing down from 10 days to just 50.5 seconds... and brought that down to 7.58 seconds :-) but that is still too slow... :-(

Early termination (1) Suppose we re-sort the document ids for each posting such that the best documents come first –e.g., sort document identifiers for retrieval by their tf.idf values. –[ –then: top 10 documents for retrieval can be retrieved very quickly: stop after processing the first 10 document ids from the posting list! –but compression and merging (multi-word queries) of postings no longer possible...

Early termination (2) Trick 3: define a static (or global) ranking of all documents –such as Google PageRank (!) –re-assign document identifiers by ascending PageRank –For every term, documents with a high Page- Rank are in the initial part of the posting list –Estimate the selectivity of the query and only process part of the posting files.

Early termination (3) Probability that a document contains a term: –1 billion / 10 billion = 0.1 for information –10 million / 10 billion = for retrieval Assume independence between terms: –0.1 x = of the documents contains both terms –so, every 1 / = 10,000 documents on average contains information AND retrieval. –for top 30, process 300,000 documents. –300,000 / 10 billion = of the posting files

Query processing on compressed index with early termination process about of postings: – * 750 Mb = 22.5 kb for information – * 7.5 Mb = 225 bytes for retrieval Hard disk transfer time: –0.2 msec. for information msec. for retrieval –(NB now, ignoring CPU time and disk latency and decompressing time is no longer reasonable)

Query processing – Continued (2) We just brought query processing down from 10 days to less than 1 ms. ! :-) “This engine is incredibly, amazingly, ridiculously fast!” (from “Top Gear” every Thursday on BBC2)

Overview Brute force implementation Text analysis Indexing Compression Web search engines Wrap-up

Web page ranking Varies by search engine –Pretty messy in many cases –Details usually proprietary and fluctuating Combining subsets of: –Term frequencies –Term proximities –Term position (title, top of page, etc) –Term characteristics (boldface, capitalized, etc) –Link analysis information –Category information –Popularity information

What about Google Google maintains the worlds largest Linux cluster (10,000 servers) These are partitioned between index servers and page servers –Index servers resolve the queries (massively parallel processing) –Page servers deliver the results of the queries Over 8 Billion web pages are indexed and served by Google

Google: Architecture (Brin & Page 1997)

Google: Zlib compression A variant of LZ77 (gzip)

Google: Forward & Inverted Index

Google: Query evaluation Parse the query. Convert words into wordIDs. Seek to the start of the doclist in the short barrel for every word. Scan through the doclists until there is a document that matches all the search terms. Compute the rank of that document for the query. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.

Google: Storage numbers Total Size of Fetched Pages147.8 GB Compressed Repository53.5 GB Short Inverted Index4.1 GB Full Inverted Index37.2 GB Lexicon293 MB Temporary Anchor Data (not in total) 6.6 GB Document Index Incl. Variable Width Data 9.7 GB Links Database3.9 GB Total Without Repository55.2 GB Total With Repository108.7 GB

Google: Page search Web Page Statistics Number of Web Pages Fetched24 million Number of URLs Seen76.5 million Number of Addresses1.7 million Number of 404's1.6 million

Google: Search speed Initial Query Same Query Repeated (IO mostly cached) QueryCPUTime(s)Total Time(s)CPU Time(s)Total Time(s) al gore vice president hard disks search engines

Q’s What about web search today? How many pages? How many searches per second? Who is the best?

Web search November 2004 Search EngineReported Size Google8.1 billion MSN5.0 billion Yahoo 4.2 billion(estimate) Ask Jeeves2.5 billion

Web search February 2003 Service Searches per day Google250 million Overture (Yahoo)167 million Inktomi (Yahoo)80 million LookSmart(MSN)45 million FindWhat33 million Ask Jeeves20 million AltaVista18 million FAST12 million

Web search July 2005 (US)

Overview Brute force implementation Text analysis Indexing Compression Web search engines Wrap-up

Summary Term distribution and statistics –What is useful and what is not Indexing techniques (inverted files) –How to store the web Compression, coding, and querying –How to squeeze the web for efficient search Search engines –Google: first steps and now