National & Kapodistrian University of Athens Dept.of Informatics & Telecommunications MSc. in Computer Systems Technology Distributed Systems Searching.

Slides:

Advertisements

Similar presentations

Chapter 5: Introduction to Information Retrieval

Advertisements

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

1 Searching the Web Junghoo Cho UCLA Computer Science.

Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

1 Searching the Web Representation and Management of Data on the Internet.

Chapter 19: Information Retrieval

Parallel and Distributed IR

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.

1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.

Overview of Search Engines

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.

Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.

The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.

1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

1 Searching the Web Representation and Management of Data on the Internet.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:

1 CS 430: Information Discovery Lecture 5 Ranking.

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

General Architecture of Retrieval Systems 1Adrienn Skrop.

Efficient Crawling Through URL Ordering By: Junghoo Cho, Hector Garcia-Molina, and Lawrence Page Presenter : Omkar S. Kasinadhuni Simerjeet Kaur.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.

1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.

Why indexing? For efficient searching of a document

Module 11: File Structure

Information Retrieval

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Yoram Bachrach Yiftah Ben-Aharon

Information Retrieval

Chapter 31: Information Retrieval

Chapter 19: Information Retrieval

Presentation transcript:

National & Kapodistrian University of Athens Dept.of Informatics & Telecommunications MSc. in Computer Systems Technology Distributed Systems Searching the Web By A.Arasu, J.Cho, H.Garcia-Molina, A.Paepcke, S.Raghavan Giorgos Matrozos M 414

This paper is about … Search Engines Generic Architecture Generic Architecture Each Component’s Architecture Each Component’s Architecture Each Component’s Design and Implementation Techniques Each Component’s Design and Implementation Techniques Crawling Crawling Page Storage Page Storage Indexing Indexing Link Analysis Link Analysis

A Quick Look Why Use Search Engines - Why their Work is Hard ? Ans:Over a Billion pages Great Growth Rate About 23% of the pages update daily Linking between pages is very complicated What about Information Retrieval ? Ans:It is used but it is unsuitable, because it is for small, coherent collections. The Web on the other hand is massive, incoherent, the other hand is massive, incoherent, distributed and rapidly changing

Search Engine Components A search engine consists of a Crawler module a Crawler module a Crawler Control module a Crawler Control module a Page Repository a Page Repository an Indexer module an Indexer module a Collection Analysis module a Collection Analysis module a Utility Index a Utility Index a Query Engine module a Query Engine module a Ranking module a Ranking module

General Search Engine Architecture

The Crawler module Starts with a set of URLs S 0 It has a prioritized queue from where it retrieves the URLs Then the Crawler downloads the pages, extracts any new URL and places it in the queue This is done until it decides to stop But some questions arise. What pages should the Crawler download ? Ans: Page Selection methods How should the Crawler refresh pages ? Ans: Page Refresh methods

Page Selection The Crawler may want to download important pages first for the collection to be of good quality ButWhat is important? How the Crawler operates? How the Crawler guesses good pages? Hints  Importance Metrics  Crawler Models  Ordering Metrics

Importance Metrics I Interest Driven Given a query Q, the importance of the page P is defined as the textual similarity between P, Q. P, Q are considered vectors where w i represents the i th word of the vocabulary. w i = #_of_appear * idf (inverse document frequency). idf = 1 / #_of_appear in the whole collection. Similarity between P,Q  IS(P) = cosine products of P,Q vectors. Idf was not used because it relies on global info. But if we want to use idf factors, they must be estimated using reference idf from other times. Then the similarity is IS’(P) and it is estimated because we have not seen yet the entire collection to compute the actual IS(P).

Importance Metrics II Popularity Driven A way to define popularity is to use a page’s backlink count, that is the links that point to this page. The number of these links determines its popularity  IB(P). Note also that the Crawler estimates IB’(P) because the actual metric needs information about the whole web. The estimate may be inaccurate early in the crawl. A more sophisticated but similar technique is also used in Page Ranking.

Importance Metrics III Location Driven IL(P) is a function of its location, not its contents. If URL u leads to P, then IL(P) is a function of u. This is a way to evaluate the location of the page and through this its importance. Another way used is the slashes that appear in the address. Fewer slashes are considered more useful. FINALLY  IC(P) = k1 * IS(P) + k2 * IB(P) + k3 * IL(P)

Crawler Models I Now, for a given importance metric, the crawler must guess using a Quality Metric Crawl and Stop Starts with initial page P 0 and stops after K pages. K is fixed. It’s the number of downloaded pages in one crawl. A perfect crawler would have visited pages with R 1 …R K where these are ordered according to the importance metric. BUT the real crawler visits M  K ordered pages. So, the performance of the Crawler C is P CS (C) = M*100/K A crawler with random visits would have a performance of K*100/T, where T are the pages in the entire Web. Each page visited is a hot page with prob K/T. Thus the expected number of desired pages until the crawler stops is K 2 /T.

Crawler Models II Crawl and Stop with Threshold In this technique, there is an importance target G and pages with importance higher than G are only considered. Lets assume that this number is H. The performance P ST (C) is the percentage of the H hot pages. If K < H then  K*100/H If K  H then  the ideal crawler has 100% A random crawler is expected to visit (H/T)*K when it stops. Thus its performance is K*100/T

Ordering Metrics According to this metric the Crawler selects the URL from the queue. The ordering metric can only use information seen by the crawler. The ordering metric should be design with an importance metric in mind. For example if the crawler searches for high popularity pages, then the ordering metric is IB’(P). Also location metrics can be used. It is hard to devise the ordering metric from the similarity metric, since we have not seen P yet.

Page Refresh After downloading the Crawler has to periodically refresh pages. Two strategies : Uniform Refresh Policy : Revisits all pages at the same frequency f, regardless of how often they change. Proportional Refresh Policy : Assume λ i is the change freq of e i and that f i is the crawler’s revisiting freq of e i. Then the freq ratio λ i /f i is the same for any i.

Freshness and Age Metrics Some definitions Freshness of local page e i at time t. Freshness of the local collection S at time t. Age of local page e i at time t. Age of the local collection We define the time average of freshness of e i and S The time average of age, similarly. All the above are approximations

Refresh Strategy I Note that crawlers can download/update limited number of pages within a period because they have limited resources. Consider a simple example. Collection of 2 pages e 1, e 2. e 1 changes 9 times per day and 2 once a day. For e1 a day is split into 9 intervals and e1 changes once and only one in each interval, but we do not know precisely when. e 2 changes once and only one in each day, but we do not know precisely when. Assume that our crawler can refresh one page/day. But which page? If e 2 changes in the middle of the day and we refresh right after, e 2 will be up-to-date for the remaining 1/2day. The prob. that change is before the middle is 1/2, thus the expected benefit is 1/4 and so on.

Refresh Strategy II It can be mathematically be proved that uniform refresh policy is always superior or equal to the proportional for any number of pages, change freqs and refresh rates, for both freshness and age metrics. Best solution  Assume that pages change following a Poisson process and their change freq. is static. The mathematic proof and the idea of the above statement is described in “Cho, Garcia-Molina Synchronizing a database to improve freshness, International Conf on Management of Data, 2000”

Storage The page repository must manage a large collection of web pages. There are 4 challenges. Scalability. It must be possible to distribute the repository across a cluster of computers and disks to cope with the size of the web. Dual access modes. Random access is used to quickly retrieve a specific web page, streaming access is used to receive the entire collection. The first is used by the Query Engine and the second by the Indexer and Analysis modules. Large bulk updates. As new versions of pages are stored, the space occupied by the old must be reclaimed through compaction and reorganization. Obsolete pages. Mechanism for detecting and removing obsolete pages

Page Distribution Policies Assumption : The repository is designed to function over a cluster of interconnected storage nodes. Uniform distribution. A page can be stored at any node independently of its identifier. Hash distribution. A page id would be hashed to yield a node id. The page should be stored at the corresponding node.

Physical Page Organization Methods Within an node, there are 3 possible operations : addition/insertion, high-speed streaming, random page access. MethodsHash-basedLog-structuredHashed-log

Update Strategies Batch Mode or Steady Crawler A batch-mode crawler is a periodical crawler, that crawls for a certain amount of time. The repository receives updates only for a certain number of dates in a month. In contrast a steady crawler crawls without any pause and updates continuously the repository. Partial or Complete crawls According to the crawl, update can be : In place, that is the pages are directly integrated in the repository’s existing collection, possibly replacing older versions. Shadowing, that is the pages are stored separately and update is done in another step

The Stanford WebBase repository It is a distributed storage system, that works with the Stanford WebCrawler. The repository employs a node manager to monitor the nodes and collect status information. Since the Stanford crawler is a batch crawler, the repository applies a shadowing technique. The URLs are first normalized to yield a canonical representation. The page id is computed as a signature of this normalized URL.

Indexing Structure (or link) index The Web is modeled as a graph. The nodes are pages and the edges hyperlinks from one to another. It uses neighborhood information : given a page P, retrieve the pages pointed to by P or the pages pointing to P. Text (or content) index Text –based retrieval continues to be the primary method for identifying pages relevant to a query. Indices to support this retrieval can be implemented with suffix arrays, inverted files, inverted indices and signature files. Utility indices Special indices like site indices for example for searching in one domain only.

WebBase text-indexing system I 3 types of nodes Distributors, that store the pages to be indexed Indexers, that execute the core of the index building engine Query servers. The final inverted index is partitioned across them. The inverted index is built in 2 stages Each distributor runs a process that disseminates the pages to the indexers. Each subset is mutually disjoint. The indexers extract postings, sort them and flush to intermediate structures on disk. These are merged to create a inverted file and its lexicon. These pairs are transferred to the query servers

WebBase text-indexing system II The core of the indexing is the index-builder process. This process can be parallelized with 3 phases. Loading, Processing and Flushing. Loading : pages are read and stored in memory Processing : pages are parsed and stored as a set of postings in a mem. Buffer. Then the postings are sorted by term and then by location. Flushing : The sorted postings are saved in the disk as a sorted run

WebBase Indexing System Statistics I One of the most commonly used statistic is idf. The idf of a term w is log(N/df w ) where N is the total number of pages in the collection and df w is the number of pages that contain at least on occurrence of w. To avoid the the query time overhead, the WebBase computes and stores statistics as part of index creation. Avoiding explicit I/O for statistics : To avoid additional I/O the local data are sent to the statistician only when they are available in memory. 2 strategies ME, FL :Send local info during merging or during flushing ME, FL :Send local info during merging or during flushing Local aggregation : Multiple postings for a term pass through memory in groups. Eg 1000 postings for “cat”. The pair (“cat”,1000) can be sent to the statistician.

Page Rank I Page Rank extends the basic idea of citation by taking into consideration the importance of the pages pointing to a given page. Thus a page receives more importance if YAHOO points to it, than an unknown page. Note that the definition of Page Rank is recursive. Simple Page Rank Let 1…m be the pages of the web, N(i) the # of outgoing links from i, B(i) the set of pages that point to i, then we have The above definition leads to the idea of random walks, called the Random Surfer Model. It can be proved that the page rank of a page is proportional to the freq. with which a random surfer would visit it.

Page Rank II Practical Page Rank The Simple Page Rank is well defined if the graph is strongly connected. This isn’t the case here. A rank sink is a connected cluster of pages that has no outgoing links. A rank leak is a single page with no outgoing links. Thus two solutions. Removal of all the leak nodes with out- degree 0 and introduction of a decay factor d to solve the problem of sinks. So the modified Page Rank where m is the number of nodes in the graph.

HITS I Link based search alg. : Hypertext Included Topic Search Instead of producing a single ranking score, HITS produces the Authority and the Hub score. Authority pages are those most likely to be relevant to a query and Hub pages are not necessarily authorities but point to several of them. The HITS algorithm The basic idea is to identify a small subgraph of the web and apply link analysis to locate the Authorities and the Hubs for a given query.

HITS II Identifying the focused subgraph Link Analysis Two kind of operations in each step, I and O.

HITS III The alg. iteratively repeats I and O steps, with normalization, until the hub and authority scores converge.

Other Link Based Techniques Identifying Communities Interesting problem to identify communities in the web. See ref [30] and [40] Finding Related Pages Companion and Cocitation algorithms. See ref [22], [32] and [38] Classification and Resource Compilation Problem of automatically classifying documents. See ref [13], [14], [15]