Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Slides:

Advertisements

Similar presentations

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.

Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.

Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

Information Retrieval in Practice

Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.

Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.

© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

1 COMP4332 Web Data Thanks for Raymond Wong’s slides.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.

Google and Scalable Query Services

1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

Overview of Search Engines

WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.

An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.

Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.

The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.

Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Search Xin Liu.

The anatomy of a Large-Scale Hypertextual Web Search Engine.

1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

General Architecture of Retrieval Systems 1Adrienn Skrop.

The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)

1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,

The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.

Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.

Information Retrieval in Practice

Search Engines and Search techniques

IST 516 Fall 2011 Dongwon Lee, Ph.D.

CSE 454 Advanced Internet Systems University of Washington

Google and Scalable Query Services

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Search Search Engines Search Engine Optimization Search Interfaces

Thanks to Ray Mooney & Scott White

Anatomy of a search engine

Data Mining Chapter 6 Search Engines

Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI

Web Search Engines.

The Search Engine Architecture

Instructor : Marina Gavrilova

Presentation transcript:

Crawling, Ranking and Indexing

Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems: –How to store a database of links? –How to crawl the web? –How to recommend pages that match a query?

Architecture of a Search Engine 1.A web crawler gathers a snapshot of the Web 2. The gathered pages are indexed for easy retrieval 3. User submits a search query 4. Search engine ranks pages that match the query and returns an ordered list

Indexing the Web Once a crawl has collected pages, full text is compressed and stored in a repository Each URL mapped to a unique ID A document index is created –For each document, contains pointer to repository, status, checksum, pointer to URL & title A hit list is created for each word in the lexicon. –Occurrences of a word in a particular document, including position, font, captialization, “plain or fancy” Fancy: occurs in a title, tag, or URL

Indexing the Web Each word in the hit list has a wordID. Forward index created –64 barrels; each contains a range of wordIDs –If a document contains words for a particular barrel, the docID is added, along with a list of wordIDs and hit lists. –Maps words to documents. Wrinkle: Can use TFIDF to only map “significant” keywords –Term Frequency * InverseDocumentFrequency

Indexing the web An inverted index is created –Forward index sorted according to word –For every valid wordID in the lexicon, create a pointer to the appropriate barrel. –Points to a list of docIDs and hit lists. –Maps keywords to URLs Some wrinkles: –Morphology: stripping suffixes (stemming), singular vs. plural, tense, case folding –Semantic similarity Words with similar meanings share an index. Issue: trading coverage (number of hits) for precision (how closely hits match request)

Indexing Issues Indexing techniques were designed for static collections How to deal with pages that change? –Periodic crawls, rebuild index. –Varied frequency crawls Records need a way to be “purged” Hash of page stored Can use the text of a link to a page to help label that page. –Helps eliminate the addition of spurious keywords.

Indexing Issues Availability and speed –Most search engines will cache the page being referenced. Multiple search terms –OR: separate searches concatenated –AND: intersection of searches computed. –Regular expressions not typically handled. Parsing –Must be able to handle malformed HTML, partial documents

Ranking The primary challenge of a search engine is to return results that match a user’s needs. A word will potentially map to millions of documents How to order them?

PageRank Google uses PageRank to determine relevance. Based on the “quality” of a page’s inward links. A simplified version: –Let N be the outward links of a page. R(page) = c * Sum v 2 inward R(v) / N v c is a normalizing factor

PageRank Average the PageRanks of each page that points to a given page, divided by their outdegree. Let p be a page, with T 1 – T n linking to p. PR(p) = (1-d) + d(Sum I (Pr(T I )/out I )) d is a ‘damping’ factor. PR ‘propagates’ through a graph. –Defined recursively, but can be computed iteratively. –Repeat until PR does not change by more than some delta.

PageRank Intuition: A page is useful if many popular sites link to it. Justification: –Imagine a random surfer who keeps clicking through links. d is the probability she starts a new search. Pros: difficult to game the system Cons: Creates a “rich get richer” web structure where highly popular sites grow in popularity.

HITS HITS is also commonly used for document ranking. Gives each page a hub score and an authority score –A good authority is pointed to by many good hubs. –A good hub points to many good authorities. –Users want good authorities.

Hubs and Authorities Common community structure –Hubs Many outward links Lists of resources –Authorities Many inward links Provide resources, content

Hubs and Authorities Hubs Authorities Link structure estimates over 100,000 Web communities Often not categorized by portals

Issues with Ranking Algorithms Spurious keywords and META tags Users reinforcing each other –Increases “authority” measure Link Similarity vs. Content similarity Topic drift –Many hubs link to more than one topic

Crawling the web How to collect Web data in the first place? Spiders are used to crawl the web and collect pages. –A page is downloaded and its outward links are found. –Each outward link is then downloaded. –Exceptions: Links from CGI interfaces Robot Exclusion Standard

Crawling the Web We may want to be a bit smarter about selecting documents to crawl –Web is too big –Building a special-purpose search engine –Indexing a particular site Choosing where to go first is a hard problem.

Crawling the Web Basic Algorithm: –Let Q be a queue, and S be a starting node –Enqueue(Q,S) –While (not empty(Q)) W = dequeue(Q) V 1,…,V n = outward links(Q) <- this is called the frontier Enqueue(v 1, …, V n ) The Enqueue function is the tricky part.

Crawling the Web BestFirst –Sorts queue according to cosine similiarity –Sim(S,V) numerator: sum w in s and v f ws f wp –Sim(S,V) denominator: Sqrt(sum w in s f w 2 * sum w in v f w 2 ) This is a generalization of Euclidean distance Expand documents most similar to the starting document.

Crawling the Web PageRank can also be used to guide a crawl. PageRank was designed to model a random walk through a web graph. Select pages probabilistically based on their PageRank –One issue: PageRank must be recomputed frequently. Leads to a crawl of the most “valuable” sites.

Web structure Structure is important for: –Predicting traffic patterns Who will visit a site? Where will visitors arrive from? How many visitors can you expect? –Estimating coverage Is a site likely to be indexed?

Core Compact –Short paths between sites –“Small world” phenomenon Distances are small relative to average path length –Number if inward and outward links follows a power law. Mechanism: preferential attachment –As new sites arrive, the probability of gaining an inward link is proportional to in-degree.

Power laws and small worlds Power laws occur everywhere in nature –Distribution of site sizes, city sizes, incomes, word frequencies, incomes, business sizes, earthquake magnitudes, spread of disease –Random networks tend to evolve according to a power law. Small-world phenomenon –“Neighborhoods” will be joined by a common member –Hubs serve to connect neighborhoods –Linkage is closer than one might expect –Application: Construction of networks and protocols that produce maximal flow/efficiency

Local structure More diverse than a power law Pages with similar topics self-organize into communities –Short average path length –High link density –Webrings –Inverse: Does a high link density imply the existence of a community? –Can this be used to study the emergence and growth of web communities?

Web Communities Alternate definition –Each member has more links to community members than non-community members. –Extension of a clique. –Can be discovered with network flow algorithms. Can be used to discover new “categories” Help people interested in a topic find each other. Focused crawling, filtering, recommender systems