CSCI 5417 Information Retrieval Systems Jim Martin Lecture 19 11/1/2011.

Slides:



Advertisements
Similar presentations
Lecture 18: Link analysis
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Information Retrieval CSE 8337 (Part II) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Link Analysis, PageRank and Search Engines on the Web
March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google.
PrasadL16Crawling1 Crawling and Web Indexes Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
CSCI 5417 Information Retrieval Systems Jim Martin
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 20: Crawling and web indexes.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.
Crawling Slides adapted from
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS.
CRAWLER DESIGN YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney’s IR course at UT Austin)
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 CS 430: Information Discovery Lecture 5 Ranking.
Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.
CSE326: Data Structures World Wide What? Hannah Tang and Brian Tjaden Summer Quarter 2002.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
ITCS 6265 Lecture 11 Crawling and web indexes. This lecture Crawling Connectivity servers.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Spidering (Crawling)
1 Web Crawling and Data Gathering Spidering. 2 Some Typical Tasks Get information from other parts of an organization –It may be easier to get information.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
CS276 Information Retrieval and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 17 Crawling and web indexes
Crawler (AKA Spider) (AKA Robot) (AKA Bot)
Search Engines and Link Analysis on the Web
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
The Anatomy of a Large-Scale Hypertextual Web Search Engine
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Anwar Alhenshiri.
Presentation transcript:

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 19 11/1/2011

Today Crawling Start on link-based ranking

10/9/2015 CSCI IR 3 Updated crawling picture URLs crawled and parsed Unseen Web Seed Pages URL frontier Crawling thread

10/9/2015 CSCI IR 4 URL frontier URL frontier contains URLs that have been discovered but have not yet been explored (retrieved and analyzed for content and more URLs) Can include multiple URLs from the same host Must avoid trying to fetch them all at the same time Even from different crawling threads Must try to keep all crawling threads busy

10/9/2015 CSCI IR 5 Robots.txt Protocol for giving spiders (“robots”) limited access to a website, originally from 1994 Website announces its request on what can(not) be crawled For a URL, create a file URL/robots.txt This file specifies access restrictions

10/9/2015 CSCI IR 6 Robots.txt example No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine": User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow:

10/9/2015 CSCI IR 7 Processing steps in crawling Pick a URL from the frontier Fetch the document at the URL Parse the document Extract links from it to other docs (URLs) Check if document has content already seen If not, add to indexes For each extracted URL Ensure it passes certain URL filter tests Check if it is already in the frontier (duplicate URL elimination) E.g., only crawl.edu, obey robots.txt, etc. Which one?

10/9/2015 CSCI IR 8 Basic crawl architecture WWW Fetch DNS Parse Content seen? Doc FP’s Dup URL elim URL set URL Frontier URL filter robots filters

10/9/2015 CSCI IR 9 DNS (Domain Name Server) A lookup service on the internet Given a URL, retrieve its IP address Service provided by a distributed set of servers – thus, lookup latencies can be high (even seconds) Common implementations of DNS lookup are blocking: only one outstanding request at a time Solutions DNS caching Batch DNS resolver – collects requests and sends them out together

10/9/2015 CSCI IR 10 Parsing: URL normalization When a fetched document is parsed, some of the extracted links are relative URLs en.wikipedia.org/wiki/Main_Page has a relative link to en.wikipedia.org/wiki/Main_Page /wiki/Wikipedia:General_disclaimer which is the same as the absolute URL en.wikipedia.org/wiki/Wikipedia:General_disclaimer en.wikipedia.org/wiki/Wikipedia:General_disclaimer Must expand such relative URLs URL shorteners (bit.ly, etc) are a new problem

10/9/2015 CSCI IR 11 Content seen? Duplication is widespread on the web If the page just fetched is already in the index, do not further process it This is verified using document fingerprints or shingles

10/9/2015 CSCI IR 12 Filters and robots.txt Filters – regular expressions for URL’s to be crawled/not Once a robots.txt file is fetched from a site, need not fetch it repeatedly Doing so burns bandwidth, hits web server Cache robots.txt files

10/9/2015 CSCI IR 13 Duplicate URL elimination Check to see if an extracted+filtered URL has already been put into to the URL frontier This may or may not be needed based on the crawling architecture

10/9/2015 CSCI IR 14 Distributing the crawler Run multiple crawl threads, under different processes – potentially at different nodes Geographically distributed nodes Partition hosts being crawled into nodes

Overview: Frontier 10/9/2015CSCI IR15 URL Frontier (discovered, but not yet crawled sites) Crawler threads request URLs to crawl Crawlers provide discovered URLs (subject to filtering)

10/9/2015 CSCI IR 16 URL frontier: two main considerations Politeness: do not hit a web server too frequently Freshness: crawl some sites/pages more often than others E.g., pages (such as News sites) whose content changes often These goals may conflict each other.

10/9/2015 CSCI IR 17 Politeness – challenges Even if we restrict only one thread to fetch from a host, it can hit it repeatedly Common heuristic: insert time gap between successive requests to a host that is >> time for most recent fetch from that host

Overview: Frontier 10/9/2015CSCI IR18 URL Frontier (discovered, but not yet crawled sites) Crawler threads request URLs to crawl Crawlers provide discovered URLs (subject to filtering)

Back queue selector B back queues: Unexplored URLs from a single host on each Crawl thread requesting URL URL Frontier: Mercator scheme Biased front queue selector Back queue router Prioritizer K front queues URLs Sec

Mercator URL frontier URLs flow in from the top into the frontier Front queues manage prioritization Back queues enforce politeness Each queue is FIFO Sec

10/9/2015 CSCI IR 21 Explicit and implicit politeness Explicit politeness: specifications from webmasters on what portions of site can be crawled robots.txt Implicit politeness: even with no specification, avoid hitting any site too often

Front queues Prioritizer assigns each URL an integer priority between 1 and K Appends URL to corresponding queue Heuristics for assigning priority Refresh rate sampled from previous crawls Application-specific (e.g., “crawl news sites more often”) Sec

Front queues Prioritizer 1K Biased front queue selector Back queue router Sec

Biased front queue selector When a back queue requests URLs (in a sequence to be described): picks a front queue from which to pull a URL This choice can be round robin biased to queues of higher priority, or some more sophisticated variant Sec

Back queues Biased front queue selector Back queue router Back queue selector 1B Heap Sec

Back queue invariants Each back queue is kept non-empty while the crawl is in progress Each back queue only contains URLs from a single host Maintain a table from hosts to back queues Sec

Back queue heap One entry for each back queue The entry is the earliest time t e at which the host corresponding to the back queue can be hit again This earliest time is determined from Last access to that host Any time heuristic we choose Sec

Back queue processing A crawler thread seeking a URL to crawl: Extracts the root of the heap Fetches URL at head of corresponding back queue q (look up from table) Checks if queue q is now empty – if so, pulls a URL v from front queues If there’s already a back queue for v’s host, append v to q and pull another URL from front queues, repeat Else add v to q When q is non-empty, create heap entry for it Sec

10/9/2015 CSCI IR 29 Back Queue Management Each back queue corresponds to a given host When a crawler thread wants a URL to process we want to get it from the host that has been bothered least recently So we need a way to quickly get at the least recently visited host and a way to update that structure as hosts are added, deleted, and accessed Hence the heap

10/9/2015 CSCI IR 30 Back Queue Management When a thread retrieves the last URL from a back queue it has to do additional work It visits the front queues based on the priority scheme and retrieves a URL If it is a URL from a host that is already assigned to a back queue then it adds that URL to the right queue AND returns to the front queue for another If it corresponds to a host not currently in the back queues then it adds it to the queue that it had emptied

10/9/2015 CSCI IR 31 Missing From the Pictures We better be indexing the documents. Not just storing the URLs If we’re caching ala google then we need to store the docs as well. And we had better be constructing a connectivity graph as we go along To facilitate what we’re going to do next

Break Nearly done with readings from the book. Current material is from Chapter 20 and 21. Then we’re going back to Ch. 15; Section 15.4 To support the material in Ch 15 I’m going to add some videos from the Stanford ML class. Basically 2 or 3 days worth And I’ll post the reading for the Topic Models stuff. 10/9/2015CSCI IR32

Break For those of you who want to get started Go to ml-class.org  Video Lectures  Watch the videos in Sections II, IV and VI  Basically equiv to 3 of our classes 10/9/2015CSCI IR33

10/9/2015 CSCI IR 34 The Web as a Directed Graph Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal) Assumption 2: The anchor text of a hyperlink describes the target page (textual context) Page A hyperlink Page B Anchor

10/9/2015 CSCI IR 35 Anchor Text For a query like tesla we would like it to return the Tesla home page first. But... Tesla’s home page is mostly graphical (I.e., low term count) So use anchor text to augment “tesla roadster” “teslamotors.com” “Tesla info A million pieces of anchor text with “ibm” send a strong signal

10/9/2015 CSCI IR 36 Indexing Anchor Text Can sometimes have unexpected side effects - e.g., french military spoof Can index anchor text with less (or more) weight Globally across all incoming links Or as a function of the quality of the page the link is coming from

10/9/2015 CSCI IR 37 Traditional Citation Analysis Citation frequency Citation impact ratings (of journals and authors) Used in tenure decisions Bibliographic coupling frequency Articles that co-cite the same articles are related Citation indexing Who is author cited by? (Garfield [Garf72] )

10/9/2015 CSCI IR 38 Web Link-Based Versions Two methods based on citation analysis literature PageRank Page and Brin -> Google HITs Kleinberg

10/9/2015 CSCI IR 39 PageRank Sketch The pagerank of a page is based on the pagerank of the pages that point at it. Roughly

10/9/2015 CSCI IR 40 PageRank scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably “In the steady state” each page has a long-term visit rate - use this as the page’s score Pages with low rank are pages rarely visited during a random walk 1/3

10/9/2015 CSCI IR 41 Not quite enough The web is full of dead-ends. Pages that are pointed to but have no outgoing links Random walk can get stuck in such dead- ends Makes no sense to talk about long-term visit rates in the presence of dead-ends. ??

10/9/2015 CSCI IR 42 Teleporting At a dead end, jump to a random web page At any non-dead end, with probability 10%, jump to a random web page With remaining probability (90%), go out on a random link. 10% - a parameter (call it alpha)

10/9/2015 CSCI IR 43 Result of teleporting Now you can’t get stuck locally. There is a long-term rate at which any page is visited How do we compute this visit rate? Can’t directly use the random walk metaphor

State Transition Probabilities We’re going to use the notion of a transition probability. If we’re in some particular state, what is the probability of going to some other particular state from there. If there are n states (pages) then we need an n x n table of probabilities. 10/9/2015CSCI IR44

Markov Chains So if I’m in a particular state (say the start of a random walk) And I know the whole n x n table Then I can compute the probability distribution over all the next states I might be in in the next step of the walk... And in the step after that And the step after that 10/9/2015CSCI IR45

Say alpha =.5 Example 10/9/2015 CSCI IR

Say alpha =.5 Example 10/9/2015 CSCI IR ? P(3  2)

Say alpha =.5 Example 10/9/2015 CSCI IR /3 P(3  2)

Say alpha =.5 Example 10/9/2015 CSCI IR /62/31/6 P(3  *)

Say alpha =.5 Example 10/9/2015 CSCI IR /62/31/6 5/121/65/12 1/62/31/6

Say alpha =.5 Example 10/9/2015 CSCI IR /62/31/6 5/121/65/12 1/62/31/6 Assume we start a walk in 1 at time T0. Then what should we believe about the state of affairs in T1? What should we believe about things at T2?

Example 10/9/2015 CSCI IR 52 PageRank values

10/9/2015 CSCI IR 53 Pagerank summary Preprocessing: Given graph of links, build matrix P. From it compute the page rank for each page. Query processing: Retrieve pages meeting query using the usual methods we’ve discussed Rank them by their pagerank Order is query-independent