Information Retrieval CSE 8337 (Part II) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval in Practice
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Web search basics.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
PrasadL16Crawling1 Crawling and Web Indexes Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)
Adversarial Information Retrieval The Manipulation of Web Content.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
CSCI 5417 Information Retrieval Systems Jim Martin
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Recuperação de Informação B Cap : Ranking : Crawling the Web : Indices December 06, 1999.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 20: Crawling and web indexes.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.
Crawling Slides adapted from
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 19 11/1/2011.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Information Retrieval CSE 8337 Spring 2005 Web Searching Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
ITCS 6265 Lecture 11 Crawling and web indexes. This lecture Crawling Connectivity servers.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Data mining in web applications
Information Retrieval in Practice
CS276 Information Retrieval and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Lecture 17 Crawling and web indexes
WEB SPAM.
Crawler (AKA Spider) (AKA Robot) (AKA Bot)
Web Mining Ref:
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information Retrieval
Data Mining Chapter 6 Search Engines
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Anwar Alhenshiri.
Presentation transcript:

Information Retrieval CSE 8337 (Part II) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto Data Mining Introductory and Advanced Topics by Margaret H. Dunham  Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze

CSE 8337 Spring CSE 8337 Outline Introduction Text Processing Indexes Boolean Queries Web Searching/Crawling Vector Space Model Matching Evaluation Feedback/Expansion

CSE 8337 Spring Web Searching TOC Web Overview Searching Ranking Crawling

CSE 8337 Spring Web Overview Size >11.5 billion pages (2005) Grows at more than 1 million pages a day Google indexes over 3 billion documents Diverse types of data

CSE 8337 Spring Web Data Web pages Intra-page structures Inter-page structures Usage data Supplemental data Profiles Registration information Cookies

CSE 8337 Spring Zipf’s Law Applied to Web Distribution of frequency of occurrence of words in text. “Frequency of i-th most frequent word is 1/i  times that of the most frequent word”

CSE 8337 Spring Heap’s Law Applied to Web Measures size of vocabulary in a text of size n : O (n    normally less than 1

CSE 8337 Spring Web search basics The Web Ad indexes Web spider Indexer Indexes Search User

CSE 8337 Spring How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)iprospect.com

CSE 8337 Spring Users’ empirical evaluation of results Quality of pages varies widely Relevance is not enough Other desirable qualities (non IR!!) Content: Trustworthy, diverse, non-duplicated, well maintained Web readability: display correctly & fast No annoyances: pop-ups, etc Precision vs. recall On the web, recall seldom matters What matters Precision at 1? Precision above the fold? Comprehensiveness – must be able to deal with obscure queries Recall matters when the number of matches is very small User perceptions may be unscientific, but are significant over a large aggregate

CSE 8337 Spring Users’ empirical evaluation of engines Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective Coverage of topics for polysemic queries Pre/Post process tools provided Mitigate user errors (auto spell check, search assist,…) Explicit: Search within results, more like this, refine... Anticipative: related searches Deal with idiosyncrasies Web specific vocabulary Impact on stemming, spell-check, etc Web addresses typed in the search box …

CSE 8337 Spring Simplest forms First generation engines relied heavily on tf/idf The top-ranked pages for the query maui resort were the ones containing the most maui’ s and resort’ s SEOs (Search Engine Optimization) responded with dense repetitions of chosen terms e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers Pure word density cannot be trusted as an IR signal

CSE 8337 Spring Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. Raw term frequency is not what we want: A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term. But not 10 times more relevant. Relevance does not increase proportionally with term frequency.

CSE 8337 Spring Log-frequency weighting The log frequency weight of term t in d is 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Score for a document-query pair: sum over terms t in both q and d: score The score is 0 if none of the query terms is present in the document.

CSE 8337 Spring Document frequency Rare terms are more informative than frequent terms Recall stop words Consider a term in the query that is rare in the collection (e.g., arachnocentric) A document containing this term is very likely to be relevant to the query arachnocentric → We want a high weight for rare terms like arachnocentric.

CSE 8337 Spring Document frequency, continued Consider a query term that is frequent in the collection (e.g., high, increase, line) For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms. We will use document frequency (df) to capture this in the score. df (  N) is the number of documents that contain the term

CSE 8337 Spring idf weight df t is the document frequency of t: the number of documents that contain t df is a measure of the informativeness of t We define the idf (inverse document frequency) of t by We use log N/df t instead of N/df t to “dampen” the effect of idf. Will turn out the base of the log is immaterial.

CSE 8337 Spring idf example, suppose N= 1 million termdf t idf t calpurnia16 animal1004 sunday1,0003 fly10,0002 under100,0001 the1,000,0000 There is one idf value for each term t in a collection.

CSE 8337 Spring Collection vs. Document frequency The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences. Example: Which word is a better search term (and should get a higher weight)? WordCollection frequency Document frequency insurance try

CSE 8337 Spring tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known weighting scheme in information retrieval Note: the “-” in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf, tfidf, tf/idf Increases with the number of occurrences within a document Increases with the rarity of the term in the collection

CSE 8337 Spring Search engine optimization (Spam) Motives Commercial, political, religious, lobbies Promotion funded by advertising budget Operators Search Engine Optimizers for lobbies, companies Web masters Hosting services Forums E.g., Web master world ( ) Search engine specific tricks Discussions about academic papers

CSE 8337 Spring Cloaking Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate How do you identify a spider? Is this a Search Engine spider? Y N SPAM Real Doc Cloaking

CSE 8337 Spring More spam techniques Doorway pages Pages optimized for a single keyword that re-direct to the real target page Link spamming Mutual admiration societies, hidden links, awards – more on these later Domain flooding: numerous domains that point or re-direct to a target page Robots Fake query stream – rank checking programs

CSE 8337 Spring The war against spam Quality signals - Prefer authoritative pages based on: Votes from authors (linkage signals) Votes from users (usage signals) Policing of URL submissions Anti robot test Limits on meta-keywords Robust link analysis Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by association) Spam recognition by machine learning Training set based on known spam Family friendly filters Linguistic analysis, general classification techniques, etc. For images: flesh tone detectors, source text analysis, etc. Editorial intervention Blacklists Top queries audited Complaints addressed Suspect pattern detection

CSE 8337 Spring More on spam Web search engines have policies on SEO practices they tolerate/block 18.html 18.html Adversarial IR: the unending (technical) battle between SEO’s and web search engines Research

CSE 8337 Spring Ranking Order documents based on relevance to query (similarity measure) Ranking has to be performed without accessing the text, just the index About ranking algorithms, all information is “top secret”, it is almost impossible to measure recall, as the number of relevant pages can be quite large for simple queries

CSE 8337 Spring Ranking Some of the new ranking algorithms also use hyperlink information Important difference between the Web and normal IR databases, the number of hyperlinks that point to a page provides a measure of its popularity and quality. Links in common between pages often indicate a relationship between those pages.

CSE 8337 Spring Ranking Three examples of ranking techniques based in link analysis: WebQuery HITS (Hub/Authority pages) PageRank

CSE 8337 Spring WebQuery WebQuery takes a set of Web pages (for example, the answer to a query) and ranks them based on how connected each Web page is y-1.html y-1.html

CSE 8337 Spring HITS Kleinberg ranking scheme depends on the query and considers the set of pages S that point to or are pointed by pages in the answer Pages that have many links pointing to them in S are called authorities Pages that have many outgoing links are called hubs Better authority pages come from incoming edges from good hubs and better hub pages come from outgoing edges to good authorities

CSE 8337 Spring Ranking

CSE 8337 Spring PageRank Used in Google PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - a This process can be modeled with a Markov chain, from where the stationary probability of being in each page can be computed Let C(a) be the number of outgoing links of page a and suppose that page a is pointed to by pages p 1 to p n

CSE 8337 Spring PageRank (cont’d) PR(p) = c (PR(1)/N 1 + … + PR(n)/N n ) PR(i): PageRank for a page i which points to target page p. N i : number of links coming out of page I

CSE 8337 Spring Conclusion Nowadays search engines use, basically, Boolean or Vector models and their variations Link Analysis Techniques seem to be the “next generation” of the search engines Indexes: Compression and distributed architecture are keys

CSE 8337 Spring Crawlers Robot (spider) traverses the hypertext sructure in the Web. Collect information from visited pages Used to construct indexes for search engines Traditional Crawler – visits entire Web (?) and replaces index Periodic Crawler – visits portions of the Web and updates subset of index Incremental Crawler – selectively searches the Web and incrementally modifies index Focused Crawler – visits pages related to a particular subject

CSE 8337 Spring Crawling the Web The order in which the URLs are traversed is important Using a breadth first policy, we first look at all the pages linked by the current page, and so on. This matches well Web sites that are structured by related topics. On the other hand, the coverage will be wide but shallow and a Web server can be bombarded with many rapid requests In the depth first case, we follow the first link of a page and we do the same on that page until we cannot go deeper, returning recursively Good ordering schemes can make a difference if crawling better pages first (PageRank)

CSE 8337 Spring Crawling the Web Due to the fact that robots can overwhelm a server with rapid requests and can use significant Internet bandwidth a set of guidelines for robot behavior has been developed Crawlers can also have problems with HTML pages that use frames or image maps. In addition, dynamically generated pages cannot be indexed as well as password protected pages

CSE 8337 Spring Focused Crawler Only visit links from a page if that page is determined to be relevant. Components: Classifier which assigns relevance score to each page based on crawl topic. Distiller to identify hub pages. Crawler visits pages based on crawler and distiller scores. Classifier also determines how useful outgoing links are Hub Pages contain links to many relevant pages. Must be visited even if not high relevance score.

CSE 8337 Spring Focused Crawler

CSE 8337 Spring Basic crawler operation Begin with known “seed” pages Fetch and parse them Extract URLs they point to Place the extracted URLs on a queue Fetch each URL on the queue and repeat

CSE 8337 Spring Crawling picture Web URLs crawled and parsed URLs frontier Unseen Web Seed pages

CSE 8337 Spring Simple picture – complications Web crawling isn’t feasible with one machine All of the above steps distributed Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters’ stipulations How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages Malicious pages Spam pages Spider traps Politeness – don’t hit a server too often

CSE 8337 Spring What any crawler must do Be Polite: Respect implicit and explicit politeness considerations Only crawl allowed pages Respect robots.txt (more on this shortly) Be Robust: Be immune to spider traps and other malicious behavior from web servers

CSE 8337 Spring What any crawler should do Be capable of distributed operation: designed to run on multiple distributed machines Be scalable: designed to increase the crawl rate by adding more machines Performance/efficiency: permit full use of available processing and network resources

CSE 8337 Spring What any crawler should do Fetch pages of “higher quality” first Continuous operation: Continue fetching fresh copies of a previously fetched page Extensible: Adapt to new data formats, protocols

CSE 8337 Spring Updated crawling picture URLs crawled and parsed Unseen Web Seed Pages URL frontier Crawling thread

CSE 8337 Spring URL frontier Can include multiple pages from the same host Must avoid trying to fetch them all at the same time Must try to keep all crawling threads busy

CSE 8337 Spring Explicit and implicit politeness Explicit politeness: specifications from webmasters on what portions of site can be crawled robots.txt Implicit politeness: even with no specification, avoid hitting any site too often

CSE 8337 Spring Robots.txt Protocol for giving spiders (“robots”) limited access to a website, originally from Website announces its request on what can(not) be crawled For a URL, create a file URL/robots.txt This file specifies access restrictions

CSE 8337 Spring Robots.txt example No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine": User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow:

CSE 8337 Spring Processing steps in crawling Pick a URL from the frontier Fetch the document at the URL Parse the URL Extract links from it to other docs (URLs) Check if URL has content already seen If not, add to indexes For each extracted URL Ensure it passes certain URL filter tests Check if it is already in the frontier (duplicate URL elimination) E.g., only crawl.edu, obey robots.txt, etc. Which one?

CSE 8337 Spring Basic crawl architecture WWW DNS Parse Content seen? Doc FP’s Dup URL elim URL set URL Frontier URL filter robots filters Fetch

CSE 8337 Spring Parsing: URL normalization When a fetched document is parsed, some of the extracted links are relative URLs E.g., at we have a relative link to /wiki/Wikipedia:General_disclaimer which is the same as the absolute URL During parsing, must normalize (expand) such relative URLs

CSE 8337 Spring Content seen? Duplication is widespread on the web If the page just fetched is already in the index, do not further process it This is verified using document fingerprints or shingles ers/sigmod03.pdf ers/sigmod03.pdf /cos435/Class_notes/duplicateDocs_corrected.pdf /cos435/Class_notes/duplicateDocs_corrected.pdf

CSE 8337 Spring Filters and robots.txt Filters – regular expressions for URL’s to be crawled/not Once a robots.txt file is fetched from a site, need not fetch it repeatedly Doing so burns bandwidth, hits web server Cache robots.txt files

CSE 8337 Spring Distributing the crawler Run multiple crawl threads, under different processes – potentially at different nodes Geographically distributed nodes Partition hosts being crawled into nodes Hash used for partition How do these nodes communicate?

CSE 8337 Spring URL frontier: two main considerations Politeness: do not hit a web server too frequently Freshness: crawl some pages more often than others E.g., pages (such as News sites) whose content changes often These goals may conflict each other. (E.g., simple priority queue fails – many links out of a page go to its own site, creating a burst of accesses to that site.)

CSE 8337 Spring Politeness – challenges Even if we restrict only one thread to fetch from a host, can hit it repeatedly Common heuristic: insert time gap between successive requests to a host that is >> time for most recent fetch from that host

CSE 8337 Spring URL frontier: Mercator scheme Prioritizer Biased front queue selector Back queue router Back queue selector K front queues B back queues Single host on each URLs Crawl thread requesting URL

CSE 8337 Spring Mercator URL frontier URLs flow in from the top into the frontier Front queues manage prioritization Back queues enforce politeness Each queue is FIFO ations/mercator_join.pdf ations/mercator_join.pdf

CSE 8337 Spring Front queues Prioritizer 1K Biased front queue selector Back queue router

CSE 8337 Spring Front queues Prioritizer assigns to URL an integer priority between 1 and K Appends URL to corresponding queue Heuristics for assigning priority Refresh rate sampled from previous crawls Application-specific (e.g., “crawl news sites more often”)

CSE 8337 Spring Biased front queue selector When a back queue requests a URL (in a sequence to be described): picks a front queue from which to pull a URL This choice can be round robin biased to queues of higher priority, or some more sophisticated variant Can be randomized

CSE 8337 Spring Back queues Biased front queue selector Back queue router Back queue selector 1B

CSE 8337 Spring Back queue invariants Each back queue is kept non-empty while the crawl is in progress Each back queue only contains URLs from a single host Maintain a table from hosts to back queues Host nameBack queue …3 1 B

CSE 8337 Spring Back queue heap One entry for each back queue The entry is the earliest time t e at which the host corresponding to the back queue can be hit again This earliest time is determined from Last access to that host Any time buffer heuristic we choose

CSE 8337 Spring Back queue processing A crawler thread seeking a URL to crawl: Extracts the root of the heap Fetches URL at head of corresponding back queue q (look up from table) Checks if queue q is now empty – if so, pulls a URL v from front queues If there’s already a back queue for v’s host, append v to q and pull another URL from front queues, repeat Else add v to q When q is non-empty, create heap entry for it

CSE 8337 Spring Number of back queues B Keep all threads busy while respecting politeness Mercator recommendation: three times as many back queues as crawler threads