Information Retrieval CSE 8337 (Part II) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto Data Mining Introductory and Advanced Topics by Margaret H. Dunham Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze
CSE 8337 Spring CSE 8337 Outline Introduction Text Processing Indexes Boolean Queries Web Searching/Crawling Vector Space Model Matching Evaluation Feedback/Expansion
CSE 8337 Spring Web Searching TOC Web Overview Searching Ranking Crawling
CSE 8337 Spring Web Overview Size >11.5 billion pages (2005) Grows at more than 1 million pages a day Google indexes over 3 billion documents Diverse types of data
CSE 8337 Spring Web Data Web pages Intra-page structures Inter-page structures Usage data Supplemental data Profiles Registration information Cookies
CSE 8337 Spring Zipf’s Law Applied to Web Distribution of frequency of occurrence of words in text. “Frequency of i-th most frequent word is 1/i times that of the most frequent word”
CSE 8337 Spring Heap’s Law Applied to Web Measures size of vocabulary in a text of size n : O (n normally less than 1
CSE 8337 Spring Web search basics The Web Ad indexes Web spider Indexer Indexes Search User
CSE 8337 Spring How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)iprospect.com
CSE 8337 Spring Users’ empirical evaluation of results Quality of pages varies widely Relevance is not enough Other desirable qualities (non IR!!) Content: Trustworthy, diverse, non-duplicated, well maintained Web readability: display correctly & fast No annoyances: pop-ups, etc Precision vs. recall On the web, recall seldom matters What matters Precision at 1? Precision above the fold? Comprehensiveness – must be able to deal with obscure queries Recall matters when the number of matches is very small User perceptions may be unscientific, but are significant over a large aggregate
CSE 8337 Spring Users’ empirical evaluation of engines Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective Coverage of topics for polysemic queries Pre/Post process tools provided Mitigate user errors (auto spell check, search assist,…) Explicit: Search within results, more like this, refine... Anticipative: related searches Deal with idiosyncrasies Web specific vocabulary Impact on stemming, spell-check, etc Web addresses typed in the search box …
CSE 8337 Spring Simplest forms First generation engines relied heavily on tf/idf The top-ranked pages for the query maui resort were the ones containing the most maui’ s and resort’ s SEOs (Search Engine Optimization) responded with dense repetitions of chosen terms e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers Pure word density cannot be trusted as an IR signal
CSE 8337 Spring Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. Raw term frequency is not what we want: A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term. But not 10 times more relevant. Relevance does not increase proportionally with term frequency.
CSE 8337 Spring Log-frequency weighting The log frequency weight of term t in d is 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Score for a document-query pair: sum over terms t in both q and d: score The score is 0 if none of the query terms is present in the document.
CSE 8337 Spring Document frequency Rare terms are more informative than frequent terms Recall stop words Consider a term in the query that is rare in the collection (e.g., arachnocentric) A document containing this term is very likely to be relevant to the query arachnocentric → We want a high weight for rare terms like arachnocentric.
CSE 8337 Spring Document frequency, continued Consider a query term that is frequent in the collection (e.g., high, increase, line) For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms. We will use document frequency (df) to capture this in the score. df ( N) is the number of documents that contain the term
CSE 8337 Spring idf weight df t is the document frequency of t: the number of documents that contain t df is a measure of the informativeness of t We define the idf (inverse document frequency) of t by We use log N/df t instead of N/df t to “dampen” the effect of idf. Will turn out the base of the log is immaterial.
CSE 8337 Spring idf example, suppose N= 1 million termdf t idf t calpurnia16 animal1004 sunday1,0003 fly10,0002 under100,0001 the1,000,0000 There is one idf value for each term t in a collection.
CSE 8337 Spring Collection vs. Document frequency The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences. Example: Which word is a better search term (and should get a higher weight)? WordCollection frequency Document frequency insurance try
CSE 8337 Spring tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known weighting scheme in information retrieval Note: the “-” in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf, tfidf, tf/idf Increases with the number of occurrences within a document Increases with the rarity of the term in the collection
CSE 8337 Spring Search engine optimization (Spam) Motives Commercial, political, religious, lobbies Promotion funded by advertising budget Operators Search Engine Optimizers for lobbies, companies Web masters Hosting services Forums E.g., Web master world ( ) Search engine specific tricks Discussions about academic papers
CSE 8337 Spring Cloaking Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate How do you identify a spider? Is this a Search Engine spider? Y N SPAM Real Doc Cloaking
CSE 8337 Spring More spam techniques Doorway pages Pages optimized for a single keyword that re-direct to the real target page Link spamming Mutual admiration societies, hidden links, awards – more on these later Domain flooding: numerous domains that point or re-direct to a target page Robots Fake query stream – rank checking programs
CSE 8337 Spring The war against spam Quality signals - Prefer authoritative pages based on: Votes from authors (linkage signals) Votes from users (usage signals) Policing of URL submissions Anti robot test Limits on meta-keywords Robust link analysis Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by association) Spam recognition by machine learning Training set based on known spam Family friendly filters Linguistic analysis, general classification techniques, etc. For images: flesh tone detectors, source text analysis, etc. Editorial intervention Blacklists Top queries audited Complaints addressed Suspect pattern detection
CSE 8337 Spring More on spam Web search engines have policies on SEO practices they tolerate/block 18.html 18.html Adversarial IR: the unending (technical) battle between SEO’s and web search engines Research
CSE 8337 Spring Ranking Order documents based on relevance to query (similarity measure) Ranking has to be performed without accessing the text, just the index About ranking algorithms, all information is “top secret”, it is almost impossible to measure recall, as the number of relevant pages can be quite large for simple queries
CSE 8337 Spring Ranking Some of the new ranking algorithms also use hyperlink information Important difference between the Web and normal IR databases, the number of hyperlinks that point to a page provides a measure of its popularity and quality. Links in common between pages often indicate a relationship between those pages.
CSE 8337 Spring Ranking Three examples of ranking techniques based in link analysis: WebQuery HITS (Hub/Authority pages) PageRank
CSE 8337 Spring WebQuery WebQuery takes a set of Web pages (for example, the answer to a query) and ranks them based on how connected each Web page is y-1.html y-1.html
CSE 8337 Spring HITS Kleinberg ranking scheme depends on the query and considers the set of pages S that point to or are pointed by pages in the answer Pages that have many links pointing to them in S are called authorities Pages that have many outgoing links are called hubs Better authority pages come from incoming edges from good hubs and better hub pages come from outgoing edges to good authorities
CSE 8337 Spring Ranking
CSE 8337 Spring PageRank Used in Google PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - a This process can be modeled with a Markov chain, from where the stationary probability of being in each page can be computed Let C(a) be the number of outgoing links of page a and suppose that page a is pointed to by pages p 1 to p n
CSE 8337 Spring PageRank (cont’d) PR(p) = c (PR(1)/N 1 + … + PR(n)/N n ) PR(i): PageRank for a page i which points to target page p. N i : number of links coming out of page I
CSE 8337 Spring Conclusion Nowadays search engines use, basically, Boolean or Vector models and their variations Link Analysis Techniques seem to be the “next generation” of the search engines Indexes: Compression and distributed architecture are keys
CSE 8337 Spring Crawlers Robot (spider) traverses the hypertext sructure in the Web. Collect information from visited pages Used to construct indexes for search engines Traditional Crawler – visits entire Web (?) and replaces index Periodic Crawler – visits portions of the Web and updates subset of index Incremental Crawler – selectively searches the Web and incrementally modifies index Focused Crawler – visits pages related to a particular subject
CSE 8337 Spring Crawling the Web The order in which the URLs are traversed is important Using a breadth first policy, we first look at all the pages linked by the current page, and so on. This matches well Web sites that are structured by related topics. On the other hand, the coverage will be wide but shallow and a Web server can be bombarded with many rapid requests In the depth first case, we follow the first link of a page and we do the same on that page until we cannot go deeper, returning recursively Good ordering schemes can make a difference if crawling better pages first (PageRank)
CSE 8337 Spring Crawling the Web Due to the fact that robots can overwhelm a server with rapid requests and can use significant Internet bandwidth a set of guidelines for robot behavior has been developed Crawlers can also have problems with HTML pages that use frames or image maps. In addition, dynamically generated pages cannot be indexed as well as password protected pages
CSE 8337 Spring Focused Crawler Only visit links from a page if that page is determined to be relevant. Components: Classifier which assigns relevance score to each page based on crawl topic. Distiller to identify hub pages. Crawler visits pages based on crawler and distiller scores. Classifier also determines how useful outgoing links are Hub Pages contain links to many relevant pages. Must be visited even if not high relevance score.
CSE 8337 Spring Focused Crawler
CSE 8337 Spring Basic crawler operation Begin with known “seed” pages Fetch and parse them Extract URLs they point to Place the extracted URLs on a queue Fetch each URL on the queue and repeat
CSE 8337 Spring Crawling picture Web URLs crawled and parsed URLs frontier Unseen Web Seed pages
CSE 8337 Spring Simple picture – complications Web crawling isn’t feasible with one machine All of the above steps distributed Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters’ stipulations How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages Malicious pages Spam pages Spider traps Politeness – don’t hit a server too often
CSE 8337 Spring What any crawler must do Be Polite: Respect implicit and explicit politeness considerations Only crawl allowed pages Respect robots.txt (more on this shortly) Be Robust: Be immune to spider traps and other malicious behavior from web servers
CSE 8337 Spring What any crawler should do Be capable of distributed operation: designed to run on multiple distributed machines Be scalable: designed to increase the crawl rate by adding more machines Performance/efficiency: permit full use of available processing and network resources
CSE 8337 Spring What any crawler should do Fetch pages of “higher quality” first Continuous operation: Continue fetching fresh copies of a previously fetched page Extensible: Adapt to new data formats, protocols
CSE 8337 Spring Updated crawling picture URLs crawled and parsed Unseen Web Seed Pages URL frontier Crawling thread
CSE 8337 Spring URL frontier Can include multiple pages from the same host Must avoid trying to fetch them all at the same time Must try to keep all crawling threads busy
CSE 8337 Spring Explicit and implicit politeness Explicit politeness: specifications from webmasters on what portions of site can be crawled robots.txt Implicit politeness: even with no specification, avoid hitting any site too often
CSE 8337 Spring Robots.txt Protocol for giving spiders (“robots”) limited access to a website, originally from Website announces its request on what can(not) be crawled For a URL, create a file URL/robots.txt This file specifies access restrictions
CSE 8337 Spring Robots.txt example No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine": User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow:
CSE 8337 Spring Processing steps in crawling Pick a URL from the frontier Fetch the document at the URL Parse the URL Extract links from it to other docs (URLs) Check if URL has content already seen If not, add to indexes For each extracted URL Ensure it passes certain URL filter tests Check if it is already in the frontier (duplicate URL elimination) E.g., only crawl.edu, obey robots.txt, etc. Which one?
CSE 8337 Spring Basic crawl architecture WWW DNS Parse Content seen? Doc FP’s Dup URL elim URL set URL Frontier URL filter robots filters Fetch
CSE 8337 Spring Parsing: URL normalization When a fetched document is parsed, some of the extracted links are relative URLs E.g., at we have a relative link to /wiki/Wikipedia:General_disclaimer which is the same as the absolute URL During parsing, must normalize (expand) such relative URLs
CSE 8337 Spring Content seen? Duplication is widespread on the web If the page just fetched is already in the index, do not further process it This is verified using document fingerprints or shingles ers/sigmod03.pdf ers/sigmod03.pdf /cos435/Class_notes/duplicateDocs_corrected.pdf /cos435/Class_notes/duplicateDocs_corrected.pdf
CSE 8337 Spring Filters and robots.txt Filters – regular expressions for URL’s to be crawled/not Once a robots.txt file is fetched from a site, need not fetch it repeatedly Doing so burns bandwidth, hits web server Cache robots.txt files
CSE 8337 Spring Distributing the crawler Run multiple crawl threads, under different processes – potentially at different nodes Geographically distributed nodes Partition hosts being crawled into nodes Hash used for partition How do these nodes communicate?
CSE 8337 Spring URL frontier: two main considerations Politeness: do not hit a web server too frequently Freshness: crawl some pages more often than others E.g., pages (such as News sites) whose content changes often These goals may conflict each other. (E.g., simple priority queue fails – many links out of a page go to its own site, creating a burst of accesses to that site.)
CSE 8337 Spring Politeness – challenges Even if we restrict only one thread to fetch from a host, can hit it repeatedly Common heuristic: insert time gap between successive requests to a host that is >> time for most recent fetch from that host
CSE 8337 Spring URL frontier: Mercator scheme Prioritizer Biased front queue selector Back queue router Back queue selector K front queues B back queues Single host on each URLs Crawl thread requesting URL
CSE 8337 Spring Mercator URL frontier URLs flow in from the top into the frontier Front queues manage prioritization Back queues enforce politeness Each queue is FIFO ations/mercator_join.pdf ations/mercator_join.pdf
CSE 8337 Spring Front queues Prioritizer 1K Biased front queue selector Back queue router
CSE 8337 Spring Front queues Prioritizer assigns to URL an integer priority between 1 and K Appends URL to corresponding queue Heuristics for assigning priority Refresh rate sampled from previous crawls Application-specific (e.g., “crawl news sites more often”)
CSE 8337 Spring Biased front queue selector When a back queue requests a URL (in a sequence to be described): picks a front queue from which to pull a URL This choice can be round robin biased to queues of higher priority, or some more sophisticated variant Can be randomized
CSE 8337 Spring Back queues Biased front queue selector Back queue router Back queue selector 1B
CSE 8337 Spring Back queue invariants Each back queue is kept non-empty while the crawl is in progress Each back queue only contains URLs from a single host Maintain a table from hosts to back queues Host nameBack queue …3 1 B
CSE 8337 Spring Back queue heap One entry for each back queue The entry is the earliest time t e at which the host corresponding to the back queue can be hit again This earliest time is determined from Last access to that host Any time buffer heuristic we choose
CSE 8337 Spring Back queue processing A crawler thread seeking a URL to crawl: Extracts the root of the heap Fetches URL at head of corresponding back queue q (look up from table) Checks if queue q is now empty – if so, pulls a URL v from front queues If there’s already a back queue for v’s host, append v to q and pull another URL from front queues, repeat Else add v to q When q is non-empty, create heap entry for it
CSE 8337 Spring Number of back queues B Keep all threads busy while respecting politeness Mercator recommendation: three times as many back queues as crawler threads