Presentation on theme: "Information Retrieval CSE 8337 (Part II) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates."— Presentation transcript:
Information Retrieval CSE 8337 (Part II) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ http://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze http://informationretrieval.org
CSE 8337 Spring 2011 2 CSE 8337 Outline Introduction Text Processing Indexes Boolean Queries Web Searching/Crawling Vector Space Model Matching Evaluation Feedback/Expansion
CSE 8337 Spring 2011 3 Web Searching TOC Web Overview Searching Ranking Crawling
CSE 8337 Spring 2011 4 Web Overview Size >11.5 billion pages (2005) Grows at more than 1 million pages a day Google indexes over 3 billion documents Diverse types of data http://www.worldwidewebsize.com/
CSE 8337 Spring 2011 5 Web Data Web pages Intra-page structures Inter-page structures Usage data Supplemental data Profiles Registration information Cookies
CSE 8337 Spring 2011 6 Zipf’s Law Applied to Web Distribution of frequency of occurrence of words in text. “Frequency of i-th most frequent word is 1/i times that of the most frequent word”
CSE 8337 Spring 2011 7 Heap’s Law Applied to Web Measures size of vocabulary in a text of size n : O (n normally less than 1
CSE 8337 Spring 2011 8 Web search basics The Web Ad indexes Web spider Indexer Indexes Search User
CSE 8337 Spring 2011 9 How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)iprospect.com
CSE 8337 Spring 2011 10 Users’ empirical evaluation of results Quality of pages varies widely Relevance is not enough Other desirable qualities (non IR!!) Content: Trustworthy, diverse, non-duplicated, well maintained Web readability: display correctly & fast No annoyances: pop-ups, etc Precision vs. recall On the web, recall seldom matters What matters Precision at 1? Precision above the fold? Comprehensiveness – must be able to deal with obscure queries Recall matters when the number of matches is very small User perceptions may be unscientific, but are significant over a large aggregate
CSE 8337 Spring 2011 11 Users’ empirical evaluation of engines Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective Coverage of topics for polysemic queries Pre/Post process tools provided Mitigate user errors (auto spell check, search assist,…) Explicit: Search within results, more like this, refine... Anticipative: related searches Deal with idiosyncrasies Web specific vocabulary Impact on stemming, spell-check, etc Web addresses typed in the search box …
CSE 8337 Spring 2011 12 Simplest forms First generation engines relied heavily on tf/idf The top-ranked pages for the query maui resort were the ones containing the most maui’ s and resort’ s SEOs (Search Engine Optimization) responded with dense repetitions of chosen terms e.g., maui resort maui resort maui resort Often, the repetitions would be in the same color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers Pure word density cannot be trusted as an IR signal
CSE 8337 Spring 2011 13 Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. Raw term frequency is not what we want: A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term. But not 10 times more relevant. Relevance does not increase proportionally with term frequency.
CSE 8337 Spring 2011 14 Log-frequency weighting The log frequency weight of term t in d is 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Score for a document-query pair: sum over terms t in both q and d: score The score is 0 if none of the query terms is present in the document.
CSE 8337 Spring 2011 15 Document frequency Rare terms are more informative than frequent terms Recall stop words Consider a term in the query that is rare in the collection (e.g., arachnocentric) A document containing this term is very likely to be relevant to the query arachnocentric → We want a high weight for rare terms like arachnocentric.
CSE 8337 Spring 2011 16 Document frequency, continued Consider a query term that is frequent in the collection (e.g., high, increase, line) For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms. We will use document frequency (df) to capture this in the score. df ( N) is the number of documents that contain the term
CSE 8337 Spring 2011 17 idf weight df t is the document frequency of t: the number of documents that contain t df is a measure of the informativeness of t We define the idf (inverse document frequency) of t by We use log N/df t instead of N/df t to “dampen” the effect of idf. Will turn out the base of the log is immaterial.
CSE 8337 Spring 2011 18 idf example, suppose N= 1 million termdf t idf t calpurnia16 animal1004 sunday1,0003 fly10,0002 under100,0001 the1,000,0000 There is one idf value for each term t in a collection.
CSE 8337 Spring 2011 19 Collection vs. Document frequency The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences. Example: Which word is a better search term (and should get a higher weight)? WordCollection frequency Document frequency insurance104403997 try104228760
CSE 8337 Spring 2011 20 tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known weighting scheme in information retrieval Note: the “-” in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf, tfidf, tf/idf Increases with the number of occurrences within a document Increases with the rarity of the term in the collection
CSE 8337 Spring 2011 21 Search engine optimization (Spam) Motives Commercial, political, religious, lobbies Promotion funded by advertising budget Operators Search Engine Optimizers for lobbies, companies Web masters Hosting services Forums E.g., Web master world ( www.webmasterworld.com ) www.webmasterworld.com Search engine specific tricks Discussions about academic papers
CSE 8337 Spring 2011 22 Cloaking Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate How do you identify a spider? Is this a Search Engine spider? Y N SPAM Real Doc Cloaking
CSE 8337 Spring 2011 23 More spam techniques Doorway pages Pages optimized for a single keyword that re-direct to the real target page Link spamming Mutual admiration societies, hidden links, awards – more on these later Domain flooding: numerous domains that point or re-direct to a target page Robots Fake query stream – rank checking programs
CSE 8337 Spring 2011 24 The war against spam Quality signals - Prefer authoritative pages based on: Votes from authors (linkage signals) Votes from users (usage signals) Policing of URL submissions Anti robot test Limits on meta-keywords Robust link analysis Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by association) Spam recognition by machine learning Training set based on known spam Family friendly filters Linguistic analysis, general classification techniques, etc. For images: flesh tone detectors, source text analysis, etc. Editorial intervention Blacklists Top queries audited Complaints addressed Suspect pattern detection
CSE 8337 Spring 2011 25 More on spam Web search engines have policies on SEO practices they tolerate/block http://help.yahoo.com/l/us/yahoo/search/basics/basics- 18.html http://help.yahoo.com/l/us/yahoo/search/basics/basics- 18.html http://www.google.com/intl/en/webmasters/ Adversarial IR: the unending (technical) battle between SEO’s and web search engines Research http://airweb.cse.lehigh.edu http://airweb.cse.lehigh.edu
CSE 8337 Spring 2011 26 Ranking Order documents based on relevance to query (similarity measure) Ranking has to be performed without accessing the text, just the index About ranking algorithms, all information is “top secret”, it is almost impossible to measure recall, as the number of relevant pages can be quite large for simple queries
CSE 8337 Spring 2011 27 Ranking Some of the new ranking algorithms also use hyperlink information Important difference between the Web and normal IR databases, the number of hyperlinks that point to a page provides a measure of its popularity and quality. Links in common between pages often indicate a relationship between those pages.
CSE 8337 Spring 2011 28 Ranking Three examples of ranking techniques based in link analysis: WebQuery HITS (Hub/Authority pages) PageRank
CSE 8337 Spring 2011 29 WebQuery WebQuery takes a set of Web pages (for example, the answer to a query) and ranks them based on how connected each Web page is http://www.cgl.uwaterloo.ca/Projects/Vanish/webquer y-1.html http://www.cgl.uwaterloo.ca/Projects/Vanish/webquer y-1.html
CSE 8337 Spring 2011 30 HITS Kleinberg ranking scheme depends on the query and considers the set of pages S that point to or are pointed by pages in the answer Pages that have many links pointing to them in S are called authorities Pages that have many outgoing links are called hubs Better authority pages come from incoming edges from good hubs and better hub pages come from outgoing edges to good authorities
CSE 8337 Spring 2011 31 Ranking
CSE 8337 Spring 2011 32 PageRank Used in Google PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - a This process can be modeled with a Markov chain, from where the stationary probability of being in each page can be computed Let C(a) be the number of outgoing links of page a and suppose that page a is pointed to by pages p 1 to p n
CSE 8337 Spring 2011 33 PageRank (cont’d) PR(p) = c (PR(1)/N 1 + … + PR(n)/N n ) PR(i): PageRank for a page i which points to target page p. N i : number of links coming out of page I
CSE 8337 Spring 2011 34 Conclusion Nowadays search engines use, basically, Boolean or Vector models and their variations Link Analysis Techniques seem to be the “next generation” of the search engines Indexes: Compression and distributed architecture are keys
CSE 8337 Spring 2011 35 Crawlers Robot (spider) traverses the hypertext sructure in the Web. Collect information from visited pages Used to construct indexes for search engines Traditional Crawler – visits entire Web (?) and replaces index Periodic Crawler – visits portions of the Web and updates subset of index Incremental Crawler – selectively searches the Web and incrementally modifies index Focused Crawler – visits pages related to a particular subject
CSE 8337 Spring 2011 36 Crawling the Web The order in which the URLs are traversed is important Using a breadth first policy, we first look at all the pages linked by the current page, and so on. This matches well Web sites that are structured by related topics. On the other hand, the coverage will be wide but shallow and a Web server can be bombarded with many rapid requests In the depth first case, we follow the first link of a page and we do the same on that page until we cannot go deeper, returning recursively Good ordering schemes can make a difference if crawling better pages first (PageRank)
CSE 8337 Spring 2011 37 Crawling the Web Due to the fact that robots can overwhelm a server with rapid requests and can use significant Internet bandwidth a set of guidelines for robot behavior has been developed Crawlers can also have problems with HTML pages that use frames or image maps. In addition, dynamically generated pages cannot be indexed as well as password protected pages
CSE 8337 Spring 2011 38 Focused Crawler Only visit links from a page if that page is determined to be relevant. Components: Classifier which assigns relevance score to each page based on crawl topic. Distiller to identify hub pages. Crawler visits pages based on crawler and distiller scores. Classifier also determines how useful outgoing links are Hub Pages contain links to many relevant pages. Must be visited even if not high relevance score.
CSE 8337 Spring 2011 39 Focused Crawler
CSE 8337 Spring 2011 40 Basic crawler operation Begin with known “seed” pages Fetch and parse them Extract URLs they point to Place the extracted URLs on a queue Fetch each URL on the queue and repeat
CSE 8337 Spring 2011 41 Crawling picture Web URLs crawled and parsed URLs frontier Unseen Web Seed pages
CSE 8337 Spring 2011 42 Simple picture – complications Web crawling isn’t feasible with one machine All of the above steps distributed Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters’ stipulations How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages Malicious pages Spam pages Spider traps Politeness – don’t hit a server too often
CSE 8337 Spring 2011 43 What any crawler must do Be Polite: Respect implicit and explicit politeness considerations Only crawl allowed pages Respect robots.txt (more on this shortly) Be Robust: Be immune to spider traps and other malicious behavior from web servers
CSE 8337 Spring 2011 44 What any crawler should do Be capable of distributed operation: designed to run on multiple distributed machines Be scalable: designed to increase the crawl rate by adding more machines Performance/efficiency: permit full use of available processing and network resources
CSE 8337 Spring 2011 45 What any crawler should do Fetch pages of “higher quality” first Continuous operation: Continue fetching fresh copies of a previously fetched page Extensible: Adapt to new data formats, protocols
CSE 8337 Spring 2011 46 Updated crawling picture URLs crawled and parsed Unseen Web Seed Pages URL frontier Crawling thread
CSE 8337 Spring 2011 47 URL frontier Can include multiple pages from the same host Must avoid trying to fetch them all at the same time Must try to keep all crawling threads busy
CSE 8337 Spring 2011 48 Explicit and implicit politeness Explicit politeness: specifications from webmasters on what portions of site can be crawled robots.txt Implicit politeness: even with no specification, avoid hitting any site too often
CSE 8337 Spring 2011 49 Robots.txt Protocol for giving spiders (“robots”) limited access to a website, originally from 1994 www.robotstxt.org/wc/norobots.html Website announces its request on what can(not) be crawled For a URL, create a file URL/robots.txt This file specifies access restrictions
CSE 8337 Spring 2011 50 Robots.txt example No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine": User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow:
CSE 8337 Spring 2011 51 Processing steps in crawling Pick a URL from the frontier Fetch the document at the URL Parse the URL Extract links from it to other docs (URLs) Check if URL has content already seen If not, add to indexes For each extracted URL Ensure it passes certain URL filter tests Check if it is already in the frontier (duplicate URL elimination) E.g., only crawl.edu, obey robots.txt, etc. Which one?
CSE 8337 Spring 2011 52 Basic crawl architecture WWW DNS Parse Content seen? Doc FP’s Dup URL elim URL set URL Frontier URL filter robots filters Fetch
CSE 8337 Spring 2011 53 Parsing: URL normalization When a fetched document is parsed, some of the extracted links are relative URLs E.g., at http://en.wikipedia.org/wiki/Main_Pagehttp://en.wikipedia.org/wiki/Main_Page we have a relative link to /wiki/Wikipedia:General_disclaimer which is the same as the absolute URL http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer During parsing, must normalize (expand) such relative URLs
CSE 8337 Spring 2011 54 Content seen? Duplication is widespread on the web If the page just fetched is already in the index, do not further process it This is verified using document fingerprints or shingles http://theory.stanford.edu/~aiken/publications/pap ers/sigmod03.pdf http://theory.stanford.edu/~aiken/publications/pap ers/sigmod03.pdf http://www.cs.princeton.edu/courses/archive/spr08 /cos435/Class_notes/duplicateDocs_corrected.pdf http://www.cs.princeton.edu/courses/archive/spr08 /cos435/Class_notes/duplicateDocs_corrected.pdf
CSE 8337 Spring 2011 55 Filters and robots.txt Filters – regular expressions for URL’s to be crawled/not Once a robots.txt file is fetched from a site, need not fetch it repeatedly Doing so burns bandwidth, hits web server Cache robots.txt files
CSE 8337 Spring 2011 56 Distributing the crawler Run multiple crawl threads, under different processes – potentially at different nodes Geographically distributed nodes Partition hosts being crawled into nodes Hash used for partition How do these nodes communicate?
CSE 8337 Spring 2011 57 URL frontier: two main considerations Politeness: do not hit a web server too frequently Freshness: crawl some pages more often than others E.g., pages (such as News sites) whose content changes often These goals may conflict each other. (E.g., simple priority queue fails – many links out of a page go to its own site, creating a burst of accesses to that site.)
CSE 8337 Spring 2011 58 Politeness – challenges Even if we restrict only one thread to fetch from a host, can hit it repeatedly Common heuristic: insert time gap between successive requests to a host that is >> time for most recent fetch from that host
CSE 8337 Spring 2011 59 URL frontier: Mercator scheme Prioritizer Biased front queue selector Back queue router Back queue selector K front queues B back queues Single host on each URLs Crawl thread requesting URL
CSE 8337 Spring 2011 60 Mercator URL frontier URLs flow in from the top into the frontier Front queues manage prioritization Back queues enforce politeness Each queue is FIFO http://users.cis.fiu.edu/~lusec001/present ations/mercator_join.pdf http://users.cis.fiu.edu/~lusec001/present ations/mercator_join.pdf
CSE 8337 Spring 2011 61 Front queues Prioritizer 1K Biased front queue selector Back queue router
CSE 8337 Spring 2011 62 Front queues Prioritizer assigns to URL an integer priority between 1 and K Appends URL to corresponding queue Heuristics for assigning priority Refresh rate sampled from previous crawls Application-specific (e.g., “crawl news sites more often”)
CSE 8337 Spring 2011 63 Biased front queue selector When a back queue requests a URL (in a sequence to be described): picks a front queue from which to pull a URL This choice can be round robin biased to queues of higher priority, or some more sophisticated variant Can be randomized
CSE 8337 Spring 2011 64 Back queues Biased front queue selector Back queue router Back queue selector 1B
CSE 8337 Spring 2011 65 Back queue invariants Each back queue is kept non-empty while the crawl is in progress Each back queue only contains URLs from a single host Maintain a table from hosts to back queues Host nameBack queue …3 1 B
CSE 8337 Spring 2011 66 Back queue heap One entry for each back queue The entry is the earliest time t e at which the host corresponding to the back queue can be hit again This earliest time is determined from Last access to that host Any time buffer heuristic we choose
CSE 8337 Spring 2011 67 Back queue processing A crawler thread seeking a URL to crawl: Extracts the root of the heap Fetches URL at head of corresponding back queue q (look up from table) Checks if queue q is now empty – if so, pulls a URL v from front queues If there’s already a back queue for v’s host, append v to q and pull another URL from front queues, repeat Else add v to q When q is non-empty, create heap entry for it
CSE 8337 Spring 2011 68 Number of back queues B Keep all threads busy while respecting politeness Mercator recommendation: three times as many back queues as crawler threads