Green Island (Coral Cay; Great Barrier Reef; Australia; 9/18)

Agin Court Reef (Outer Barrier Reef; Australia; 9/19)

Search Engine A search engine is essentially a text retrieval system for web pages plus a Web interface. So what’s new???

Some Characteristics of the Web Web pages are –very voluminous and diversified –widely distributed on many servers. –extremely dynamic/volatile. Web pages have –more structure (extensively tagged). –are extensively linked. –may often have other associated metadata Web search is –Noisy (pages with high similarity to query may still differ in relevance) –Adversarial! A page can advertise itself falsely just so it will be retrieved Web users are –ordinary folks (“dolts”?) without special training they tend to submit short queries. –There is a very large user community. Use the links and tags and Meta-data! Use the social structure of the web Need to crawl and maintain index Easily impressed

Crawlers: Main issues General-purpose crawling Context specific crawiling –Building topic-specific search engines…

SPIDER CASE STUDY

Web Crawling (Search) Strategy Starting location(s) Traversal order –Depth first –Breadth first –Or ??? Cycles? Coverage? Load? b c d e fg h i j

Robot (2) Some specific issues: 1.What initial URLs to use? Choice depends on type of search engines to be built. For general-purpose search engines, use URLs that are likely to reach a large portion of the Web such as the Yahoo home page. For local search engines covering one or several organizations, use URLs of the home pages of these organizations. In addition, use appropriate domain constraint.

Robot (7) Several research issues about robots: Fetching more important pages first with limited resources. –Can use measures of page importance Fetching web pages in a specified subject area such as movies and sports for creating domain-specific search engines. –Focused crawling Efficient re-fetch of web pages to keep web page index up-to-date. –Keeping track of change rate of a page

Storing Summaries Can’t store complete page text –Whole WWW doesn’t fit on any server Stop Words Stemming What (compact) summary should be stored? –Per URL Title, snippet –Per Word URL, word number But, look at Google’s “Cache” copy..and its “privacy violations”…

Robot (4) 2.How to extract URLs from a web page? Need to identify all possible tags and attributes that hold URLs. Anchor tag: … Option tag: … Map: Frame: Link to an image: Relative path vs. absolute path:

Mercator’s way of maintaining URL frontier  Extracted URLs enter front queue  Each URL goes into a front queue based on its Priority. (priority assigned Based on page importance and Change rate)  URLs are shifted from Front to back queues. Each Back queue corresponds To a single host. Each queue Has time t e at which the host Can be hit again  URLs removed from back Queue when crawler wants A page to crawl

Focused Crawling Classifier: Is crawled page P relevant to the topic? –Algorithm that maps page to relevant/irrelevant Semi-automatic Based on page vicinity.. Distiller:is crawled page P likely to lead to relevant pages? –Algorithm that maps page to likely/unlikely Could be just A/H computation, and taking HUBS –Distiller determines the priority of following links off of P

Map-Reduce Parallelism Named after lisp constructs map and reduce –(reduce #’fn2 (map #’fn1 list)) Run function fn1 on every item of the list, and reduce the resulting list using fn2 (reduce #’* (map #’1+ ‘(4 5 6 7 8 9))) »(reduce #’* ‘(5 6 7 8 9 10)) »151200 (=5*6*7*89*10) (reduce #’+ (map #’primality-test ‘(num1 num2…))) So where is the parallelism? –All the map operations can be done in parallel (e.g. you can test the primality of each of the numbers in parallel). –The overall reduce operation has to be done after the map operation (but can also be parallelized; e.g. assuming the primality-test returns a 0 or 1, the reduce operation can partition the list into k smaller lists and add the elements of each of the lists in parallel (and add the results) –Note that the parallelism in both the above examples depends on the length of input (the larger the input list the more parallel operations you can do in theory). Map-reduce on clusters of computers involve writing your task in a map-reduce form The cluster computing infrastructure will then “parallelize” the map and reduce parts using the available pool of machines (you don’t need to think—while writing the program—as to how many machines and which specific machines are used to do the parallel tasks) An open source environment that provides such an infrastructure is Hadoop –http://hadoop.apache.org/core/ Qn: Can we bring map-reduce parallelism to indexing? (added after class—based on class discussion)

Partition the set of documents into “blocks” construct index for each block separately merge the indexes

Distributing indexes over hosts At web scale, the entire inverted index can’t be held on a single host. How to distribute? –Split the index by terms –Split the index by documents Preferred method is to split it by docs (!) –Each index only points to docs in a specific barrel –Different strategies for assigning docs to barrels At retrieval time –Compute top-k docs from each barrel –Merge the top-k lists to generate the final top-k Result merging can be tricky..so try to punt it Idea –Consider putting most “important” docs in top few barrels This way, we can ignore worrying about other barrels unless the top barrels don’t return enough results Another idea –Split the top 20 and bottom 80% of the doc occurrences into different indexes.. Short vs. long barrels Do search on short ones first and then go to long ones as needed

Dynamic Indexing “simplest” approach

IR for Web Pages

Use of Tag Information (1) Web pages are mostly HTML documents (for now). HTML tags allow the author of a web page to –Control the display of page contents on the Web. –Express their emphases on different parts of the page. HTML tags provide additional information about the contents of a web page. Can we make use of the tag information to improve the effectiveness of a search engine?

Use of Tag Information (2) Two main ideas of using tags: Associate different importance to term occurrences in different tags. Use anchor text to index referenced documents....... airplane ticket and hotel...... Page 1 Page 2: http://travelocity.com/ Document is indexed not just with its contents; But with the contents of others descriptions of it

Google Bombs: The other side of Anchor Text You can “tar” someone’s page just by linking to them with some damning anchor text –If the anchor text is unique enough, then even a few pages linking with that keyword will make sure the page comes up high E.g. link your SO’s page with –“my cuddlybubbly woogums” –“Shmoopie” unfortunately is already taken by Seinfeld –For more common-place keywords (such as “unelectable” or “my sweet heart”) you need a lot more links Which, in the case of the later, may defeat the purpose Document is indexed not just with its contents; But with the contents of others’ descriptions of it

Use of Tag Information (3) Many search engines are using tags to improve retrieval effectiveness. Associating different importance to term occurrences is used in Altavista, HotBot, Yahoo, Lycos, LASER, SIBRIS. WWWW and Google use terms in anchor tags to index a referenced page. Qn: what should be the exact weights for different kinds of terms?

Use of Tag Information (4) The Webor Method (Cutler 97, Cutler 99) Partition HTML tags into six ordered classes: –title, header, list, strong, anchor, plain Extend the term frequency value of a term in a document into a term frequency vector (TFV). Suppose term t appears in the i th class tf i times, i = 1..6. Then TFV = (tf 1, tf 2, tf 3, tf 4, tf 5, tf 6 ). Example: If for page p, term “binghamton” appears 1 time in the title, 2 times in the headers and 8 times in the anchors of hyperlinks pointing to p, then for this term in p: TFV = (1, 2, 0, 0, 8, 0).

Use of Tag Information (6) The Webor Method (Continued) Challenge: How to find the (optimal) CIV = (civ 1, civ 2, civ 3, civ 4, civ 5, civ 6 ) such that the retrieval performance can be improved the most? One Solution: Find the optimal CIV experimentally using a hill-climbing search in the space of CIV Details Skipped

Use of LINK information: Why? Pure query similarity will be unable to pinpoint right pages because of the sheer volume of pages –There may be too many pages that have same keyword similarity with the query The “even if you are one in a million, there are still 300 more like you” phenomenon –Web content creators are autonomous/uncontrolled No one stops me from making a page and writing on it “this is the homepage of President Bush” –and… adversarial I may intentionally create pages with keywords just to drive traffic to my page I might even use spoofing techniques to show one face to the search engine and another to the user So we need some metrics about the trustworthiness/importance of the page –Let’s look at social networks, since these topics have been investigated there..

Connection to Citation Analysis Mirror mirror on the wall, who is the biggest Computer Scientist of them all? –The guy who wrote the most papers That are considered important by most people –By citing them in their own papers »“Science Citation Index” –Should I write survey papers or original papers? Infometrics; Bibliometrics

What Citation Index says About Rao’s papers

Scholar.google

Green Island (Coral Cay; Great Barrier Reef; Australia; 9/18)

Similar presentations

Presentation on theme: "Green Island (Coral Cay; Great Barrier Reef; Australia; 9/18)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Green Island (Coral Cay; Great Barrier Reef; Australia; 9/18)

Similar presentations

Presentation on theme: "Green Island (Coral Cay; Great Barrier Reef; Australia; 9/18)"— Presentation transcript:

Similar presentations

About project

Feedback