Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CS/INFO 430 Information Retrieval Lecture 18 Web Search 4.

Similar presentations


Presentation on theme: "1 CS/INFO 430 Information Retrieval Lecture 18 Web Search 4."— Presentation transcript:

1 1 CS/INFO 430 Information Retrieval Lecture 18 Web Search 4

2 2 Course Administration

3 3 Search Engine Spam: Objective Success of commercial Web sites depends on the number of visitors that find the site while searching for a particular product. 85% of searchers look at only the first page of results A new business sector – search engine optimization M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. International Joint Conference on Artificial Intelligence, 2003. Drost, I. and Scheffer, T., Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam. 16th European Conference on Machine Learning, Porto, 2005

4 4 Spam: Meta Tags Meta tags provide the creator of a Web page a place for cataloguing data that describes the page, but it can be used for advertising, misleading, or other mischievous text Example: http://www.georgewbush.com/ (October 2000)

5 5 Search Engine Spam: Techniques Invisible text: Add keywords to a page in the hope that search engines will index it, but organized so that it will not be visible to a user, e.g., in special type of format, background color, etc. Cloaking: Return different page to Web crawlers than to ordinary downloads. (Can also be used to help Web search, e.g., by providing a text version of a highly visual page.)

6 6 Search Engine Spam: Anchor Text Search engines assume that anchor text provides helpful terms to index the page that is linked to. But anchor text can be deliberately misleading. Consider the impact if a million pages each contained the anchor text: Cornell University

7 7 Search Engine Spam: Anchor Text Google Bomb: a collective hyperlinking strategy intended to change the search results of a specific term or phrase. Examples The "miserable failure" Google bomb promoted George W. Bush’s page on whitehouse.gov to the number one rank in a search of the phrase "miserable failure." The "Jew" Google bomb demoted an anti–Semitic Web site from number one rank with a search of "Jew," and promoted the wikipedia.org definition of "Jew" to number one. See: Clifford Tatum, 2005, http://www.firstmonday.org/issues/issue10_10/tatum/

8 8 Link Spamming: Techniques Link exchange services: Listings of (often unrelated) hyperlinks. To be listed, businesses have to provide a back link that enhances the PageRank of the exchange service. Guestbooks, discussion boards, and weblogs: Automatic tools post large numbers of messages to many sites; each message contains a hyperlink to the target website. Link farms: Densely connected arrays of pages. Farm pages propagate their PageRank to the target, e.g., by a funnel- shaped architecture that points directly or indirectly towards the target page. To camouflage link farms, tools fill in inconspicuous content, e.g., by copying news bulletins.

9 9 Search Engine Spam: Link Farms The regular Web, W, with n w pages. A link farm, F, with n f pages Link from W to F for crawler to find F

10 10 Search Engine Spam: Link Farms Consider the PageRank iteration formula w k = (1-d)w 0 + dBw k-1 Assuming that all pages are crawled, the effect of the factor (1-d)w 0 is that the random jumps go to W and F in the ratio n w :n f. Since there are few links between W and F, the effect of B is to assign PageRank within W and F respectively. Therefore the total PageRank is divided between W and F in the ratio n w :n f.

11 11 Search Engine Spam: Link Farms The manager of the link farm, F, can organize the links within the farm so that certain pages within the farm, h 1, h 2,..., h k, are highly ranked. A manager who wants to give high rank to a page w 0 in W, places links to w 0 from several of the pages h 1, h 2,..., h k. As a result, w 0 is linked to from several highly ranked pages and hence becomes highly ranked. (In addition, w 0 could link back to F thus returning rank to the farm.)

12 12 Link Spamming: Defenses Manual identification of spam pages and farms to create a blacklist. Automatic classification of pages using machine learning techniques. BadRank algorithm. The "bad rank" is initialized to a high value for blacklisted pages. It propagates bad rank to all referring pages (with a damping factor) thus penalizing pages that refer to spam.

13 13 Search Engine Friendly Pages Good ways to get your page indexed and ranked highly Use straightforward URLs, with simple structure, that do not change with time. Submit your site to be crawled. Provide a site map of the pages that you wish to be crawled. Have the words that you would expect to see in queries: - in the content of your pages. - in and tags. Attempt to have links to your page from appropriate authorities. Avoid suspicious behavior.

14 14 Legal Issues in Web Searching Copyright In US law, the creator of a Web page (or the employer) owns the copyright, with a few exceptions. Copyright gives the owner exclusive right to: reproduce, distribute, perform, display, or license others to reproduce, distribute, perform, or display. Search engines operate under an untested legal concept of an implied license. The concept is to assume that somebody who puts a Web page online expects users to download it, read it, index it, etc., unless the copyright owner explicitly states otherwise. Historically, Web companies have been cautious, but recently Google has been pushing the legal limits.

15 15 Economic Models for Content and Services on the Web Mounting information on the Web or supplying services costs money. Who pays? Open access Externally funded from other funds (standard model). Advertising (e.g., Web search). Restricted access Subscription (e.g., journal publishers). Pay by use (rare). Note that these same four models are used for television

16 16

17 17 Information about Individuals Advertising is most effective if it is tailored to the individual Portals, such as Yahoo or Google, have many ways of gaining information about users: identity tracked by cookie or login search terms used, pages retrieved, advertisements clicked use of other services, e.g., travel, shopping, maps Data mining such information can provide valuable services, but raises serious concerns about privacy.

18 18 How many of these services collect information about the user?

19 19 Adding Audience Information to Ranking Conventional information retrieval: A given query returns the same set of hits, ranked in the same sequence, irrespective of who submitted the query. If the search service has information about the user: The results set and/or the ranking can be varied to match the user's profile Example: In an educational digital library, the order of search results can be varied for: instructor v. student grade level of course

20 20 Adding Audience Information to Ranking Metadata based methods: Label documents with controlled vocabulary to define intended audience. Provide users with means to specify their needs, through a profile (preferences), or by a query parameter Automatic methods: Capture persistent information about user behavior by data mining Adjust tf.idf rankings using terms derived from terms previously use by the user


Download ppt "1 CS/INFO 430 Information Retrieval Lecture 18 Web Search 4."

Similar presentations


Ads by Google