Presentation is loading. Please wait.

Presentation is loading. Please wait.

Digital Libraries IS479 Ranking

Similar presentations


Presentation on theme: "Digital Libraries IS479 Ranking"— Presentation transcript:

1 Digital Libraries IS479 Ranking

2 Content Based Ranking http://en.wikipedia.org/wiki/Trip_hop
mentions the artist "DJ Shadow" once mentions "DJ Shadow" 30+ times Intuitively, the latter is more "about" DJ Shadow than the former and the frequency of the term(s) reflect this "aboutness"

3 Content Based Ranking How desirable is recall at web scale? 6 30+ 16

4 Content Based Ranking Why is djshadow.rpod.ru
on page 15 when it has the phrase "dj shadow" 20+ times, and Rhapsody.com appears on page 1 (pos #6) when it has the same phrase only 6 times?

5 Is It Really About DJ Shadow?
Fake page about the real DJ Shadow? Or real page about a fake

6 Link-Based Metrics Content based metrics have an implicit assumption: everyone is telling the truth! We can mine the collective intelligence of the web community by seeing how they voted with their links assumption: when choosing a target for their web page links, (honest) people do a good job of filtering out spam, poor quality, etc. result: in search engine rankings, your document is influenced by the content of documents of others

7 But Not All Links Are Equal…
You linking to my LP review is nice, but its not as nice as it would be if it were linked to by Spin Magazine, Rolling Stone, MTV, etc. a page’s “importance” is defined by having other important pages link to it many links > few links "important" links > "unimportant" links

8 Random "Surfer" Model The surfer starts at some random page on the Web
Begins following links from page to page At each page, there is some probability 1-d the surfer becomes "bored" and randomly jumps to some other page in the Web that is, they type a URL directly, follow on from , etc. -- they just "teleport" to some other place in the Web

9 Computing PageRank original paper version, sums to N (number of
pages in graph) more common version, sums to 1 d = damping factor L() = out degree of a page PR() = PageRank of a page (all nodes start with PR() = 1 or 1/N).

10 Calculating PageRank for a Page, One Iteration
fig 4-3 needs an extra link from C to match the text PR(A) = (1-0.85) * ( PR(B)/links(B) + PR(C)/links(C) + PR(D)/links(D) ) = * ( 0.5/ / /1 ) = * ( ) = * 0.465 = damping factor (d) = (probability surfer landed on page by following a link) 1-d = 0.15 (probability surfer landed on page at “random”) since this is the original version where PR sums to N and we've only accounted for ~1.95 of total PR, pages not shown must be holding PR

11 PageRank Sinks C A D S B E "S" doesn't point to anybody else, so it will acquire PageRank, but not distribute it .pdf, .jpeg, .html w/ no links, etc. Solution: pretend S has links to all other nodes (A,B,C,D,E)

12 When To Stop? Stop computing when the changes are small: |PRi+1 - PRi| <  PageRank converges in O(log(N)) iterations see: "The PageRank Citation Ranking: Bringing Order to the Web" for more information

13 PageRank is a Cool Name…
…but recalling your linear algebra, you're really just computing the eigenvector of the adjacency matrix (each column sums to 0): the innovation was realizing the Web is a graph and applying eigenvector centrality for "quality" For more info, see PageRank paper,

14 PageRank Visualizer http://www.mapequation.org/

15 Check Your PageRank… http://www.prchecker.info/check_page_rank.php
10/10: google.com, cnn.com 7/10: 6/10: 5/10: djshadow.com 4/10: f-measure.blogspot.com

16 But PageRank & Friends Are Not the Only Method…
Kleinberg introduced the "HITS" algorithm at roughly the same time as PageRank Constraint: rather than build a full-up, web-scale search engine like Google, he built what could be described as real-time, post-query processor for content-based search engines of the day (e.g., AltaVista) that exploited the link structure in a manner similar to Google

17 Motivation ford.com, toyota.com, etc. don't describe themselves as "automobile manufacturers", though a query for those terms arguably should return those companies harvard.edu is clearly canonical for a query of "Harvard", even though it uses the term less frequently than many other pages Many search engines of the day could not "find themselves"

18 Idea: Use Initial Search Results as "Root"
include pages that link to the root now you have a subgraph to work with include pages the root links to

19 Empirical Values Start with t=200 URIs in the root set
Allow each page to bring d=50 "back link" (green nodes) pages into S if adobe.com is in the root set, you don't want all of its back links to be in S S tended to be possible optimization: exclude intra-domain links to separate "good" links from navigational links

20 In Degree is Insufficient
Within S, the "good" pages receive more links, and so ordering by in degree would allow authorities to bubble up But "universally popular" links (e.g., yahoo.com, adobe.com, netscape.com) would still have too many in links, and they're (generally) not relevant to the query Example (for a "similar page" query for honda.com, but result is comparable): Honda Ford Motor Company The Blue Ribbon Campaign for Online Free Speech Welcome to Magellan! Welcome to Netscape LinkExchange — Welcome Welcome PointCom Welcome to Netscape Yahoo!

21 Hubs… Insight: authoritative pages relevant to the query not only have high in degree, but also overlap in the pages that point to them: good hubs point to good authorities, and good authorities point to good hubs…

22 Computing Hubs & Authorities

23 HITS Example 1, 1 2, 0 0, 3 0, 1 3, 0 0, 0 .67, 0 .50, .33 0, .50 .83, 0 0,0 .57, 0 .43, .33 0,1 0, .42 .86, 0 .33, 0 .17, .17 0, .17 .50, 0 .25, .14 0, .43 0, .21 .42, 0 .31, 0 .23, .16 0, .46 0, .19 .46, 0 Iteration 1: Input Iteration 1: Update Scores Iteration 1: Normalize Scores Iteration 2: Input Iteration 2: Update Scores Iteration 2: Normalize Scores Iteration 3: Input Iteration 3: Update Scores Iteration 3: Normalize Scores Slide 20 from Chapter 10 of “Search Engines: Information Retrieval in Practice”

24 Passes the "Looks Right" Test
(java) Authorities Gamelan JavaSoft Home Page The Java Developer: How Do I... The Java Book Pages comp.lang.java FAQ (censorship) Authorities EFFweb - The Electronic Frontier Foundation The Blue Ribbon Campaign for Online Free Speech The Center for Democracy and Technology Voters Telecommunications Watch ACLU: American Civil Liberties Union (“search engines”) Authorities Yahoo! Excite Welcome to Magellan! Lycos Home Page AltaVista: Main Page (Gates) Authorities Bill Gates: The Road Ahead Welcome to Microsoft .440

25 Also Can Be Used to Cluster or Discover Communities
(jaguar*) Authorities: principal eigenvector .370 .347 .292 dlms/Consoles/jaguar.html Jaguar Page (jaguar jaguars) Authorities: 2nd non-principal vector, positive end Official Jacksonville Jaguars NFL Website Jacksonville Jaguars Home Page Brett’s Jaguar Page Jacksonville Jaguars (jaguar jaguars) Authorities: 3rd non-principal vector, positive end Jaguar Cars Global Home Page The Jaguar Collection - Official Web site .211 .211 Atari video game NFL team Expensive car


Download ppt "Digital Libraries IS479 Ranking"

Similar presentations


Ads by Google