Presentation is loading. Please wait.

Presentation is loading. Please wait.

Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.

Similar presentations


Presentation on theme: "Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn."— Presentation transcript:

1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn

2 Searching the Web Goal: find pages relevant to a query. The basic text-based search algorithms retrieve pages that contain the query keywords. Improved searching algorithms can examine the link structure of the web to learn about the contents of web pages. This paper introduces an algorithm for identifying authoritative pages and hub pages.

3 Overview Issues in Searching Algorithm Overview Iterative Algorithm Wrap-up

4 Types of Queries Specific queries: information about the topic is scarce. Broad-topic queries: information about the topic is overabundant. We want to return the most ‘authoritative’ pages. Similar-page queries: find pages that are ‘like’ a given page. This paper examines broad-topic queries.

5 Complications with Text- based Search An authoritative page for a query may not contain the query terms. –Example: www.uh.edu contains neither ‘University’ nor ‘Houston’, and has ‘UH’ only six times. –Text may be in the form of images or flash animations. A page might not be self-descriptive. –Example: Honda does not describe itself as an automobile manufacturer and Google does not describe itself as a search engine.

6 Examining Link Structure The creator of a page p, by including a link to a page q, confers authority in some way to page q. How can we exploit this latent human judgment information? Pitfall: Many links, such as navigational links and advertisement links do not confer authority.

7 Exploiting Link Structure 1 An authoritative page must be popular. So, of all pages that contain the query terms, return those with the highest in-degree. Pitfall: Still misses authoritative pages that do not contain the query terms. Pitfall: Universally popular pages (like www.yahoo.com) will be considered highly authoritative for any query terms they contain.

8 Exploiting Link Structure 2 Authoritative sources often do not link to other authoritative sources. –Examples: Toyota does not link to Honda, and Google does not link to Teoma. Other pages, which we call hub pages, link to multiple authoritative sources. –Example: Auto enthusiast websites linking to multiple manufacturer’s websites. The authoritative pages for a query share many hub pages.

9 Overview Issues in Searching Algorithm Overview Iterative Algorithm Wrap-up

10 Algorithm Overview For a query , start with a text-based search to generate an initial root set R . Enlarge the root set to a base set S . Identify authoritative pages and hub pages in S . Return the most authoritative pages in S .

11 Desiderata for S  S  should be: Relatively small. Rich in relevant pages. Contain most (or many) of the strongest authorities. R  will satisfy 1 and 2, but not 3. Even the set of all pages that contain the query terms may not satisfy 3.

12 Enlarging R  to S  Pages in R  may not be authoritative, but most authoritative pages are probably pointed to by at least one member of R . Pages in R  may not point to each other. Let S  = R  + all pages pointed to by pages of R  + some pages that point to pages of R . Use a heuristic to avoid navigation links. Kleinberg’s experiments had R   200 and S   1000 to 5000.

13 Identifying Hubs and Authorities Our set S  still has the problem of non- authoritative pages of high in-degree. The authoritative pages are the popular pages that have a large overlap in the sets of pages that point to them. The hub pages are the pages that point to many of the authoritative pages.

14 Hubs and Authorities Picture hubs authorities Unrelated page of large in-degree

15 Mutually Reinforcing Relationship Good hubs point to many good authorities. Good authorities are pointed to by many good hubs. There must be an iterative algorithm.

16 Overview Issues in Searching Algorithm Overview Iterative Algorithm Wrap-up

17 Iterative Algorithm 1 For each page p, we associate a non- negative authority weight x(p) and a non-negative hub weight y(p). Values are normalized Larger values indicate better pages.

18 Iterative Algorithm 2 If p points to many pages with large x-values, then p receives a large y-value: If p is pointed to by many pages with large y- values, then p receives a large x-value:

19 Iterative Algorithm 3 We iterate and renormalize until values converge. Therefore, we need to prove convergence. The algorithm is a discrete-time evolution and can be written as multiplications of matrices and vectors A result of linear algebra guarantees convergence of X and Y to the principle eigenvectors of M T M and MM T.

20 Example: Mini Web X YZ AMH ii * 1   HMA i T i * 1   H M MH T i i * 1   AMMA i T i ** 1                  011 100 111 M XYZ X Y Z

21 Example Iteration 0 1 2 3 …  X YZ X is the best hub Z is most authoritative

22 Overview Issues in Searching Algorithm Overview Iterative Algorithm Example Wrap-up

23 Notes to Consider In general, we don’t need to iterate to convergence. Paper contains a list of good results for various queries. After initial text-based search, the text was ignored in favor of the link structure.

24 Related Areas Similar-page queries. Connections with: –Social networks –Bibliometrics (citations) –Stand-alone hypertext environments –Clustering of link structures –Multiple sets of hubs and authorities –Diffusion and Generalization

25 Conclusion Influential paper – many citations. Published at the same time as the Google page-rank algorithm. HITS – Hyperlink Induced Topic Search Clever (IBM) Basis of Teoma search engine algorithm.

26 References Kleinberg, Jon. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, Vol. 46, No. 5, September 1999, pp. 604-632. The mini-web example comes from http://www.cs.fiu.edu/~vagelis/presentations/ RandomWalks.ppt

27 The End


Download ppt "Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn."

Similar presentations


Ads by Google