Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two.

Similar presentations


Presentation on theme: "1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two."— Presentation transcript:

1 1 Hyperlink Analysis A Survey (In Progress)

2 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two sub-topics: Measures and Metrics Interesting Web Structures

3 3 Definition of Hyperlink Analysis Hyperlink Analysis can be defined as an area of Web Information Retrieval using the hyperlink structure of the Web.

4 4 Motivation Hyperlinks serve two main purposes. Pure Navigation. Point to pages with authority* on the same topic of the page containing the link. This can be used to retrieve useful information from the web. * - a set of ideas or statements supporting a topic

5 5 What Information Can Be Retrieved ?  Quality of Web Page. - The authority of a page on a topic. - Ranking of web Pages.  Interesting Web Structures. - Graph patterns like Co-citation, Social choice, Complete bipartite graphs etc.  Web Page Classification. - Classifying web pages according to various topics.

6 6 What Information Can Be Retrieved? (Cont…)  Which pages to crawl. - Deciding which web pages to add to the collection of web pages.  Finding Related Pages. - Given one relevant page, find all related pages.  Detection of duplicated pages. - Detection of neared-mirror sites to eliminate duplication.

7 7 Classification of Hyperlink Analysis Research Hyperlink Analysis Measures and Metrics Interesting Web Structures Web Page Classification Web Search (Still needs to be refined. Suggestions Welcome)

8 8 Measures/metrics  Standards for measuring properties of a page or a web structure. Quality of a page. Distance between pages. Web Page Reputation.

9 9 PageRank Citation Ranking[1] Aim  Ranking Metric for Hypertext Documents Approach  Page has a high rank if the sum of the ranks of its backlinks is high

10 10 Authoritative Sources in Hyperlink Environment[3] Aim  Determining relative “authority” of pages Approach  Good authority page is one pointed to by many good hubs  Good hub page is one that points to many good authorities Results  Efficient when query topic is sufficiently “broad” Benefits  Locating dense bipartite communities

11 11 Does “Authority” Mean Quality ?[4] Aim.  Are any metrics we compute for Web documents good predictors of document quality ? Approach.  Do experts agree in their quality judgments?  Are different link-based metrics different? oIndegree, PageRank and Authority.  Can we predict human quality judgments ? Compute correlations between each pair of metrics and also compare it with expert judgment.

12 12 Does “Authority” Mean Quality ?[4] Results.  Experts agree on the nature of a quality within a topic.  No significant difference between link based metrics.  In-degree performed as well as PR and Authority.

13 13 Web Page Reputations [5] Aim.  Input: URL, Output: Ranked set of topics for which the page has a reputation. Approach.  A page an acquire a high reputation on a topic because the page is pointed to by many pages on that topic, or because the page is pointed to by some high reputation pages on that topic.  A page is deemed authority on the topic if it is pointed to by good hubs on the topic, and a good hub is one that points to good authorities.

14 14 One-level Influence Propagation Reputation of the page p on a topic is the probability that the random surfer looking for topic t will visit page p At each step:  with probability d>0 jump to a random page, or  with probability (1-d) follow a random link from the current page if term t appears in page p otherwise

15 15 Two Level Influence Propagation  with probability d>0 jump to random page that contains term t  with probability (1-d) follow random link forward/backward from the current page, alternating directions Authority Reputation of a page p on a topic t is the probability that a random surfer looking for a topic t makes a forward visit to the page p Hub Reputation of a page p on a topic t is the probability that a random surfer looking for a topic t makes a backward visit to the page p

16 16 Two Level Influence Propagation if term t appears in page p otherwise if term t appears in page p otherwise A(p,t) = probability of a forward visit to page p when searching for term t = Authority rank of page p on term t H(p,t) = probability of a backward visit to page p when searching for term t = Hub rank of page p on term t

17 17 Factors Affecting Page Reputation  How well a topic is represented.  How well pages on a topic are connected.

18 18 Link Analysis and Stability[6] Aim.  When to expect stable rankings under small perturbations to hyperlink patterns. Approach.  Eigengap directly affects the stability of eigenvectors in HITS algorithm.  Coupled Markov Chain Theory(?).  So long as perturbed web pages did not have high overall PageRank scores, then the perturbed PageRank Scores will not be far from the original. Result.  HITS – Unstable; PageRank – Stable.

19 19 Stable Algorithms [7] Aim  Stable Link Analysis Methods Approach  Randomized HITS  Merging Hubs and Authorities notion with “reset” mechanism from PageRank  Subspace HITS  Combining multiple eigenvectors from HITS to yield aggregate authority scores – Subspace HITS Results  Both approaches more stable than HITS, latter a little worse than PageRank

20 20 Average Clicks [8] Aim.  A new definition of distance between two pages. Approach.  Based on probability to click a link through random surfing. Benefit.  A good justification of practical search for fetching neighboring pages. Result.  Distance by average clicks seems to fit well intuitively.

21 21 Interesting Web Structure Analyzing interesting graph patterns or Web Structures.  Helpful in identification of ‘Web Communities.’

22 22 Interesting Web Structures [11] Endorsement Mutual Reinforcement Co-Citation Social Choice Transitive Endorsement

23 23 Interesting Web Structures [11] Directed Complete Bipartite graph NK-clan with N=2, K=10 NK- Clan is a set of K-nodes in which there is a path length N or less(ignoring edge directions) between every pair of nodes

24 24 Interesting Web Structures [11] In - Tree Out- Tree

25 25 Interesting Web Structures Web Communities

26 26 Friends and Neighbors [9] Aim.  Techniques to mine information in order to predict relationship between individuals. ApproachApproach.  Similarity measured by analyzing text, in-links, out-links and mailing list. Result.  In-links were ‘good’ predictors.

27 27 References [1] S. Brin and L. Page(1998) The PageRank Citation Ranking: Bringing Order to the Web. In Technical Report available at http://www- db.stanford.edu/~backrub/pageranksub.ps, January 1998. [2] T. Haveliwala,(1999) Efficient Computation of PageRank In Technical Report, Stanford University,CA [3] J.M. Klienberg (1998), Authoritative Sources in Hyperlinked Environment

28 28 References [4] B. Amento1, L. Terveen, and Will Hill(2000), Does "Authority" Mean Quality? Predicting Expert Quality Ratings of Web Documents (ACM 2000) [5] D. Rafiei, A.O. Mendelzon (2000), What is this Page Known for? Computing Web Page Reputations,Proceedings of Ninth International WWW Conference

29 29 References(contd…) [6] A. Y. Ng, A. X. Zheng, and M. I. Jordan(2001),Link Analysis, Eigenvectors and Stability, IJCAI-01. [7] A. Y. Ng, A. X. Zheng, and M. I. Jordan(2001), Stable algorithms for link analysis. Proc. 24th International Conference on Research and Development in Information Retrieval (SIGIR), 2001. [8] Y. Matsuo, Y.Ohsawa and M. Ishizuka(2001), Average-clicks: A new measure of distance on the WWW, WI-2001, 2001.

30 30 References(contd…) [9] L. A. Adamic and E. Adar(2000), Friends and Neighbors on the Web,Xerox Palo Alto Research Center Palo Alto, CA 94304. [10] A. Borodin, G.O. Roberts, J.S. Rosenthal, P. Tsaparas (2000), Finding Authorities and Hubs From Link Structures on the World Wide Web,WWW10 Proceedings.

31 31 References (contd…) [11] Kemal Efe, Vijay Raghavan, C. Henry Chu, Adrienne L. Broadwater, Levent Bolelli, Seyda Ertekin (2000), The Shape of the Web and Its Implications for Searching the Web, International Conference on Advances in Infrastructure for Electronic Business, Science, and Education on the Internet- Proceedings at http://www.ssgrr.it/en/ssgrr2000/proceedings.htm, Rome. Italy, Jul.-Aug. 2000 [12] Monika Henzinger, Link Analysis in Web Information Retrieval, ICDE Bulletin Sept 2000, Vol 23. No.3

32 32 PageRank Approach PageRank of a page p. d is the damping factor (or probability that a page is chosen uniformly at random from all pages ). n is the number of nodes in Graph G. outdegree(q) is the number of edges leaving a page q. BackBack.

33 33 HITS Approach Let z denote the vector(1,1,1,1,….1). Initially set x  z ; y  z, For i = 1,2,3…. Apply the I Operation. Apply the O operation. Normalize x and y. The sequence of (x, y) pairs produced converges to a limit (x *, y * ). Return (x *, y * ) as the authority and hub weights. BackBack.

34 34 Friends and Neighbors Predicting Friendship Items that are unique to few users are weighted more than commonly occurring items  2 people mention item, Weight = 1/log(2) = 1.4  5 people mention item, Weight = 1/log(5) = 0.62 Back


Download ppt "1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two."

Similar presentations


Ads by Google