1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Information Networks Link Analysis Ranking Lecture 8.
Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Link Analysis.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ICS 278: Data Mining Lecture 15: Mining Web Link Structure
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Link Analysis, PageRank and Search Engines on the Web
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
Link Structure and Web Mining Shuying Wang
1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Overview of Web Data Mining and Applications Part I
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Presented By: - Chandrika B N
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Overview of Web Ranking Algorithms: HITS and PageRank
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Copyright © D.S.Weld12/3/2015 8:49 PM1 Link Analysis CSE 454 Advanced Internet Systems University of Washington.
Ranking Link-based Ranking (2° generation) Reading 21.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
Chapter 6: Link Analysis
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
CSE 450 – Web Mining Seminar Professor Brian D. Davison Fall 2005 A PRESENTATION on What is this Page Known for? Computing Web Page Reputations D. Rafiei.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
CSE 454 Advanced Internet Systems University of Washington
A Comparative Study of Link Analysis Algorithms
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Graph and Link Mining.
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

1 Hyperlink Analysis A Survey (In Progress)

2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two sub-topics: Measures and Metrics Interesting Web Structures

3 Definition of Hyperlink Analysis Hyperlink Analysis can be defined as an area of Web Information Retrieval using the hyperlink structure of the Web.

4 Motivation Hyperlinks serve two main purposes. Pure Navigation. Point to pages with authority* on the same topic of the page containing the link. This can be used to retrieve useful information from the web. * - a set of ideas or statements supporting a topic

5 What Information Can Be Retrieved ?  Quality of Web Page. - The authority of a page on a topic. - Ranking of web Pages.  Interesting Web Structures. - Graph patterns like Co-citation, Social choice, Complete bipartite graphs etc.  Web Page Classification. - Classifying web pages according to various topics.

6 What Information Can Be Retrieved? (Cont…)  Which pages to crawl. - Deciding which web pages to add to the collection of web pages.  Finding Related Pages. - Given one relevant page, find all related pages.  Detection of duplicated pages. - Detection of neared-mirror sites to eliminate duplication.

7 Classification of Hyperlink Analysis Research Hyperlink Analysis Measures and Metrics Interesting Web Structures Web Page Classification Web Search (Still needs to be refined. Suggestions Welcome)

8 Measures/metrics  Standards for measuring properties of a page or a web structure. Quality of a page. Distance between pages. Web Page Reputation.

9 PageRank Citation Ranking[1] Aim  Ranking Metric for Hypertext Documents Approach  Page has a high rank if the sum of the ranks of its backlinks is high

10 Authoritative Sources in Hyperlink Environment[3] Aim  Determining relative “authority” of pages Approach  Good authority page is one pointed to by many good hubs  Good hub page is one that points to many good authorities Results  Efficient when query topic is sufficiently “broad” Benefits  Locating dense bipartite communities

11 Does “Authority” Mean Quality ?[4] Aim.  Are any metrics we compute for Web documents good predictors of document quality ? Approach.  Do experts agree in their quality judgments?  Are different link-based metrics different? oIndegree, PageRank and Authority.  Can we predict human quality judgments ? Compute correlations between each pair of metrics and also compare it with expert judgment.

12 Does “Authority” Mean Quality ?[4] Results.  Experts agree on the nature of a quality within a topic.  No significant difference between link based metrics.  In-degree performed as well as PR and Authority.

13 Web Page Reputations [5] Aim.  Input: URL, Output: Ranked set of topics for which the page has a reputation. Approach.  A page an acquire a high reputation on a topic because the page is pointed to by many pages on that topic, or because the page is pointed to by some high reputation pages on that topic.  A page is deemed authority on the topic if it is pointed to by good hubs on the topic, and a good hub is one that points to good authorities.

14 One-level Influence Propagation Reputation of the page p on a topic is the probability that the random surfer looking for topic t will visit page p At each step:  with probability d>0 jump to a random page, or  with probability (1-d) follow a random link from the current page if term t appears in page p otherwise

15 Two Level Influence Propagation  with probability d>0 jump to random page that contains term t  with probability (1-d) follow random link forward/backward from the current page, alternating directions Authority Reputation of a page p on a topic t is the probability that a random surfer looking for a topic t makes a forward visit to the page p Hub Reputation of a page p on a topic t is the probability that a random surfer looking for a topic t makes a backward visit to the page p

16 Two Level Influence Propagation if term t appears in page p otherwise if term t appears in page p otherwise A(p,t) = probability of a forward visit to page p when searching for term t = Authority rank of page p on term t H(p,t) = probability of a backward visit to page p when searching for term t = Hub rank of page p on term t

17 Factors Affecting Page Reputation  How well a topic is represented.  How well pages on a topic are connected.

18 Link Analysis and Stability[6] Aim.  When to expect stable rankings under small perturbations to hyperlink patterns. Approach.  Eigengap directly affects the stability of eigenvectors in HITS algorithm.  Coupled Markov Chain Theory(?).  So long as perturbed web pages did not have high overall PageRank scores, then the perturbed PageRank Scores will not be far from the original. Result.  HITS – Unstable; PageRank – Stable.

19 Stable Algorithms [7] Aim  Stable Link Analysis Methods Approach  Randomized HITS  Merging Hubs and Authorities notion with “reset” mechanism from PageRank  Subspace HITS  Combining multiple eigenvectors from HITS to yield aggregate authority scores – Subspace HITS Results  Both approaches more stable than HITS, latter a little worse than PageRank

20 Average Clicks [8] Aim.  A new definition of distance between two pages. Approach.  Based on probability to click a link through random surfing. Benefit.  A good justification of practical search for fetching neighboring pages. Result.  Distance by average clicks seems to fit well intuitively.

21 Interesting Web Structure Analyzing interesting graph patterns or Web Structures.  Helpful in identification of ‘Web Communities.’

22 Interesting Web Structures [11] Endorsement Mutual Reinforcement Co-Citation Social Choice Transitive Endorsement

23 Interesting Web Structures [11] Directed Complete Bipartite graph NK-clan with N=2, K=10 NK- Clan is a set of K-nodes in which there is a path length N or less(ignoring edge directions) between every pair of nodes

24 Interesting Web Structures [11] In - Tree Out- Tree

25 Interesting Web Structures Web Communities

26 Friends and Neighbors [9] Aim.  Techniques to mine information in order to predict relationship between individuals. ApproachApproach.  Similarity measured by analyzing text, in-links, out-links and mailing list. Result.  In-links were ‘good’ predictors.

27 References [1] S. Brin and L. Page(1998) The PageRank Citation Ranking: Bringing Order to the Web. In Technical Report available at db.stanford.edu/~backrub/pageranksub.ps, January [2] T. Haveliwala,(1999) Efficient Computation of PageRank In Technical Report, Stanford University,CA [3] J.M. Klienberg (1998), Authoritative Sources in Hyperlinked Environment

28 References [4] B. Amento1, L. Terveen, and Will Hill(2000), Does "Authority" Mean Quality? Predicting Expert Quality Ratings of Web Documents (ACM 2000) [5] D. Rafiei, A.O. Mendelzon (2000), What is this Page Known for? Computing Web Page Reputations,Proceedings of Ninth International WWW Conference

29 References(contd…) [6] A. Y. Ng, A. X. Zheng, and M. I. Jordan(2001),Link Analysis, Eigenvectors and Stability, IJCAI-01. [7] A. Y. Ng, A. X. Zheng, and M. I. Jordan(2001), Stable algorithms for link analysis. Proc. 24th International Conference on Research and Development in Information Retrieval (SIGIR), [8] Y. Matsuo, Y.Ohsawa and M. Ishizuka(2001), Average-clicks: A new measure of distance on the WWW, WI-2001, 2001.

30 References(contd…) [9] L. A. Adamic and E. Adar(2000), Friends and Neighbors on the Web,Xerox Palo Alto Research Center Palo Alto, CA [10] A. Borodin, G.O. Roberts, J.S. Rosenthal, P. Tsaparas (2000), Finding Authorities and Hubs From Link Structures on the World Wide Web,WWW10 Proceedings.

31 References (contd…) [11] Kemal Efe, Vijay Raghavan, C. Henry Chu, Adrienne L. Broadwater, Levent Bolelli, Seyda Ertekin (2000), The Shape of the Web and Its Implications for Searching the Web, International Conference on Advances in Infrastructure for Electronic Business, Science, and Education on the Internet- Proceedings at Rome. Italy, Jul.-Aug [12] Monika Henzinger, Link Analysis in Web Information Retrieval, ICDE Bulletin Sept 2000, Vol 23. No.3

32 PageRank Approach PageRank of a page p. d is the damping factor (or probability that a page is chosen uniformly at random from all pages ). n is the number of nodes in Graph G. outdegree(q) is the number of edges leaving a page q. BackBack.

33 HITS Approach Let z denote the vector(1,1,1,1,….1). Initially set x  z ; y  z, For i = 1,2,3…. Apply the I Operation. Apply the O operation. Normalize x and y. The sequence of (x, y) pairs produced converges to a limit (x *, y * ). Return (x *, y * ) as the authority and hub weights. BackBack.

34 Friends and Neighbors Predicting Friendship Items that are unique to few users are weighted more than commonly occurring items  2 people mention item, Weight = 1/log(2) = 1.4  5 people mention item, Weight = 1/log(5) = 0.62 Back