1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
CSE 5243 (AU 14) Graph Basics and a Gentle Introduction to PageRank 1.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Presented by Zheng Zhao Originally designed by Soumya Sanyal
Link Analysis HITS Algorithm PageRank Algorithm.
More Algorithms for Trees and Graphs Eric Roberts CS 106B March 11, 2013.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
Adaptive On-Line Page Importance Computation Serge, Mihai, Gregory Presented By Liang Tian 7/13/2010 1Adaptive On-Line Page Importance Computation.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
A Sublinear Time Algorithm for PageRank Computations CHRISTIA N BORGS MICHAEL BRAUTBA R JENNIFER CHAYES SHANG- HUA TENG.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
DATA MINING Introductory and Advanced Topics Part III – Web Mining
HITS Hypertext-Induced Topic Selection
Lecture #11 PageRank (II)
Link-Based Ranking Seminar Social Media Mining University UC3M
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
The Anatomy of a Large-Scale Hypertextual Web Search Engine
A Comparative Study of Link Analysis Algorithms
Lecture 22 SVD, Eigenvector, and Web Search
CSE 454 Advanced Internet Systems University of Washington
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CSE 454 Advanced Internet Systems University of Washington
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 22 SVD, Eigenvector, and Web Search
COMP5331 Web databases Prepared by Raymond Wong
Presentation transcript:

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg

2 PageRank & HITS: Bring Order to the Web Background and Introduction Approach – PageRank Approach – Authorities & Hubs

Motivation and Introduction Why is Page Importance Rating important? New challenges for information retrieval on the World Wide Web. Huge number of web pages: 150 million by billion by 2008 Diversity of web pages: different topics, different quality, etc. Hard to imagine no ranking algorithms in search engine. December 22, 2015Data Mining: Concepts and Techniques 3

Motivation and Introduction Hard to imagine no ranking algorithms in search engine. December 22, 2015Data Mining: Concepts and Techniques 4

Motivation and Introduction Modern search engines may return millions of pages for a single query. This amount is prohibitive to preview for human users. Ranking algorithms will process the search results and only show the most useful information to the search engine user. December 22, 2015Data Mining: Concepts and Techniques 5

Motivation and Introduction December 22, 2015Data Mining: Concepts and Techniques 6

PageRank: History PageRank was developed by Larry Page (hence the name Page-Rank) and Sergey Brin. It is first as part of a research project about a new kind of search engine. That project started in 1995 and led to a functional prototype in Shortly after, Page and Brin founded Google. December 22, 2015Data Mining: Concepts and Techniques 7

Link Structure of the Web 150 million web pages  1.7 billion links December 22, 2015Data Mining: Concepts and Techniques 8 Backlinks and Forward links:  A and B are C’s backlinks  C is A and B’s forward link Intuitively, a webpage is important if it has a lot of backlinks. What if a webpage has only one link off

PageRank: A Simplified Version December 22, 2015Data Mining: Concepts and Techniques 9 u: a web page B u : the set of u’s backlinks N v : the number of forward links of page v c: the normalization factor to make ||R|| L1 = 1 (||R|| L1 = |R 1 + … + R n |)

An example of Simplified PageRank December 22, 2015Data Mining: Concepts and Techniques 10 PageRank Calculation: first iteration

An example of Simplified PageRank December 22, 2015Data Mining: Concepts and Techniques 11 PageRank Calculation: second iteration

An example of Simplified PageRank December 22, 2015Data Mining: Concepts and Techniques 12 Convergence after some iterations

A Problem with Simplified PageRank December 22, 2015Data Mining: Concepts and Techniques 13 A loop: During each iteration, the loop accumulates rank but never distributes rank to other pages!

An example of the Problem

Random Walks in Graphs The Random Surfer Model The simplified model: the standing probability distribution of a random walk on the graph of the web. simply keeps clicking successive links at random The Modified Model The modified model: the “random surfer” simply keeps clicking successive links at random, but periodically “gets bored” and jumps to a random page based on the distribution of E

Modified Version of PageRank E(u): a distribution of ranks of web pages that “users” jump to when they “gets bored” after successive links at random.

An example of Modified PageRank 19

Dangling Links Links that point to any page with no outgoing links Most are pages that have not been downloaded yet Affect the model since it is not clear where their weight should be distributed Do not affect the ranking of any other page directly Can be simply removed before pagerank calculation and added back afterwards

PageRank Implementation Convert each URL into a unique integer and store each hyperlink in a database using the integer IDs to identify pages Sort the link structure by ID Remove all the dangling links from the database Make an initial assignment of ranks and start iteration Choosing a good initial assignment can speed up the pagerank Adding the dangling links back.

Convergence Property PR (322 Million Links): 52 iterations PR (161 Million Links): 45 iterations Scaling factor is roughly linear in logn

Convergence Property The Web is an expander-like graph Theory of random walk: a random walk on a graph is said to be rapidly-mixing if it quickly converges to a limiting distribution on the set of nodes in the graph. A random walk is rapidly- mixing on a graph if and only if the graph is an expander graph. Expander graph: every subset of nodes S has a neighborhood (set of vertices accessible via outedges emanating from nodes in S) that is larger than some factor α times of |S|. A graph has a good expansion factor if and only if the largest eigenvalue is sufficiently larger than the second-largest eigenvalue.

PageRank vs. Web Traffic Some highly accessed web pages have low page rank possibly because People do not want to link to these pages from their own web pages (the example in their paper is pornographic sites…) Some important backlinks are omitted use usage data as a start vector for PageRank.

Hypertext-Induced Topic Search(HITS) To find a small set of most “authoritative’’ pages relevant to the query. Authority – Most useful/relevant/helpful results of a query. “java’’ – java.com “harvard’’ – harvard.edu “search engine’’ – powerful search engines. December 22, 2015Data Mining: Concepts and Techniques 25

Hypertext-Induced Topic Search(HITS) Or Authorities & Hubs, developed by Jon Kleinberg, while visiting IBM Almaden IBM expanded HITS into Clever. Authorities – pages that are relevant and are linked to by many other pages Hubs – pages that link to many related authorities December 22, 2015Data Mining: Concepts and Techniques 26

Authorities & Hubs Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks related to the conferral of authority. Find the pattern authoritative pages have: Authoritative Pages share considerable overlap in the sets of pages that point to them. December 22, 2015Data Mining: Concepts and Techniques 27

Authorities & Hubs First Step: Constructing a focused subgraph of the WWW based on query Second Step Iteratively calculate authority weight and hub weight for each page in the subgraph December 22, 2015Data Mining: Concepts and Techniques 28

Constructing a focused subgraph December 22, 2015Data Mining: Concepts and Techniques 29

Constructing a focused subgraph December 22, 2015Data Mining: Concepts and Techniques 30

Constructing a focused subgraph December 22, 2015Data Mining: Concepts and Techniques 31

Constructing a focused subgraph December 22, 2015Data Mining: Concepts and Techniques 32

Computing Hubs and Authorities Rules: A good hub points to many good authorities. A good authority is pointed to by many good hubs. Authorities and hubs have a mutual reinforcement relationship. December 22, 2015Data Mining: Concepts and Techniques 33

Computing Hubs and Authorities December 22, 2015Data Mining: Concepts and Techniques 34

Example (no normalization) December 22, 2015Data Mining: Concepts and Techniques 35

Example (no normalization) December 22, 2015Data Mining: Concepts and Techniques 36

Example (no normalization) December 22, 2015Data Mining: Concepts and Techniques 37

Example (no normalization) December 22, 2015Data Mining: Concepts and Techniques 38

Example (no normalization) December 22, 2015Data Mining: Concepts and Techniques 39

The Iterative Algorithm December 22, 2015Data Mining: Concepts and Techniques 40

The Iterative Algorithm December 22, 2015Data Mining: Concepts and Techniques 41

The Iterative Algorithm December 22, 2015Data Mining: Concepts and Techniques 42

The Iterative Algorithm December 22, 2015Data Mining: Concepts and Techniques 43

A Statistical View of HITS December 22, 2015Data Mining: Concepts and Techniques 44

A Statistical View of HITS The weight of authority equals the contribution of transforming the dataset to first principal component. Importance of this vector for the distribution of whole dataset. From the statistical view: HITS can be implemented by PCA HITS is different from clustering using dimensionality reduction. The number of samples of PCA is limited. December 22, 2015Data Mining: Concepts and Techniques 45

PageRank v.s. HITS December 22, 2015Data Mining: Concepts and Techniques 46

47 Summary PageRank is a global ranking of all web pages based on their locations in the web graph structure PageRank uses information which is external to the web pages – backlinks Backlinks from important pages are more significant than backlinks from average pages The structure of the web graph is very useful for information retrieval tasks. HITS – Find authoritative pages; Construct subgraph; Mutually reinforcing relationship; Iterative algorithm