Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June 30 2011.

Slides:



Advertisements
Similar presentations
Hyper search ing the Web Soumen Chakrabarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins Jacob Kalakal Joseph CS.
Advertisements

Hyper search ing the Web Soumen Chakrabarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins Jacob Kalakal Joseph CS.
Hubs and Authorities on the world wide web (most from Rao’s lecture slides) Presentor: Lei Tang.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington
Information Networks Link Analysis Ranking Lecture 8.
Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
ICS 278: Data Mining Lecture 15: Mining Web Link Structure
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
Link Structure and Web Mining Shuying Wang
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented by Benjy Weinberger.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Link Analysis HITS Algorithm PageRank Algorithm.
CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure.
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSCI-235 Micro-Computer in Science Internet Search.
Web Intelligence Web Communities and Dissemination of Information and Culture on the www.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining
Link Analysis on the Web An Example: Broad-topic Queries Xin.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Overview of Web Ranking Algorithms: HITS and PageRank
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
PageRank & Random Walk “The important of a Web page is depends on the readers interest, knowledge and attitudes…” –By Larry Page, Co-Founder of Google.
DATA MINING Introductory and Advanced Topics Part III – Web Mining
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
7CCSMWAL Algorithmic Issues in the WWW
Link-Based Ranking Seminar Social Media Mining University UC3M
Greg Nilsen University of Pittsburgh April 2003
Text & Web Mining 9/22/2018.
A Comparative Study of Link Analysis Algorithms
PageRank & Random Walk “The important of a Web page is depends on the readers interest, knowledge and attitudes…” –By Larry Page, Co-Founder of Google.
Lecture 22 SVD, Eigenvector, and Web Search
Information retrieval and PageRank
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June

Ranking for searching results  Modern search engines may return millions of pages for a single query. This amount is prohibitive to preview for human users, hence need a method to filter a small set of most authoritative results.  An ranking method will help to process the query results and put the most useful information on the top of the list.  Link based methods focus on the way that pages reference on another, provided an efficient way to filter the authoritative results.  Queries:  Specific queries. E.g. “What does Dr. Chris Mattmann’s think of the presentations between 3:30-5:00 PM PDT, June ” – very few pages, difficult to determine the identity of these pages.  Broad-topic queries. E.g. “java” – Too many pages, difficult to find the authority pages for traditional text-based search engine.  Similar-page queries. E.g. “find page similar to java” – similar as broad-topic queries.

Related to Class material  HITS stands for Hypertext Induced Topic Search  HITS was a pioneered link based ranking. One of the major web ranking model mentioned in the class.  This presentation will goes into the details of how to calculate “authority” and “hub” pages, which is mentioned in the class.  We will compare with the other link based algorithm: PageRank  We will evaluate the pros and cons of the paper.

Outline  Link-based algorithms  HITS algorithm  Constructing a Focused Subgraph of the WWW  Computing Hubs and Authorities  Comparison with PageRank  Expansions  Similar-Page Queries (modification)  Social Network/Scientific Citation  Multiple Set of Hubs and Authorities  Diffusion and generalization  Evaluation  Pros and Cons of the paper

Link based ranking algorithm  Challenge of the text-based ranking  most authoritative pages for query “harvard”. However, other pages may content “harvard” keyword more often.  Pages are not sufficiently self descriptive: e.g. query “search engine”. Google do not use the term on their pages.  Number of pages too large to preview.

Link based ranking algorithm  Links encoded some human latent judgment  Creating a page p by including a link to page q has in some measure conferred authority on q. No need self-descriptive.  Balance of relevance and popularity in the authority criteria (automobile  VW, Benz, BMW webpage, also large number of in-degree, lack thematic unity.)

Link based ranking algorithm  Authority: A authority is a page with many in- links.  The page may have good or authoritative content on some topic and many people trust it and link to it.  Hub: A hub is a page with many out-links.  The page serves as an organizer of the information on a particular topic and points to many good authority pages on the topic.

Link based ranking algorithm  PageRank (Brin & Page 1998):  Computed for all the webpages before query (Query independent).  Compute the authority only  Fast to compute  HITS  Performed on the set of retrieved webpages for each query (Query dependent)  Compute authority and hubs  More calculation needed, slow in real time query

HITS Algorithm  Step1: Constructing a Focused Subgraph of the WWW. Requirement: 1.S q (collection of pages wrt query q) is small 2.S q is rich in relevant pages 3.S q contains most of the strongest authorities Subgraph(q,E,t,d) q: a query string E: a text-based searching engine /*Narrow down: form AltaVista*/ Let R q denote the top t results of E on q. Set S q := R q For each page p in R q : /*Expanding*/ Add all pages that p points to into the S q ; Add all pages point to p to S q. (If the number of these pages is greater than d, randomly select d pages and add to S q.) /* Limit: a single pointed pages can bring in maximum d pages. Otherwise, can involve hundred thousands extra pages */ /*remove intrinsic links (for website navigation), and anti-collusion (allow up to m pages from a single domain to point to any given page)*/ Return S q

HITS Algorithm  Step 2: Computing Hubs and Authorities Rules: 1.A good hub points to many good authorities. 2.A good authority is pointed to by many good hubs. 3.Authorities and hubs have a mutual reinforcement relationship. Let authority score of the page i be x(i), and the hub score of page i be y(i). mutual reinforcing relationship: I step: O step:

HITS Algorithm x(1) = y(2) + y(3) + y(4) y(1) = x(5) + x(6) + x(7)

HITS Algorithm

 Recap:  If A is a square matrix, a non-zero vector v is an eigenvector of A if there is a scalar λ such that Av = λv

HITS Algorithm

 The Iterate(G,k) procedure can be applied to filter out the top c authorities and top c hubs.

HITS Results  rank 123rd by AltaVista.  Text-based search ignore the authorities.  Text-based search + link analysis works. Do not content many of the query string “Gates”.

Related work  Similar page queries:  find t pages containing the string q  find t pages pointing to p.  Honda  ford, toyota, etc.  Social Network  Measure of standing by path counting(Katz):  Scientific Citations  Multiple set of Hubs and Authorities  Same query string corresponding to different meaning.

Multiple set of Hubs and Authorities

Highlights of the method  Developed a set of algorithmic tools for extracting information from the link structures environments.  Formulate the notion of authority based on relationship between a set of “authority” pages and “hub” pages.  Proposed a heuristic algorithm to find these pages.  Surveyed variants and applications

Evaluation: HITS vs PageRank  EigenGaps  Difference between the largest and 2 nd largest eigenvalue of M matrix.  Work from Ng 2001, compared the stability of convergence. Idea: The Cora database is a collection containing the citation (similar to link) information from several thousand papers in AI. Article is truly authoritative or influential, then surely the addition of a few links or a few citations should not make us change our minds about these sites or articles having been very influential. Based on this idea, Ng et. al. constructed a set of five perturbed databases in which 30% of the papers from the base set were randomly deleted

Evaluation: HITS vs PageRank  HITS  PageRank

Evaluation: HITS vs PageRank The eigenvalues of the matrices are indicated by the directions of the principal axes of the ellipses. Small perturbation cause 45 degree change when eigengap small. No change when eigengap large.

Evaluation: Pros  Creative idea of formulating the authority concept into “Authority” and “Hub”, especially in 1998  Efficient heuristic algorithm so solve the Authority weights and Hub weights.  Query-driven dynamic ranking  Solid theoretical background  Abundant variants and applications

Evaluation: Cons  The convergence is not as robust as PageRank when there are some perturbation.  Topic drift  In-efficiency at run-time.  User behavior information is not integrated.

Reference  J. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, Extended version in Journal of the ACM 46(1999). Also appears as IBM Research Report RJ 10076, May 1997.Authoritative sources in a hyperlinked environment.  Stable algorithms for link analysis. A. Y. Ng, A. X. Zheng, and M. I. Jordan. Proceedings of the 24th International Conference on Research and Development in Information Retrieval (SIGIR), New York, NY: ACM Press, 2001 Stable algorithms for link analysis  Wikipedia:

Questions?  Thanks for time!