Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?

Slides:



Advertisements
Similar presentations
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Advertisements

Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Advances & Link Analysis
Link Analysis, PageRank and Search Engines on the Web
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Link Structure and Web Mining Shuying Wang
Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Singular Value Decomposition and Data Management
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented by Benjy Weinberger.
Link Analysis HITS Algorithm PageRank Algorithm.
Dominant Eigenvalues & The Power Method
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Presented By: - Chandrika B N
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Web Intelligence Web Communities and Dissemination of Information and Culture on the www.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Link Analysis on the Web An Example: Broad-topic Queries Xin.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Overview of Web Ranking Algorithms: HITS and PageRank
Algorithmic Detection of Semantic Similarity WWW 2005.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CS155b: E-Commerce Lecture 16: April 10, 2001 WWW Searching and Google.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
PageRank & Random Walk “The important of a Web page is depends on the readers interest, knowledge and attitudes…” –By Larry Page, Co-Founder of Google.
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Greg Nilsen University of Pittsburgh April 2003
Text & Web Mining 9/22/2018.
A Comparative Study of Link Analysis Algorithms
PageRank & Random Walk “The important of a Web page is depends on the readers interest, knowledge and attitudes…” –By Larry Page, Co-Founder of Google.
Lecture 22 SVD, Eigenvector, and Web Search
Identify Different Chinese People with Identical Names on the Web
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?

The obstacles The quality of a search method necessarily requires human evaluation What is relevant, really? There are several types of query.

The obstacles Query types: Specific queries: “Does Netscape support the JDK 1.1 codesigning API?” Broad-topic queries: “Find information about the Java programming language.” Similar-page queries: “Find pages ‘similar’ to java.sun.com.”

The obstacles Specific queries raise a problem called Scarcity Problem – there are very few pages that contain the required information, and it is often difficult to determine the identity of these pages. Broad-topic queries raise a problem of Abundance – The number of pages that could reasonably be returned as relevant is far too large for a human user to digest.

The obstacles Most importantly: How will we determine whether a certain page is relevant? - The Harvard problem The search engine problem - lack of self description

Using links Idea: use the links within the pages rather then the text. Using links gives us the someone’s judgment about the relevance of a page. Solves the problem: lack of self describing.

Using links Disadvantage: Links can be used for various reasons – commercials or “for home page press here”. Finding balance between relevance and popularity

Using links An idea: 1. Choose all pages that contain the query string. 2. Calculate the number of times each of them is being linked to. 3. Return C most linked pages. will be popular in respect to any query

Using Links Hubs: Pages that have many links to related authorities Step 1: Build a base graph 1. relatively small 2. rich in relevant pages 3. contains most of the strongest authorities.

Building a subgraph d = 50t = 200

Building a subgraph |S| ≈

Building a subgraph |S| ≈ Reducing the graph even farther: 1. Delete intrinsic links (transverse, intrinsic) 2. Only allow m pages from the same domain to point to a certain page.

Computing Hubs and Authorities Now that we have a reasonably sized graph, we can find the most quality information using only link structure. This time, if we rank pages according to in-degree we’ll mostly get good results. But… search: java results: java.sun.com Caribbean vacations commercials Amazon Good! Popular Good! Popular

Computing Hubs and Authorities Maybe links are not enough then? Can we override the problem without an additional text-based algorithm? Hubs: pages that link to many related authorities

Computing Hubs and Authorities Solution: Hubs can help discard irrelevant popular pages.

Computing Hubs and Authorities Observation: a good hub is a page that points to many good authorities; a good authority is a page that is pointed to by many good hubs. We need to break down the circularity.

Computing Hubs and Authorities The invariant that the weights: If p points to many pages with large x-values, then it should receive a large y-value; and if p is pointed to by many pages with large y-values, then it should receive a large x-value.

Computing Hubs and Authorities 1) 2)

Computing Hubs and Authorities

Now we can filter the c best pages: 5 <= c <= 10 For an arbitrary large k, all values converge to fixed points.

Computing Hubs and Authorities Small reminder – from linear algebra: Eigenvalues and eigenvectors: An eigenvector or characteristic vector of a square matrix A is a non-zero vector v that, when multiplied with A, yields a scalar multiple of itself. That is: The number λ is called the eigenvalue of A corresponding to v. Transope: the transpose of a matrix A is another matrix A T, such that.

Computing Hubs and Authorities A is the matrix: The principal eigenvector is the vector that corresponds with the biggest |λ|.

Computing Hubs and Authorities “Java”: “Search engines”:

Computing Hubs and Authorities Censorship:

Similar-Page Queries This algorithm can also be used to determine similarity between pages. Those can be very hard to answer via text-only. Change the algorithm: Instead of starting with: “find t pages that contain the string σ” start with : “find t pages pointing to p”.

Similar-Page Queries Why not sort by in links now? searching In-links only:

Similar-Page Queries Using Hubs and authorities:

Multiple Sets of Hubs and Authorities This algorithm is, in a sense, finding the most densely linked collection of hubs and authorities in the subgraph. Sometimes we wish to represent more than just one such collection.

Multiple Sets of Hubs and Authorities The non-principal eigenvectors of and provide us with a natural way to extract additional densely linked collections of hubs and authorities from the base set.

Multiple Sets of Hubs and Authorities Examples: searching “jaguar*” Principal eigenvector Atari jaguar product

Multiple Sets of Hubs and Authorities 2 nd non-Principal eigenvector National football league team

Multiple Sets of Hubs and Authorities 3 rd non-Principal eigenvector Jaguar cars

Multiple Sets of Hubs and Authorities Examples: searching “abortion”

Multiple Sets of Hubs and Authorities Examples: searching “randomized algorithms”

Diffusion and Generalization For specific queries - the answer very often represents a natural generalization of the query string. Searching “www conferences”:

Diffusion and Generalization searching “sigact.acm.org”:

Diffusion and Generalization Taking the 11 th nonprincipal vector:

Evaluation How can we evaluate the algorithm? Relevance is subjective Diversity of authoring styles Maybe it can’t even be assessed Comparison.

Evaluation Testing CLEVER system: Yahoo!, Alta Vista 26 topics 10 pages per topic 5 tops hubs, 5 top authorities. 37 users. All the results were presented as one. “bad”, “fair”, “good” or “fantastic”

Evaluation The results: 31% - Yahoo! And CLEVER were equivalent 19% - Yahoo! Was more successful. 50% - CLEVER was the more successful.

Conclusion - The algorithm produces better results that text-based search. - It seems it serves better as a starting point for better and improved searching methods rather than a stand alone search.