Presentation on theme: "Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?"— Presentation transcript:
Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?
The obstacles The quality of a search method necessarily requires human evaluation What is relevant, really? There are several types of query.
The obstacles Query types: Specific queries: “Does Netscape support the JDK 1.1 codesigning API?” Broad-topic queries: “Find information about the Java programming language.” Similar-page queries: “Find pages ‘similar’ to java.sun.com.”
The obstacles Specific queries raise a problem called Scarcity Problem – there are very few pages that contain the required information, and it is often difficult to determine the identity of these pages. Broad-topic queries raise a problem of Abundance – The number of pages that could reasonably be returned as relevant is far too large for a human user to digest.
The obstacles Most importantly: How will we determine whether a certain page is relevant? - The Harvard problem - - The search engine problem - lack of self description
Using links Idea: use the links within the pages rather then the text. Using links gives us the someone’s judgment about the relevance of a page. Solves the problem: lack of self describing.
Using links Disadvantage: Links can be used for various reasons – commercials or “for home page press here”. Finding balance between relevance and popularity
Using links An idea: 1. Choose all pages that contain the query string. 2. Calculate the number of times each of them is being linked to. 3. Return C most linked pages. will be popular in respect to any query
Using Links Hubs: Pages that have many links to related authorities Step 1: Build a base graph 1. relatively small 2. rich in relevant pages 3. contains most of the strongest authorities.
Building a subgraph d = 50t = 200
Building a subgraph |S| ≈
Building a subgraph |S| ≈ Reducing the graph even farther: 1. Delete intrinsic links (transverse, intrinsic) 2. Only allow m pages from the same domain to point to a certain page.
Computing Hubs and Authorities Now that we have a reasonably sized graph, we can find the most quality information using only link structure. This time, if we rank pages according to in-degree we’ll mostly get good results. But… search: java results: java.sun.com Caribbean vacations commercials Amazon Good! Popular Good! Popular
Computing Hubs and Authorities Maybe links are not enough then? Can we override the problem without an additional text-based algorithm? Hubs: pages that link to many related authorities
Computing Hubs and Authorities Solution: Hubs can help discard irrelevant popular pages.
Computing Hubs and Authorities Observation: a good hub is a page that points to many good authorities; a good authority is a page that is pointed to by many good hubs. We need to break down the circularity.
Computing Hubs and Authorities The invariant that the weights: If p points to many pages with large x-values, then it should receive a large y-value; and if p is pointed to by many pages with large y-values, then it should receive a large x-value.
Computing Hubs and Authorities 1) 2)
Computing Hubs and Authorities
Now we can filter the c best pages: 5 <= c <= 10 For an arbitrary large k, all values converge to fixed points.
Computing Hubs and Authorities Small reminder – from linear algebra: Eigenvalues and eigenvectors: An eigenvector or characteristic vector of a square matrix A is a non-zero vector v that, when multiplied with A, yields a scalar multiple of itself. That is: The number λ is called the eigenvalue of A corresponding to v. Transope: the transpose of a matrix A is another matrix A T, such that.
Computing Hubs and Authorities A is the matrix: The principal eigenvector is the vector that corresponds with the biggest |λ|.
Computing Hubs and Authorities “Java”: “Search engines”:
Computing Hubs and Authorities Censorship:
Similar-Page Queries This algorithm can also be used to determine similarity between pages. Those can be very hard to answer via text-only. Change the algorithm: Instead of starting with: “find t pages that contain the string σ” start with : “find t pages pointing to p”.
Similar-Page Queries Why not sort by in links now? searching In-links only:
Similar-Page Queries Using Hubs and authorities:
Multiple Sets of Hubs and Authorities This algorithm is, in a sense, finding the most densely linked collection of hubs and authorities in the subgraph. Sometimes we wish to represent more than just one such collection.
Multiple Sets of Hubs and Authorities The non-principal eigenvectors of and provide us with a natural way to extract additional densely linked collections of hubs and authorities from the base set.
Multiple Sets of Hubs and Authorities Examples: searching “jaguar*” Principal eigenvector Atari jaguar product
Multiple Sets of Hubs and Authorities 2 nd non-Principal eigenvector National football league team
Multiple Sets of Hubs and Authorities 3 rd non-Principal eigenvector Jaguar cars
Multiple Sets of Hubs and Authorities Examples: searching “abortion”
Multiple Sets of Hubs and Authorities Examples: searching “randomized algorithms”
Diffusion and Generalization For specific queries - the answer very often represents a natural generalization of the query string. Searching “www conferences”:
Diffusion and Generalization searching “sigact.acm.org”:
Diffusion and Generalization Taking the 11 th nonprincipal vector:
Evaluation How can we evaluate the algorithm? Relevance is subjective Diversity of authoring styles Maybe it can’t even be assessed Comparison.
Evaluation Testing CLEVER system: Yahoo!, Alta Vista 26 topics 10 pages per topic 5 tops hubs, 5 top authorities. 37 users. All the results were presented as one. “bad”, “fair”, “good” or “fantastic”
Evaluation The results: 31% - Yahoo! And CLEVER were equivalent 19% - Yahoo! Was more successful. 50% - CLEVER was the more successful.
Conclusion - The algorithm produces better results that text-based search. - It seems it serves better as a starting point for better and improved searching methods rather than a stand alone search.