Link Structure and Web Mining Shuying Wang 2003.11.

Link Structure and Web Mining Shuying Wang 2003.11

Outline Part one: Link Structure and Web Mining Part two: Analysis of Link Structure Topic covered: - Web mining methods - Text based Web mining - Web graph -- Bow tie theory - Eigenvalue and Eigenvector - Authorities & Hubs - Hits (Hyperlink-Induced Topic Search) - PageRank

Challenges for Web Search The WWW is a vast collection of information: over 3 billion text pages plus a multitude of multimedia files. Over a million new resources are added every day. Huge Complex Dynamic Diversity Different User Group How do we find the information we need in such a large collection? Search is the most common activity on the web after email.

Web Mining Method Web content mining - Context, Keyword, Document classification Web structure mining - Link structure and link text Web usage mining - Weblog, URL, timestamp, IP and web page content

Limitations of text based analysis Web pages Web database Keyword Text-based ranking function Eg. Could www.harvard.edu be recognized as one of the most authoritative pages, since many other web pages contain “harvard” more often. Pages are not sufficiently self – descriptive Usually the term “search engine” doesn't’t appear on search engine web pages

Bow-tie Theory

What are the benefits of link building? Following a link is one of the most popular ways for people to find new sites. By providing links to other material people don't have to re-invent the wheel. Inbound links help to build trust. Link structure and link text provide a lot of information for making relevance judgments and quality filtering The link structure implies an underlying social structure in the way that pages and links are created, and it is an understanding of this social organization that can provide us the most leverage.

Queries and Authoritative Sources Types of queries Specific queries E.g., “Does Netscape support the JDK 1.3?” Broad-topic queries E.g., “Find information about the Java programming language.” Similar-page queries E.g., “Find pages java.sun.com” Authoritative pages –relative to broad-topic query It is not sufficient to collect a large number of potentially relevant page from text-based methods. Authorities are often not particularly self-descriptive

Authorities and Hubs A good authority is a page that is pointed by many good hubs, while a good hub is a page that points to many good authorities. This is the mutually reinforcing relationship. The authority pages are those that contain the most definitive, central, and useful information in the context of particular topics. Hubs that link to a collection of prominent sites on a common topic hubs authorities

Hits (Hyperlink-Induced Topic Search) The focused subgraph is created by first taking the highest-ranked pages from a text-based search engine as a root set R. R is expanded into the base set S by taking all sites pointing to or pointed at by a site in R. Note that while R may fail to contain some “important” authorities, S will probably contain them. … … u R1R1 RnRn S1S1 SnSn Root set Base set

Computing Hubs and Authorities(1) (3)(4) Number the pages{1,2,…n} and define their adjacency matrix A to be the n*n matrix whose (i,j) th entry is equal to 1 if page i links to page j, and is 0 otherwise. Define a=(a 1,a 2,…,a n ) and h=(h 1,h 2,…,h n ). For each page p, we associate a non-negative authority weight a p and a non-negative hub weight h p. (2)(1)

Computing Hubs and Authorities(2) In other words, a is an eigenvector of B: B is the co-citation matrix: B(i,j) is the number of sites that jointly point to both i and j. B is symmetric and has n orthogonal unit eigenvectors. (5) (6) (7) Let

Computing Hubs and Authorities(3) –We initialize a(p) = h(p) = 1 for all p. –We iterate the following operations: –And renormalize after each iteration

Computing Hubs and Authorities(4) The eigenvectors of B are precisely the stationary points of this process. h is the principal eigenvector of A T A, and a is the principal eigenvector of AA T. The principal eigenvector represents the “densest cluster” within the focused subgraph. By initializing a(p)=h(p)=1, a will converge to the principal eigenvector of B. –Initializing differently may lead to convergence to a different eigenvector. –In practice convergence is achieved after only 10-20 iterations.

PageRank (Simple structure of Google search engine) TextIndex() PageRank() query Query Processor Ranked results Web Page rank Inverted Text index offline Query-time

PageRank Computing (C <1) Let A be a square matrix with rows and columns corresponding to web pages. Let If let R as vector over web pages, Then R = cAR. (2) R is an eigenvector of A with eigenvalue c. (1) u: web page v: page links to u Bu: the set of pages c: a factor for normilization

Hits and PageRank PageRank - Offline computing - Focuses on authoritative pages - Computing all the web pages Hits: - Query time computing - Seeks good hub pages - Computing the base set pages

Conclusion A technique for locating high-quality information related to a broad search topic on the www, based on a structural analysis of the link topology surrounding “authoritative” pages on the topic. Related work. Standing, influence in social networks, scientific citations, etc. Hypertext and WWW rankings …

Reference Mining the Link Structure of the World Wide Web Jon Kleinberg Authoritative Sources in a Hyperlinked Environment Jon Kleinberg The PageRank Citation Ranking: Bringing Order to the Web Larry Page Effective Finding Relevant Web Pages from Linkage Information Jingyu Hou Yanchun Zhang Data Mining Concepts and Techniques JiaWei Han Micheline Kamber

Link Structure and Web Mining Shuying Wang 2003.11.

Similar presentations

Presentation on theme: "Link Structure and Web Mining Shuying Wang 2003.11."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Link Structure and Web Mining Shuying Wang 2003.11.

Similar presentations

Presentation on theme: "Link Structure and Web Mining Shuying Wang 2003.11."— Presentation transcript:

Similar presentations

About project

Feedback