Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.

Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data

Web Mining Taxonomy

Crawlers Robot (spider) traverses the hypertext sructure in the Web. Robot (spider) traverses the hypertext sructure in the Web. Collect information from visited pages Collect information from visited pages Used to construct indexes for search engines Used to construct indexes for search engines Traditional Crawler – visits entire Web (?) and replaces index Traditional Crawler – visits entire Web (?) and replaces index Periodic Crawler – visits portions of the Web and updates subset of index Periodic Crawler – visits portions of the Web and updates subset of index Incremental Crawler – selectively searches the Web and incrementally modifies index Incremental Crawler – selectively searches the Web and incrementally modifies index Focused Crawler – visits pages related to a particular subject Focused Crawler – visits pages related to a particular subject

Focused Crawler Classifier also determines how useful outgoing links are Classifier also determines how useful outgoing links are

Focused Crawler

Personalization Web access or contents tuned to better fit the desires of each user. Web access or contents tuned to better fit the desires of each user. Manual techniques identify user’s preferences based on profiles or demographics. Manual techniques identify user’s preferences based on profiles or demographics. Collaborative filtering identifies preferences based on ratings from similar users. Collaborative filtering identifies preferences based on ratings from similar users. Content based filtering retrieves pages based on similarity between pages and user profiles. Content based filtering retrieves pages based on similarity between pages and user profiles.

PageRank Used by Google Used by Google Prioritize pages returned from search by looking at Web structure. Prioritize pages returned from search by looking at Web structure. Importance of page is calculated based on number of pages which point to it – Backlinks. Importance of page is calculated based on number of pages which point to it – Backlinks. Weighting is used to provide more importance to backlinks coming form important pages. Weighting is used to provide more importance to backlinks coming form important pages.

PageRank (cont’d) PR(p) = c (PR(1)/N 1 + … + PR(n)/N n ) PR(p) = c (PR(1)/N 1 + … + PR(n)/N n ) –PR(i): PageRank for a page i which points to target page p. –N i : number of links coming out of page I Rank source E: R= cAR+cE Rank source E: R= cAR+cE

CLEVER Identify authoritative and hub pages. Identify authoritative and hub pages. Authoritative Pages : Authoritative Pages : –Highly important pages. –Best source for requested information. Hub Pages : Hub Pages : –Contain links to highly important pages.

Web Usage Mining Applications Personalization Personalization Improve structure of a site’s Web pages Improve structure of a site’s Web pages Aid in caching and prediction of future page references Aid in caching and prediction of future page references Improve design of individual pages Improve design of individual pages Improve effectiveness of e-commerce (sales and advertising) Improve effectiveness of e-commerce (sales and advertising)

Web Usage Mining Activities Preprocessing Web log Preprocessing Web log –Cleanse –Remove extraneous information –Sessionize Session: Sequence of pages referenced by one user at a sitting. Pattern Discovery Pattern Discovery –Count patterns that occur in sessions –Pattern is sequence of pages references in session. –Similar to association rules »Transaction: session »Itemset: pattern (or subset) »Order is important Pattern Analysis Pattern Analysis

Web Usage Mining Issues Identification of exact user not possible. Identification of exact user not possible. Exact sequence of pages referenced by a user not possible due to caching. Exact sequence of pages referenced by a user not possible due to caching. Session not well defined Session not well defined Security, privacy, and legal issues Security, privacy, and legal issues

Web Log Cleansing Replace source IP address with unique but non-identifying ID. Replace source IP address with unique but non-identifying ID. Replace exact URL of pages referenced with unique but non-identifying ID. Replace exact URL of pages referenced with unique but non-identifying ID. Delete error records and records containing not page data (such as figures and code) Delete error records and records containing not page data (such as figures and code)

Sessionizing Divide Web log into sessions. Divide Web log into sessions. Two common techniques: Two common techniques: –Number of consecutive page references from a source IP address occurring within a predefined time interval (e.g. 25 minutes). –All consecutive page references from a source IP address where the interclick time is less than a predefined threshold.

Episodes Partially ordered set of pages Partially ordered set of pages Serial episode – totally ordered with time constraint Serial episode – totally ordered with time constraint Parallel episode – partial ordered with time constraint Parallel episode – partial ordered with time constraint General episode – partial ordered with no time constraint General episode – partial ordered with no time constraint

DAG for Episode

Longest Common Subseries Find longest subseries they have in common. Find longest subseries they have in common. Ex: Ex: –X = –X = –Y = –Y = –Output: –Output: –Sim(X,Y) = l/n = 4/9

Similarity based on Linear Transformation Linear transformation function f Linear transformation function f –Convert a value form one series to a value in the second  f – tolerated difference in results  f – tolerated difference in results  – time value difference allowed  – time value difference allowed

Distance between Strings Cost to convert one to the other Cost to convert one to the other Transformations Transformations –Match: Current characters in both strings are the same –Delete: Delete current character in input string –Insert: Insert current character in target string into string

Distance between Strings

Frequent Sequence Frequent Sequence

Frequent Sequence Example Purchases made by customers Purchases made by customers s( ) = 1/3 s( ) = 1/3 s( ) = 2/3 s( ) = 2/3

Frequent Sequence Lattice

SPADE Sequential Pattern Discovery using Equivalence classes Sequential Pattern Discovery using Equivalence classes Divides lattice into equivalent classes and searches each separately. Divides lattice into equivalent classes and searches each separately.

SPADE Example ID-List for Sequences of length 1: ID-List for Sequences of length 1: Count for is 3 Count for is 3 Count for is 2 Count for is 2

  Equivalence Classes

SPADE Algorithm

Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.

Similar presentations

Presentation on theme: "Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.

Similar presentations

Presentation on theme: "Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data."— Presentation transcript:

Similar presentations

About project

Feedback