Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Chase Repp.  knowledge discovery  searching, analyzing, and sifting through large data sets to find new patterns, trends, and relationships contained.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Tries Standard Tries Compressed Tries Suffix Tries.
Information Retrieval in Practice
Chapter 12: Web Usage Mining - An introduction
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Web Mining Research: A Survey
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Discovery of Aggregate Usage Profiles for Web Personalization
Overview of Web Data Mining and Applications Part I
CIS 674 Introduction to Data Mining
Overview of Search Engines
WEB ANALYTICS Prof Sunil Wattal. Business questions How are people finding your website? What pages are the customers most interested in? Is your website.
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web logs Data Engineering Lab 성 유 진.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
CS 401 Paper Presentation Praveen Inuganti
A Visualized Product Recommendation System using Fisheye Views and Data Adjacency.
Southern Methodist University
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
CSE Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Discovery of Aggregate Usage Profiles for Web Personalization Bamshad Mobasher, Honghua Dai, Tao Luo, Miki Nakagawa, Yuqing Sun, Jim Wiltshire WebKDD 2000.
Log files presented to : Sir Adnan presented by: SHAH RUKH.
User Behavior Analysis of Location Aware Search Engine Third international Conference of MDM, 2002 Takahiko Shintani, Iko Pramudiono NTT Information Sharing.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Srivastava J., Cooley R., Deshpande M, Tan P.N.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Web Mining G.Anuradha References from Dunham. Objective What is web mining? Taxonomy of web mining? Web content mining Web structure mining Web usage.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
© Prentice Hall1 ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2008 Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
Characteristics of Information on the Web Dania Bilal IS 530 Spring 2006.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Data Mining Algorithms Web Mining. 2 Web Mining Outline Goal: Examine the use of data mining on the World Wide Web Introduction Web Content Mining Web.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Data mining in web applications
Information Retrieval in Practice
Search Engine Architecture
DATA MINING Introductory and Advanced Topics Part III – Web Mining
DATA MINING © Prentice Hall.
WEB SPAM.
Web Mining Ref:
Special Thanks to Dr. S. C. Shirwaikar for such making wonderful PPTS
Text & Web Mining 9/22/2018.
Sangeeta Devadiga CS 157B, Spring 2007
Anatomy of a search engine
DATA MINING Introductory and Advanced Topics Part III
Discovery of Significant Usage Patterns from Clickstream Data
Web Mining Research: A Survey
Presentation transcript:

Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data

Web Mining Taxonomy

Crawlers Robot (spider) traverses the hypertext sructure in the Web. Robot (spider) traverses the hypertext sructure in the Web. Collect information from visited pages Collect information from visited pages Used to construct indexes for search engines Used to construct indexes for search engines Traditional Crawler – visits entire Web (?) and replaces index Traditional Crawler – visits entire Web (?) and replaces index Periodic Crawler – visits portions of the Web and updates subset of index Periodic Crawler – visits portions of the Web and updates subset of index Incremental Crawler – selectively searches the Web and incrementally modifies index Incremental Crawler – selectively searches the Web and incrementally modifies index Focused Crawler – visits pages related to a particular subject Focused Crawler – visits pages related to a particular subject

Focused Crawler Classifier also determines how useful outgoing links are Classifier also determines how useful outgoing links are

Focused Crawler

Personalization Web access or contents tuned to better fit the desires of each user. Web access or contents tuned to better fit the desires of each user. Manual techniques identify user’s preferences based on profiles or demographics. Manual techniques identify user’s preferences based on profiles or demographics. Collaborative filtering identifies preferences based on ratings from similar users. Collaborative filtering identifies preferences based on ratings from similar users. Content based filtering retrieves pages based on similarity between pages and user profiles. Content based filtering retrieves pages based on similarity between pages and user profiles.

PageRank Used by Google Used by Google Prioritize pages returned from search by looking at Web structure. Prioritize pages returned from search by looking at Web structure. Importance of page is calculated based on number of pages which point to it – Backlinks. Importance of page is calculated based on number of pages which point to it – Backlinks. Weighting is used to provide more importance to backlinks coming form important pages. Weighting is used to provide more importance to backlinks coming form important pages.

PageRank (cont’d) PR(p) = c (PR(1)/N 1 + … + PR(n)/N n ) PR(p) = c (PR(1)/N 1 + … + PR(n)/N n ) –PR(i): PageRank for a page i which points to target page p. –N i : number of links coming out of page I Rank source E: R= cAR+cE Rank source E: R= cAR+cE

CLEVER Identify authoritative and hub pages. Identify authoritative and hub pages. Authoritative Pages : Authoritative Pages : –Highly important pages. –Best source for requested information. Hub Pages : Hub Pages : –Contain links to highly important pages.

Web Usage Mining Applications Personalization Personalization Improve structure of a site’s Web pages Improve structure of a site’s Web pages Aid in caching and prediction of future page references Aid in caching and prediction of future page references Improve design of individual pages Improve design of individual pages Improve effectiveness of e-commerce (sales and advertising) Improve effectiveness of e-commerce (sales and advertising)

Web Usage Mining Activities Preprocessing Web log Preprocessing Web log –Cleanse –Remove extraneous information –Sessionize Session: Sequence of pages referenced by one user at a sitting. Pattern Discovery Pattern Discovery –Count patterns that occur in sessions –Pattern is sequence of pages references in session. –Similar to association rules »Transaction: session »Itemset: pattern (or subset) »Order is important Pattern Analysis Pattern Analysis

Web Usage Mining Issues Identification of exact user not possible. Identification of exact user not possible. Exact sequence of pages referenced by a user not possible due to caching. Exact sequence of pages referenced by a user not possible due to caching. Session not well defined Session not well defined Security, privacy, and legal issues Security, privacy, and legal issues

Web Log Cleansing Replace source IP address with unique but non-identifying ID. Replace source IP address with unique but non-identifying ID. Replace exact URL of pages referenced with unique but non-identifying ID. Replace exact URL of pages referenced with unique but non-identifying ID. Delete error records and records containing not page data (such as figures and code) Delete error records and records containing not page data (such as figures and code)

Sessionizing Divide Web log into sessions. Divide Web log into sessions. Two common techniques: Two common techniques: –Number of consecutive page references from a source IP address occurring within a predefined time interval (e.g. 25 minutes). –All consecutive page references from a source IP address where the interclick time is less than a predefined threshold.

Episodes Partially ordered set of pages Partially ordered set of pages Serial episode – totally ordered with time constraint Serial episode – totally ordered with time constraint Parallel episode – partial ordered with time constraint Parallel episode – partial ordered with time constraint General episode – partial ordered with no time constraint General episode – partial ordered with no time constraint

DAG for Episode

Longest Common Subseries Find longest subseries they have in common. Find longest subseries they have in common. Ex: Ex: –X = –X = –Y = –Y = –Output: –Output: –Sim(X,Y) = l/n = 4/9

Similarity based on Linear Transformation Linear transformation function f Linear transformation function f –Convert a value form one series to a value in the second  f – tolerated difference in results  f – tolerated difference in results  – time value difference allowed  – time value difference allowed

Distance between Strings Cost to convert one to the other Cost to convert one to the other Transformations Transformations –Match: Current characters in both strings are the same –Delete: Delete current character in input string –Insert: Insert current character in target string into string

Distance between Strings

Frequent Sequence Frequent Sequence

Frequent Sequence Example Purchases made by customers Purchases made by customers s( ) = 1/3 s( ) = 1/3 s( ) = 2/3 s( ) = 2/3

Frequent Sequence Lattice

SPADE Sequential Pattern Discovery using Equivalence classes Sequential Pattern Discovery using Equivalence classes Divides lattice into equivalent classes and searches each separately. Divides lattice into equivalent classes and searches each separately.

SPADE Example ID-List for Sequences of length 1: ID-List for Sequences of length 1: Count for is 3 Count for is 3 Count for is 2 Count for is 2

  Equivalence Classes

SPADE Algorithm