Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Similar presentations


Presentation on theme: "Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh."— Presentation transcript:

1 Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh

2 Nov, 2002Banerjee and Ghosh2 Motivation Why Characterize or Predict web user behavior? Site-centric view: Personalization, sticky websites User-centric view: personal agents for information acquisition Universalist approaches: Pagerank, web metrics,…

3 Nov, 2002Banerjee and Ghosh3 Clustering Users from Web Logs Wide variety of web behavior  segment users based on surfing behavior as a first step to further analysis. User: set of sessions Session: sequence of –(page I.d., time spent on that page) tuples –How to cluster sets of sequences?

4 Nov, 2002Banerjee and Ghosh4 The Approach Cluster Sessions –Session Similarity Measure –Session Similarity Graph Outlier Detection –Graph Partitioning Create a Cluster Space Cluster users in this Space

5 Nov, 2002Banerjee and Ghosh5 A Similarity Measure for Sessions 1.Overlap between two sessions represented by the longest common subsequence (LCS) 2.Obtain session similarity using LCS and time information session similarity = (time similarity in LCS) x (importance of LCS) The similarity component : –Average min-max similarity for each page in the LCS The importance component : –Average of the fraction of overall session time spent in the LCS

6 Nov, 2002Banerjee and Ghosh6 Session Clustering Find the pairwise similarity values between all pair of sessions; record only similarities >  Incrementally construct similarity graph G  –the vertices are the sessions, the edge weights are the session similarity values –no isolated vertices (discard “outliers”) Balanced Graph Partitioning –we used Metis [Karypis, Kumar]

7 Nov, 2002Banerjee and Ghosh7 The Cluster Space Given: each session assigned to one of k clusters (sets)  Sessions of a user are distributed among the k sets –vector u = [u 1 u 2 … u k ] T where u i = number of sessions of the user belonging to cluster I Stage II : User Clustering –find pairwise similarity values using the extended Jaccard measure –partition similarity graph Gives l user clusters and a set of outlier users

8 Nov, 2002Banerjee and Ghosh8 The Dataset : Sulekha.com

9 Nov, 2002Banerjee and Ghosh9 Dataset details Logs over a one month period Raw log size 184 Mb 453,953 files accessed 37,753 sessions in all 23,310 sessions after some preprocessing/filtering 2,493 users

10 Nov, 2002Banerjee and Ghosh10 Results : Session Clusters Cluster 1 – interest in coffeehouse, contests Cluster 2 – glance through home, articles -(/,12)(/movies,6)(/contests,178) -(/contests,142) -(/coffeehouse,5)(/contests,183) -(/contests,172) -(/,10)(/contests,143) -(/,22)(/articles,22) -(/,20)(/articles,20) -(/,21)(/articles,21) -(/,19)(/articles,19) -(/,20)(/articles,19) Cluster 3 – interest in author, articles Cluster 4 – read articles -(/,148)(/authors,6)(/articles,77) -(/authors,290)(/articles,290) -(/authors,295)(/articles,295) -(/,33)(/authors,90)(/articles,475) -(/,32)(/authors,91)(/articles,425) -(/,39)(/articles,98)(/misc,17) (/articles,2649) -(/,9)(/articles,2666) -(/authors,26)(/articles,2561) -(/misc,20)(/articles,77)(/misc 32)(/articles,43)(/authors,16) (/articles,2373.1)

11 Nov, 2002Banerjee and Ghosh11 Results : User Clusters user : [( xxx.xxx)] –(/authors,3)(/articles,129) –(/authors,8)(/articles,8) –(/authors,80)(/articles,2141) user : [( xxx.xxx)] –(/home,77)(/articles,111)(/authors,93)(/articles,629)(/m isc,58) (/coffeehouse,75)(/wo-men,967) –(/articles,2627) user : [( xxx.xxx)] –(/home,323)(/articles,24)(/authors,45)(/articles,1290) A user cluster : people who read the articles

12 Nov, 2002Banerjee and Ghosh12 Results : User Clusters user : [( xxx.xxx)] –(/home,21)(/wo-men,1075)(/philosophy,52) user : [( xxx.xxx)] –(/home,5)(/coffeehouse,94)(/wo-men,75)(/movies,75)(/wo- men,31) –(/home,52)(/philosophy,67)(/wo-men,955)(/philosophy, 26)(/coffeehouse,382)(/biztech,298)(/philosophy,290) –(/home,17)(/coffeehouse,12)(/wo-men,15)(/personal,6) (/biztech,94)(/coffeehouse,2)(/philosophy,1093) A user cluster : people interested in wo-men, philosophy, coffeehouse

13 Nov, 2002Banerjee and Ghosh13 Results : User Clusters user : [( xxx.xxx)] –(/coffeehouse,12)(/biztech,25)(/books,48) –(/coffeehouse,13)(/biztech,26)(/books,19) user : [( xxx.xxx)] –(/coffeehouse,162) –(/coffeehouse,40) user : [( xxx.xxx)] –(/coffeehouse,12)(/contests 12) –(/coffeehouse,43)(/contests 44) A user cluster : people interested in coffeehouse – bookmarked it !

14 Nov, 2002Banerjee and Ghosh14 Result Visualization using CLUSION [Strehl &Ghosh 01] Sessions Users

15 Nov, 2002Banerjee and Ghosh15 Conclusions Segmentation: a basic pre-processing step for Web Mining Similarity measure + Cluster Space Concept: applicable to clustering of sets of any data-structure For certain websites, time spent on the pages matters –not handled by current commercial tools Outlier detection before clustering is important Results QA-ed by human subjects –Results for clusters & outliers at both levels were subjectively good No good way to find cluster quality analytically Formation of similarity graph is a slow process

16 Nov, 2002Banerjee and Ghosh16 Future Work Improve the present method by: –using cluster seeds for cluster growing –using alternative clustering algorithms for each stage –studying the effect of thresholds, number of clusters on performance –studying the importance of order of page-visits –studying the importance of balanced clustering

17 Nov, 2002Banerjee and Ghosh17 Backup

18 Nov, 2002Banerjee and Ghosh18 Issues : Choice of Parameters Number of session clusters, k, should be chosen appropriately Thresholds for forming session & user similarity graphs : –threshold value should be chosen after looking at the distribution of edge weights

19 Nov, 2002Banerjee and Ghosh19 Related Work Research in Web Mining : –Extraction of navigational patterns : Spiliopoulou, Faulstich –Ordering relationships : Mannila, Meek –Surfing prediction : Pitkow, Pirolli –Clustering web usage sessions : Fu, Sandhu, Shih

20 Nov, 2002Banerjee and Ghosh20 Example Sessions : –Session 1 = [(a,8) (b,100) (d,8) (c,5) (e,23) (a,5)] –Session 2 = [(b,5) (d,12) (f,1) (a,7) (c,5)] LCS pages = [(b)(d)(c)] Corresponding Index, Times Sequences : –Index 1 = [(1)(2)(3)], Time 1 = [(100) (8) (5)] –Index 2 = [(0)(1)(4)], Time2 = [ (5) (12) (5)] Similarity over each LCS page : of the two times –Similarity on page b = 5/100 = 0.05 –Similarity on page d = 8/12 = 0.67 –Similarity on page c = 5/5 = 1.00

21 Nov, 2002Banerjee and Ghosh21 Example (contd.) The similarity component = ( )/3 = 0.57 The importance component : –Fraction of time spent in the LCS by Session 1 = 113/149 = 0.76 –Fraction of time spent in the LCS by Session 2 = 22/30 = 0.73 –The mean = ( )/2 = 0.75 The overall similarity = 0.57 x 0.75 = 0.43

22 Nov, 2002Banerjee and Ghosh22 Issues : Session Resolution Generate coarse resolution paths making use of the concept hierarchy of the website Reduces computations; Increases interpretability of results Original PathConcept-level Path (/authors/ramesh_mahadevan.html,3) (/articles/rm_phattas.html,75) (/articles/rm_desidads.html,39) (/authors,3) (/articles,114) (/authors/arun_sampath.html,109) (/philosophy/messages/1951.html,102) (/philosophy/messages/1953.html,46) (/,3) (/philosophy/messages/1954.html,69) (/authors,109) (/philosophy,148) (/,3) (/philosophy,69)

23 Nov, 2002Banerjee and Ghosh23 Comments Results QA-ed by human subject –Results for clusters & outliers at both levels were subjectively good –No good way to find cluster quality analytically Clustering algorithms for the two stages –Stage I : Graph partitioning works well for large sparse graphs, so it is desirable in this stage –Stage II : Since the space is not high-dimensional, any reasonable clustering algorithm should be adequate Cluster space –Gives a general framework for mapping any non-vector clustering problem to an equivalent vector clustering problem


Download ppt "Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh."

Similar presentations


Ads by Google