Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey A New Reactive Method for Processing Web Usage Data.

Similar presentations


Presentation on theme: "1 Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey A New Reactive Method for Processing Web Usage Data."— Presentation transcript:

1 1 Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey A New Reactive Method for Processing Web Usage Data

2 Murat Ali Bayir, June 062 Web Mining Previous Session Reconstruction Heuristics Smart-SRA Agent Simulator Experimental Results Conclusion OUTLINE

3 Murat Ali Bayir, June 063 Data & Web Mining Data Mining: Discovery of useful and interesting patterns from a large dataset. Web mining: the application of data mining techniques to discover and retrieve useful information and patterns from the World Wide Web documents and services. : Dimensions: –Web content mining –Web structure mining –Web usage mining

4 Murat Ali Bayir, June 064 IP AddressRequest TimeMethodURLProtocolSuccess of Return Code Number of Bytes Transmitted 144.123.121.23[25/Apr/2005:03:04:41–05]GETA.htmlHTTP/1.02003290 144.123.121.23[25/Apr/2005:03:04:43–05]GETB.htmlHTTP/1.02002050 144.123.121.23[25/Apr/2005:03:04:48–05]GETC.htmlHTTP/1.02004130 Web Usage Mining (WUM) Application of data mining techniques to web log data in order to discover user access patterns. Example User Web Access Log Web Mining It is possible to capture necessary information for WUM.

5 Murat Ali Bayir, June 065 Phases of Web Usage Mining 1. 1.Data Processing – – Includes reconstruction of user sessions by using heuristics techniques. (Most important phase) since it directly affects quality of extracted frequent patterns at final step significantly. 2. 2.Pattern Discovery – – Includes Discovering useful patterns from reconstructed sessions obtained in the first phase. We have related work about Pattern Discovery phase [Bayir 06-1]. Web Mining

6 Murat Ali Bayir, June 066 Web Mining Previous Session Reconstruction Heuristics Smart-SRA Agent Simulator Experimental Results Conclusion OUTLINE

7 Murat Ali Bayir, June 067 Session Reconstruction Includes selecting and grouping requests belonging to the same user by using heuristics techniques. Types: – –Reactive strategies process requests after they are handled by the web server, they process web server logs to obtain session. The proposed approach is this thesis is reactive. – –Proactive strategies process requests during the interactive browsing of the web site by the user. Session data is gathered during interaction of web user. applied on dynamic server pages. Previous Session Reconstruction Heuristics

8 Murat Ali Bayir, June 068 Session Reconstruction Proactive Strategies need to change internal structure of web site. To illustrate, change in source code of each dynamic web pages. Reactive strategies need no change, used for web analytics purposes, customers give web logs of their web site and analyzed them by using this methods. Reactive methods are applicable for all web sites satisfying same log format. Previous Reactive Heuristics

9 Murat Ali Bayir, June 069 Time-oriented heuristics [Spiliopoulou 98, Cooley 99-1] Navigation-oriented heuristic Navigation-oriented heuristic [Cooley 99-1, Cooley 99-2] Smart-SRA [Bayir 06-2] is new approach proposed in this thesis. It combines these heuristics with web topology information in order to increase the accuracy of the reconstructed sessions. Previous Reactive Heuristics Two types of reactive heuristics defined before

10 Murat Ali Bayir, June 0610 Example Web Topology Graph used for Applying heuristics Example Web Page Request Sequence Page P1P1 P 20 P 13 P 49 P 34 P 23 Timestamp 0615293247 Previous Reactive Heuristics The topology of web site can be represented by directed web graph. The topology information can be extracted by using crawling module of Search engine APIs.

11 Murat Ali Bayir, June 0611 Time-oriented heuristics -1 Time threshold (  1 = 30 mins): 1. 1. [P 1, P 20, P 13, P 49 ] (t(P 1 ) - t(P 49 ) = 29 < 30) 2. 2. [P 34, P 23 ] (t(P 34 ) - t(P 23 ) = 15 < 30) Page P1P1 P 20 P 13 P 49 P 34 P 23 Timestamp 0615293247 Previous Session Reconstruction Heuristics Two types of time oriented Heuristics defined. total duration of a discovered session is limited with a threshold  1 Example:

12 Murat Ali Bayir, June 0612 Time-oriented Heuristics -2 Time threshold (  2 = 10 mins): 1. 1. [P 1, P 20, P 13 ] 2. 2. [P 49, P 34 ] 3. 3. [P 23 ] Page P1P1 P 20 P 13 P 49 P 34 P 23 Timestamp 0615293247 Previous Session Reconstruction Heuristics The time spent on any page is limited with a threshold  2. That means t(P n+1 ) - t(P n ) <  2 Example:

13 Murat Ali Bayir, June 0613 Navigation-Oriented Heuristic In Navigation Oriented Heuristics, when processing user request sequence, There are two cases for Adding new page WP N+1 to a session [WP 1, WP 2, …, WP N ] If WP N has a hyperlink to WP N+1 [WP 1, WP 2, …, WP N, WP N+1 ] If WP N does not have a hyperlink to WP N+1 Assume that WP Kmax is the nearest page having a hyperlink to WP N+1 add backward browser moves [WP 1, WP 2,…, WP N, WP N-1, WP N-2,..., WP Kmax, WP N+1 ] Previous Session Reconstruction Heuristics

14 Murat Ali Bayir, June 0614 Navigation-Oriented Heuristic Curent SessionConditionNew Page [ ]P1P1 [P 1 ]Link[P 1, P 20 ] = 1P 20 [P 1, P 20 ]Link[P 20, P 13 ] = 0 Link[P 1, P 13 ] = 1 P 13 [P 1, P 20, P 1, P 13 ]Link[P 13, P 49 ] = 1P 49 [P 1, P 20, P 1, P 13, P 49 ]Link[P 49, P 34 ] = 0 Link[P 13, P 34 ] = 1 P 34 [P 1, P 20, P 1, P 13, P 49, P 13, P 34 ]Link[P 34, P 23 ] =1P 23 [P 1, P 20, P 1, P 13, P 49, P 13, P 34, P 23 ] Previous Session Reconstruction Heuristics Example: User request sequence

15 Murat Ali Bayir, June 0615 Web Mining Previous Session Reconstruction Heuristics Smart-SRA Agent Simulator Experimental Results Conclusion OUTLINE

16 Murat Ali Bayir, June 0616 Smart-SRA Phase 1: Shorter request sequences are constructed by using overall session duration time and page-stay time criteria Phase 2: Candidate sessions are partitioned into maximal sub-sessions such that: – –between each consecutive page pair in a session there is a hyperlink from the previous page to the next page Topology Rule:  i:1  i<n, there is a hyperlink from P i to P i+1 Time Rules: – –o i: 1  i<n, Timestam(P i ) < Timestamp(P i+1 ) – –o i: 1  i<n Timestamp(P i+1 ) - Timestamp(P i )   (page stay time) – –o Timestamp(P n ) - Timestamp(P 1 )  δ (session duration time).

17 Murat Ali Bayir, June 0617 Smart-SRA Phase2 of Smart-SRA process a candidate session from left to right by repeating the following steps until the candidate session is empty: 1. 1.Determine the web pages without any referrer (on its left) and remove them from the candidate session 2. 2.For each one of these pages For each previously constructed session – –If there is a hyperlink from the last page of the session to the web page and page stay time constraint is satisfied then append the web page to the session 3. 3.Remove non-maximal sessions

18 Murat Ali Bayir, June 0618 Example Candidate Session Page P1P1 P 20 P 13 P 49 P 34 P 23 Timestamp 069121415 Smart-SRA Example Web Topology Used of Applying Smart-SRA

19 Murat Ali Bayir, June 0619 Smart-SRA Iteration1 (non referers in the set)2 Candidate Session[P 1, P 20, P 13, P 49, P 34, P 23 ][P 20, P 13, P 49, P 34, P 23 ] New Session Set (before) [P 1 ] Temp Page Set{P 1 }{P 20, P 13 } Temp Session Set [P 1 ][P 1,P 20 ] [P 1,P 13 ] New Session Set (after) [P 1 ][P 1,P 20 ] [P 1,P 13 ] Iteration34 Candidate Session[P 49, P 34, P 23 ][P 23 ] New Session Set (before) [P 1,P 20 ] [P 1,P 13 ] [P 1,P 13,P 34 ] [P 1, P 13, P 49 ] [P 1, P 20 ] Temp Page Set{P 49, P 34 }{P 23 } Temp Session Set[P 1,P 13,P 34 ] [P 1, P 13, P 49 ] [P 1, P 13, P 34, P 23 ] [P 1, P 13, P 49, P 23 ], [P 1, P 20, P 23 ] New Session Set (after) [P 1,P 13,P 34 ], [P 1, P 13, P 49 ] [P 1, P 20 ] [P 1, P 13, P 34, P 23 ], [P 1, P 13, P 49, P 23 ] [P 1, P 20, P 23 ]

20 Murat Ali Bayir, June 0620 Web Mining Previous Session Reconstruction Heuristics Smart-SRA Agent Simulator Experimental Results Conclusion OUTLINE

21 Murat Ali Bayir, June 0621 Agent Simulator Models the behavior of web users and generates web user navigation and the log data kept by the web server Used to Used to compare the performances of alternative session reconstruction heuristics

22 Murat Ali Bayir, June 0622 Agent Simulator A Web user can start session with any one of the possible entry pages of a web site. A Web user can select the next page having a link from the most recently accessed page. A Web user can press the back button one more time and thus selects as the next page a page having a link from any one of the previously browsed pages (i.e., pages accessed before the most recently accessed one). A Web user can terminate his/her session. Provides 4 basic behaviors of Web User.

23 Murat Ali Bayir, June 0623 Web user can start a new session with any one of the possible entry pages of the web site P 13 P 1 P 20 P 23 P 34 1 S1 P 49 2 S2 Agent Simulator Behavior I

24 Murat Ali Bayir, June 0624 P 13 P 1 P 49 P 20 P 23 P 34 2 1 Web user can select a new page having a link from the most recently accessed page. Agent Simulator Behavior II

25 Murat Ali Bayir, June 0625 P 13 P 1 P 49 P 20 P 23 P 34 2 1 3 4 5 Web user can select as the next page having a link from any one of the previously browsed pages. Agent Simulator Behavior III

26 Murat Ali Bayir, June 0626 P 13 P 1 P 49 P 20 P 23 P 34 2 1 3 4 5 6 Web user can terminate the session. Agent Simulator Behavior IV Example session is terminated in P 23.

27 Murat Ali Bayir, June 0627 3 Parameters for simulating behavior of web user Session Termination Probability (STP) Link from Previous pages Probability (LPP) New Initial page Probability (NIP) Agent Simulator

28 Murat Ali Bayir, June 0628 Web Mining Previous Session Reconstruction Heuristics Smart-SRA Agent Simulator Experimental Results Conclusion OUTLINE

29 Murat Ali Bayir, June 0629 Heuristics Tested Time oriented heuristic (heur1) (total time  30 min) Time oriented heuristic (heur2) (page stay  10 min) Navigation oriented heuristic (heur3) Smart-SRA heuristic (heur4) Experimental Results

30 Murat Ali Bayir, June 0630 Accuracy is determined as: Reconstructed session H captures a real session R if R occurs as a subsequence of H (R  H) String-matching relation needed R = [P1, P3, P5] H = [P9, P1, P3, P5, P8] =>R  H Yes H = [P1, P9, P3, P5, P8] =>R  H No Experimental Results

31 Murat Ali Bayir, June 0631 Parameters for generating user sessions and web topology Number of web pages (nodes) in topology300 Average number of outdegree15 Average number of page stay time2,2 min Deviation for page stay time0,5 min Number of agents10000 STP : Fixed & Range5% 1%-20% LPP : Fixed & Range30% 0%-90% NIP : Fixed & Range30% 0%-90% Experimental Results

32 Murat Ali Bayir, June 0632 Accuracy vs. STP Experimental Results Increasing STP leads to sessions with fewer pages. It becomes more easy to predict. In small length sessions the probability of LPP and NIP that holds is also small.

33 Murat Ali Bayir, June 0633 Accuracy vs LPP Experimental Results As LPP increases the real accuracy decreases. Increasing LPP leads to more complex sessions. Intelligent Path completion is needed for discovering more accurate sessions.

34 Murat Ali Bayir, June 0634 Accuracy vs. NIP Experimental Results Increasing NIP causes more complex sessions, the accuracy decreases for all heuristics. Path separation is needed for discovering more accurate sessions.

35 Murat Ali Bayir, June 0635 Web Mining Previous Session Reconstruction Heuristics Smart-SRA Agent Simulator Experimental Results Conclusion OUTLINE

36 Murat Ali Bayir, June 0636 Conclusion New session reconstruction heuristic: Smart-SRA – –Does not allow sequences with unrelated consecutive requests (no hyperlink between the previous one to the next one) – –No artificial browser (back) requests insertion in order to prevent unrelated consecutive requests – –Only maximal sessions discovered. Agent simulator simulates behaviors of real www users. It is possible to evaluate accuracy of heuristics by using Agent Simulator. Experimental results show Smart-SRA outperforms previous reactive heuristics.

37 Murat Ali Bayir, June 0637 References [Bayir 06-1] M. A. Bayir, I. H. Toroslu, A. Cosar, (2006) A Performance Comparison of Pattern Discovery Methods on Web Log Data, AICCSA-06, the 4th ACS/IEEE International Conference on Computer Systems and Applications. [Bayir 06-2] M. A. Bayir, I. H. Toroslu, A. Cosar, (2006): A New Approach for Reactive Web Usage Data Processing. ICDE Workshops, 44. [Cooley 99-1] R. Cooley, B. Mobasher, and J. Srivastava (1999), Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge and Information Systems Vol. 1, No. 1. [Cooley 99-2] R. Cooley, P. Tan and J. Srivastava (1999), Discovery of interesting usage patterns from Web data. Advances in Web Usage Analysis and User Profiling. LNAI 1836, Springer, Berlin, Germany. 163-182. [Spiliopoulou 98] M. Spiliopoulou, L.C. Faulstich (1998). WUM: A tool for Web Utilization analysis. Proceedings EDBT workshop WebDB’98, LNCS 1590, Springer, Berlin, Germany. 184-203.

38 Murat Ali Bayir, June 0638 Thank you for Listening Thank you for Listening Any Questions ?


Download ppt "1 Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey A New Reactive Method for Processing Web Usage Data."

Similar presentations


Ads by Google