Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Click-stream Data With Statistical and Rule-based Methods Martin Labský, Vladimír Laš, Petr Berka University of Economics, Prague.

Similar presentations


Presentation on theme: "Mining Click-stream Data With Statistical and Rule-based Methods Martin Labský, Vladimír Laš, Petr Berka University of Economics, Prague."— Presentation transcript:

1 Mining Click-stream Data With Statistical and Rule-based Methods Martin Labský, Vladimír Laš, Petr Berka University of Economics, Prague

2 Discovery Challenge 20052 The Clickstream Data 3 617 171 page requests containing : unix time; IP address; session ID; page request; referer 522 410 sessions, out of them only 203 887 with length > 1 100 000 training set 60 000 testing set 43 887 heldout set

3 Discovery Challenge 20053 Clickstream Data Preprocessing unix time ;IP address ; session ID ; page request; referee 1074589200;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/dp/?id=124 ;www.google.cz; 1074589201;194.213.35.234;3995b2c0599f1782e2b40582823b1c94;/dp/?id=182 ; 1074589202;194.138.39.56 ;2fd3213f2edaf82b27562d28a2a747aa;/ ;www.seznam.cz; 1074589233;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/dp/?id=148 ;/dp/?id=124; 1074589245;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/sb/ ;/dp/?id=148; 1074589248;194.138.39.56 ;2fd3213f2edaf82b27562d28a2a747aa;/contacts/ ; /; 1074589290;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/sb/ ;/sb/; Sequences of page visits in each session (same sessionID) were constructed from the www log data Sequences of page types [start, dp, dp, sb, sb, end] Sequences of products [start, 124, 148]

4 Discovery Challenge 20054 Predicting New Page in a Sequence Problem Observing a sequence of pages A 1 A 2 …A n-1 what will be the next page A n ? Methods Markov n-gram models Decision rules

5 Discovery Challenge 20055 Markov N-gram Predictor (1/5) Probability of a sequence A 1 A 2 ….A n each term (interpolated k-gram distribution) computed as

6 Discovery Challenge 20056 Markov N-gram Predictor (2/5) where n(xy) is the occurrence of sequence xy in data and

7 Discovery Challenge 20057 Markov N-gram Predictor (3/5) Building model using EM algorithm 1. compute P i (i=1,…,k) from counts of sequences observed in the training set D TR 2. assign non-zero initial values to weights i 3. repeat 3.1 compute the probability of the holdout set using the interpolated distribution 3.2 modify the weights

8 Discovery Challenge 20058 Markov N-gram Predictor (4/5) Results for page types

9 Discovery Challenge 20059 Markov N-gram Predictor (5/5) Results for product types

10 Discovery Challenge 200510 Rule Induction Algorithms (1/5) “Classical “ Decision rules in the form Ant => Class (p) where Ant is a conjunction of values of input attributes, p = n(Ant  Class)/n(Ant) Decision rules for clickstreams are in the form Ant => page (p) where Ant is a sequence of pages, p = n(Ant//page)/n(Ant)

11 Discovery Challenge 200511 Rule induction algorithms (2/5) Set-covering algorithm 1. find a rule that covers some positive examples and no negative example of a given class (concept) 2. remove covered examples from the training set D TR 3. if D TR contains some positive examples (not covered so far) goto 1, else end Compositional algorithm 1. add empty rule to the rule set KB 2. repeat 2.1 find by rule specialization a rule Ant => Class that fulfils the user given criteria on lengths and validity 2.2 if this rule significantly improves the set of rules KB build so far (we test using the chi 2 test the difference between the rule validity and the result of classification of an example covered by Ant) then add the rule to KB

12 Discovery Challenge 200512 Rule induction algorithms (3/5) Set-covering algorithm rule specialization extends the antecedent sequence by any sequence member from left the decision if the antecedent sequence should be specialized is made by a chi 2 test adding a rule CD  X to the rule D  X changes the rule D  X into a rule meaning (D but not CD)  X Compositional algorithm rule specialization extends the antecedent sequence by any sequence member from left the decision if new rule should be added is made by a chi 2 test

13 Discovery Challenge 200513 Rule induction algorithms (4/5) Rule-based classification Set-covering algorithm apply single rule Compositional algorithm combine conritbutions of all applicable rules using pseudo-bayesian formula

14 Discovery Challenge 200514 Rule induction algorithms (5/5) Examples of rules For the page types sequences dp, sb -> sb (Ant: 5174; AntSuc: 4801; P: 93%) ct -> end (Ant: 5502; AntSuc: 1759; P: 32%) faq -> help (Ant: 594; AntSuc: 127; P: 21%) For the products sequences loud-speakers -> video (Ant: 14840, AntSuc: 3785, P: 26%) data cables -> telephones (Ant: 2560, AntSuc: 565, P: 22%) PC peripheries -> telephones (Ant: 8671, AntSuc: 1823, P: 21%)

15 Discovery Challenge 200515 Results of testing

16 Discovery Challenge 200516 Conclusions Comparison of methods N-gram models as exhaustive sets of compositional rules P i (c|a…b) ~ a…b  c Set covering algorithm for exhaustive non comp. rules Compositional algorithm for non exhaustive comp. rules Comparison of results N-gram comparable with set covering (slightly better), worst results for compositional algorithm All algorithms can be applied by web servers to recommend relevant pages to their users, and to identify interesting patterns in their log files.

17 Thank you for your attention.


Download ppt "Mining Click-stream Data With Statistical and Rule-based Methods Martin Labský, Vladimír Laš, Petr Berka University of Economics, Prague."

Similar presentations


Ads by Google