Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data e Web Mining 825368 Paolo Gobbo Smart Miner: A New Framework for Mining Large Scale Web Usage Data Bayir – Toroslu – Cosar - Fidan.

Similar presentations


Presentation on theme: "Data e Web Mining 825368 Paolo Gobbo Smart Miner: A New Framework for Mining Large Scale Web Usage Data Bayir – Toroslu – Cosar - Fidan."— Presentation transcript:

1 Data e Web Mining Paolo Gobbo Smart Miner: A New Framework for Mining Large Scale Web Usage Data Bayir – Toroslu – Cosar - Fidan

2 Paolo Gobbo 2 Data e Web Mining Data Mining on Web Web Mining discover and retrieve useful and interesting pattern from large web dataset web content mining web structure mining web usage mining text and multimedia documents hyperlink structure web log records real data in web pages data describes the organization of the content data describes the pattern of usage of web pages

3 Paolo Gobbo 3 Data e Web MiningPreProcessing Site File Access Log Referrer LogAgent LogRegistration Site Crawler Data Cleaning Path Completion Session Identification User Identification User Session File Transaction Identification Transaction File Site Topology INPUT PREPROCESISNG SQL Query

4 Paolo Gobbo 4 Data e Web Mining Session Identification partitioning each user’s activities into sequence (session) of entries from web request logs Session Identification time oriented heuristics navigation oriented heuristics temporal boundaries session length page-stay link between web pages

5 Paolo Gobbo 5 Data e Web Mining Sequential Mining Association Mining with the order of transactions itemset/element items sequence : : is itemset sequence size sequence length number of itemsets/elements number of items : : : : : Given a set of data sequences find all sequences with a user-specified minimum support subsequence :

6 Paolo Gobbo 6 Data e Web Mining Sequential Mining algorithms GSP APrioriAll APrioriSome Sort Phase LargeItemSet Phase Transformation Phase Sequence Phase Maximal Phase Transforms customer transaction into custumer sequences Generates set of large itemset Represents customer sequences based on large itemset Derives large k-sequences based on large (k-1)-sequences Prunes non maximal sequences

7 Paolo Gobbo 7 Data e Web Mining Smart-SRA session Path timestamp ordering (time oriented) rule topology (navigation oriented) rule maximality rule (session) (path in the web site)

8 Paolo Gobbo 8 Data e Web Mining Smart Miner Candidate Session Smart Session Sequencial AprioriAll SMART-SRA SESSION CONSTRUCTION SEQUENCIAL MINING DATA STREAM FREQUENT ACCESS PATTERN

9 Paolo Gobbo 9 Data e Web Mining Smart Miner: First Phase Smart SRA time oriented heuristics  session length  page-stay  no backward movement P1P1 P 13 P 20 P 49 P 34 P 23 Web Site Graph Candidate Session Candidate session construction P1P1 P 20 P 13 P 49 P 34 P Page TimeStamp P 13 P 20 P Page TimeStamp P 49 10

10 Paolo Gobbo 10 Data e Web Mining Smart Miner: Second Phase Smart SRA time oriented heuristics  inherithed session length  re-check page-stay  no backward movement  maximality  topology rule Smart session construction P1P1 P 13 P 20 P 49 P 34 P 23 Web Site Graph [P 1, P 13, P 34, P 23 ] [P 1, P 13, P 49, P 23 ] [P 1, P 20, P 23 ] Smart Session P1P1 P 20 P 13 P 49 P 34 P Page TimeStamp

11 Paolo Gobbo 11 Data e Web Mining Smart Miner: Second Phase Smart SMART SESSION RECONSTRUCTION foreach CanditateSession in CandSessionSet NewSessionSet={} while CanditateSession ≠Ø TSessionSet = {}; TPageSet = {}; foreach Page i in CandSession StartPageFlag = TRUE foreach Page j in CandidateSession with j

12 Paolo Gobbo 12 Data e Web Mining Session Construction Example IterationCandidateSessionTPageSetNewSessionSet 1[ P 1, P 20, P 13, P 49, P 34, P 23 ] 2 [ P 20, P 13, P 49, P 34, P 23 ] 3 4 [ P 49, P 34, P 23 ] [ P 23 ] { P 1 } { P 20, P 13 } { P 49, P 34 } { P 23 } [ P 1 ] [ P 1, P 20 ] [ P 1, P 13 ] [ P 1, P 13, P 34 ] [ P 1, P 13, P 49 ] [ P 1, P 20 ] [ P 1, P 13, P 34, P 23 ] [ P 1, P 13, P 49, P 23 ] [ P 1, P 20, P 23 ] P1P1 P 13 P 20 P 49 P 34 P 23

13 Paolo Gobbo 13 Data e Web Mining Sequential APrioriAll Pruning  topological constraint  every subsequent pair of pages in a sequence the former one must have a hyperlink to the latter one  string matching costraint  session S supports a pattern P if and only if P is a subsequence of S not violating string matching support not support during candidate sequence generation before calculating their support

14 Paolo Gobbo 14 Data e Web MiningSupport Support I : pattern S : user reconstructed sessions one scan through the transaction database by keeping candidate session in hashmap

15 Paolo Gobbo 15 Data e Web Mining Sequential Apriori Algorithm SEQUENTIAL APRIORI INPUT:minimum support frequency: δ reconstructed sessions: S topology information : Link set of all web pages: P OUTPUT:set of maximal frequent patterns: Max L 1 = {} for i = 1 to |P| do L 1 = L 1 U [P i ] | if Support([P i ],S)> δ for k = 1 to N-1 do if L k = Ø then Halt else L k+1 = {} foreach I i in L k foreach P j in P if Link[Last(I i ),P j ] then T = I i P j // append page if Support(T,S)> δ then T.maximal = true I i.maximal = false V = [T 2,T 3,…, T |T| ] if V in L k then V.maximal = false l k+1 = l k+1 U {T} endif endfor endif max = {} for k=1 to N-1 do max = max U {S|S in L k and S.maximal = true } endfor length-1 candidate pattern generation union of the sets of maximal patterns no further generation length-k+1 candidate pattern generation joining step pruning step topological rule support rule maximality rule

16 Paolo Gobbo 16 Data e Web Mining Accuracy Metric : frequent maximal pattern of the agent simulator : frequent maximal pattern of the heuristic recall precision accuracy

17 Paolo Gobbo 17 Data e Web Mining Agent Simulator STP: Session Termination Probability LPP: Link from Previous page Probability LPC: Link from Current page Probability NIP: New Initial page Probability probability of terminating session probability of referring next page from one of the previously accessed pages except the most recently accessed one probability of referring next page from the most recently visited page probability of selecting one of the starting pages of a web site during the navigation Agent Simulator Parameters

18 Paolo Gobbo 18 Data e Web Mining Simulated Data Web topology number of web pages from 10 to 1000 number users from 1000 to Agent simulator parameters NIP/STP 0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0 LPC/LPP 0.1, 0.2, 0.5, 1.0, 2.0, 5.0, different cases Support parameter Values 0.001, , 0.005, 0,0075, 0.01 Runs of agent simulator 10 random different runs

19 Paolo Gobbo 19 Data e Web Mining Results on Simulated Data NO TO : : SSRA : navigation oriented time oriented Smart SRA NIP : New Initial Page Probability STP : Session Termination Probability NIP : New Initial Page Probability STP : Session Termination Probability

20 Paolo Gobbo 20 Data e Web Mining Results on Simulated Data NO TO : : SSRA : navigation oriented time oriented Smart SRA

21 Paolo Gobbo 21 Data e Web Mining Real Data AGMLAB’s company web site 4 months user activity 3801 users 30 minutes session time-out 10 web pages link graph densely connected User Activity action tracking program cookies cookie information recorded to a server log file

22 Paolo Gobbo 22 Data e Web Mining Results on Real Data NO TO : : SSRA : navigation oriented time oriented Smart SRA

23 Paolo Gobbo 23 Data e Web MiningScalability Performance on 100 GB Data Performance with 50 nodes MAP/REDUCE paradigm each node process a block of session database computing the local frequency of each candidate patterns

24 Paolo Gobbo 24 Data e Web MiningSitologia/Bibliografia  M.A.Bayir – I.H.Toroslu – A.Cosar – G.Fidan, Smart Miner: A New Framework for Mining Larga Scale Web Usage Data  R.Cooley - B.Mobasher - J.Srivastava, Data Preparation for Mining World Wide Web  J.Srivastava - R.Cooley – M.Deshpande – P.N. Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data  M.G Da Costa jr – Z. Gong, Web Structure Mining: An Introduction  J.J.Jung, Semantic PreProcessing of Web Request Streams for Web Usage Mining  R.Agrawal – R.Srikant, Mining Sequential Patterns- 1995

25 Paolo Gobbo 25 Data e Web Mining foreach p in L k-1 foreach q in L k-1 if ( ) then C k = C k U {p 1,…,p k-1,q k-1 } foreach s in C k if exists(r | ˄ ) then C k = C k - sGSP C 1 = Init_Pass L 1 = { |f in C 1, with minimum support} for (k=2; L k-1 ≠Ø; k++) do begin C k = Candidate-gen-SPM L k-1 foreach sequence s in the database D do foreach candidate c in Ck if (c in s) then update candidate c L k = candidated c in C k with minimum support end result = U k( L k ) GSP – GENERALIZED SEQUENTIAL PATTERN CANDIDATE-GEN-SPM (join step) (prune step)

26 Paolo Gobbo 26 Data e Web Mining GSP Example L3-sequences Candidate 4-sequences (join step) Candidate 4-sequences (prune step)

27 Paolo Gobbo 27 Data e Web Mining foreach p in L k-1 foreach q in L k-1 if (p.x 1 =q.x 1 ) ˄ (p.x 2 =q.x 2 ) ˄ … ˄ (p.x k-2 =q.x k-2 ) then C k = C k U { } foreach s in C k if exists(r | ˄ ) then C k = C k - sAPrioriAll L 1 = {large 1-sequences} for (k=2; L k-1 ≠Ø; k++) do begin C k = Apriori-generate function L k-1 foreach sequence c in the database D do update candidates in C k that are contained in c L k = candidated in C k with minimum support end result = maximal sequences in U k( L k ) APRIORIALL APRIORI-GENERATE (join step) (prune step)

28 Paolo Gobbo 28 Data e Web Mining APrioriAll Example L3-sequences Candidate 4-sequences (join step) Candidate 4-sequences (prune step)

29 Paolo Gobbo 29 Data e Web MiningAPrioriSome APRIORISOME //Forward Phase L 1 = {large 1-sequences}; C 1 = L 1 ; last = 1; for (k=2; C k-1 ≠Ø; k++) do begin if (Lk-1 known) then C k = Apriori-generate function L k-1 else C k = Apriori-generate function C k-1 if (k=next(last)) then foreach sequence c in the database D do update candidates in C k that are contained in c L k = candidated in C k with minimum support; last = k end //Backword Phase for (k--; k>=1; k--) do begin if (L k not found) then delete all sequences in C k contained in some L i, i>k foreach sequence c in the database D do update candidates in C k that are contained in c L k = candidated in C k with minimum support else delete all sequences in L k contained in some L i, i>k end result = maximal sequences in U k( L k )

30 Paolo Gobbo 30 Data e Web Mining Sequential Mining Algorithm 1111 June 25 ’93 June 25 ‘ June 10 ’93 June 15 ’93 June 20 ‘93 10, ,60,60 3June 25 ’9330,50, June 25 ’93 June 30 ‘93 July 25 ‘ , June 12 ’9390 Customer IDTransaction TimeItems Customer IDCustomer Sequence (30)1 (40)2 (70)3 (40 70)4 (90)5 Large itemsetMapped to Customer IDCustomer Sequence


Download ppt "Data e Web Mining 825368 Paolo Gobbo Smart Miner: A New Framework for Mining Large Scale Web Usage Data Bayir – Toroslu – Cosar - Fidan."

Similar presentations


Ads by Google