Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.

Slides:



Advertisements
Similar presentations
Mining Association Rules
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
CSE 634 Data Mining Techniques
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
AVL Trees1 Part-F2 AVL Trees v z. AVL Trees2 AVL Tree Definition (§ 9.2) AVL trees are balanced. An AVL Tree is a binary search tree such that.
Data Mining Association Analysis: Basic Concepts and Algorithms
Incremental Discovery of Sequential Patterns (ACM-SIGMOD's 96 Data Mining Workshop)
1 IncSpan :Incremental Mining of Sequential Patterns in Large Database Hong Cheng, Xifeng Yan, Jiawei Han Proc Int. Conf. on Knowledge Discovery.
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
Mining Association Rules
SEG Tutorial 2 – Frequent Pattern Mining.
What Is Sequential Pattern Mining?
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
Sequential Pattern Mining
Mining Frequent Patterns without Candidate Generation.
Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto ICDE’01
Generalized Sequential Pattern Mining with Item Intervals Yu Hirate Hayato Yamana PAKDD2006.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining.
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,
Merge Sort 1/12/2018 5:48 AM Merge Sort 7 2   7  2  2 7
Advanced Sorting 7 2  9 4   2   4   7
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Merge Sort 1/12/2018 9:44 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia,
MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.
Sequential Pattern Mining
HUFFMAN CODES.
Top 50 Data Structures Interview Questions
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Frequent Pattern Mining
Advanced Pattern Mining 02
13 Text Processing Hongfei Yan June 1, 2016.
Objectives Introduce different known sorting algorithms
Market Basket Analysis and Association Rules
Data Mining Association Analysis: Basic Concepts and Algorithms
Vasiljevic Vladica, FP-Growth algorithm Vasiljevic Vladica,
Data Mining Association Analysis: Basic Concepts and Algorithms
Ch. 8 Priority Queues And Heaps
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Association Analysis: Basic Concepts and Algorithms
Mining Frequent Patterns without Candidate Generation
Data Warehousing Mining & BI
Frequent-Pattern Tree
Data Structure and Algorithms
FP-Growth Wenlong Zhang.
Mining Path Traversal Patterns with User Interaction for Query Recommendation 龚赛赛
Merge Sort 2/23/ :15 PM Merge Sort 7 2   7  2   4  4 9
Association Rule Mining
Merge Sort 4/10/ :25 AM Merge Sort 7 2   7  2   4  4 9
Merge Sort 5/30/2019 7:52 AM Merge Sort 7 2   7  2  2 7
Analysis of Algorithms CS 477/677
Divide-and-Conquer 7 2  9 4   2   4   7
Presentation transcript:

Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영

1. Introduction Web Mining Useful Knowledge of Web log Web content mining Web usage mining (Web log minig) Useful Knowledge of Web log Improving designs of web site Analyzing system performance Understanding user reaction and motivation Building adaptive Web sites 2018-11-28 DE Lab. 윤지영

Sequential Pattern mining Web Access Pattern is sequential pattern in a large set of pieces of Web logs -> using sequential pattern mining Sequential Pattern mining Given a sequence database Each sequence is a list of transactions Ordered by transaction time and each transactions of a set of items Find all sequential patterns with a user-specified minimum support The support is the number of data sequences that contain the pattern 2018-11-28 DE Lab. 윤지영

Contents Problem defined WAP-tree structure is designed WAP-mine Algorithm - WAP : Web Access Pattern Performance study - vs. Apriori-based mining(GSP) 2018-11-28 DE Lab. 윤지영

2. Problem Statement Web log - a sequence of pairs of user id and event - Sequential pattern over support threshold Let E be a set of events Web Access Pattern S = e1,e2…en (ei ∈E) for (1≤i ≤n) - n : length of the access sequence (called an n-sequence) - ei ≠ ej for ( i ≠ j) ( ‘aab’ and ‘ab’ is two different access sequences ) 2018-11-28 DE Lab. 윤지영

Subsequence & super_sequence Subsequence of S = e1,e2…en S’ = e’1e’2…e’l S is super_sequence of S’ (S’⊂S) In S’⊂S, If (1≤i1<i2<…<il≤n) , such that e’j = eij for (1 ≤j ≤l) 2018-11-28 DE Lab. 윤지영

suffix & prefix S = e1e2…ek+1…en Pattern P = e’1,e’2,…e’l, and ek+1 = e’1 Suffix of S with respect to pattern P - Ssuffix = ek+1…en Prefix of S with respect to pattern P - Sprefix = e1e2…ek 2018-11-28 DE Lab. 윤지영

ξ -pattern Web Access Pattern Database (WAS) - WAS = { S1, S2,…,Sm} , Si (1≤i ≤m) Support of S in WAS - supwas(S) = sup(S) = |{Si|S⊆Si} | / m ξ -pattern : supwas(S) ≥ ξ 2018-11-28 DE Lab. 윤지영

2. Problem Statement Given Web access sequence database WAS and a support threshold ξ, mine the complete set of ξ -pattern of WAS < Example 1> Table1. (ex) fc is 50%-pattern 5-sequence 2018-11-28 DE Lab. 윤지영

3. WAP-mine : property 1 <Property 1> (Sequential Pattern Apriori) Let SEQ be a sequence database, if G is not a ξ -pattern of SEQ , any super-sequence of G can not be a ξ -pattern of SEQ < Example > “ f ” is not a 75%-pattern of WAS in Example 1, thus any access sequence containing “ f ”, cannot be a 75%-pattern 2018-11-28 DE Lab. 윤지영

3. WAP-mine : property 2 <Property 2> (Suffix heuristic) If e is a frequent event in the set of prefixes of sequences in WAS, w.r.t. pattern P, sequence eP is an access pattern of WAS < Example > - b is a frequent event within the set of prefixes w.r.t. frequent sequence ‘abac’ in Example 1, so we can claim that bac is an access pattern. 2018-11-28 DE Lab. 윤지영

WAP-tree Data structure, WAP-tree have Access sequences and corresponding counts Tedious support counting can be avoided -> No large candidate generation and creating the patterns with enough support much smaller than Original WAP-tree is built by simplly scanning twice The philosophy is conditional search instead of level-wise 2018-11-28 DE Lab. 윤지영

Conditional Search Looking for patterns with the same suffix Count frequent events in the set of prefixes with same suffix condition Partition-based divide-and-conquer method instead of bottom-up generation of combinations Avoids generating large candidate sets 2018-11-28 DE Lab. 윤지영

Algorithm 1 : WAP-mine (miming access patterns in Web access sequence database) Input : access sequence database WAS and support threshold ξ(0≤ ξ ≤1) Output : the complete set of ξ-patterns in WAS Method : 1. Scan Once, find all frequent event 2. Scan Twice, construct a WAP-tree 3. Recursively mine the WAP-tree using conditional search 2018-11-28 DE Lab. 윤지영

4. Construction of WAP-tree <Observation of WAP-tree> <1> If an event e is not in the set of frequent 1-sequences, there is no need to include e in the construction of a Web access pattern tree <2> If two access sequences share a common prefix P, the prefix P can be shard in the WAP-tree 2018-11-28 DE Lab. 윤지영

WAP-tree Structure Label and count in each node - Labeled by an event and count the number of occurrences of the corresponding prefix Constructure of WAP-tree - filter out any nonfrequent event - insert the resulting frequent subsequence into WAP-tree Auxiliary node linkage structures are constructed to assist node travel - heder table H, event node queue of with label ei (ei-queue) 2018-11-28 DE Lab. 윤지영

Example 2 (Do reference Table 1 ) - Support threshold is set to 75% - Tthe set of frequent 1-event : {a,b,c} <fig 1>The WAP-tree and conditional WAP-tree for frequent subsequences in Table1 2018-11-28 DE Lab. 윤지영

Algorithm 2 ( WAP-tree Construction) Input : database WAS and the set of frequent events FE Output : an WAP-tree T ; Method : 1. Create a root node for T 2. For each access sequence S in WAS do - Extract frequent subsequence S’ (S’ = s1s2…sn, si(1≤ i ≤ n) and For I = 1 to n, - if current_node has a child labeled si, increase the count of si by 1 - else create a new child node (si:1) 3. Return(T); 2018-11-28 DE Lab. 윤지영

Analysis of WAP-tree The height of the WAP-tree - 1 + maximum length of the frequent subsequences the width of the WAP-tree - the number of leaves of the tree = the number of access sequences => much smaller than the size of WAS 2018-11-28 DE Lab. 윤지영

Lemma 1 For any access sequence in WAS , there exists a unique path in the WAP-tree about all labels of nodes as the events in the database The number of distinct leaf node as well as paths in an WAP-tree - cannot more than distinct frequent subsequences and height is short => implemented by B+tree and even in pure SQL 2018-11-28 DE Lab. 윤지영

5. Mining Web Access Patterns from WAP-tree <Property 3> ( Node-link property) All the frequent subsequences contain ei can be visited by following the ei-queue => can lookahead suffix node of ei A prefix sequence of ei may contain another prefix sequence of ei < example > - in abab, prefix sequences of b => aba, a 2018-11-28 DE Lab. 윤지영

Concept of unsubsumed count Let G and H be two prefix sequences of ei , if G is subpath of H then G is a sub-prefix sequence of H and H is a super-prefix sequence of G For ei without any super-prefix sequence, we define the unsubsumed count of that sequence if ei with some super-prefix sequences, the unsubsumed count of it => the count of that sequence minus unsubsumed counts of all its super-prefix sequences 2018-11-28 DE Lab. 윤지영

What’s Conditional Search <Property 4> (Prefix sequence unsubsumed count property) the count of a sequence G ended with ei is the sum of unsubsumed counts of all prefix sequences of ei which is a super-sequence of G can count all frequent events in the set of sequences with same suffix Instead of searching all at a time, it turns to search web access patterns with same suffix as the condition 2018-11-28 DE Lab. 윤지영

Conditional Search paradigm 1. PS |ei : contains all prefix sequences of ei and count 2. for each prefix sequence of ei with count c , When it insert into PS |ei and all of its sub-prefix sequences of ei are inserted into PS |ei with count – c 3. A prefix sequence in PS |ei holds its unsubsumed count 4. It can be mined by concatenating 2018-11-28 DE Lab. 윤지영

Algorithm 3 (Mining all Web access patters in a WAP-tree) Input : a WAP-tree T and support threshold ξ Output : the complete set of ξ-patterns Method : 1. If the WAP-tree T has only one branch, return all the unique combinations of node in that branch 2. Initialize Web access pattern set Wap = 0. Insert patterns into WAP 2018-11-28 DE Lab. 윤지영

Algorithm 3 (Mining all Web access patters in a WAP-tree) 3. For each event ei in WAP-tree T, - Construct PS |ei by following the ei- queue and count conditional frequent events at same time - if the set of conditional frequent events is not empty, build a conditional WAP-tree - Web access pattern from conditional WAP- tree concatenate ei to it and insert it into WAP 4. Return WAP 2018-11-28 DE Lab. 윤지영

6. Performance Evaluation and Conclusions <Theorem 1> WAP-mine returns the complete set of access patterns without redundancy <fig 2> GSP vs. WAP-mine 2018-11-28 DE Lab. 윤지영