A Fast Algorithm for Subspace Clustering by Pattern Similarity

Slides:



Advertisements
Similar presentations
SAX: a Novel Symbolic Representation of Time Series
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
CSE 634 Data Mining Techniques
指導教授:陳良弼 老師 報告者:鄧雅文  Introduction  Related Work  Problem Formulation  Future Work.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.
Fast Algorithms For Hierarchical Range Histogram Constructions
ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva,
Incremental Discovery of Sequential Patterns (ACM-SIGMOD's 96 Data Mining Workshop)
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
GRIN – A Graph Based RDF Index Octavian Udrea Andrea Pugliese V. S. Subrahmanian Presented by Tulika Thakur.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Discovering RFM Sequential Patterns From Customers’ Purchasing Data 中央大學資管系 陳彥良 教授 Date: 2015/10/14.
TAR: Temporal Association Rules on Evolving Numerical Attributes Wei Wang, Jiong Yang, and Richard Muntz Speaker: Sarah Chan CSIS DB Seminar May 7, 2003.
Sorting Fun1 Chapter 4: Sorting     29  9.
Mining various kinds of Association Rules
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
SeqStream: Mining Closed Sequential Pattern over Stream Sliding Windows Lei Chang Tengjiao Wang Dongqing Yang Hua Luan ICDM’08 Lei Chang Tengjiao Wang.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
1 On Demand Classification of Data Streams Charu C. Aggarwal Jiawei Han Philip S. Yu Proc Int. Conf. on Knowledge Discovery and Data Mining (KDD'04),
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Rapid Association Rule Mining Amitabha Das, Wee-Keong Ng, Yew-Kwong Woon, Proc. of the 10th ACM International Conference on Information and Knowledge Management(CIKM’01),2001.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Decision Trees DEFINITION: DECISION TREE A decision tree is a tree in which the internal nodes represent actions, the arcs represent outcomes of an action,
What Is Cluster Analysis?
Frequency Counts over Data Streams
CMPS 3130/6130 Computational Geometry Spring 2017
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Parallel Density-based Hybrid Clustering
RE-Tree: An Efficient Index Structure for Regular Expressions
abstract containers sequence/linear (1 to 1) hierarchical (1 to many)
Subspace Clustering/Biclustering
CARPENTER Find Closed Patterns in Long Biological Datasets
Quick-Sort 11/14/2018 2:17 PM Chapter 4: Sorting    7 9
Data Mining II: Association Rule mining & Classification
Jianping Fan Dept of CS UNC-Charlotte
Chao Zhang1, Yu Zheng2, Xiuli Ma3, Jiawei Han1
Quick-Sort 11/19/ :46 AM Chapter 4: Sorting    7 9
Mining Complex Data COMP Seminar Spring 2011.
CSE572, CBS598: Data Mining by H. Liu
Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.
Tree Representation Heap.
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
A Framework for Clustering Evolving Data Streams
CSE572, CBS572: Data Mining by H. Liu
Quick-Sort 2/23/2019 1:48 AM Chapter 4: Sorting    7 9
CS 485G: Special Topics in Data Mining
Compact routing schemes with improved stretch
An Efficient Method for Projected Clustering
Discovery of Significant Usage Patterns from Clickstream Data
CSE572: Data Mining by H. Liu
Finding Frequent Itemsets by Transaction Mapping
K.L Ong, W. Li, W.K. Ng, and E.P. Lim
Presentation transcript:

A Fast Algorithm for Subspace Clustering by Pattern Similarity Haixun Wang, Fang Chu, Wei Fan, Philip S Yu, and Jian Pei In the 16th International Conference on Scientific and Statistical Database Management, 2004 2019/1/3 報告人:吳建良

Abstract Clustering by pattern similarity SeqClus: a novel model Objects exhibit a coherent pattern of rise and fall in subspaces SeqClus: a novel model Intuitive for capturing subspace pattern similarity Reduce computation complexity dramatically Discovery pattern similarity embedded in data sequence

Subspace pattern similarity (b) A Shifting pattern in subspace {b, c, h, j, e} (a) Raw data: 3 objects, 10 columns (c) A Scaling pattern in subspace {f, d, a, g, i}

Applications Analysis of Large Scientific Datasets Gene expression data Example: VPS8, EFB1, CYS3 changes coherently under {CH1I, CH1D, CH2B}

Applications (cont.) Example 1. Counting Example 2. Clustering How many genes whose expression level in sample CH1I is about 100±5 units higher than that in CH2B, 280±5 units higher than that in CH1D, and 75±5 units higher than that in CH2I? Find clusters of genes that exhibit coherent subspace patterns, given the following constraints: i) the subspace pattern has dimensionality higher than minCols; and ii) the number of objects in the cluster is larger than minRows

Applications (cont.) Discovery of Sequential Patterns Network event logs Focus on two attributes: Event and Timestamp

Applications (cont.) Example 3. Sequential Pattern Mining subspace pattern Events as condition on X axis Timestamp as Y axis Sliding Window Event CiscoDCDLinkUp is followed by MLMStatusUp that is followed, in turn, by CiscoDCDLinkUp, under the constraint that the interval between the first two events is about 20±2 seconds, and the interval between the 1st and 3rd events is about 40±2 seconds Event1 Event2 Event3 … Timestamp t1 t2 t3 t4

Object Definition Use sequences to represent objects in a tabular dataset D A={c1, c2,…cn}: the set of columns : total order Object x: Where is the value of x in column ci Example: EFB1: <(CH1I, 318), (CH1B, 280), (CH1D, 37)…>

Sequence-based pattern similarity Distance function: Given two objects x and y, a subspace S, an arbitrary dimension ck Example: (1) ck max x y c1 c3 c2 c4 1 4 2 3 5 distk,S(x, y)=3

Sequence-based pattern similarity Property1 Proof:

Pattern Definition Pattern p: a tuple (T,δ), T is an ordered sequence of (column, value) pair Object x exhibits pattern p in subspace S={c1,…,ck} (2)

Pattern Definition (cont.) Example: Pattern p: <(c1,0), (c2,2), (c3,1), 2> Object x: <(c1,2), (c2,3), (c3,1)> x’: <(c1,0), (c2,1), (c3,-1)> High density pattern The number of objects that satisfy Eq(2) reach a user-defined density threshold c1 c3 c2 1 2 3 4 -1 -2 +δ -δ

Construct a Counting tree A compact summary structure of density patterns, like suffix trie Create a counting tree For each object x, insert its relevant subsequences (length≧) into the tree. At the end up node t, increase the t’s count by 1 With depth-first traversal, label each tree node t as a triple: (ID1, ID2, Count)

Construct a Counting tree (cont.) Relevant subsequence The relevant subsequence of an object x in an n-dimension space are: (ID1, ID2, Count) ID1: unique identification of node t ID2: is the largest ID1 of t’s descendent nodes Count: If t is a leaf node, Count is the number of objects end up at t Otherwise, it is the sum of the counts of its child nodes

Example: =2 x c1 c3 c2 c4 z y x1 c1 c3 c2 c4 x3 x2 y1 c1 c3 c2 c4 y3 Relevant subsequences x c1 c3 c2 c4 4 3 2 z 1 y x1 c1 c3 c2 c4 -1 -4 -2 x3 x2 -3 y1 c1 c3 c2 c4 1 -2 y3 y2 -3 -1 z1 c1 c3 c2 c4 1 2 z3 -2 z2 -1 c1 c2 c3 c4 [1, , ] [2, , ] [3, , ] [4, ,1] [1,9, ] [2,4, ] [3,4, ] [4,4,1] [ , ,1] [1,9,3] [2,4,1] [3,4,1] [4,4,1] x -1 -4 -2 1 y -2 x, y -3 -1 [ , ,1] [5,9,2] [6,7,1] [7,7,1] [5, , ] [6, , ] [7, ,1] [5,9, ] [6,7, ] [7,7,1] z 2 [8,9, ] [9,9,1] [8, , ] [9, ,1] [8,9,1] [9,9,1] [ , ,1] [10, , ] [11, , ] [12, ,2] [10,14, ] [11,12, ] [12,12,2] [ , ,2] [ , ,1] [10,14,3] [11,12,2] [12,12,2] z 1 -1 [13,14,1] [14,14,1] [ , ,1] [13, , ] [14, ,1] [13,14, ] [14,14,1]

Counting list: count pattern occurrences during the depth-first traversal Link head List of node labels (c1,c1,0) [1,9,3] [1,9,3] [2,4,1] [3,4,1] [4,4,1] (c1,c2,-1) [2,4,1] x -1 -4 -2 [5,9,2] [6,7,1] [7,7,1] (c1,c3,-4) [3,4,1] 1 y (c1,c4,-2) [4,4,1] -2 [8,9,1] [9,9,1] z (c1,c2,1) [5,9,2] 2 [10,14,3] [11,12,2] [12,12,2] (c1,c3,-2) [6,7,1] x, y -3 -1 (c1,c4,0) [7,7, 1 ] [9,9,1+1] [13,14,1] [14,14,1] 1 z (c1,c3,2) [8,9,1] -1 (c2,c2,0) [10,14,3] (c2,c3,-3) [11,12,2] (c2,c4,-1) [12,12,2] (c2,c3,1) [13,14,1] [14,14,3]

Counting Pattern Occurrence -1 1 2 [1,12,3] [2,5,1] [3,5,1] [4,5,1] [6,12,2] [7,9,1] [8,9,1] [10,12,1] [11,12,1] c5 [5,5,1] [9,9,1] [12,12,1] Link head List of node labels (c1,c1,0) [1,12,3] (c1,c2,-1) [2,5,1] (c1,c3,1) [3,5,1] [7,9,2] (c1,c4,1) [4,5,1] [8,9,2] [11,12,3] (c1,c5,2) [5,5,1] [9,9,2] [12,12,3] (c1,c2,1) [6,12,2] <(c1,0), (c3,1), (c4,1)>出現次數=2 (c1,c3,2) [10,12, 1] (c1,c1,0) [1,12,3] (c1,c3,1) [3,5,1] [7,9,2] 計算規則: If IDv is the first element of the list, then there are cntw objects Otherwise, there are cntw-cntu objects (c1,c4,1) [4,5,1] [8,9,2] [11,12,3]

Clustering Construct Cluster Tree node: is the triple (item, count, range-list) Count the occurrences of all 2-column pattern If it is frequent (count≧minRows), insert it under root node of cluster tree For each node p on current level, join p with its eligible nodes to derive nodes on the next level

Clustering (cont.) A node q is node p’s eligible nodes Join operation q is on the same level as p; if p denotes a-b=v, and q denotes c-d=v’, then Join operation p: (a-b=v, count, range-list) New node: (c-b=v’, count’, range-list’) q: (c-d=v’, count’, range-list’)

Clustering (cont.) minRows=2 (c2-c1=-1, 1, [2,4]) (c2-c1=1, 2, [5,9]) X (c2-c1=1, 2, [5,9]) (c4-c1=0, 2, [7,7], [9,9]) join (c3-c1=-4, 1, [3,4]) X (c3-c1=-2, 1, [6,7]) 從root到leaf代表了freq. pattern <(c1,0),(c2,1),(c4,0)> <(c1,0),(c4,0)> <(c2,0),(c3,-3),(c4,-1)> <(c2,0),(c4,-1)> X (c3-c1=2, 1, [8,9]) X Root (c4-c1=-2, 1, [4,4]) X (c4-c1=0, 2, [7,7], [9,9]) (c3-c2=-3, 2, [11,12]) (c4-c2=-1, 3, [12,12], [14,14]) join (c3-c2=1, 1, [13,14]) X (c4-c2=-1, 3, [12,12], [14,14])

Experiment Synthetic data Tabular form Sequential form Value range: 0~300 Embed clusters: δ=0, 2, 4, 6,… Sequential form (id, timestamp) Generated by probabilistic distribution

Experiment (cont.) Scalability

Experiment (cont.) Gene expression data Event management data Yeast micro-array 2884 genes, 17 conditions, Expression level range: 0~600, discretized into 40 bins Event management data NETVIEW EventType: 241 10 days’ worth of event logs

Experiment (cont.) Yeast micro-array NETVIEW