Data Mining: Concepts and Techniques — Chapter 9 — Graph mining: Part II Graph Classification and Clustering Jiawei Han and Micheline Kamber Department.

Slides:



Advertisements
Similar presentations
 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,
Advertisements

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
gSpan: Graph-based substructure pattern mining
Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.
NeMoFinder: Dissecting genome- wide protein-protein intractions with meso-scale network motifs Mike Yuan.
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Data Mining Association Analysis: Basic Concepts and Algorithms
Discriminative and generative methods for bags of features
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
COM (Co-Occurrence Miner): Graph Classification Based on Pattern Co-occurrence Ning Jin, Calvin Young, Wei Wang University of North Carolina at Chapel.
© 2008 IBM Corporation Mining Significant Graph Patterns by Leap Search Xifeng Yan (IBM T. J. Watson) Hong Cheng, Jiawei Han (UIUC) Philip S. Yu (UIC)
Data Mining Association Analysis: Basic Concepts and Algorithms
Reduced Support Vector Machine
SCAN: A Structural Clustering Algorithm for Networks
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Integration of Classification and Pattern Mining: A Discriminative and Frequent Pattern-Based Approach Hong Cheng Jiawei Han Chinese.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.
What Is Sequential Pattern Mining?
Slides are modified from Jiawei Han & Micheline Kamber
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Network Aware Resource Allocation in Distributed Clouds.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
SCAN: A Structural Clustering Algorithm for Networks
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Density-Based Clustering Algorithms
Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.
Chapter 3. Community Detection and Evaluation May 2013 Youn-Hee Han
Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1 Knowledge Discovery from Transportation Network Data Paper Review Jiang, W., Vaidya, J., Balaporia, Z., Clifton, C., and Banich, B. Knowledge Discovery.
1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49.
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
Graph Indexing From managing and mining graph data.
1 Data Mining: Principles and Algorithms Mining Homogeneous Networks Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign.
A Fast Kernel for Attributed Graphs Yu Su University of California at Santa Barbara with Fangqiu Han, Richard E. Harang, and Xifeng Yan.
Gspan: Graph-based Substructure Pattern Mining
Cohesive Subgraph Computation over Large Graphs
What Is Cluster Analysis?
Data Mining: Concepts and Techniques (3rd ed
CS6220: Data Mining Techniques
Data Mining: Concepts and Techniques — Chapter 9 — 9.1. Graph mining
CARPENTER Find Closed Patterns in Long Biological Datasets
On Efficient Graph Substructure Selection
CSE572, CBS598: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Slides are modified from Jiawei Han & Micheline Kamber
Graph Classification SEG 5010 Week 3.
CPT-S 415 Big Data Yinghui Wu EME B45.
CSE572: Data Mining by H. Liu
Approximate Graph Mining with Label Costs
Presentation transcript:

Data Mining: Concepts and Techniques — Chapter 9 — Graph mining: Part II Graph Classification and Clustering Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign ©2006 Jiawei Han and Micheline Kamber. All rights reserved.

January 16, 2016 Mining and Searching Graphs in Graph Databases 2

Graph Mining II Graph Classification Graph pattern-based approach Subgraph patterns from data mining: LEAP Machine Learning approaches Kernel-based approach Boosting Graph Clustering Link-density-based approach: “SCAN: A Structural Clustering Algorithm for Networks” Summary

Substructure-Based Graph Classification  Data: graph data with labels, e.g., chemical compounds, software behavior graphs, social networks  Basic idea  Extract graph substructures  Represent a graph with a feature vector,  where is the frequency of in that graph  Build a classification model  Different features and representative work  Fingerprint  Maccs keys  Tree and cyclic patterns [Horvath et al., KDD’04]  Minimal contrast subgraph [Ting and Bailey, SDM’06]  Frequent subgraphs [Deshpande et al., TKDE’05; Liu et al., SDM’05]  Graph fragments [Wale and Karypis, ICDM’06]

Fingerprints (fp-n) Enumerate all paths up to length l and certain cycles 1 2 ■ ■ ■ ■ ■ n Hash features to position(s) in a fixed length bit-vector 1 2 ■ ■ ■ ■ ■ n O N O O Chemical Compounds N O O N O O N N O Courtesy of Nikil Wale

Maccs Keys (MK) Domain Expert Each Fragment forms a fixed dimension in the descriptor-space Identify “Important” Fragments for bioactivity HO O NH 2 NH 2 O OH O NH NH 2 O Courtesy of Nikil Wale

Cycles and Trees (CT) [Horvath et al., KDD’04] Identify Bi-connected components Delete Bi-connected Components from the compound O NH 2 O O Left-over Trees Fixed number of cycles Bounded Cyclicity Using Bi-connected components Chemical Compound O O O NH 2 O Courtesy of Nikil Wale

Frequent Subgraphs (FS) [Deshpande et al., TKDE’05 ] Discovering Features OOO Sup:+ve:30% -ve:5% F O Sup:+ve:40%-ve:0% Sup:+ve:1% -ve:30% H H O N O H H H H H HH H HH H H H Chemical Compounds Discovered Subgraphs Frequent Subgraph Discovery Min. Support. Topological features – captured by graph representation F Courtesy of Nikil Wale

Graph Fragments (GF) [Wale and Karypis, ICDM’06] Tree Fragments (TF): At least one node of the tree fragment has a degree greater than 2 (no cycles). Path Fragments (PF): All nodes have degree less than or equal to 2 but does not include cycles. Acyclic Fragments (AF): TF U PF –Acyclic fragments are also termed as free trees. NH NH 2 O OH O Courtesy of Nikil Wale

Comparison of Different Features [Wale and Karypis, ICDM’06]

Minimal Contrast Subgraphs [Ting and Bailey, SDM’06] A contrast graph is a subgraph appearing in one class of graphs and never in another class of graphs Minimal if none of its subgraphs are contrasts May be disconnected Allows succinct description of differences But requires larger search space Mining Contrast Subgraphs Find the maximal common edge sets These may be disconnected Apply a minimal hypergraph transversal operation to derive the minimal contrast edge sets from the maximal common edge sets Must compute minimal contrast vertex sets separately and then minimal union with the minimal contrast edge sets Courtesy of Bailey and Dong

Frequent Subgraph-Based Classification [Deshpande et al., TKDE’05] Frequent subgraphs A graph is frequent if its support (occurrence freq.) in a given dataset is no less than a min. support threshold Feature generation Frequent topological subgraphs by FSG Frequent geometric subgraphs with 3D shape information Feature selection Sequential covering paradigm Classification Use SVM to learn a classifier based on feature vectors Assign different misclassification costs for different classes to address skewed class distribution

Varying Minimum Support

Varying Misclassification Cost

All graph substructures up to a given length (size or # of bonds) Determined dynamically → Dataset dependent descriptor space Complete coverage → Descriptors for every compound Precise representation → One to one mapping Complex fragments → Arbitrary topology Recurrence relation to generate graph fragments of length l Graph Fragment [Wale and Karypis, ICDM’06 ] Courtesy of Nikil Wale

ICDM 08 Tutorial16 Performance Comparison

Re-examination of Pattern-Based Classification Model Learning Positive Negative Training Instances Test Instances Prediction Model Pattern-Based Feature Construction Computationally Expensive! Feature Space Transformation

The Computational Bottleneck Data Frequent Patterns 10 4 ~10 6 Discriminative Patterns Two steps, expensive Mining Filtering Data Discriminative Patterns Direct mining, efficient Direct MiningTransform FP-tree

Challenge: Non Anti-Monotonic Anti-Monotonic Non Monotonic Non-Monotonic: Enumerate all subgraphs then check their score? Enumerate subgraphs : small-size to large-size

Direct Mining of Discriminative Patterns Avoid mining the whole set of patterns Harmony [Wang and Karypis, SDM’05] DDPMine [Cheng et al., ICDE’08] LEAP [Yan et al., SIGMOD’08] MbT [Fan et al., KDD’08] Find the most discriminative pattern A search problem? An optimization problem? Extensions Mining top-k discriminative patterns Mining approximate/weighted discriminative patterns

Mining Most Significant Graph with Leap Search [Yan et al., SIGMOD’08] Objective functions

Upper-Bound

Upper-Bound: Anti-Monotonic Rule of Thumb : If the frequency difference of a graph pattern in the positive dataset and the negative dataset increases, the pattern becomes more interesting We can recycle the existing graph mining algorithms to accommodate non-monotonic functions.

Structural Similarity Sibling Structural similarity  Significance similarity Size-4 graph Size-5 graph Size-6 graph

Leap on g’ subtree if : leap length, tolerance of structure/frequency dissimilarity Structural Leap Search Mining PartLeap Part g : a discovered graph g’: a sibling of g

Frequency Association Association between pattern’s frequency and objective scores Start with a high frequency threshold, gradually decrease it

LEAP Algorithm 1. Structural Leap Search with Frequency Threshold 3. Branch-and-Bound Search with F(g*) 2. Support Descending Mining F(g*) converges

Branch-and-Bound vs. LEAP Branch-and-BoundLEAP Pruning base Parent-child bound (“vertical”) strict pruning Sibling similarity (“horizontal”) approximate pruning Feature Optimality GuaranteedNear optimal EfficiencyGoodBetter

NCI Anti-Cancer Screen Datasets NameAssay IDSizeTumor Description MCF-78327,770Breast MOLT ,765Leukemia NCI-H23140,353Non-Small Cell Lung OVCAR ,516Ovarian P ,472Leukemia PC-34127,509Prostate SF ,271Central Nerve System SN12C14540,004Renal SW ,532Colon UACC ,988Melanoma YEAST16779,601Yeast anti-cancer Data Description

Efficiency Tests Search EfficiencySearch Quality: G-test

OA Kernel scalability problem! Mining Quality: Graph Classification NameOA Kernel* LEAPOA Kernel (6x) LEAP (6x) MCF MOLT NCI-H OVCAR P PC Average AUC Runtime * OA Kernel: Optimal Assignment Kernel LEAP: LEAP search [Frohlich et al., ICML’05]

Graph Mining II Graph Classification Graph pattern-based approach Subgraph patterns from data mining: LEAP Machine Learning approaches Kernel-based approach Boosting Graph Clustering Link-density-based approach: “SCAN: A Structural Clustering Algorithm for Networks” Summary

Kernel-based Classification Random walk Marginalized Kernels (Gärtner ’02, Kashima et al. ’02, ICML’03, Mahé et al. ICML’04) and are paths in graphs and and are probability distributions on paths is a kernel between paths, e.g.,

Kernel-based Classification Optimal local assignment (Fröhlich et al. ICML’05) Can be extended to include neighborhood information e.g., where could be an RBF-kernel to measure the similarity of neighborhoods of vertices and, is a damping parameter

Boosting in Graph Classification Decision stumps Simple classifiers in which the final decision is made by single features. A rule is a tuple. If a molecule contains substructure, it is classified as. Gain Applying boosting

Boosting An Associative Classifier [Sun, et al., TKDE’06] Apply AdaBoost to associative classification with low-order rules Three weighting strategies for combining classifiers Classifier-based weighting (AdaBoost) Sample-based weighting (Evaluated to be the best) Hybrid weighting

Graph Classification with Boosting [Kudo, Maeda and Matsumoto, NIPS’04] Decision stump If a molecule x contains t, it is classified as y Gain Find a decision stump (subgraph) which maximizes gain Boosting with weight vector

Graph Mining II Graph Classification Graph Clustering Graph similarity measure Feature-based similarity measure Each graph is represented as a feature vector The similarity is defined by the distance of their corresponding vectors Frequent subgraphs can be used as features Structure-based similarity measure Maximal common subgraph Graph edit distance: insertion, deletion, and relabel Graph alignment distance Graph/network clustering: A link-density-based approach: “SCAN: A Structural Clustering Algorithm for Networks

Graph Compression Extract common subgraphs and simplify graphs by condensing these subgraphs into nodes

Graph/Network Clustering Problem X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, Proc ACM SIGKDD Int. Conf. Knowledge Discovery in Databases (KDD'07), San Jose, CA, Aug Networks made up of the mutual relationships of data elements usually have an underlying structure Because relationships are complex, it is difficult to discover these structures. How can the structure be made clear? Given simply information of who associates with whom, could one identify clusters of individuals with common interests or special relationships (families, cliques, terrorist cells)?

An Example of Networks How many clusters? What size should they be? What is the best partitioning? Should some points be segregated?

A Social Network Model Individuals in a tight social group, or clique, know many of the same people, regardless of the size of the group. Individuals who are hubs know many people in different groups but belong to no single group. Politicians, for example bridge multiple groups. Individuals who are outliers reside at the margins of society. Hermits, for example, know few people and belong to no group.

The Neighborhood of a Vertex Define  ( ) as the immediate neighborhood of a vertex (i.e. the set of people that an individual knows ).

Structure Similarity The desired features tend to be captured by a measure we call Structural Similarity Structural similarity is large for members of a clique and small for hubs and outliers.

Structural Connectivity [1]  -Neighborhood: Core: Direct structure reachable: Structure reachable: transitive closure of direct structure reachability Structure connected: [1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases

Structure-Connected Clusters Structure-connected cluster C Connectivity: Maximality: Hubs: Not belong to any cluster Bridge to many clusters Outliers: Not belong to any cluster Connect to less clusters hub outlier

Algorithm  = 2  = 0.7

Algorithm  = 2  =

Algorithm  = 2  =

Algorithm  = 2  = 0.7

Algorithm  = 2  =

Algorithm  = 2  =

Algorithm  = 2  = 0.7

Algorithm  = 2  =

Algorithm  = 2  =

Algorithm  = 2  =

Algorithm  = 2  = 0.7

Algorithm  = 2  =

Algorithm  = 2  = 0.7

Running Time Running time = O(|E|) For sparse networks = O(|V|) [2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E 70, (2004).

Summary: Graph Classification and Clustering Graph Classification Graph pattern-based approach Subgraph patterns from data mining: LEAP Machine Learning approaches Kernel-based approach Boosting Graph Clustering Link-density-based approach: “SCAN: A Structural Clustering Algorithm for Networks” Lots more to be explored

References (1)  G. Cong, K. Tan, A. Tung, and X. Xu. Mining Top-k Covering Rule Groups for Gene Expression Data, SIGMOD’05.  M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis. Frequent Substructure-based Approaches for Classifying Chemical Compounds, TKDE’05.  G. Dong and J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and Differences, KDD’99.  G. Dong, X. Zhang, L. Wong, and J. Li. CAEP: Classification by Aggregating Emerging Patterns, DS’99  R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd ed.), John Wiley & Sons,  W. Fan, K. Zhang, H. Cheng, J. Gao, X. Yan, J. Han, P. S. Yu, and O. Verscheure. Direct Mining of Discriminative and Essential Graphical and Itemset Features via Model-based Search Tree, KDD’08.  D. Cai, Z. Shao, X. He, X. Yan, and J. Han, “Community Mining from Multi-Relational Networks”, PKDD'05.  H. Fröhlich, J. Wegner, F. Sieker, and A. Zell, “Optimal Assignment Kernels For Attributed Molecular Graphs”, ICML’05  T. Gärtner, P. Flach, and S. Wrobel, “On Graph Kernels: Hardness Results and Efficient Alternatives”, COLT/Kernel’03  H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized Kernels Between Labeled Graphs”, ICML’03  T. Kudo, E. Maeda, and Y. Matsumoto, “An Application of Boosting to Graph Classification”, NIPS’04  C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for ‘Backtrace'' of Noncrashing Bugs’'', SDM'05  P. Mahé, N. Ueda, T. Akutsu, J. Perret, and J. Vert, “Extensions of Marginalized Graph Kernels”, ICML’04

References (2)  T. Horvath, T. Gartner, and S. Wrobel. Cyclic Pattern Kernels for Predictive Graph Mining, KDD’04.  T. Kudo, E. Maeda, and Y. Matsumoto. An Application of Boosting to Graph Classification, NIPS’04.  W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification based on Multiple Class- association Rules, ICDM’01.  B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining, KDD’98.  H. Liu, J. Han, D. Xin, and Z. Shao. Mining Frequent Patterns on Very High Dimensional Data: A Topdown Row Enumeration Approach, SDM’06.  S. Nijssen, and J. Kok. A Quickstart in Frequent Structure Mining Can Make a Difference, KDD’04.  F. Pan, G. Cong, A. Tung, J. Yang, and M. Zaki. CARPENTER: Finding Closed Patterns in Long Biological Datasets, KDD’03  F. Pan, A. Tung, G. Cong G, and X. Xu. COBBLER: Combining Column, and Row enumeration for Closed Pattern Discovery, SSDBM’04.  Y. Sun, Y. Wang, and A. K. C. Wong. Boosting an Associative Classifier, TKDE’06.  P-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns, KDD’02.  R. Ting and J. Bailey. Mining Minimal Contrast Subgraph Patterns, SDM’06.  N. Wale and G. Karypis. Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification, ICDM’06.  H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by Pattern Similarity in Large Data Sets, SIGMOD’02.  J. Wang and G. Karypis. HARMONY: Efficiently Mining the Best Rules for Classification, SDM’05.  X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, SCAN: A Structural Clustering Algorithm for Networks, KDD'07  X. Yan, H. Cheng, J. Han, and P. S. Yu, Mining Significant Graph Patterns by Scalable Leap Search, SIGMOD’08.

January 16, 2016 Mining and Searching Graphs in Graph Databases 64