Hierarchical Agglomerative Clustering on graphs

Slides:



Advertisements
Similar presentations
Network analysis Sushmita Roy BMI/CS 576
Advertisements

Social network partition Presenter: Xiaofei Cao Partick Berg.
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.
Modularity and community structure in networks
Presented by Yuhua Jiao Outline Limitation of some network clustering methods Hierarchical Agglomerative Clustering – Method – Performance.
K Means Clustering , Nearest Cluster and Gaussian Mixture
1 Modularity and Community Structure in Networks* Final project *Based on a paper by M.E.J Newman in PNAS 2006.
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
4. Ad-hoc I: Hierarchical clustering
Fast algorithm for detecting community structure in networks.
Modularity in Biological networks.  Hypothesis: Biological function are carried by discrete functional modules.  Hartwell, L.-H., Hopfield, J. J., Leibler,
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Network analysis and applications Sushmita Roy BMI/CS 576 Dec 2 nd, 2014.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Clustering Unsupervised learning Generating “classes”
The Relative Vertex-to-Vertex Clustering Value 1 A New Criterion for the Fast Detection of Functional Modules in Protein Interaction Networks Zina Mohamed.
Dependency networks Sushmita Roy BMI/CS 576 Nov 26 th, 2013.
tch?v=Y6ljFaKRTrI Fireflies.
Part 1: Biological Networks 1.Protein-protein interaction networks 2.Regulatory networks 3.Expression networks 4.Metabolic networks 5.… more biological.
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Dependency networks Sushmita Roy BMI/CS 576 Nov 25 th, 2014.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Dynamic Networks: How Networks Change with Time? Vahid Mirjalili CSE 891.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Network Theory: Community Detection Dr. Henry Hexmoor Department of Computer Science Southern Illinois University Carbondale.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Network applications Sushmita Roy BMI/CS 576 Dec 9 th, 2014.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
Algorithms and Computational Biology Lab, Department of Computer Science and & Information Engineering, National Taiwan University, Taiwan Modular organization.
Graph clustering to detect network modules
Network (graph) Models
Spectral clustering of graphs
Non negative matrix factorization for Global Network Alignment
Spectral methods for Global Network Alignment
Unsupervised Learning: Clustering
Computational Molecular Biology
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Incorporating graph priors in Bayesian networks
Multi-task learning approaches to modeling context-specific networks
Input Output HMMs for modeling network dynamics
Dynamics and context-specificity in biological networks
Minimum Spanning Tree 8/7/2018 4:26 AM
by Hyunwoo Park and Kichun Lee Knowledge-Based Systems 60 (2014) 58–72
Greedy Algorithm for Community Detection
Spectral Clustering.
Community detection in graphs
K-means and Hierarchical Clustering
Hierarchical clustering approaches for high-throughput data
Finding modules on graphs
Revision (Part II) Ke Chen
Revision (Part II) Ke Chen
Multivariate Statistical Methods
Evaluation of inferred networks
Spectral methods for Global Network Alignment
3.3 Network-Centric Community Detection
Boltzmann Machine (BM) (§6.4)
Text Categorization Berlin Chen 2003 Reference:
Hairong Qi, Gonzalez Family Professor
“Traditional” image segmentation
Social Network Analysis
Analysis of Large Graphs: Overlapping Communities
Presentation transcript:

Hierarchical Agglomerative Clustering on graphs Sushmita Roy sroy@biostat.wisc.edu Computational Network Biology Biostatistics & Medical Informatics 826 Computer Sciences 838 https://compnetbiocourse.discovery.wisc.edu Nov 3rd 2016

RECAP Biological networks are modular Modules can be topological or functional Two questions about modularity of biological systems Is a graph modular? What are the modules? Graph clustering algorithms to identify modules Clustering algorithms can be “flat” or “hierarchical” Girvan-Newman Algorithm: Remove high edge betweenness to identify community structure

RECAP continued Graph clustering partitions vertices into groups based on their interaction patterns Such partitions could be indicative of the specific properties of the vertices/form communities Community 3 Community 1 Community 2

Goals for today Hierarchical agglomerative clustering on graphs Stochastic block model Hierarchical random graph model Combining the two models Spectral clustering (if time permits)

Issues with existing graph clustering models Flat clustering (Stochastic block model) Need to specify the number of clusters Cluster structure might be hierarchical Either too many or too few Hierarchical clustering At the top level, only one cluster is found, which might group unrelated clusters At the bottom level, clusters might be too small

Hierarchical Agglomerative Clustering (HAC) The HAC algorithm is a deterministic algorithm that uses a different measure to define modules It is a hierarchical “agglomerative” clustering algorithm Starting from individual graph nodes it gradually finds clusters of increasing size HAC algorithm was developed to overcome issues of existing graph clustering algorithms Park and Bader BMC Bioinformatics 2011

HAC algorithm main idea Extend the flat stochastic block model to a hierarchical model Use maximum likelihood (or Bayesian) estimation of model parameters We will focus on the ML model

Stochastic block model (SBM) Let G={V,E} be a graph, with vertex set V and edge set E Suppose each vertex (node) v in V can belong to one of K groups Let M denote the membership of v in one of K groups SBM is a probabilistic model to describe such a flat group structure on the graph, P(M) The SBM is defined by parameters of within and between group interactions θii Probability of interactions between vertices of the same group i θij Probability of interactions between vertices of two different groups i and j

Probability of a grouping structure from the stochastic block model Probability of edge and hole patterns between group i and j eij: Number of edges between groups i and j tij: Number of possible edges between groups i and j hij: Number of holes between groups i and j hij = tij - eij Groups Probability of edge-hole pattern (M) for all groups

Stochastic block model for flat clustering for K=3 groups Group 1 (7 vertices) Group 2 (6 vertices) Group 3 (7 vertices)

Stochastic block model for flat clustering for K=3 groups

Maximum likelihood Parameter estimation in HAC Suppose we know what the cluster memberships are Parameters of HAC are θij Maximum likelihood estimate of parameters is The probability of edges/holes between groups i and j can be re-written as Hence, everything can be expressed in terms of the edge presence and absence pattern

Hierarchical Random Graph model Left subtree L(r) Right subtree R(r) Probability of edges (er) and holes (hr) between nodes in er: Number of edges crossing L(r) and R(r) hr: Number of holes between L(r) and R(r)

Hierarchical Random Graph model R(r) The HRG also describes a probabilistic model over the entire hierarchy

What if we did not want to merge up to the very top? This is what the HAC model tries to do Defined in the same way as the flat model for two groups, r and r’ Defined in the same way as the hierarchical model for tree node r

Combining the HRG and the SB model Two top level groups j i

Agglomerative clustering of HAC The idea of the HAC algorithm is to use agglomerative clustering: build the hierarchy bottom up, but stop if a merge of trees is not good enough We consider each pair of clusters and merge them if the corresponding model score improves What is a good enough merge?

Define the score of a merge Assume there are K top level clusters If two nodes 1 and 2 are merged into 1’, the change in likelihood before and after the merge is scored as: This score enables us to develop a greedy agglomerative algorithm

HAC greedy algorithm Input: Graph G=(V,E) Initialize top-level clusters top={{v}, in V} Initialize K = |V| While K>1 do Find top-level clusters i and j with max Merge i and j into new top level cluster r, to top L (r) = {i} and R(r)={j} Remove i and j from top K=K-1

HAC algorithm execution Let’s see one iteration of HAC with a small example graph with 5 nodes V={1,2,3,4,5} and following edges 1 5 2 3 4 Example graph Initialization top={1,2,3,4,5} K=5

HAC algorithm iteration 1 Compute scores for all potential merges, into a new node 6. In iteration one … 1 2 3 4 5 6 Merge nodes with the highest score (example shows 1,2) Update hierarchy L(6)=1, R(6)=2 Update top={3, 4, 5, 6} K=4 Repeat with new top 1 2 3 4 5

Experiments Compare HAC with other algorithms Fast Modularity (CNM) Variational Bayes Modularity (VBM) Graph Diffusion Kernel (GDK) Assess performance based on link prediction Randomly remove a set of test edges Estimate within and between cluster interaction probabilities using the remaining edges Generate confidence scores for all pairs of nodes within and between clusters using cluster structure from training data Rank links based on confidence score and compute AUPR and F-score Note, this assumes that the test nodes do not disappear from the training set.

Assessing performance based on link prediction Generate test and training data Cluster and rank all edges (thickness) Assess Precision and recall on test set

HAC-ML is able to outperform other algorithms First numbers indicate an average F score of multiple experiments and second numbers following ± sign are standard deviations of last-digit (multiplied by 100).

Multi-resolution analysis of yeast PPI network

Take away points HAC is a bottom up hierarchical clustering algorithm Combines SBM and Hierarchical Random Graph model Strengths Greedy probabilistic score for merging subtrees is fast Good performance on link prediction Provides also hierarchically organized relationships and flat-level clusters Weaknesses Evaluation was based on link prediction Cluster quality based on edge density might be a better measure Based on local edge connectivity patterns