Samir Khuller University of Maryland Joint Work with Barna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang RECOMB 2010.

Slides:



Advertisements
Similar presentations
Maximum flow Main goals of the lecture:
Advertisements

Lecture 7. Network Flows We consider a network with directed edges. Every edge has a capacity. If there is an edge from i to j, there is an edge from.
ECE 667 Synthesis and Verification of Digital Circuits
Charalampos (Babis) E. Tsourakakis KDD 2013 KDD'131.
~1~ Infocom’04 Mar. 10th On Finding Disjoint Paths in Single and Dual Link Cost Networks Chunming Qiao* LANDER, CSE Department SUNY at Buffalo *Collaborators:
Greedy Algorithms Greed is good. (Some of the time)
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Global Flow Optimization (GFO) in Automatic Logic Design “ TCAD91 ” by C. Leonard Berman & Louise H. Trevillyan CAD Group Meeting Prepared by Ray Cheung.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Lectures on Network Flows
Approximating Maximum Edge Coloring in Multigraphs
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Introduction to Approximation Algorithms Lecture 12: Mar 1.
Data Transmission and Base Station Placement for Optimizing Network Lifetime. E. Arkin, V. Polishchuk, A. Efrat, S. Ramasubramanian,V. PolishchukA. EfratS.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Decision Tree Algorithm
Stereo & Iterative Graph-Cuts Alex Rav-Acha Vision Course Hebrew University.
Maximum Flows Lecture 4: Jan 19. Network transmission Given a directed graph G A source node s A sink node t Goal: To send as much information from s.
2-Layer Crossing Minimisation Johan van Rooij. Overview Problem definitions NP-Hardness proof Heuristics & Performance Practical Computation One layer:
Job Scheduling Lecture 19: March 19. Job Scheduling: Unrelated Multiple Machines There are n jobs, each job has: a processing time p(i,j) (the time to.
Stereo Computation using Iterative Graph-Cuts
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.
Samir Khuller University of Maryland Joint Work with Barna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang RECOMB 2010.
Minimal Spanning Trees What is a minimal spanning tree (MST) and how to find one.
Fixed Parameter Complexity Algorithms and Networks.
Graph Coalition Structure Generation Maria Polukarov University of Southampton Joint work with Tom Voice and Nick Jennings HUJI, 25 th September 2011.
Efficient Gathering of Correlated Data in Sensor Networks
Introduction to Operations Research
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Finding dense components in weighted graphs Paul Horn
Edge Covering problems with budget constrains By R. Gandhi and G. Kortsarz Presented by: Alantha Newman.
Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.
CS774. Markov Random Field : Theory and Application Lecture 13 Kyomin Jung KAIST Oct
UNC Chapel Hill Lin/Foskey/Manocha Minimum Spanning Trees Problem: Connect a set of nodes by a network of minimal total length Some applications: –Communication.
Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
CS223 Advanced Data Structures and Algorithms 1 Maximum Flow Neil Tang 3/30/2010.
Assignments and matchings Chapter 12 Presented by Yorai Geffen.
Fixed parameter algorithms for protein similarity search under mRNA structure constrains A joint work by: G. Blin, G. Fertin, D. Hermelin, and S. Vialette.
CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 25.
Approximating the k Steiner Forest and Capacitated non preemptive dial a ride problems, with almost uniform weights Guy Kortsarz Joint work with Dinitz.
1 CS612 Algorithms for Electronic Design Automation CS 612 – Lecture 8 Lecture 8 Network Flow Based Modeling Mustafa Ozdal Computer Engineering Department,
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
::Network Optimization:: Minimum Spanning Trees and Clustering Taufik Djatna, Dr.Eng. 1.
Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Cohesive Subgraph Computation over Large Graphs
Finding Dense and Connected Subgraphs in Dual Networks
Introduction to Approximation Algorithms
Polynomial integrality gaps for
Graph Theory and Algorithm 02
June 2017 High Density Clusters.
Lectures on Network Flows
Chapter 5. Optimal Matchings
James B. Orlin Presented by Tal Kaminker
Instructor: Shengyu Zhang
Analysis of Algorithms
3.5 Minimum Cuts in Undirected Graphs
Vertex Covers, Matchings, and Independent Sets
Linear Programming Duality, Reductions, and Bipartite Matching
Problem Solving 4.
SEG5010 Presentation Zhou Lanjun.
Algorithms (2IL15) – Lecture 7
Minimum Spanning Trees
Presentation transcript:

Samir Khuller University of Maryland Joint Work with Barna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang RECOMB 2010

Sitting in a talk on community detection…..  How do we define a community?  Perhaps we want to capture a group of individuals with strong interactions within the group?

Density The density of {1,2,3,4,5,6,7} = 9/7 = 1.28 The density of {1,2,3,4} = 6/4 = 1.5 The densest subgraph is {1,2,3,4}. How do we compute the densest subgraph? Surprisingly, this can be solved optimally in polynomial time! [Goldberg 84, Lawler 76, Queyranne 75, GGT] Extends to weighted graphs. 1 sum of weights of edges in the induced subgraph Subgraph density = number of nodes in the induced subgraph

Density sum of weights of edges in the induced subgraph Subgraph density = number of nodes in the induced subgraph Density of entire graph is 13/8 > 1.5

 Are all dense subgraphs meaningful ? ◦ How do we allow some control over the kind of dense subgraphs that are found? ◦ Putting size constraints makes the problem intractable immediately.  Densest subgraph of size >=k. NP-hard, 2 approximation [Anderson][Khuller, Saha]. Greedily Union densest subgraphs…..  Densest subgraph of size <=k (or =k). NP-hard and some approximations known [Feige, Kortsarz, Peleg] [Charikar et al]. Are Dense Subgraphs Useful?

 Goldberg’s algorithm: a new flow network, is created with “directed” edges. ◦ An s-t min cut is computed in order to find the densest subgraph after guessing the max density. Nodes on the “s” side of the cut are part of a densest subgraph. ◦ GGT speeds everything up to a single flow computation!  Lawler’s algorithm: slightly different flow construction, more intuitive.  Greedy algorithm: recursively delete low degree nodes. Gives a 2 approximation to density! Fast!

Background: What is a min cut? source sink V1 V2 We use s-t min cuts

 Original Graph: Background: Find the Densest Subgraph

Background: Finding the Densest Subgraph source sink = 7 + 2* Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) g = guess = 2 CUT = m’|V| +2|V1| (g-D1) where V1 are the source side nodes

Background: Finding the Densest Subgraph source sink Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) Since the cut is <21, the guess is too low. g = guess = 2 CUT = m’|V| +2|V1| (g-D1) where V1 are the source side nodes

Background: Finding the Densest Subgraph source sink Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) Since the min cut is the trivial cut (and unique), the guess is too high. g = guess = 4

Background: Finding the Densest Subgraph source sink /3 6 2/3 4 2/3 7 Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) Since the cut is smaller than 21, the guess is too low. g = guess = 2 1/3

Background: Finding the Densest Subgraph source sink Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) Since the cut has value 21 and V1 is not empty, guess is correct! g = guess = 2 1/2 5

What can we do with this weapon?  Consider gene annotation data from TAIR.  For large networks we can use the fast greedy approximation (gave us the densest subgraph every time!).

 Biological knowledge often can be represented as graphs ◦ Protein Interactions, Metabolic Pathways, Gene Regulation, Gene Annotation RRP43 RRP4RRP42 RRP46 RRP45 SK16DIS3 EXOSOME

 Dense regions in graphs may represent useful information ◦ Many previous works on clustering protein interaction, metabolic networks etc. to find dense regions. And many more …

 Dense regions in graphs may represent useful information (B) MIPS complex in the yeast network Y 11k and the matching complex obtained by the densest subgraph algorithm King, A.D., Przulj, N. and Jurisica, I., Protein complex prediction via cost- based clustering. Bioinformatics, 2004

TAIR Annotation Example gene annotations

AT1G15550GA4 GO: gibberellin 3- beta-dioxygenase activity GO: gibberellin biosynthetic process GO: response to gibberellin stimulus GO: response to red or far red light GO: transcription factor binding GO: response to red light PO: embryo axis PO: flower PO: root PO: fruit PO: ovary placenta PO: shoot apical meristem PO: cotyledon PO: receptacle PO: root vascular system PO: rosette leaf PO: sepal vascular system PO: stem PO: stem node PO: embryo PO: terminal floral bud PO: leaf PO: germination PO: seedling growth PO: filament GO: gibberellic acid mediated signalling GO: cytoplasm GO-(gene)-PO tri-partite graph

GO: gibberellin biosynthetic process GO: response to gibberellin stimulus GO: response to red or far red light GO: response to red light GO: gibberellic acid mediated signalling GO: biological process GO Ontology

PO: embryo axis PO: flower PO: root PO: fruit PO: ovary placenta PO: shoot apical meristem PO: cotyledon PO: receptacle PO: root vascular system PO: rosette leaf PO: sepal vascular system PO: stem PO: stem node PO: embryo PO: terminal floral bud PO: leaf PO: filament Plant structure PO Ontology

Gene Annotation Graph  Construct graphs for each gene using their GO, PO annotations  Combine the graphs of several genes into one single weighted graph Gene 1 Gene 2 Gene 3 Gene 4 GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 3 PO 4

 Scientists like to find patterns in gene annotation graphs – but these are huge!  Need to allow some control over the kind of patterns that are computed  Would like to find biologically meaningful patterns The Problem Gene 1 Gene 2 Gene 3 Gene 4 GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 3 PO 4 Node Edge

AT1G15550GA4 GO: gibberellin 3- beta-dioxygenase activity GO: gibberellin biosynthetic process GO: response to gibberellin stimulus GO: response to red or far red light GO: transcription factor binding GO: response to red light PO: embryo axis PO: flower PO: root PO: fruit PO: ovary placenta PO: shoot apical meristem PO: cotyledon PO: receptacle PO: root vascular system PO: rosette leaf PO: sepal vascular system PO: stem PO: stem node PO: embryo PO: terminal floral bud PO: leaf PO: germination PO: seedling growth PO: filament GO: gibberellic acid mediated signalling GO: cytoplasm GO-(gene)-PO tri-partite graph

GO: gibberellin 3- beta-dioxygenase activity GO: gibberellin biosynthetic process GO: response to gibberellin stimulus GO: response to red or far red light GO: transcription factor binding GO: response to red light PO: embryo axis PO: flower PO: root PO: fruit PO: ovary placenta PO: shoot apical meristem PO: cotyledon PO: receptacle PO: root vascular system PO: rosette leaf PO: sepal vascular system PO: stem PO: stem node PO: embryo PO: terminal floral bud PO: leaf PO: germination PO: seedling growth PO: filament GO: gibberellic acid mediated signalling GO: cytoplasm GO-PO bipartite graph

Gene Annotation Graph  Construct complete bipartite graph for each gene using their GO, PO annotations  Combine the bipartite graphs of several genes into one single weighted graph GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 3 PO

 Cliques – these might give us some biological information – but this is a stringent reqmt.  However clique finding is well known to be really hard (NP-hard, hard to approximate).  Why not look for “dense regions”?  Note that the notion of density could be defined for hyper-edges as well, but for our purposes this does not do as well.

Dense Subgraphs in Gene Annotation Graph  A collection of GO-PO terms that appear together in the underlying genes. GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 3 PO (GO3,PO1),(GO3,PO2),(GO3,PO4),(GO4,PO1),(GO4,PO2), (GO4,PO4) appear frequently in the 4 genes

 Are all dense subgraphs biologically meaningful ? ◦ How do we allow some control over the kind of dense subgraphs that are computed. ◦ In fact we can impose both restrictions at the same time! Restrictions in dense subgraph computation Distance Restricted Subset Restricted GO terms and similarly PO terms that appear must be biologically related Certain GO, PO terms must appear in the returned subgraph

 Are all dense subgraphs biologically meaningful ? ◦ How do we allow some control over the kind of dense subgraphs that are found? Restrictions in dense subgraph computation Distance Restricted Subset Restricted GO terms that appear in the densest subgraph must be close in the GO ontology graph and similarly for the PO terms

 Distance threshold = 1  This means that some sets of nodes are not allowed to coexist in the final solution: {GO1,GO2}, {GO1,GO4}, {PO1,PO4}, {PO1,PO2},{PO2,PO3,}.  The final solution is {GO2, GO3, GO4, PO2, PO4}, which has a density of.8. GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 1 PO 3 PO 4 PO 2 PO 3 PO 4 GO 2 GO 1 GO 3 GO 4

 For arbitrary ontology graph structure ◦ NP Hard even to approximate it reasonably  Reduction from Independent set problem ◦ Factor 2 relaxation of distance threshold is enough to get a solution with density as high as the optimum  Trees, Interval Graphs, Each edge participates in small number of cycles ◦ Polynomial time algorithm to compute the optimum

GO-Ontology PO-Ontology Distance Threshold=

Guess two nodes in each ontology that appears in the optimum solution and have maximum distance GO-Ontology PO-Ontology

Distance Threshold=2 Compute all the nodes which are within distance threshold from both the guessed nodes GO-Ontology PO-Ontology

Distance Threshold=2 In the gene-annotation bipartite graph consider only the chosen nodes and compute the densest subgraph GO-Ontology PO-Ontology

Distance Threshold=2 In the gene-annotation bipartite graph consider only the chosen nodes and compute the densest subgraph GO-Ontology PO-Ontology

Distance Threshold=2 In the gene-annotation bipartite graph consider only the chosen nodes and compute the densest subgraph GO-Ontology PO-Ontology

Distance Threshold= GO-Ontology PO-Ontology Proof of optimality:  Any node not chosen can not be in the optimum solution  All the nodes chosen are within distance threshold

 Guess a small subset of nodes from the optimum  Choose candidate nodes by considering distance from the guessed nodes  Compute the densest subgraph by restricting the gene annotation graph to only the chosen nodes

 Are all dense subgraphs biologically meaningful ? ◦ How do we some control over the kind of dense subgraphs that are found ? Subset Restricted Dense Subgraph Restrictions in dense subgraph computation Distance Restricted Subset Restricted Given a subset of GO, PO terms compute the densest subgraph containing them.

Subset Restricted Dense Subgraph This set {5,6} must be in the solution. Density of {1,2,3,4} = ( )/4 = 2.25– Doesn’t contain {5,6} Density of {5,6,7,8} = 6/4 = 1.5 (Satisfies subset requirement) Density of {1,2,3,4,5,6,7,8} = ( *7)/8 = 2.0 (Best answer) Polynomial time algorithm to compute the optimum solution

 For this problem we modified Lawler’s method of finding densest subgraphs. Let’s assume that we have a graph in which we want to force {5,6} to be in the final solution. Specified Set of Nodes in Densest Subgraph

The guess “g” is iteratively updated, as in Goldberg’s algorithm until the min cut is calculated and there is more than one possible solution, one contains just {s’ and s} and the other specifies the densest subgraph.

 A graph may contain multiple subgraphs of equal (or close to equal) density  Computing just one subgraph may not be sufficient  Compute all subgraphs close to maximum density  Extension of Picard and Queyranne’s result ◦ Polynomial time algorithm to find almost all dense subgraphs given the number of such subgraphs is polynomial in the number of vertices. ◦ Their method encodes all possible s-t min cuts. ◦ After a max-flow is found, we lower edges with residual capacity close to zero, to zero and now used [PQ] method to list all s-t min cuts.  Can be extended to consider both distance and subset restriction All Almost Dense Subgraph

All Almost Dense Subgraphs Density of {1,2,3,4} = 9/8 = 2.25 Density of {5,6,7,8,9} = 11/5 = 2. Density of {1,2,3,4,5,6,7,8,9} = 21/9 = The entire graph is the densest subgraph, but {1,2,3,4} and {5,6,7,8,9} are “almost” dense subgraphs

 10 Photomorphogenesis genes  CIB5 CRY2 HFR1 COP1 PHOT1 PHOT2 HY5 SHB1 CRY1 CIB1  66 GO CV terms. 41 PO CV terms; 2230 GO-PO edges.  Generate distance restricted dense subgraph.  GO distance = 2.  PO distance = 3.  Dense subgraph with 3 GO terms & 13 PO terms Photomorphogenesis Experiment

HFR1 COP1 PHOT1 PHOT2 HY5 13 PO CV terms3 GO CV terms Set of 10 genes CRY2 CIB5 SHB1 CIB1 CRY1 (partial) dense subgraph; 3 GO terms; 13 PO terms; 10 genes 0 annotation edges

Photomorphogenesis Experiment  GO CV Terms PO CV Terms 5634-nucleus:cellular-component 13-cauline leaf:plant structure 9010-seed:plant structure 5794-Golgi apparatus;cellular-comp 37-shoot apex:plant struture 9025-leaf:plant structure 5773-vacuole:cellular-component 8034-leaf whorl:plant structure 9031-sepal:plant structure 9005-root;plant struture 9032-petal-plant structure shhot:plant structure 9047-stem:plant structure 9009-embryo;plant structure cotyledon:plant structure 20038: petiole:plant structure HFR1 (AT1G02340)1000 CRY2 (AT1G04400)1111 CIB5 (AT1G26260)1100 COP1 (AT2G32950)1100 PHOT1 (AT3G45780)0011 CRY1 (AT4G08920)1100 SHB1 (AT4G25350)1000 HY5 (AT5G11260)1100 PHOT2 (AT5G5840)0000 CIB1 (AT4G34530) 0000

Potential Discovery  Genes CRY2 and PHOT1 are both observed in the dense subgraph with the following two GO and PO combinations: 5773: vacuole: cellular_component 13: cauline leaf; plant_structure 37: shoot apex; plant_structure (5773, 13) (5773, 37)  This pattern has not been reported in the literature. Two independent studies [Kang et al. Planta 08, Ohgishi PNAS 04] have suggested that there may be some functional interactions between the members of PHOT1 and CRY2 in vacuole

Density of entire graph is 13/8 > 1.5 Can we use distance based cutoffs to define a sub graph of interest?

 Validation - Generate subset restricted dense subgraph.  Add 10 control genes.  2 GO terms: 5634 and  2 PO terms: 13 cauline leaf; plant_structure and 37 shoot apex.  Dense subgraph with 2 GO terms, 12 PO terms  User validated that the missing PO term and additional control genes and edges were acceptable changes from the distance restricted dense subgraph to the subset restricted dense subgraph. Photomorphogenesis Experiment with Control Genes

Almost Dense Subgraphs Density of {1,2,3,4} = 9/8 = 2.25 Density of {5,6,7,8,9} = 11/5 = 2.2 Density of {1,2,3,4,5,6,7,8,9} = 21/9 = The entire graph is the densest subgraph, but {1,2,3,4} and {5,6,7,8,9} are “almost” dense subgraphs

 Identifying dense subgraphs with distance and subset restriction may help in identifying interesting patterns  Potential Applications in other domains: ◦ Distance restricted dense subgraph for community detection ◦ Subset restricted dense subgraph in PPI network for deriving protein complexes  Ranking almost dense subgraphs  Change the notion of density [K,Mukherjee,Saha]?

Almost Dense Subgraphs Density of {1,2,3,4} = 9/8 = 2.25 Density of {5,6,7,8,9} = 11/5 = 2.2 Density of {1,2,3,4,5,6,7,8,9} = 21/9 = The entire graph is the densest subgraph, but {1,2,3,4} and {5,6,7,8,9} are “almost” dense subgraphs