Samir Khuller University of Maryland Joint Work with Barna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang RECOMB 2010
Sitting in a talk on community detection….. How do we define a community? Perhaps we want to capture a group of individuals with strong interactions within the group?
Density The density of {1,2,3,4,5,6,7} = 9/7 = 1.28 The density of {1,2,3,4} = 6/4 = 1.5 The densest subgraph is {1,2,3,4}. How do we compute the densest subgraph? Surprisingly, this can be solved optimally in polynomial time! [Goldberg 84, Lawler 76, Queyranne 75, GGT] Extends to weighted graphs. 1 sum of weights of edges in the induced subgraph Subgraph density = number of nodes in the induced subgraph
Density sum of weights of edges in the induced subgraph Subgraph density = number of nodes in the induced subgraph Density of entire graph is 13/8 > 1.5
Are all dense subgraphs meaningful ? ◦ How do we allow some control over the kind of dense subgraphs that are found? ◦ Putting size constraints makes the problem intractable immediately. Densest subgraph of size >=k. NP-hard, 2 approximation [Anderson][Khuller, Saha]. Greedily Union densest subgraphs….. Densest subgraph of size <=k (or =k). NP-hard and some approximations known [Feige, Kortsarz, Peleg] [Charikar et al]. Are Dense Subgraphs Useful?
Goldberg’s algorithm: a new flow network, is created with “directed” edges. ◦ An s-t min cut is computed in order to find the densest subgraph after guessing the max density. Nodes on the “s” side of the cut are part of a densest subgraph. ◦ GGT speeds everything up to a single flow computation! Lawler’s algorithm: slightly different flow construction, more intuitive. Greedy algorithm: recursively delete low degree nodes. Gives a 2 approximation to density! Fast!
Background: What is a min cut? source sink V1 V2 We use s-t min cuts
Original Graph: Background: Find the Densest Subgraph
Background: Finding the Densest Subgraph source sink = 7 + 2* Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) g = guess = 2 CUT = m’|V| +2|V1| (g-D1) where V1 are the source side nodes
Background: Finding the Densest Subgraph source sink Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) Since the cut is <21, the guess is too low. g = guess = 2 CUT = m’|V| +2|V1| (g-D1) where V1 are the source side nodes
Background: Finding the Densest Subgraph source sink Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) Since the min cut is the trivial cut (and unique), the guess is too high. g = guess = 4
Background: Finding the Densest Subgraph source sink /3 6 2/3 4 2/3 7 Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) Since the cut is smaller than 21, the guess is too low. g = guess = 2 1/3
Background: Finding the Densest Subgraph source sink Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) Since the cut has value 21 and V1 is not empty, guess is correct! g = guess = 2 1/2 5
What can we do with this weapon? Consider gene annotation data from TAIR. For large networks we can use the fast greedy approximation (gave us the densest subgraph every time!).
Biological knowledge often can be represented as graphs ◦ Protein Interactions, Metabolic Pathways, Gene Regulation, Gene Annotation RRP43 RRP4RRP42 RRP46 RRP45 SK16DIS3 EXOSOME
Dense regions in graphs may represent useful information ◦ Many previous works on clustering protein interaction, metabolic networks etc. to find dense regions. And many more …
Dense regions in graphs may represent useful information (B) MIPS complex in the yeast network Y 11k and the matching complex obtained by the densest subgraph algorithm King, A.D., Przulj, N. and Jurisica, I., Protein complex prediction via cost- based clustering. Bioinformatics, 2004
TAIR Annotation Example gene annotations
AT1G15550GA4 GO: gibberellin 3- beta-dioxygenase activity GO: gibberellin biosynthetic process GO: response to gibberellin stimulus GO: response to red or far red light GO: transcription factor binding GO: response to red light PO: embryo axis PO: flower PO: root PO: fruit PO: ovary placenta PO: shoot apical meristem PO: cotyledon PO: receptacle PO: root vascular system PO: rosette leaf PO: sepal vascular system PO: stem PO: stem node PO: embryo PO: terminal floral bud PO: leaf PO: germination PO: seedling growth PO: filament GO: gibberellic acid mediated signalling GO: cytoplasm GO-(gene)-PO tri-partite graph
GO: gibberellin biosynthetic process GO: response to gibberellin stimulus GO: response to red or far red light GO: response to red light GO: gibberellic acid mediated signalling GO: biological process GO Ontology
PO: embryo axis PO: flower PO: root PO: fruit PO: ovary placenta PO: shoot apical meristem PO: cotyledon PO: receptacle PO: root vascular system PO: rosette leaf PO: sepal vascular system PO: stem PO: stem node PO: embryo PO: terminal floral bud PO: leaf PO: filament Plant structure PO Ontology
Gene Annotation Graph Construct graphs for each gene using their GO, PO annotations Combine the graphs of several genes into one single weighted graph Gene 1 Gene 2 Gene 3 Gene 4 GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 3 PO 4
Scientists like to find patterns in gene annotation graphs – but these are huge! Need to allow some control over the kind of patterns that are computed Would like to find biologically meaningful patterns The Problem Gene 1 Gene 2 Gene 3 Gene 4 GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 3 PO 4 Node Edge
AT1G15550GA4 GO: gibberellin 3- beta-dioxygenase activity GO: gibberellin biosynthetic process GO: response to gibberellin stimulus GO: response to red or far red light GO: transcription factor binding GO: response to red light PO: embryo axis PO: flower PO: root PO: fruit PO: ovary placenta PO: shoot apical meristem PO: cotyledon PO: receptacle PO: root vascular system PO: rosette leaf PO: sepal vascular system PO: stem PO: stem node PO: embryo PO: terminal floral bud PO: leaf PO: germination PO: seedling growth PO: filament GO: gibberellic acid mediated signalling GO: cytoplasm GO-(gene)-PO tri-partite graph
GO: gibberellin 3- beta-dioxygenase activity GO: gibberellin biosynthetic process GO: response to gibberellin stimulus GO: response to red or far red light GO: transcription factor binding GO: response to red light PO: embryo axis PO: flower PO: root PO: fruit PO: ovary placenta PO: shoot apical meristem PO: cotyledon PO: receptacle PO: root vascular system PO: rosette leaf PO: sepal vascular system PO: stem PO: stem node PO: embryo PO: terminal floral bud PO: leaf PO: germination PO: seedling growth PO: filament GO: gibberellic acid mediated signalling GO: cytoplasm GO-PO bipartite graph
Gene Annotation Graph Construct complete bipartite graph for each gene using their GO, PO annotations Combine the bipartite graphs of several genes into one single weighted graph GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 3 PO
Cliques – these might give us some biological information – but this is a stringent reqmt. However clique finding is well known to be really hard (NP-hard, hard to approximate). Why not look for “dense regions”? Note that the notion of density could be defined for hyper-edges as well, but for our purposes this does not do as well.
Dense Subgraphs in Gene Annotation Graph A collection of GO-PO terms that appear together in the underlying genes. GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 3 PO (GO3,PO1),(GO3,PO2),(GO3,PO4),(GO4,PO1),(GO4,PO2), (GO4,PO4) appear frequently in the 4 genes
Are all dense subgraphs biologically meaningful ? ◦ How do we allow some control over the kind of dense subgraphs that are computed. ◦ In fact we can impose both restrictions at the same time! Restrictions in dense subgraph computation Distance Restricted Subset Restricted GO terms and similarly PO terms that appear must be biologically related Certain GO, PO terms must appear in the returned subgraph
Are all dense subgraphs biologically meaningful ? ◦ How do we allow some control over the kind of dense subgraphs that are found? Restrictions in dense subgraph computation Distance Restricted Subset Restricted GO terms that appear in the densest subgraph must be close in the GO ontology graph and similarly for the PO terms
Distance threshold = 1 This means that some sets of nodes are not allowed to coexist in the final solution: {GO1,GO2}, {GO1,GO4}, {PO1,PO4}, {PO1,PO2},{PO2,PO3,}. The final solution is {GO2, GO3, GO4, PO2, PO4}, which has a density of.8. GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 1 PO 3 PO 4 PO 2 PO 3 PO 4 GO 2 GO 1 GO 3 GO 4
For arbitrary ontology graph structure ◦ NP Hard even to approximate it reasonably Reduction from Independent set problem ◦ Factor 2 relaxation of distance threshold is enough to get a solution with density as high as the optimum Trees, Interval Graphs, Each edge participates in small number of cycles ◦ Polynomial time algorithm to compute the optimum
GO-Ontology PO-Ontology Distance Threshold=
Guess two nodes in each ontology that appears in the optimum solution and have maximum distance GO-Ontology PO-Ontology
Distance Threshold=2 Compute all the nodes which are within distance threshold from both the guessed nodes GO-Ontology PO-Ontology
Distance Threshold=2 In the gene-annotation bipartite graph consider only the chosen nodes and compute the densest subgraph GO-Ontology PO-Ontology
Distance Threshold=2 In the gene-annotation bipartite graph consider only the chosen nodes and compute the densest subgraph GO-Ontology PO-Ontology
Distance Threshold=2 In the gene-annotation bipartite graph consider only the chosen nodes and compute the densest subgraph GO-Ontology PO-Ontology
Distance Threshold= GO-Ontology PO-Ontology Proof of optimality: Any node not chosen can not be in the optimum solution All the nodes chosen are within distance threshold
Guess a small subset of nodes from the optimum Choose candidate nodes by considering distance from the guessed nodes Compute the densest subgraph by restricting the gene annotation graph to only the chosen nodes
Are all dense subgraphs biologically meaningful ? ◦ How do we some control over the kind of dense subgraphs that are found ? Subset Restricted Dense Subgraph Restrictions in dense subgraph computation Distance Restricted Subset Restricted Given a subset of GO, PO terms compute the densest subgraph containing them.
Subset Restricted Dense Subgraph This set {5,6} must be in the solution. Density of {1,2,3,4} = ( )/4 = 2.25– Doesn’t contain {5,6} Density of {5,6,7,8} = 6/4 = 1.5 (Satisfies subset requirement) Density of {1,2,3,4,5,6,7,8} = ( *7)/8 = 2.0 (Best answer) Polynomial time algorithm to compute the optimum solution
For this problem we modified Lawler’s method of finding densest subgraphs. Let’s assume that we have a graph in which we want to force {5,6} to be in the final solution. Specified Set of Nodes in Densest Subgraph
The guess “g” is iteratively updated, as in Goldberg’s algorithm until the min cut is calculated and there is more than one possible solution, one contains just {s’ and s} and the other specifies the densest subgraph.
A graph may contain multiple subgraphs of equal (or close to equal) density Computing just one subgraph may not be sufficient Compute all subgraphs close to maximum density Extension of Picard and Queyranne’s result ◦ Polynomial time algorithm to find almost all dense subgraphs given the number of such subgraphs is polynomial in the number of vertices. ◦ Their method encodes all possible s-t min cuts. ◦ After a max-flow is found, we lower edges with residual capacity close to zero, to zero and now used [PQ] method to list all s-t min cuts. Can be extended to consider both distance and subset restriction All Almost Dense Subgraph
All Almost Dense Subgraphs Density of {1,2,3,4} = 9/8 = 2.25 Density of {5,6,7,8,9} = 11/5 = 2. Density of {1,2,3,4,5,6,7,8,9} = 21/9 = The entire graph is the densest subgraph, but {1,2,3,4} and {5,6,7,8,9} are “almost” dense subgraphs
10 Photomorphogenesis genes CIB5 CRY2 HFR1 COP1 PHOT1 PHOT2 HY5 SHB1 CRY1 CIB1 66 GO CV terms. 41 PO CV terms; 2230 GO-PO edges. Generate distance restricted dense subgraph. GO distance = 2. PO distance = 3. Dense subgraph with 3 GO terms & 13 PO terms Photomorphogenesis Experiment
HFR1 COP1 PHOT1 PHOT2 HY5 13 PO CV terms3 GO CV terms Set of 10 genes CRY2 CIB5 SHB1 CIB1 CRY1 (partial) dense subgraph; 3 GO terms; 13 PO terms; 10 genes 0 annotation edges
Photomorphogenesis Experiment GO CV Terms PO CV Terms 5634-nucleus:cellular-component 13-cauline leaf:plant structure 9010-seed:plant structure 5794-Golgi apparatus;cellular-comp 37-shoot apex:plant struture 9025-leaf:plant structure 5773-vacuole:cellular-component 8034-leaf whorl:plant structure 9031-sepal:plant structure 9005-root;plant struture 9032-petal-plant structure shhot:plant structure 9047-stem:plant structure 9009-embryo;plant structure cotyledon:plant structure 20038: petiole:plant structure HFR1 (AT1G02340)1000 CRY2 (AT1G04400)1111 CIB5 (AT1G26260)1100 COP1 (AT2G32950)1100 PHOT1 (AT3G45780)0011 CRY1 (AT4G08920)1100 SHB1 (AT4G25350)1000 HY5 (AT5G11260)1100 PHOT2 (AT5G5840)0000 CIB1 (AT4G34530) 0000
Potential Discovery Genes CRY2 and PHOT1 are both observed in the dense subgraph with the following two GO and PO combinations: 5773: vacuole: cellular_component 13: cauline leaf; plant_structure 37: shoot apex; plant_structure (5773, 13) (5773, 37) This pattern has not been reported in the literature. Two independent studies [Kang et al. Planta 08, Ohgishi PNAS 04] have suggested that there may be some functional interactions between the members of PHOT1 and CRY2 in vacuole
Density of entire graph is 13/8 > 1.5 Can we use distance based cutoffs to define a sub graph of interest?
Validation - Generate subset restricted dense subgraph. Add 10 control genes. 2 GO terms: 5634 and 2 PO terms: 13 cauline leaf; plant_structure and 37 shoot apex. Dense subgraph with 2 GO terms, 12 PO terms User validated that the missing PO term and additional control genes and edges were acceptable changes from the distance restricted dense subgraph to the subset restricted dense subgraph. Photomorphogenesis Experiment with Control Genes
Almost Dense Subgraphs Density of {1,2,3,4} = 9/8 = 2.25 Density of {5,6,7,8,9} = 11/5 = 2.2 Density of {1,2,3,4,5,6,7,8,9} = 21/9 = The entire graph is the densest subgraph, but {1,2,3,4} and {5,6,7,8,9} are “almost” dense subgraphs
Identifying dense subgraphs with distance and subset restriction may help in identifying interesting patterns Potential Applications in other domains: ◦ Distance restricted dense subgraph for community detection ◦ Subset restricted dense subgraph in PPI network for deriving protein complexes Ranking almost dense subgraphs Change the notion of density [K,Mukherjee,Saha]?
Almost Dense Subgraphs Density of {1,2,3,4} = 9/8 = 2.25 Density of {5,6,7,8,9} = 11/5 = 2.2 Density of {1,2,3,4,5,6,7,8,9} = 21/9 = The entire graph is the densest subgraph, but {1,2,3,4} and {5,6,7,8,9} are “almost” dense subgraphs