Samir Khuller University of Maryland Joint Work with Barna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang RECOMB 2010.

Samir Khuller University of Maryland Joint Work with Barna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang RECOMB 2010

Sitting in a talk on community detection…..  How do we define a community?  Perhaps we want to capture a group of individuals with strong interactions within the group?

Density 5 34 2 7 6 1 The density of {1,2,3,4,5,6,7} = 9/7 = 1.28 The density of {1,2,3,4} = 6/4 = 1.5 The densest subgraph is {1,2,3,4}. How do we compute the densest subgraph? Surprisingly, this can be solved optimally in polynomial time! [Goldberg 84, Lawler 76, Queyranne 75, GGT] Extends to weighted graphs. 1 sum of weights of edges in the induced subgraph Subgraph density = number of nodes in the induced subgraph

Density 53 42 7 6 8 1 sum of weights of edges in the induced subgraph Subgraph density = number of nodes in the induced subgraph Density of entire graph is 13/8 > 1.5

 Are all dense subgraphs meaningful ? ◦ How do we allow some control over the kind of dense subgraphs that are found? ◦ Putting size constraints makes the problem intractable immediately.  Densest subgraph of size >=k. NP-hard, 2 approximation [Anderson][Khuller, Saha]. Greedily Union densest subgraphs…..  Densest subgraph of size <=k (or =k). NP-hard and some approximations known [Feige, Kortsarz, Peleg] [Charikar et al]. Are Dense Subgraphs Useful?

 Goldberg’s algorithm: a new flow network, is created with “directed” edges. ◦ An s-t min cut is computed in order to find the densest subgraph after guessing the max density. Nodes on the “s” side of the cut are part of a densest subgraph. ◦ GGT speeds everything up to a single flow computation!  Lawler’s algorithm: slightly different flow construction, more intuitive.  Greedy algorithm: recursively delete low degree nodes. Gives a 2 approximation to density! Fast!

Background: What is a min cut? 13 2 4 source sink 1 1 1 1 1 1 1 1 1 1 V1 V2 We use s-t min cuts

 Original Graph: Background: Find the Densest Subgraph 1 2 3 5 2

Background: Finding the Densest Subgraph 1 2 3 5 2 source sink 7 7 9 6 = 7 + 2*2 - 5 4 7 Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) g = guess = 2 CUT = m’|V| +2|V1| (g-D1) where V1 are the source side nodes

Background: Finding the Densest Subgraph 1 2 3 5 2 source sink 7 7 9 6 4 7 Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) Since the cut is <21, the guess is too low. g = guess = 2 CUT = m’|V| +2|V1| (g-D1) where V1 are the source side nodes

Background: Finding the Densest Subgraph 1 2 3 5 2 source sink 7 7 13 10 8 7 Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) Since the min cut is the trivial cut (and unique), the guess is too high. g = guess = 4

Background: Finding the Densest Subgraph 1 2 3 5 2 source sink 7 7 9 2/3 6 2/3 4 2/3 7 Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) Since the cut is smaller than 21, the guess is too low. g = guess = 2 1/3

Background: Finding the Densest Subgraph 1 2 3 5 2 source sink 7 7 10 7 7 Edges from source to nodes: m’= sum of all edges in graph Edge from node i to sink: m’ + 2g – d(i) Since the cut has value 21 and V1 is not empty, guess is correct! g = guess = 2 1/2 5

What can we do with this weapon?  Consider gene annotation data from TAIR.  For large networks we can use the fast greedy approximation (gave us the densest subgraph every time!).

 Biological knowledge often can be represented as graphs ◦ Protein Interactions, Metabolic Pathways, Gene Regulation, Gene Annotation RRP43 RRP4RRP42 RRP46 RRP45 SK16DIS3 EXOSOME

 Dense regions in graphs may represent useful information ◦ Many previous works on clustering protein interaction, metabolic networks etc. to find dense regions. And many more …

 Dense regions in graphs may represent useful information (B) MIPS complex in the yeast network Y 11k and the matching complex obtained by the densest subgraph algorithm King, A.D., Przulj, N. and Jurisica, I., Protein complex prediction via cost- based clustering. Bioinformatics, 2004

TAIR Annotation Example gene annotations

AT1G15550GA4 GO:0016707 gibberellin 3- beta-dioxygenase activity GO:0009686 gibberellin biosynthetic process GO:0009739 response to gibberellin stimulus GO:0009639 response to red or far red light GO:0008134 transcription factor binding GO:0010114 response to red light PO:0019018 embryo axis PO:0009046 flower PO:0009005 root PO:0009001 fruit PO:0020001 ovary placenta PO:0020148 shoot apical meristem PO:0020030 cotyledon PO:0009064 receptacle PO:0003011 root vascular system PO:0000014 rosette leaf PO:0004723 sepal vascular system PO:0009047 stem PO:0020141 stem node PO:0009009 embryo PO:0004714 terminal floral bud PO:0009025 leaf PO:0007057 0 germination PO:0007131 seedling growth PO:0009067 filament GO:0009740 gibberellic acid mediated signalling GO:0005737 cytoplasm GO-(gene)-PO tri-partite graph

GO:0009686 gibberellin biosynthetic process GO:0009739 response to gibberellin stimulus GO:0009639 response to red or far red light GO:0010114 response to red light GO:0009740 gibberellic acid mediated signalling GO:0008135 biological process GO Ontology

PO:0019018 embryo axis PO:0009046 flower PO:0009005 root PO:0009001 fruit PO:0020001 ovary placenta PO:0020148 shoot apical meristem PO:0020030 cotyledon PO:0009064 receptacle PO:0003011 root vascular system PO:0000014 rosette leaf PO:0004723 sepal vascular system PO:0009047 stem PO:0020141 stem node PO:0009009 embryo PO:0004714 terminal floral bud PO:0009025 leaf PO:0009067 filament Plant structure PO Ontology

Gene Annotation Graph  Construct graphs for each gene using their GO, PO annotations  Combine the graphs of several genes into one single weighted graph Gene 1 Gene 2 Gene 3 Gene 4 GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 3 PO 4

 Scientists like to find patterns in gene annotation graphs – but these are huge!  Need to allow some control over the kind of patterns that are computed  Would like to find biologically meaningful patterns The Problem Gene 1 Gene 2 Gene 3 Gene 4 GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 3 PO 4 Node Edge

AT1G15550GA4 GO:0016707 gibberellin 3- beta-dioxygenase activity GO:0009686 gibberellin biosynthetic process GO:0009739 response to gibberellin stimulus GO:0009639 response to red or far red light GO:0008134 transcription factor binding GO:0010114 response to red light PO:0019018 embryo axis PO:0009046 flower PO:0009005 root PO:0009001 fruit PO:0020001 ovary placenta PO:0020148 shoot apical meristem PO:0020030 cotyledon PO:0009064 receptacle PO:0003011 root vascular system PO:0000014 rosette leaf PO:0004723 sepal vascular system PO:0009047 stem PO:0020141 stem node PO:0009009 embryo PO:0004714 terminal floral bud PO:0009025 leaf PO:0007057 0 germination PO:0007131 seedling growth PO:0009067 filament GO:0009740 gibberellic acid mediated signalling GO:0005737 cytoplasm GO-(gene)-PO tri-partite graph

GO:0016707 gibberellin 3- beta-dioxygenase activity GO:0009686 gibberellin biosynthetic process GO:0009739 response to gibberellin stimulus GO:0009639 response to red or far red light GO:0008134 transcription factor binding GO:0010114 response to red light PO:0019018 embryo axis PO:0009046 flower PO:0009005 root PO:0009001 fruit PO:0020001 ovary placenta PO:0020148 shoot apical meristem PO:0020030 cotyledon PO:0009064 receptacle PO:0003011 root vascular system PO:0000014 rosette leaf PO:0004723 sepal vascular system PO:0009047 stem PO:0020141 stem node PO:0009009 embryo PO:0004714 terminal floral bud PO:0009025 leaf PO:0007057 0 germination PO:0007131 seedling growth PO:0009067 filament GO:0009740 gibberellic acid mediated signalling GO:0005737 cytoplasm GO-PO bipartite graph

Gene Annotation Graph  Construct complete bipartite graph for each gene using their GO, PO annotations  Combine the bipartite graphs of several genes into one single weighted graph GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 3 PO 4 1 2 1 1 1 3 3 2 3 1 1 1 2

 Cliques – these might give us some biological information – but this is a stringent reqmt.  However clique finding is well known to be really hard (NP-hard, hard to approximate).  Why not look for “dense regions”?  Note that the notion of density could be defined for hyper-edges as well, but for our purposes this does not do as well.

Dense Subgraphs in Gene Annotation Graph  A collection of GO-PO terms that appear together in the underlying genes. GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 3 PO 4 1 2 1 1 1 3 3 2 3 1 1 1 2 (GO3,PO1),(GO3,PO2),(GO3,PO4),(GO4,PO1),(GO4,PO2), (GO4,PO4) appear frequently in the 4 genes

 Are all dense subgraphs biologically meaningful ? ◦ How do we allow some control over the kind of dense subgraphs that are computed. ◦ In fact we can impose both restrictions at the same time! Restrictions in dense subgraph computation Distance Restricted Subset Restricted GO terms and similarly PO terms that appear must be biologically related Certain GO, PO terms must appear in the returned subgraph

 Are all dense subgraphs biologically meaningful ? ◦ How do we allow some control over the kind of dense subgraphs that are found? Restrictions in dense subgraph computation Distance Restricted Subset Restricted GO terms that appear in the densest subgraph must be close in the GO ontology graph and similarly for the PO terms

 Distance threshold = 1  This means that some sets of nodes are not allowed to coexist in the final solution: {GO1,GO2}, {GO1,GO4}, {PO1,PO4}, {PO1,PO2},{PO2,PO3,}.  The final solution is {GO2, GO3, GO4, PO2, PO4}, which has a density of.8. GO 1 GO 2 GO 3 GO 4 PO 1 PO 2 PO 1 PO 3 PO 4 PO 2 PO 3 PO 4 GO 2 GO 1 GO 3 GO 4

 For arbitrary ontology graph structure ◦ NP Hard even to approximate it reasonably  Reduction from Independent set problem ◦ Factor 2 relaxation of distance threshold is enough to get a solution with density as high as the optimum  Trees, Interval Graphs, Each edge participates in small number of cycles ◦ Polynomial time algorithm to compute the optimum

2 3 4 5 16 9 7 8 5 8 3 7 4 2 6 1 GO-Ontology PO-Ontology Distance Threshold=2 2 3 4 5 1 6 9 7 8 2 3 4 5 1 6 7 8

Guess two nodes in each ontology that appears in the optimum solution and have maximum distance 2 3 4 5 16 9 7 8 5 8 3 7 4 2 6 1 GO-Ontology PO-Ontology 2 3 4 5 1 6 9 7 8 2 3 4 5 1 6 7 8

Distance Threshold=2 Compute all the nodes which are within distance threshold from both the guessed nodes 2 3 4 5 16 9 7 8 5 8 3 7 4 2 6 1 GO-Ontology PO-Ontology 2 3 4 5 1 6 9 7 8 2 3 4 5 1 6 7 8

Distance Threshold=2 In the gene-annotation bipartite graph consider only the chosen nodes and compute the densest subgraph 2 3 4 5 16 9 7 8 5 8 3 7 4 2 6 1 GO-Ontology PO-Ontology 2 3 4 5 1 6 9 7 8 2 3 4 5 1 6 7 8

Distance Threshold=2 In the gene-annotation bipartite graph consider only the chosen nodes and compute the densest subgraph 2 3 4 5 16 9 7 8 5 8 3 7 4 2 6 1 GO-Ontology PO-Ontology 5 6 9 7 2 4 5

Distance Threshold=2 2 3 4 5 16 9 7 8 5 8 3 7 4 2 6 1 GO-Ontology PO-Ontology 5 6 9 7 2 4 5 Proof of optimality:  Any node not chosen can not be in the optimum solution  All the nodes chosen are within distance threshold

 Guess a small subset of nodes from the optimum  Choose candidate nodes by considering distance from the guessed nodes  Compute the densest subgraph by restricting the gene annotation graph to only the chosen nodes

 Are all dense subgraphs biologically meaningful ? ◦ How do we some control over the kind of dense subgraphs that are found ? Subset Restricted Dense Subgraph Restrictions in dense subgraph computation Distance Restricted Subset Restricted Given a subset of GO, PO terms compute the densest subgraph containing them.

Subset Restricted Dense Subgraph 8 23 4 56 7 1 22211 1 1 1 1 1 3 This set {5,6} must be in the solution. Density of {1,2,3,4} = (3+2+2+2)/4 = 2.25– Doesn’t contain {5,6} Density of {5,6,7,8} = 6/4 = 1.5 (Satisfies subset requirement) Density of {1,2,3,4,5,6,7,8} = (2+3+2+2+1*7)/8 = 2.0 (Best answer) Polynomial time algorithm to compute the optimum solution

 For this problem we modified Lawler’s method of finding densest subgraphs. Let’s assume that we have a graph in which we want to force {5,6} to be in the final solution. Specified Set of Nodes in Densest Subgraph

The guess “g” is iteratively updated, as in Goldberg’s algorithm until the min cut is calculated and there is more than one possible solution, one contains just {s’ and s} and the other specifies the densest subgraph.

 A graph may contain multiple subgraphs of equal (or close to equal) density  Computing just one subgraph may not be sufficient  Compute all subgraphs close to maximum density  Extension of Picard and Queyranne’s result ◦ Polynomial time algorithm to find almost all dense subgraphs given the number of such subgraphs is polynomial in the number of vertices. ◦ Their method encodes all possible s-t min cuts. ◦ After a max-flow is found, we lower edges with residual capacity close to zero, to zero and now used [PQ] method to list all s-t min cuts.  Can be extended to consider both distance and subset restriction All Almost Dense Subgraph

All Almost Dense Subgraphs 8 23456 7 1 22211 1 1 1 1 2 3 9 2 2 Density of {1,2,3,4} = 9/8 = 2.25 Density of {5,6,7,8,9} = 11/5 = 2. Density of {1,2,3,4,5,6,7,8,9} = 21/9 = 2.333 The entire graph is the densest subgraph, but {1,2,3,4} and {5,6,7,8,9} are “almost” dense subgraphs

 10 Photomorphogenesis genes  CIB5 CRY2 HFR1 COP1 PHOT1 PHOT2 HY5 SHB1 CRY1 CIB1  66 GO CV terms. 41 PO CV terms; 2230 GO-PO edges.  Generate distance restricted dense subgraph.  GO distance = 2.  PO distance = 3.  Dense subgraph with 3 GO terms & 13 PO terms Photomorphogenesis Experiment

HFR1 COP1 PHOT1 PHOT2 HY5 13 PO CV terms3 GO CV terms Set of 10 genes CRY2 CIB5 SHB1 CIB1 CRY1 (partial) dense subgraph; 3 GO terms; 13 PO terms; 10 genes 0 annotation edges 8 26 12 13 12 13 2

Photomorphogenesis Experiment  GO CV Terms PO CV Terms 5634-nucleus:cellular-component 13-cauline leaf:plant structure 9010-seed:plant structure 5794-Golgi apparatus;cellular-comp 37-shoot apex:plant struture 9025-leaf:plant structure 5773-vacuole:cellular-component 8034-leaf whorl:plant structure 9031-sepal:plant structure 9005-root;plant struture 9032-petal-plant structure 9006- shhot:plant structure 9047-stem:plant structure 9009-embryo;plant structure 20030-cotyledon:plant structure 20038: petiole:plant structure 5634-13 5634-37 5773-13 5773-37 HFR1 (AT1G02340)1000 CRY2 (AT1G04400)1111 CIB5 (AT1G26260)1100 COP1 (AT2G32950)1100 PHOT1 (AT3G45780)0011 CRY1 (AT4G08920)1100 SHB1 (AT4G25350)1000 HY5 (AT5G11260)1100 PHOT2 (AT5G5840)0000 CIB1 (AT4G34530) 0000

Potential Discovery  Genes CRY2 and PHOT1 are both observed in the dense subgraph with the following two GO and PO combinations: 5773: vacuole: cellular_component 13: cauline leaf; plant_structure 37: shoot apex; plant_structure (5773, 13) (5773, 37)  This pattern has not been reported in the literature. Two independent studies [Kang et al. Planta 08, Ohgishi PNAS 04] have suggested that there may be some functional interactions between the members of PHOT1 and CRY2 in vacuole

53 42 7 6 8 1 Density of entire graph is 13/8 > 1.5 Can we use distance based cutoffs to define a sub graph of interest?

 Validation - Generate subset restricted dense subgraph.  Add 10 control genes.  2 GO terms: 5634 and 5773.  2 PO terms: 13 cauline leaf; plant_structure and 37 shoot apex.  Dense subgraph with 2 GO terms, 12 PO terms  User validated that the missing PO term and additional control genes and edges were acceptable changes from the distance restricted dense subgraph to the subset restricted dense subgraph. Photomorphogenesis Experiment with Control Genes

Almost Dense Subgraphs 8 23456 7 1 22211 1 1 1 1 2 3 9 2 2 Density of {1,2,3,4} = 9/8 = 2.25 Density of {5,6,7,8,9} = 11/5 = 2.2 Density of {1,2,3,4,5,6,7,8,9} = 21/9 = 2.333 The entire graph is the densest subgraph, but {1,2,3,4} and {5,6,7,8,9} are “almost” dense subgraphs

 Identifying dense subgraphs with distance and subset restriction may help in identifying interesting patterns  Potential Applications in other domains: ◦ Distance restricted dense subgraph for community detection ◦ Subset restricted dense subgraph in PPI network for deriving protein complexes  Ranking almost dense subgraphs  Change the notion of density [K,Mukherjee,Saha]?

Almost Dense Subgraphs 8 23456 7 1 22211 1 1 1 1 2 3 9 2 2 Density of {1,2,3,4} = 9/8 = 2.25 Density of {5,6,7,8,9} = 11/5 = 2.2 Density of {1,2,3,4,5,6,7,8,9} = 21/9 = 2.333 The entire graph is the densest subgraph, but {1,2,3,4} and {5,6,7,8,9} are “almost” dense subgraphs

Samir Khuller University of Maryland Joint Work with Barna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang RECOMB 2010.

Similar presentations

Presentation on theme: "Samir Khuller University of Maryland Joint Work with Barna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang RECOMB 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Samir Khuller University of Maryland Joint Work with Barna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang RECOMB 2010.

Similar presentations

Presentation on theme: "Samir Khuller University of Maryland Joint Work with Barna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang RECOMB 2010."— Presentation transcript:

Similar presentations

About project

Feedback