Presentation on theme: "Statistical Inference Using Graphs for Protein Complex Identification Denise Scholtens Robert Gentleman Marc Vidal Workshop on Statistical Inference, Computing,"— Presentation transcript:
Statistical Inference Using Graphs for Protein Complex Identification Denise Scholtens Robert Gentleman Marc Vidal Workshop on Statistical Inference, Computing, and Visualization for Graphs Stanford University August 1-2, 2003
Graphic from: U.S. Department of Energy Human Genome Program http://www.ornl.gov/hgmis
High-throughput Protein Complex Identification Gavin, et al. (Nature, 2002) –TAP : Tandem Affinity Purification Ho, et al. (Nature, 2002) –HMS-PCI: High-throughput Mass Spectromic Protein Complex Identification
Protein Complex Identification Using TAP Data Spoke Model Matrix Model Bader, et al. (Nature Biotechnology, 2002)
Cohesive vs. Dynamic Protein Complexes Cohesive Complex: a complex of invariable composition whose proteins are associated only with that complex and its particular function
Cohesive Complex Affiliation Network Incidence Matrix C1C1 Bait Hit 1 Hit 2 Hit 3 Hit 4 Hit 5 111111111111 A =
Cohesive vs. Dynamic Protein Complexes Dynamic Complex: complex composed of proteins that may also be involved in other complexes
Dynamic Complex Affiliation Network Incidence Matrices A = C1C1 C2C2 C3C3 C4C4 C5C5 Bait11111 Hit 110000 Hit 201000 Hit 300100 Hit 400010 Hit 500001 C1C1 C2C2 Bait11 Hit 110 Hit 201 Hit 310 Hit 401 Hit 510 A = C1C1 C2C2 Bait11 Hit 111 Hit 211 Hit 301 Hit 401 Hit 501 A =
All 5 complexes above would yield the same TAP Data:
Statistical Inference Problem What is A? A captures the cohesive/dynamic distinction. At best, we observe all but the main diagonal of X=AA. Current analyses focus on X, not on A.
SubGraph of Bait Proteins from Previous Graphs with Outdegree 7 Gavin DataHo Data
Examples of Distinct Complexes Identified by Gavin, et al.
Back to Affiliation Networks C1 B11 B21 B31 A = B1B2B3 B1111 B2111 B3111 X=AA = One Three-Way Conversation
Affiliation Networks C1C2C3 B1110 B2101 B3011 A = B1B2B3 B1211 B2121 B3112 X=AA = Three Two-Way Conversations
Statistical Inference Problem Which A is correct? –A uniquely defines X, but X does not uniquely define the observable part of A. Extra information and directed graph model for the TAP data –Cellular Component Data –Gene Expression Data –Hit Data
Conclusions In the protein complex setting, directed graphs are useful for EDA, as well as framing the correct questions for statistical inference. Statistical inference problem for cohesive and dynamic protein complex identification should focus on A, not X. Digraph model of the TAP data better reflects what we actually observe, and is informative for estimating A.