Presentation is loading. Please wait.

Presentation is loading. Please wait.

Partitioning Sequences Based on Association Measures Deborah Weisser Carnegie Mellon University.

Similar presentations


Presentation on theme: "Partitioning Sequences Based on Association Measures Deborah Weisser Carnegie Mellon University."— Presentation transcript:

1 Partitioning Sequences Based on Association Measures Deborah Weisser Carnegie Mellon University

2 Feature boundaries Need to know form and function of protein sequences to understand complex biological systems Not possible to directly determine features or functions directly –estimate feature positions by indirect laboratory experiments, e.g. hydrophobicity Use statistical measures of association to determine feature boundaries

3 Feature boundaries Proteins are comprised of adjacent, non- overlapping features: –helical, cytoplasmic, periplasmic, extracellular, intracellular, etc. GPCR proteins have a fixed feature pattern, although feature positions are only known for one member of the family, Rhodopsin (opsd_human)

4

5 Goal: Statistically determine feature boundaries in sequences of amino acids S H D E G C L S S E P K P R K Q S D S S T

6 Association measures S H D E G C L S S E P K P R K Q S D S S T 2.5 2.5 is a measure of the strength of the association between P and R

7 Association measures S H D E G C L S S E P K P R K Q S D S S T 2.3 4.5 1.10.85.5 1.23.70.36.21.24.8 4.1 5.2 2.51.80.2 0.7 1.16.2 3.4 1.1

8 Association measures S H D E G C L S S E P K P R K Q S D S S T 2.3 4.5 1.10.85.5 1.23.70.36.21.24.8 4.1 5.2 2.51.80.2 0.7 1.16.2 3.4 1.1 4.2 Adjacent pairs with low association measures are candidates for partition points.

9 Association measures are used to quantify correlations between adjacent amino acids Yule’s Q statistic Mutual information

10

11

12

13

14 E P M S N V V V G F R F Y C K H M I A N Q Q Q A A K E A V F T V Q L T V R M S A T T Q K A E K E I I V E I M M Y R G T T V Q H K R N T T V M L C Cytoplasmic (cp) Domain T L Y V N F L I Y N L C C IIIIIIIVVVIVII L K P K N Q F 55 75 136155 225 256306 cp1 cp2 cp3 A OOC- P AV Q S T E T K S V T - T S A E D D G L P K N Cytoplasmic (cp) Domain Transmembrane (helices) Domain Extracellular (ec) Domain MI: 39, 63, 76, 94, 115, 136, 155, 176, 205, 233, 255, 279, 287, 301 Hydropathy: 37, 61, 74, 98, 114, 133, 153, 176, 203, 230,253, 276, 285, 309 61 309 74 153 253 230 133 Hydropathy breaks 301 63 76 155 255 136 233 Cytoplasmic (cp) Domain Transmembrane (helices) Domain Extracellular (ec) Domain MI breaks

15 The changes in association measure values correspond to feature boundaries Goal: automatically detect partition points based on association measures

16 Partitioning algorithm Cluster adjacent association values –each group is represented by its mean value Calculate standard deviation of values over all clusters Locate partition points in data based on: –deviation from mean –[change between adjacent values]

17 Parameters Cluster adjacent association values –each group is represented by its mean value window size for computing mean Calculate standard deviation of values over all clusters Locate partition points in data based on: –deviation from mean –[change between adjacent values] cutoff distance from mean for a value to be considered “extreme”

18 Effect of cutoff threshold on partitioning in opsd_human using mutual information

19 Effect of window size on partitioning in opsd_human using mutual information

20 Class A Rhodopsin like Amine Peptide Hormone protein (Rhodopsin Rhodopsin Vertebrate Rhodopsin Vertebrate type 1Rhodopsin Vertebrate type 1 Rhodopsin Vertebrate type 2Rhodopsin Vertebrate type 2 Rhodopsin Vertebrate type 3Rhodopsin Vertebrate type 3 Rhodopsin Vertebrate type 4Rhodopsin Vertebrate type 4 Rhodopsin Vertebrate type 5Rhodopsin Vertebrate type 5 Rhodopsin Arthropod Rhodopsin Mollusc Rhodopsin Other Olfactory Prostanoid Nucleotide-like Cannabis Platelet activating factor Gonadotropin-releasing hormone Thyrotropin-releasing hormone & SecretagogueThyrotropin-releasing hormone & Secretagogue Melatonin Viral Lysosphingolipid & LPA (EDG) Leukotriene B4 receptor Class A Orphan/other Class B Secretin like Class C Metabotropic glutamate / pheromone Class D Fungal pheromone Class E cAMP receptors (Dictyostelium) Frizzled/Smoothened family GPCR: different subfamilies

21 Size:Hierarchy: 717755 GPCR 371134Class A 48393Rhodopsin 33543Vertebrate 20314Vertebrate 1 348opsd_human 39724Class B 20930Class C

22

23 Structure of curve is preserved even when the dataset is small.

24

25

26

27 In progress / Future work Set parameters of partition algorithm automatically Apply to other sources of data, types of features Group amino acids into sub-classes Quantify the effect of training set information content and training set size.


Download ppt "Partitioning Sequences Based on Association Measures Deborah Weisser Carnegie Mellon University."

Similar presentations


Ads by Google