Partitioning Sequences Based on Association Measures Deborah Weisser Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Assignment of PROSITE motifs to topological regions: Application to a novel database of well characterised transmembrane proteins Tim Nugent.
Assignment of PROSITE motifs to topological regions: Application to a novel database of well characterised transmembrane proteins Tim Nugent 6 Month.
Mechanism of hormone action
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Pfam(Protein families )
Segmenting G-Protein Coupled Receptors using Language Models Betty Yee Man Cheng Language Technologies Institute, CMU Advisors:Judith Klein-Seetharaman.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Extraction and comparison of gene expression patterns from 2D RNA in situ hybridization images BIOINFORMATICS Gene expression Vol. 26, no. 6, 2010, pages.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Biological Language Modeling Toolkit “Graphing Utilities” by: Danny Lam.
Chemical Signals Types Production Transmission Reception.
Protein Domain Finding Problem Olga Russakovsky, Eugene Fratkin, Phuong Minh Tu, Serafim Batzoglou Algorithm Step 1: Creating a graph of k-mers First,
Biological sequence analysis and information processing by artificial neural networks.
Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BIOL 5190/6190 Cellular & Molecular Singal Transduction
Cell signaling Cells do not work in isolation but continually ‘talk’ to each other by sending and receiving chemical signals to each other. This process.
Cell Signaling (Lecture 2). Types of signaling Autocrine Signaling Can Coordinate Decisions by Groups of Identical Cells Cells send signals to other.
oY. G-Protein Coupled Receptor.
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
Structural alignments of Proteins using by TOPOFIT method Vitkup D., Melamud E., Moult J., Sander C. Completeness in structural genomics. Nature Struct.
How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Chapter 6-10 AP Biology. Define phagocytosis and pinocytosis. What does it mean for a cell to have a concentration gradient?
Inferring Selection Pressure from Positional Residue Conservation Rose Hoberman Roni Rosenfeld Judith Klein-Seetharaman.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
TMpro: Transmembrane Helix Prediction using Amino Acid Properties and Latent Semantic Analysis Madhavi Ganapathiraju, N. Balakrishnan, Raj Reddy and Judith.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Protein-Protein Interaction Hotspots Carved into Sequences Yanay Ofran 1,2, Burkhard Rost 1,2,3 1.Department of Biochemistry and Molecular Biophysics,
Biological Membranes and Transport Functions of membranes Define cell boundaries, compartments Maintain electric and chemical potentials Self-sealing (break.
November 18, 2000ICTCM 2000 Introductory Biological Sequence Analysis Through Spreadsheets Stephen J. Merrill Sandra E. Merrill Marquette University Milwaukee,
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Basic concepts in G-protein-coupled receptor homo- and heterodimerization RAFAEL FRANCO
BIOLOGICALLY IMPORTANT MACROMOLECULES PROTEINS. A very diverse group of macromolecules characterized by their functions: - Catalysts - Structural Support.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Contest Beta Test of Bioinformatics Toolbox in Matlab Hidden Markov Model for profile analysis of GPCR sequences ShannChing Chen.
Bioinformatics in Vaccine Design
November 5, 2013Computer Vision Lecture 15: Region Detection 1 Basic Steps for Filtering in the Frequency Domain.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Cell Signaling (Lecture 2)
Next theme: ion channel modulation (or “indirect” synaptic transmission) 1.
Machine Learning for the Quantified Self
Figure 1. Structure of the fly LGR2 gene and the corresponding cDNA sequence. A, Derivation of the fly LGR2 full-length cDNA from the genomic sequence.
Introduction to Receptors
Computer Vision Lecture 13: Image Segmentation III
Recognition of Antigen By T cells: The TCR
Computer Vision Lecture 12: Image Segmentation II
Fast and Robust Object Tracking with Adaptive Detection
Quiz#3 LC710 10/17/12 name____________ Q1(4%)
Hormones and the Endocrine System
Università degli Studi di Milano
Generalizations of Markov model to characterize biological sequences
Growth Hormone Receptor
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen
Secretin and vasoactive intestinal peptide receptors: Members of a unique family of G protein–coupled receptors  Charles D. Ulrich, Martin Holtmann‡,
Volume 83, Issue 6, Pages (December 2002)
Volume 3, Issue 6, Pages (November 1998)
Discovery of New GPCR Biology: One Receptor Structure at a Time
Saliency Optimization from Robust Background Detection
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen
Cell to Cell Interaction (Cell signaling/cell communication)
G protein coupled receptors
Presentation transcript:

Partitioning Sequences Based on Association Measures Deborah Weisser Carnegie Mellon University

Feature boundaries Need to know form and function of protein sequences to understand complex biological systems Not possible to directly determine features or functions directly –estimate feature positions by indirect laboratory experiments, e.g. hydrophobicity Use statistical measures of association to determine feature boundaries

Feature boundaries Proteins are comprised of adjacent, non- overlapping features: –helical, cytoplasmic, periplasmic, extracellular, intracellular, etc. GPCR proteins have a fixed feature pattern, although feature positions are only known for one member of the family, Rhodopsin (opsd_human)

Goal: Statistically determine feature boundaries in sequences of amino acids S H D E G C L S S E P K P R K Q S D S S T

Association measures S H D E G C L S S E P K P R K Q S D S S T is a measure of the strength of the association between P and R

Association measures S H D E G C L S S E P K P R K Q S D S S T

Association measures S H D E G C L S S E P K P R K Q S D S S T Adjacent pairs with low association measures are candidates for partition points.

Association measures are used to quantify correlations between adjacent amino acids Yule’s Q statistic Mutual information

E P M S N V V V G F R F Y C K H M I A N Q Q Q A A K E A V F T V Q L T V R M S A T T Q K A E K E I I V E I M M Y R G T T V Q H K R N T T V M L C Cytoplasmic (cp) Domain T L Y V N F L I Y N L C C IIIIIIIVVVIVII L K P K N Q F cp1 cp2 cp3 A OOC- P AV Q S T E T K S V T - T S A E D D G L P K N Cytoplasmic (cp) Domain Transmembrane (helices) Domain Extracellular (ec) Domain MI: 39, 63, 76, 94, 115, 136, 155, 176, 205, 233, 255, 279, 287, 301 Hydropathy: 37, 61, 74, 98, 114, 133, 153, 176, 203, 230,253, 276, 285, Hydropathy breaks Cytoplasmic (cp) Domain Transmembrane (helices) Domain Extracellular (ec) Domain MI breaks

The changes in association measure values correspond to feature boundaries Goal: automatically detect partition points based on association measures

Partitioning algorithm Cluster adjacent association values –each group is represented by its mean value Calculate standard deviation of values over all clusters Locate partition points in data based on: –deviation from mean –[change between adjacent values]

Parameters Cluster adjacent association values –each group is represented by its mean value window size for computing mean Calculate standard deviation of values over all clusters Locate partition points in data based on: –deviation from mean –[change between adjacent values] cutoff distance from mean for a value to be considered “extreme”

Effect of cutoff threshold on partitioning in opsd_human using mutual information

Effect of window size on partitioning in opsd_human using mutual information

Class A Rhodopsin like Amine Peptide Hormone protein (Rhodopsin Rhodopsin Vertebrate Rhodopsin Vertebrate type 1Rhodopsin Vertebrate type 1 Rhodopsin Vertebrate type 2Rhodopsin Vertebrate type 2 Rhodopsin Vertebrate type 3Rhodopsin Vertebrate type 3 Rhodopsin Vertebrate type 4Rhodopsin Vertebrate type 4 Rhodopsin Vertebrate type 5Rhodopsin Vertebrate type 5 Rhodopsin Arthropod Rhodopsin Mollusc Rhodopsin Other Olfactory Prostanoid Nucleotide-like Cannabis Platelet activating factor Gonadotropin-releasing hormone Thyrotropin-releasing hormone & SecretagogueThyrotropin-releasing hormone & Secretagogue Melatonin Viral Lysosphingolipid & LPA (EDG) Leukotriene B4 receptor Class A Orphan/other Class B Secretin like Class C Metabotropic glutamate / pheromone Class D Fungal pheromone Class E cAMP receptors (Dictyostelium) Frizzled/Smoothened family GPCR: different subfamilies

Size:Hierarchy: GPCR Class A 48393Rhodopsin 33543Vertebrate 20314Vertebrate 1 348opsd_human 39724Class B 20930Class C

Structure of curve is preserved even when the dataset is small.

In progress / Future work Set parameters of partition algorithm automatically Apply to other sources of data, types of features Group amino acids into sub-classes Quantify the effect of training set information content and training set size.