Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences.

Slides:



Advertisements
Similar presentations
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Advertisements

Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Profiles for Sequences
Tools to analyze protein characteristics Protein sequence -Family member -Multiple alignments Identification of conserved regions Evolutionary relationship.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Tools to analyze protein characteristics Protein sequence -Family member -Multiple alignments Identification of conserved regions Evolutionary relationship.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Profile-profile alignment using hidden Markov models Wing Wong.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Protein Modules An Introduction to Bioinformatics.
Similar Sequence Similar Function Charles Yan Spring 2006.
By: Manchikalapati Myerow Shivananda Monday, April 14, 2003
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Protein Bioinformatics Course
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Day 2: Protein Sequence Analysis 1.Physico-chemical properties. 2.Cellular localization. 3.Signal peptides. 4.Transmembrane domains. 5.Post-translational.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Lab7 QRNA, HMMER, PFAM. Sean Eddy’s Lab
Functional Annotation 基因功能预测 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Protein Database David Shiuan Department of Life Science Institute of Biotechnology Interdisciplinary Program of Bioinformatics National Dong Hwa University.
Localization prediction of transmembrane proteins Stefan Maetschke, Mikael Bodén and Marcus Gallagher The University of Queensland.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
es/by-sa/2.0/. From Protein Sequence to Protein Properties Prof:Rui Alves Dept Ciencies.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Protein and RNA Families
Motif discovery and Protein Databases Tutorial 5.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Finding new nirK genes in metagenomic data
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
(H)MMs in gene prediction and similarity searches.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
1 Computational Approaches(1/7)  Computational methods can be divided into four categories: prediction methods based on  (i) The overall protein amino.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein families, domains and motifs in functional prediction
Protein Families, Motifs & Domains.
Functional Annotation of Transcripts
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Genome Annotation Continued
Combining HMMs with SVMs
Predicting Active Site Residue Annotations in the Pfam Database
Sequence Based Analysis Tutorial
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Presentation transcript:

Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences

Today’s Topics Hidden Markov Models (HMMs) Predicting sub-cellular localization of proteins Predicting post-translation modification sites Using Standalone tools Current Trends in Bioinformatics

Hidden Markov Models

HMMs for biological sequences Hidden Markov model is a statistical model and has been mostly developed for speech recognition. The most popular use of HMM in molecular biology is as a ‘probabilistic profile’ of a protein family, which is called a profile HMM. Apart from this, HMMs are also used for multiple sequence alignment, gene prediction (ORF finding), and protein structure prediction Advantages are, statistically sound, no sequence ordering or gap penalties are required Limitations are, large number of similar sequences are required to get good models

Stochastic modeling of biological sequences For Example, Profile is a position-specific scoring matrix. Given this model the probability of CGGSV is: 0.8 * 0.4 * 0.8* 0.6* 0.2 = Since multiplication of fractions is computationally expensive and prone to floating point errors, a transformation into the logarithmic world is used. The score is calculated by taking the logs of all amino acid probabilities and adding them up. ln(0.8) + ln(0.4) + ln(0.8) + ln(0.6) + ln(0.2) = -3.48

Stochastic modeling of biological sequences But with this expression it is not possible to distinguish between the highly implausible sequence TGCT- - AGG and the consensus sequence ACAC - - ATC

The HMM architecture S-start; E-end m- main state (matches/mismatches) i - insert state d - delete state A C A A T G T C A A C T A T C A C A C - - A G C A G A A T C A C C G - - A T C

Parameters used in HMM building Transition probability: T ij (average 0.333) Emission probability: E i  (average 0.05) M N – F L S M N K Y L T M Q – W - T m i d m Since the probabilities are very small numbers, they are converted to log odds scores and added to get the overall probability score

Markov modeling of biological sequences A C A A T G T C A A C T A T C A C A C - - A G C A G A A T C A C C G - - A T C

Markov modeling of biological sequences P(s)*100 A C A A T G 3.3 T C A A C T A T C A C A C - - A G C1.2 A G A A T C3.3 A C C G - - A T C 0.59 A C A C - - A T C4.7 P(ACACATC)= Obtained by taking the product of probabilities for residues in each state and the transitions.

Sequence Alignment and Database Search using HMMER Multiple Alignment Build a Profile HMM Database search Multiple alignments Query against Profile HMM database (PFAM database)

HMMSEARCH Results (on voltage-gated ion channel proteins database)

PFAM Protein Family Database created using HMMs Pfam-A contains functionally annotated families (~7500) Pfam-B contains unannotated families (~107000) All protein sequences were clustered into families based on sequence identity For each family, non-redundant, full-domain seed members were selected to represent the family Seed multiple alignments were built using ClustalW and manual checking HMM models were built using hmmbuild (suite of programs called HMMER) Using these models more family members were added in an iterative process of adding new members to multiple alignment and updating the HMM Model until no more new members are found

How to build and use Profile HMMs Get a family of seed sequences in multiple alignment Build a Hidden Markov Model using hmmbuild Use HMM as a query to find remote homologues in the sequence database using hmmsearch Add new sequences to the seed alignment using hmmalign and update the model, iteratively Get the consensus sequence of the model using hmmemit Query HMM with new query sequences to find if the sequences are related to the Model using hmmpfam

SledgeHMMER web server Accessible at Pfam database is the largest protein functional domain database built by Hidden Markov Models This server provides quick access to pre-calculated Pfam results for 1.2 million (entire SP+TrEMBL databases) protein sequences Sequences are compared with PERL MD5 hexadecimal hashing methods Web server is implemented in PERL/CGI interface

Predicting sub-cellular localization of proteins

Different cellular compartments (modified from Voet & Voet, Biochemistry; Weinheim, New York, Basel, Wiley-VCH 1992)

Based on amino acid composition Based on signal or target peptides PSORT TargetP Based on domain occurrence patterns MITOPRED Based on lexical analysis Methods to predict sub-cellular location

Amino acid compositional differences in different sub-cellular locations

PSORT ( PSORT program works based on a comprehensive knowledge of protein sorting Different parameters relevant to different groups of species are determined Bacterial sequences N-terminal signal sequence (Positive - H region)/cleavage site Transmembrane segments Lipoprotein Analysis Amino Acid composition

Eukaryotic sequences (Yeast/Animal/Plant) N-terminal signal sequence (Positive-H region)/cleavage site Transmembrane segments and Membrane topology Mitochondrial targeting signals and AAC of NT-20 amino acids Nuclear localization signals (NLS) Peroxysome matrix targeting sequences (PTSs) (S/A/C)(K/R/H/)L Chloroplast targeting signals Endoplasmic Reticulum signals (KDEL or HDEL-yeast) Vesicular, liposomal, vacuolar proteins etc. PSORT continued …

MITOPRED ( A new method based on Pfam domain occurrence patterns, amino acid composition (AAC) and pI value differences between mitochondrial and non- mitochondrial proteins Eukaryotic cells have multiple compartments and hence a set of pathways are localized to a specific compartment. Thus, a protein family involved in a specific pathway is expected in a specific compartment A knowledge base is developed by studying the occurrence and co-occurrence patterns of different Pfam domain in different cellular compartments The method compares the Pfam domains found in the query sequence against the knowledge-base and assigns a score, depending on which compartment it belongs to Independent scores are calculated based on the AAC, pI values of the query sequence by comparing them to the average values in different locations Final prediction is based on the combined score from AAC, pI and Pfam scores

More in CytoplasmicMore in Mitochondrial

pI value differences in different sub-cellular locations

Flowchart showing MITOPRED procedure

MITOPRED Web Server Accessible at Implemented using PERL/CGI interface Pre-calculated predictions are available for all eukaryotic proteins from Swiss-prot and TrEmbl databases (~500000) Genome-scale predictions can be downloaded for yeast, C.elegans, Drosophila, human, mouse and Arabidopsis species Provides data for the Mitoproteome database accessible at

Prediction of sub-cellular location by lexical analysis Separate SP proteins into different sub-cellular classes based on annotation In each class, extract all unique keywords for each sequence The total # of keywords in all classes is equal to the feature space (N) Generate a binary vector for each sequence in each class where the length of the vector is equal to N, 1 if the keyword is present and 0 if its absent. For the Unknown protein, generate a binary vector similar to above, based on its key words. From this, generate sub-vectors of size 2 k -1 (where k is equal to the number of key words in the unknown) by flipping the 1s to 0s. Based on the sub-vectors, retrieve all proteins with matching binary vectors from all classes. The unknown belongs to the class that contributes the most number of sequences in the retrieved group. This program works better, if the number of keywords are more as well as the family size is bigger.

Flow diagram of lexical analysis method (From Nair R, Rost Burkhard, Bioinformatics 18:S78-S86, 2000)

Predicting Post-translational Modification Sites of Proteins

General Method for PTM site Prediction PROSITE provides consensus patterns for a lot of PTM sites, however in most cases these patterns are very short and the true modifications occur based on the structural or environmental context in the protein fold Because of this reason, methods based on reg expressions or local alignment methods produce large number of false positives In almost all methods used in PTM site prediction, artificial neural networks (ANNs) are used. General procedure: Prepare datasets experimentally-known to possess a type of PTM site Separate the dataset into training and testing data Train a network using training data and test it with the testset. This process is iterated until the model is well refined Sufficient number of training sequences and good quality data are important for the success of any neural network method

Different Post-translational modifications (PTMs) Glycosylation ASN(N)-glycosylation (NetNGlyc) O-glycosylation (NetOGlyc) Sulfation (Sulfinator) Phosphorylation (NetPhos) Myristoylation (NMT)

Prediction of Glycosylation Sites (NetNGlyc, NetOGlyc)NetNGlycNetOGlyc Glycoproteins are specially synthesized molecules by covalent attachment of oligosaccharides to certain proteins at the ASN(N-glycosylation) or Ser or Thr (O-glycosylation) residues. These are usually exported to extra-cellular destinations like mucin in alimentary tract or glycoprotein harmones in the anterior pitutory gland. N-glycosylation O-glycosyltion No consensus pattern SEA domain is associated with it

Prediction of Sulfation Sites Protein tyrosine sulfation is an important post-translational modification for proteins that go through the secretory pathway. It regulates several protein- protein interactions and modulates the binding affinity of TM peptide receptors Based on the rules described above, HMMs could be trained to build models for predicting proteins sequences with patterns that abide these rules

Sulfinator Algorithm ( Sulfinator employs four different HMMs to recognize N-terminal (HMM- N), Internal (HMM-I), C-terminal (HMM-C) and in Y-clusters (HMM-Y)

Prediction of Phosphorylation Sites (NetPhos ( Protein kinases, a very large family of enzymes catalyze phosphorylation NetPhos produces neural network predictions for serine (S), threonine (T) or tyrosine (Y) phosphorylation sites in eukaryotic proteins that affect a multitude of cellular signaling processes Y-kinase Phosphorylation S or T-Phosphorylation in Caesin Kinase II Since these are very short patterns, the amino acids surrounding a phosphorylated residue are significant in determining whether a particular site is phosphorylated or not

Standalone Tools

Local Installation of tools and databases NCBI-Toolkit Formatting and using BLAST CD-HIT CLUSTALW HMMER package

Current Trends in Bioinformatics

Cell Structure Function Genomics Transcriptomics Proteomics Metabolomics Components Biology Systems Biology Reductionistic Approach Integrative Approach

Highway network system in San Antonio