Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Pfam(Protein families )
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Profiles for Sequences
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Protein Tertiary Structure Prediction
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Protein structure (Part 2 of 2).
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Protein Fold recognition
The Protein Data Bank (PDB)
What’s next ?? Today 3.3 Protein function 10.3 Protein secondary structure prediction 17.3 Protein tertiary structure prediction 24.3Gene expression &
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Protein Modules An Introduction to Bioinformatics.
Protein Tertiary Structure. Primary: amino acid linear sequence. Secondary:  -helices, β-sheets and loops. Tertiary: the 3D shape of the fully folded.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structure Prediction II
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Protein Tertiary Structure Prediction
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
Proteins to Proteomes The InterPro Database
Protein Structure Prediction: Homology Modeling & Threading/Fold Recognition D. Mohanty NII, New Delhi.
Protein Domain Database
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Protein Tertiary Structure Prediction Structural Bioinformatics.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein Families, Motifs & Domains.
Demo: Protein Information Resource
Dot Plots, Path Matrices, Score Matrices
Genome Annotation Continued
Protein structure prediction.
Presentation transcript:

Protein Structure and Function Prediction

Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult problem

Comparative Modeling Comparative structure prediction produces an all atom model of a sequence, based on its alignment to one or more related protein structures in the database Similar sequence suggests similar structure

Comparative Modeling Modeling of a sequence based on known structures Consist of four major steps : 1.Finding a known structure(s) related to the sequence to be modeled (template), using sequence comparison methods such as PSI-BLAST 2. Aligning sequence with the templates 3. Building a model 4. Assessing the model

Comparative Modeling Accuracy of the comparative model is related to the sequence identity on which it is based >50% sequence identity = high accuracy 30%-50% sequence identity= 90% modeled <30% sequence identity =low accuracy (many errors) Similarity particularly high in core –Alpha helices and beta sheets preserved –Even near-identical sequences vary in loops

Comparative Modeling Methods MODELLER (Sali –Rockefeller/UCSF) SCWRL (Dunbrack- UCSF ) SWISS-MODEL

Protein Folds A combination of secondary structural units –Forms basic level of classification Each protein family belongs to a fold –Estimated 1000–3000 different folds –Fold is shared among close and distant family members Different sequences can share similar folds

HemoglobinTIM Protein Folds: sequential and spatial arrangement of secondary structures

Fold classification: (SCOP) Class: All alpha All beta Alpha/beta Alpha+beta Fold Family Superfamily

Basic steps in Fold Recognition : Compare sequence against a Library of all known Protein Folds (finite number) Query sequence MTYGFRIPLNCERWGHKLSTVILKRP... Goal: find to what folding template the sequence fits best Find ways to evaluate sequence-structure fit

Find best fold for a protein sequence: Fold recognition (threading) MAHFPGFGQSLLFGYPVYVFGD... Potential fold... 1)... 56)... n)

Programs for fold recognition TOPITS (Rost 1995) GenTHREADER (Jones 1999) SAMT02 (UCSC HMM) 3D-PSSM

Ab Initio Modeling Compute molecular structure from laws of physics and chemistry alone –Ideal solution (theoretically) Simulate process of protein folding –Apply minimum energy considerations Practically nearly impossible –Exceptionally complex calculations –Biophysics understanding incomplete

Ab Initio Methods Rosetta (Bakers lab, Seattle) Undertaker (Karplus, UCSC)

Predicting Protein Function PART 2

Inferring protein function : Based on the existence of known protein domains Based on homology

Protein Domains Domains can be considered as building blocks of proteins. Some domains can be found in many proteins with different functions, while others are only found in proteins with a certain function. The presence of a particular domain can be indicative of the function of the protein.

DNA Binding domain Zinc-Finger

Protein Domain can be defined by : A motif A profile (PSSM) A Hidden Markov Model

MOTIF Rxx(F,Y,W)(R,K)SAQ

Profile Scoring

PROSITE ProSite is a database of protein domains that can be searched by either regular expression patterns or sequence profiles. Zinc_Finger_C2H2 Cx{2,4}Cx3(L,I,V,M,F,Y,W,C)x8Hx{3,5}H

Profile HMM (Hidden Markov Model) D16D17D18D19 M16M17M18M19 I16I19I18I17 100% D 0.8 S 0.2 P 0.4 R 0.6 T 1.0 R 0.4 S 0.6 XXXX 50% D R T R D R T S S - - S S P T R D R T R D P T S D - - S D - - R HMM is a probabilistic model of the MSA consisting of a number of interconnected states Match delete insert

Pfam The Pfam database is based on two distinct classes of alignments –Seed alignments which are deemed to be accurate and used to produce Pfam A –Alignments derived by automatic clustering of SwissProt, which are less reliable and give rise to Pfam B Database that contains a large collection of multiple sequence alignments and Profile hidden Markov Models (HMMs). High-quality seed alignments are used to build HMMs to which sequences are aligned

InterPro Was built from protein classification databases, such as: PROSITE ProDom SMART Pfam PRINTS Uses UniProt = SWISSPROT and TrEMBL

Database and Tools for protein families and domains InterPro - Integrated Resources of Proteins Domains and Functional SitesInterPro Prosite – A dadabase of protein families and domain BLOCKS - BLOCKS dbBLOCKS Pfam - Protein families db (HMM derived)Pfam PRINTS - Protein Motif fingerprint dbPRINTS ProDom - Protein domain db (Automatically generated)ProDom PROTOMAP - An automatic hierarchical classification of Swiss-Prot proteinsPROTOMAP SBASE - SBASE domain dbSBASE SMART - Simple Modular Architecture Research ToolSMART TIGRFAMs - TIGR protein families dbTIGRFAMs

Inferring protein function based on sequence homology

Clusters of Orthologous Groups of proteins (COGs ) Classification of conserved genes according to their homologous relationships. (Koonin et al., NAR) Homologs - Proteins with a common evolutionary origin Paralogs - Proteins encoded within a given species that arose from one or more gene duplication events. Orthologs - Proteins from different species that evolved by vertical descent (speciation).

Clusters of Orthologous Groups of proteins (COGs) Each COG consists of individual orthologous proteins or orthologous sets of paralogs from at least three lineages. Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG.

COGS - Clusters of orthologous groups * All-against-all sequence comparison of the proteins encoded in completed genomes (paralogs/orthologs) * For a given protein “a” in genome A, if there are several similar proteins in genome B, the most similar one is selected * If when using the protein “b” as a query, protein “a” in genome A is selected as the best hit “a” and “b” can be included in a COG * Proteins in a COG are more similar to other proteins in the COG than to any other protein in the compared genomes * A COG is defined when it includes at least three homologous proteins from three distant genomes