Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.

Slides:



Advertisements
Similar presentations
Pfam(Protein families )
Advertisements

The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Biology 224 Dr. Tom Peavy Sept 27 & 29 Protein Structure & Analysis.
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Gene Expression and Networks. 2 Microarray Analysis Unsupervised -Partion Methods K-means SOM (Self Organizing Maps -Hierarchical Clustering Supervised.
Biology 224 Dr. Tom Peavy Sept 28 & 30
Protein analysis and proteomics Friday, 27 January 2006 Introduction to Bioinformatics DA McClellan
What’s next ?? Today 3.3 Protein function 10.3 Protein secondary structure prediction 17.3 Protein tertiary structure prediction 24.3Gene expression &
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Sequence similarity search Glance to the protein world.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Protein analysis and proteomics (Part 1 of 2). Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan.
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Protein Bioinformatics Course
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
You have worked for 2 years to isolate a gene involved in axon guidance. You sequence the cDNA clone that contains axon guidance activity. What do you.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
Protein Database David Shiuan Department of Life Science Institute of Biotechnology Interdisciplinary Program of Bioinformatics National Dong Hwa University.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Protein and RNA Families
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
InterPro Sandra Orchard.
Advanced Database Searching February 13, 2008 Jonathan Pevsner, Ph.D. Introduction to Bioinformatics Johns Hopkins University.
3/15/20161 BLAST : Basic local alignment search tools.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Advanced BLAST Searching Courtesy of Jonathan Pevsner Johns Hopkins U.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein analysis and proteomics
Advanced BLAST Searching
Protein families, domains and motifs in functional prediction
Bio/Chem-informatics
Protein Families, Motifs & Domains.
Sequence based searches:
Genome Annotation Continued
Presentation transcript:

protein RNA DNA Predicting Protein Function

Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245

Function based on ligand binding specificity What (who) does it bind ?? Page 245

Function based on biological process What is it good for ?? Amino acid metabolism? Page 245

Function based on cellular location DNARNA Page 245 Where is it active?? Nucleolus ?? Cytoplasm??

Function based on cellular location DNARNA Page 245 Where is the RNA/Protein Expressed ?? Brain? Testis? Where it is under expressed??

GO (gene ontology) The GO project is aimed to develop three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated molecular functions (F) biological processes (P) cellular components (C) Ontology is a description of the concepts and relationships that can exist for an agent or a community of agents

GO AnnotationsRIM11 GO evidence and references Molecular Functionglycogen synthase kinase 3 activityglycogen synthase kinase 3 activity (ISS) protein serine/threonine kinase activity (IDA)ISS protein serine/threonine kinase activityIDA Biological Processprotein amino acid phosphorylationprotein amino acid phosphorylation (IGI, ISS) proteolysis (IGI) response to stress (IGI, IMP) sporulation (sensu Fungi) (IMP)IGIISS proteolysisIGI response to stressIGIIMP sporulation (sensu Fungi)IMP Cellular Component cytoplasm (IDA)cytoplasmIDA Extracted from SGD Saccharomyces Genome Database

Inferring protein function Bioinformatics approach Based on homology Based on the existence of known protein domains (the protein signature)

Homologous proteins  Rule of thumb: Proteins are homologous if 25% identical (length >100)

Proteins with a common evolutionary origin Paralogs - Proteins encoded within a given species that arose from one or more gene duplication events. Orthologs - Proteins from different species that evolved by speciation. Hemoglobin human vs Hemoglobin mouse Hemoglobin human vs Myoglobin human Homologous proteins

COGs Clusters of Orthologous Groups of proteins > Each COG consists of individual orthologous proteins or orthologous sets of paralogs. > Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG. DATABASE Refence: Classification of conserved genes according to their homologous relationships. (Koonin et al., NAR)

Inferring protein function based on the protein signature

The Protein Signature Signature: Existence of a known protein domain or motif Domain: A region of a protein that can adopt a 3D structure Motif (or fingerprint): a short, conserved region of a protein typically 10 to 20 contiguous amino acid residues examples: zinc finger domain immunoglobulin domain

DNA Binding domain Zinc-Finger

Protein Domains Domains can be considered as building blocks of proteins. Some domains can be found in many proteins with different functions, while others are only found in proteins with a certain function.

Varieties of protein domains Page 228 Extending along the length of a protein Occupying a subset of a protein sequence Occurring one or more times

Example of a protein with 2 domains: Methyl CpG binding protein 2 (MeCP2) MBDTRD The protein includes a Methylated DNA Binding Domain (MBD) and a Transcriptional Repression Domain (TRD). MeCP2 is a transcriptional repressor.

Result of an MeCP2 blastp search: A methyl-binding domain shared by several proteins

Are proteins that share only a domain homologous?

PROSITE ProSite is a database of protein patterns that can be searched by either regular expression patterns or sequence profiles. Zinc_Finger_C2H2 Cx{2,4}Cx3(L,I,V,M,F,Y,W,C)x8Hx{3,5}H

PHI-BLAST Searching a specific protein sequence pattern with local alignments surrounding the match. Page 145 PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology. EXAMPLE: Search for a short sequence motif in the lipocalin family

PHI-BLAST Given 1) protein sequence S 2) pattern P occurring in S, PHI-BLAST helps answer the question: What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences? Page 145

1 50 ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD hsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD Align three lipocalins (RBP and two bacterial lipocalins)

1 50 ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD hsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD GTWYEI K AV M Concentrate on the conserved region of interest and see which amino acid residues are used

1 50 ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD hsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD GTWYEI K AV M GXW[YF][EA][IVLM] Create a pattern using the appropriate syntax

Results

Pfam > Database that contains a large collection of multiple sequence alignments of protein domains Based on Profile hidden Markov Models (HMMs).

Profile HMM (Hidden Markov Model) D16D17D18 D19 M16M17M18M19 I16I19I18I17 100% D 0.8 S 0.2 P 0.4 R 0.6 T 1.0 R 0.4 S 0.6 XXXX 50% D R T R D R T S S - - S S P T R D R T R D P T S D - - S D - - R HMM is a probabilistic model of the MSA consisting of a number of interconnected states Match delete insert

Pfam > Database that contains a large collection of multiple sequence alignments of protein domains Based on Profile hidden Markov Models (HMMs). > The Pfam database is based on two distinct classes of alignments – Seed alignments which are deemed to be accurate and used to produce Pfam A -Alignments derived by automatic clustering of SwissProt, which are less reliable and give rise to Pfam B

Physical properties of proteins

DNA binding domains have relatively high frequency of basic (positive) amino acids M K D P A A L K R A R N T E A A R R S S R A R K L Q R M GCN4 zif268 M E R P Y A C P V E S C D R R F S R S D E L T R H I R I H T myoD S K V N E A F E T L K R C T S S N P N Q R L P K V E I L R N A I R

Transmembrane proteins have a unique hydrophobicity pattern

Physical properties of proteins Many websites are available for the analysis of individual proteins for example: EXPASY (ExPASy)ExPASy UCSC Proteome BrowserBrowser ProtoNet HUJIHUJI The accuracy of the analysis programs are variable. Predictions based on primary amino acid sequence (such as molecular weight prediction) are likely to be more trustworthy. For many other properties (such as Phosphorylation sites), experimental evidence may be required rather than prediction algorithms. Page 236

Knowledge Based Approach IDEA Find the common properties of a protein family (or any group of proteins of interest) which are unique to the group and different from all the other proteins. Generate a model for the group and predict new members of the family which have similar properties.

Knowledge Based Approach Generate a dataset of proteins with a common function (DNA binding protein) Generate a control dataset Calculate the different properties which are characteristic of the protein family you are interested for all the proteins in the data (DNA binding proteins and the non-DNA binding proteins Represent each protein in a set by a vector of calculated features and build a statistical model to split the groups Basic Steps 1. Building a Model

Calculate the properties for a new protein And represent them in a vector Predict whether the tested protein belongs to the family Basic Steps 2. Predicting the function of a new protein

TEST CASE Y14 – A protein sequence translated from an ORF (Open Reading Frame) Obtained from the Drosophila complete Genome >Y14 PQRSVGWILFVTSIHEEAQEDEIQEKFCDYGEIKNIHL NLDRRTGFSKGYALVEYETHKQALAAKEALNGAEIM GQTIQVDWCFVKG G

>Y14 PQRSVGWILFVTSIHEEAQEDEIQEKFCDYGEIKNI HLNLDRRTGFSKGYALVEYETHKQALAAKEALN GAEIMGQTIQVDWCFVKG G Y14 DOES NOT BIND RNA

Database and Tools for protein families and domains InterPro - Integrated Resources of Proteins Domains and Functional SitesInterPro Prosite – A dadabase of protein families and domain BLOCKS - BLOCKS dbBLOCKS Pfam - Protein families db (HMM derived)Pfam PRINTS - Protein Motif fingerprint dbPRINTS ProDom - Protein domain db (Automatically generated)ProDom PROTOMAP - An automatic hierarchical classification of Swiss-Prot proteinsPROTOMAP SBASE - SBASE domain dbSBASE SMART - Simple Modular Architecture Research ToolSMART TIGRFAMs - TIGR protein families dbTIGRFAMs