Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Slides:



Advertisements
Similar presentations
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Advertisements

Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Fa05CSE 182 CSE182-L5: Position specific scoring matrices Regular Expression Matching Protein Domains.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Corrections. N-linked glycosylation (GlcNac): Look at the Swiss-Prot annotation (in a random ‘glycosylated’ entry)
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Tutorial 5 Motif discovery.
Protein Modules An Introduction to Bioinformatics.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Protein and Function Databases
Single Motif Charles Yan Spring Single Motif.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Motif searching and protein structure prediction May 26, 2005 Hand in written assignments today! Learning objectives-Learn how to read structure information.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Classifying the protein universe Synapse- Associated Protein 97 Wu et al, EMBO J 19:
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
Profile Searches Revised 07/11/06. Overview Introduction Motif representation Motif screening Motif Databases Exercise.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
Motif discovery and Protein Databases Tutorial 5.
Protein Domain Database
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK Bioinformatics:
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein families, domains and motifs in functional prediction
Protein Families, Motifs & Domains.
Sequence based searches:
Genome Annotation Continued
Sequence Based Analysis Tutorial
A brief on: Domain Families & Classification
A brief on: Domain Families & Classification
Presentation transcript:

Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP

Protein databases Genpept – protein sequence database translated from GenBank UniProtKB/TrEMBL – is a computer-annotated protein sequence database complementing the UniProtKB/Swiss-Prot Protein Knowledgebase. UniProtKB/Swiss-Prot – is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and a high level of integration with other databases. AG-ICB-USP

How to assign protein functions? Similar proteins may share common functions, but… proteins that share common domains may have evolved to perform distinct functions Proteins that exert similar function may share common domains, but… domain sequences are not always very similar – more refined are requires than simply similarity searches Proteins may share common domains, but have different architectures – no single domain are necessarily involved with protein function. Many proteins use multiple domains to perform their activities AG-ICB-USP

Some conclusions Similarity searches may reveal proteins that share very similar sequences and functions – high similarity over the full length of the query sequence An output with no significant hits or with hits to unannotated proteins will no unravel the possible function of the query protein Similarity searches do not differentiate orthologues from paralogues When matching multidomain proteins, it may not be appropriate to transfer the functional annotation – the context is important! AG-ICB-USP

So what do proteins with similar function have in common? AG-ICB-USP

residues, motifs, domains, architecture… AG-ICB-USP

Pattern databases Databases that contain patterns of residue conservation within groups of related sequences There are several methods to determine patterns There are many different pattern databases AG-ICB-USP

Pattern databases AG-ICB-USP

Common protein pattern databases AG-ICB-USP Prosite patterns – regular expressions Prosite profiles – weight matrices (profiles) Pfam – database of protein domain families. Contains curated multiple sequence alignments for each family and corresponding HMMs Prints – database of groupf of motifs that in the context of being together, are more potent for assign protein function Prodom – automatedly generated databases based on a recursive use of PSI-BLAST similarity searches Interpro – an integrated databaes that combines different protein signature recognition methods in one single resource

How to start building a pattern database? AG-ICB-USP Prosite patterns – regular expressions Prosite profiles – weight matrices (profiles) Pfam – database of protein domain families. Contains curated multiple sequence alignments for each family and corresponding HMMs Prints – database of groupf of motifs that in the context of being together, are more potent for assign protein function Prodom – automatedly generated databases based on a recursive use of PSI-BLAST similarity searches Interpro – an integrated databaes that combines different protein signature recognition methods in one single resource

How to start building a pattern database? AG-ICB-USP

How to start building a pattern database? AG-ICB-USP With multiple sequence alignments of functionally related proteins

Some definitions AG-ICB-USP Protein motif – a single conserved region Prosite pattern – a consensus expression of a conserved region Frequency matrices (PRINTS) – matrices that contain the frequencies in which residures occur in a given motif PSSM – position specific score (weight) matrices (BLOCKS) –add a scoring scheme to the frequency matrices HMMs profiles – probabilistic models derived from alignment profiles Protein domain - is a part of protein sequence and structure that can evolve, function, and exist independently of the rest of the protein chain.

AG-ICB-USP