PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Slides:



Advertisements
Similar presentations
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
Advertisements

EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Today’s menu: -SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Single Motif Charles Yan Spring Single Motif.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
PROTEIN SEQUENCE ANALYSIS. Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Protein Sequence Alignment and Database Searching.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
Profile Searches Revised 07/11/06. Overview Introduction Motif representation Motif screening Motif Databases Exercise.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
Proteins to Proteomes The InterPro Database
Motif discovery and Protein Databases Tutorial 5.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
(H)MMs in gene prediction and similarity searches.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein families, domains and motifs in functional prediction
Protein Families, Motifs & Domains.
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
BLAST.
PROTEIN PATTERN DATABASES
Presentation transcript:

PROTEIN PATTERN DATABASES

PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE

BASIC INFORMATION COMES FROM SEQUENCE Multiple alignments of related sequences- can build up consensus sequences of known families, domains, motifs or sites. Pattern Matrix Profile HMM

COMMON PROTEIN PATTERN DATABASES Prosite patterns Prosite profiles Pfam SMART Prints TIGRFAMs BLOCKS Alignment databases ProDom PIR-ALN ProtoMap Domo ProClass

PROSITE Patterns and profiles Building a pattern: a.) from literature -test against SP, update if necessary b.) new patterns: Start with reviewed protein family, known functional sites: enzyme catalytic site, attachment site eg heme, metal ion binding site cysteines for disulphide bonds, molecule (GTP) or protein binding site

PROSITE PATTERNS Alignment Functionally important residues Find 4-5 conserved residues Core pattern Find only correct entries- leave Find many false positives, increase pattern and re-search Search SP Pattern is given as regular expression: [AC]-x-V-x(4)-{ED} ala/cys-any-val-any-any-any-any-(any except glu or asp)

PROSITE PROFILES Not confined to small regions, cover whole protein or domain and has more info on allowed aa at each position Start with multiple seq alignment -uses a symbol comparison table to convert residue frequency distributions into weights Result- table of position-specific amino acid weights and gap costs- calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence Tested on SP, refined. Begin as prefiles then integrated

Pfam Database of HMMs for domains and families HMMs are built from HMMER2 (Bayesian statistical models), can use two modes ls or fs, all domains should be matched with ls Use Bits scores (??) thresholds are chosen manually using E-value from extreme fit distribution Two parts to Pfam: é PfamA -manually curated é PfamB -automatic clustering of rest of SPTR from ProDom using Domainer Use -looking at domain structure of SPTR protein or new sequence

Additional features of Pfam PfamA has about 65% coverage of SPTR, rest is covered by PfamB Can search directly with DNA -Wise2 package Can view taxonomic range of each entry Can view proteins with similar domain structure and view of all family members Links to other databases including 3D structure Note: No 2 PfamA HMMs should overlap

SMART- Simple Modular Architecture Research Tool Relies on hand curated multiple sequence alignments of representative family members from PSI-BLAST- builds HMMs- used to search database for more seq for alignment- iterative searching until no more homologues detected Store Ep (highest per protein E-value of T) and En (lowest per protein E-value of N) values Will predict domain homologue with sequence if –Ep < E-value <En and E-value <1.0

Additional features of SMART Used for identification of genetically mobile domains and analysis of domain architectures Can search for proteins containing specific combinations of domains in defined taxa Can search for proteins with identical domain architecture Also has information on intrinsic features like signal sequences, transmembrane helices, coiled-coil regions and compositionally biased regions

ProDom Groups all sequences in SPTR into domains -> families Use automatic process to build up domains -DOMAINER For expert curated families, use PfamA alignments to build new ProDom families Use diameter (max distance between two domains in family) and radius of gyration root mean square of distance between domain and family consensus), both counted in PAM (percent accepted mutations (no per 100 aa) to measure consistency of a family, lower these values, more homogeneous family

Building of ProDom families DB initialisation: SEG, split modified seq, sort by size Query DB looking for internal repeats Extract shortest sequence as a query PSI-BLAST New ProDom domain family Extract first repeat - use as a query Remove new found domains, sort by size Repeated until database is empty

PRINTS -Fingerprint DB Fingerprint- set of motifs used to predict occurrence of similar motifs in a sequence Built by iterative scanning of OWL database Multiple sequence alignment- identify conserved motifs- scan database with each motif- correlate hitlists for each- should have more sequences now- generate more motifs- repeat until convergence Recognition of individual elements in fingerprint is mutually conditional True members match all elements in order, subfamily may match part of fingerprint