BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.

Slides:



Advertisements
Similar presentations
Pfam(Protein families )
Advertisements

The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
The Protein Data Bank (PDB)
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Proteomics: Analyzing proteins space. Protein families Why proteins? Shift of interest from “Genomics” to “Proteomics” Classification of proteins to groups/families.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Protein Sequence Databases Computational Molecular Biology Biochem 218 – BioMedical.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Proteins to Proteomes The InterPro Database
Motif discovery and Protein Databases Tutorial 5.
Protein Domain Database
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK Bioinformatics:
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Based Analysis Tutorial
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Protein families, domains and motifs in functional prediction May 31, 2016.
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
Bio/Chem-informatics
Demo: Protein Information Resource
Biological Sequence Databases
Pfam: multiple sequence alignments and HMM-profiles of protein domains
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Genome Annotation Continued
PIR: Protein Information Resource
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
A brief on: Domain Families & Classification
PROTEIN PATTERN DATABASES
Overview of Enzyme, Protein and Network Databases
A brief on: Domain Families & Classification
Presentation transcript:

BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile Built up using PROTOMAT (BLOSUM scoring model), calibrated against SWISS-PROT, use LAMA to search blocks against blocks Starting sequences from Prosite, PRINTS, Pfam, ProDom and Domo - total of 2129 families

Building of Blocks Alignments from Prosite Alignments from ProDom Alignments from PfamA Alignments from PRINTS Alignments from Domo Build blocks using PROTOMAT Search for common blocks LAMA- remove Build blocks using PROTOMAT Search for common blocks LAMA- remove Build blocks using PROTOMAT Search for common blocks LAMA- remove Blocks database annotated verified Unverified and changes

SEARCHING BLOCKS Compare a protein or DNA (1-6 frames) sequence to database of blocks Blocks Searcher- used via internet or First position of sequence aligned to first position of first block - score for that position, score summed over width of alignment, then block is aligned with next position etc for all blocks in database- get best alignment score. Search is slow (350 aa/2 min) Can search database of PSI-BLAST PSSMs for each blocks family using IMPALA

TIGRFAMs Collection of protein families in HMMs built with curated multiple sequence alignments and with associated functional information Equivalog- homologous proteins conserved with respect to function since last ancestor (other pattern databases concentrate on related seq not function) > 800 non-overlapping families -can search by text or sequence Has information for automatic annotation of function, weighted towards microbial genomes

Text search results

Example entry

Sequence search result

PIR-ALN search/textpiraln.html Database of annotated protein sequence alignments derived automatically from PIR PSD Includes alignments at superfamily (whole sequence), family (45% identity) and domain (in more than one superfamily) levels 3983 alignments, 1480 superfamilies, 371 domains Can search by protein accession number or text

PROTOMAP Automatic classification of all SWISS-PROT proteins into groups of related proteins (also including TrEMBL now) Based on pairwise similarities Has hierarchical organisation for sub- and super-family distinctions clusters, 5869  2 proteins, 1403  10 Keeps SP annotation eg description, keywords Can search with a sequence -classify it into existing clusters

DOMO page+LibInfo+-lib+DOMO (SRS) Database of gapped multiple sequence alignments from SWISS-PROT and PIR Domain boundaries inferred automatically, rather than from 3D data Has 8877 alignments, domains, and repeats Each entry is one homogous domain, has annotation on related proteins, functional families, evolutionary tree etc

ProClass Non-redundant protein database organized by family relationships defined by ProSite patterns and PIR superfamilies. Facilitates protein family information retrieval, domain and family relationships, and classifies multi-domain proteins Contains 155,868 sequence entries

SBASE (Agricultural Biotechnology Centre) Protein domain library from clustering of functional and structural domains SBASE entries - grouped by Standard names (SN groups) that designate various functional and structural domains of protein sequences- relies on good annotation of domains Detects subclasses too Can do similarity search with BLAST or PSI-BLAST

Integrating Pattern databases MetaFam IProClass CDD InterPro

METAFAM Protein family classification built with Blocks+, DOMO, Pfam, PIR-ALN, PRINTS, Prosite, ProDom, SBASE, SYSTERS Automatically create supersets of overlapping families using set-theory to compare databases- reference domains covering total area Use non-redundant protein set from SPTR & PIR

IProClass Integrated database linking ProClass, PIR-ALN, Prosite, Pfam and Blocks Contains >20000 non-redundant SP & PIR proteins, superfamilies, 2600 domains, 1300 motifs, 280 PTMs Can be searched by text or sequence

CDD Conserved Domain Database Database of domains derived form SMART, Pfam and contributions from NCBI (LOAD) Uses reverse position-specific BLAST (matrix) Links to proteins in Entrez and 3D structure Stand-alone version of RPS-BLAST at: ftp://ncbi.nlm.nih.gov/toolbox

CDD homepage

CDD Search result

DART

CDD example entry

PIR link from CDD

INTERPRO Integration of different signature recognition methods (PROSITE, PRINTS, PFAM, ProDom and SMART)

InterPro release 3 Built from PROSITE, PRINTS, Pfam, ProDom, SMART, SWISS-PROT and TrEMBL Contains 3915 entries encoded by 7714 different regular expressions, profiles, fingerprints, Hidden Markov Models and ProDom domains InterPro provides >1 million InterPro matches hits against SWISS- PROT + TrEMBL protein sequences (68% coverage) Direct access to the underlying Oracle database A XML flatfile is available at ftp://ftp.ebi.ac.uk/pub/databases/interpro/ SRS implementation Text- and sequence-based searches

InterProScan PROSITE patterns: ppsearch PROSITE profiles: pfscan PFAM HMMs: hmmpfam PRINTS fingerprints: fpscan ProDom SMART eMotif derived PROSITE pattern TMHMM SignalP

PRINTS detailed results ANX3_MOUSE Annexin type III

SUMMARY Many different protein signature databases from small patterns to alignments to complex HMMs Have different strengths and weaknesses Have different database formats Therefore: best to combine methods, preferably in a database with them already merged for simple analysis with consistent format

Protein Secondary Structure CATH (Class, Architecture,Topology, Homology) SCOP (structural classification of proteins) -hierarchical database of protein folds FSSP Fold classification using structure-structure alignment of proteins TOPS Cartoon representation of topology showing helices and strands