Protein World SARA 12-12-2002 Amsterdam Tim Hulsen.

Slides:



Advertisements
Similar presentations
Pfam(Protein families )
Advertisements

Basics of Comparative Genomics Dr G. P. S. Raghava.
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.
Sequence Similarity Searching Class 4 March 2010.
Archives and Information Retrieval
Benchmarking Orthology in Eukaryotes Nijmegen Tim Hulsen.
Heuristic alignment algorithms and cost matrices
Protein structure (Part 2 of 2).
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
The Protein Data Bank (PDB)
What’s next ?? Today 3.3 Protein function 10.3 Protein secondary structure prediction 17.3 Protein tertiary structure prediction 24.3Gene expression &
Protein Modules An Introduction to Bioinformatics.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc.,
Proteomics: Analyzing proteins space. Protein families Why proteins? Shift of interest from “Genomics” to “Proteomics” Classification of proteins to groups/families.
Overview of Bioinformatics A/P Shoba Ranganathan Justin Choo National University of Singapore A Tutorial on Bioinformatics.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Genomics in Drug Organon, Oss Tim Hulsen.
Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches1 By Jayakumar Rudhrasenan S Primary Supervisor: Prof. Heiko Schroder.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
Construction of Substitution Matrices
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein and RNA Families
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
1 The Genome Gamble, Knowledge or Carnage? Comparative Genomics Leading the Organon Tim Hulsen, Oss, November 11, 2003.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Construction of Substitution matrices
Motif Search and RNA Structure Prediction Lesson 9.
Copyright OpenHelix. No use or reproduction without express written consent1.
Testing sequence comparison methods with structure Organon, Oss Tim Hulsen.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Gene3D, Orthology and Homology-Based Inheritance of Protein-Protein Interactions Corin Yeats
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
Bioinformatics What is a genome? How are databases used? What is a phylogentic tree?
Genome Annotation (protein coding genes)
Bioinformatics Overview
Demo: Protein Information Resource
Genome Annotation Continued
Genomic Data Manipulation
Bioinformatics and BLAST
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Large-Scale Genomic Surveys
Basic Local Alignment Search Tool
Classification and binomial naming
Protein structure prediction.
A brief on: Domain Families & Classification
A brief on: Domain Families & Classification
Presentation transcript:

Protein World SARA Amsterdam Tim Hulsen

Genome sequencing Since 1995: sequencing of complete ‘genomes’ (DNA): A/C/G/T order ACGTCATCGTAGCTAGCTAGTCGTACGTATG TGCAGTAGCATCGATCGATCAGCATGCATAC At this moment more than 80 genomes have been sequenced and published, of all kinds of organisms: –Animals –Plants –Fungi –Bacteria

Genomes  Proteins ‘Transcription’ and ‘translation’ of specific regions of the genome leads to proteins, consisting of twenty types of ‘amino acids’: ATG ACG CTG AGC TGC GGA CGT TGA -> TLSCGR Proteins are responsible for all kinds of life processes All the proteins that can be produced in an organism together are called the ‘proteome’ Sequence comparisons make possible the classification of proteins

Protein families e.g. The GPCR family: Sequence comparison helps in predicting the function of new proteins

Determining protein functions Function of 40-50% of the new proteins is unknown Understanding of protein functions and relationships is important for: –Study of fundamental biological processes –Drug design –Genetic engineering

Sequence comparison Smith-Waterman dynamic programming algorithm (1981): calculates similarity/distance between two sequences: Query ---PLIT-LETRESV- Subject NEQPKVTMLETRQTAD (bold=similar) Results in a SW-score that is a measure for how similar the two sequences are to each other Disadvantage: score is dependent of length After the alignments, the proteins are ‘clustered’ (divided into families) according to their similarity

Existent databases Domain-based clusterings: Prosite, Pfam, ProDom, Prints, Domo, Blocks Protein-based clusterings: ProtoMap, COGs, Systers, PIR, ClusTr Structural classifications: SCOP, CATH, FSSP Why should there be another database?

Another method Enhanced Smith-Waterman algorithm: Monte-Carlo evaluation (Lipman et al., 1984) How big is the chance that two sequences are similar but not related? One of the two sequences is randomized and recalculated (200 times). Randomization leads to sequences with the same length and the same composition, but different order Method leads to calculation of the Z-value: S(A,B) - µ Z(A,B) = σ

Advantages The obtained Z-value is a very reliable measure for sequence, compared to SW- score: –SW-score is dependent of length, Z-value is not –Amino acid bias does not affect the Z-value Independent of the database size Easier updating of the database, without a total recalculation

Disadvantage LOTS of calculation time needed, especially when all proteins in all proteomes are compared to each other (“all-against-all”)!  SARA

SARA calculation Proteomes of 82 organisms compared ‘all- against-all’ with the use of the Monte Carlo algorithm: more than 400,000 proteins! 21,600 CPU days (~520,000 CPU hours) = 21,600 PCs running parallel over 24 hours / 1 PC running for ~ 60 years Using supercomputer TERAS (1024-CPU SGI Origin 3800) at SARA: less than two months!

Parties involved Gene-IT (Paris, France) SARA (Amsterdam, the Netherlands) CMBI (Nijmegen, the Netherlands) Organon (Oss, the Netherlands) EBI (Hinxton, UK)

Supporting parties Financed by NCF, foundation in support of supercomputing Under the auspices of BioASP, the new Dutch knowledge and service center for Bioinformatics

Results available through BioASP Log in and click on links ‘Research’ and ‘Protein World’: 1 2

Results available through BioASP Organism selection screen:

Results available through BioASP Results screen:

Results available through BioASP Alignment screen:

Conclusions Currently the most comprehensive and most accurate data-set of protein comparisons A start for a maintainable and unique database of all proteins currently known A rich data-source for clustering, data- mining and orthology determination

Orthology determination Orthologs: genes/proteins in different species that derive from a common ancestor Orthologs often have the same function Interesting! Information from other species could help in annotating a protein

Thank you for your attention Any questions?