Large scale protein sequence clustering Prof. Dr. Antje Krause Bioinformatics Wildau University of Applied Sciences

Large scale protein sequence clustering Prof. Dr. Antje Krause Bioinformatics Wildau University of Applied Sciences Antje.Krause@tfh-wildau.de

Antje KrausePoznań 14.07.20062 Abstract The concept of protein superfamilies, families and domains is one of the oldest in computational biology. Back in the 60s, when the first protein sequence database was published as printed version, Margaret Dayhoff defined the basic principles of this discipline with only a small number of sequences at hand. Nowadays, with more than a million protein sequences available in public databases, a constantly growing number of uncharacterized proteins from completely sequenced genomes and still a comparatively small number of known protein structures, a systematic grouping and characterization of this data is needed more than ever. This tutorial reviews the different approaches developed during the last decades and points out possible challenges waiting in the future.

Antje KrausePoznań 14.07.20063 Margaret O. Dayhoff “Dr. Margaret Oakley Dayhoff (1925-1983) was a pioneer in the use of computers in chemistry and biology, beginning with her PhD thesis project in 1948. Her work was multi-disciplinary, and used her knowledge of chemistry, mathematics, biology and computer science to develop an entirely new field. She is credited today as a founder of the field of Bioinformatics. This field is defined as the use of computers in solving information problems in the life sciences, mainly involving the creation of extensive electronic databases on protein sequences and genomes. Dr. Dayhoff was the first woman in the field of Bioinformatics.” http://www.dayhoff.cc/

Antje KrausePoznań 14.07.20064 Margaret O. Dayhoff deduce evolutionary connections of the biological kingdoms, phyla, and other taxa from sequence evidence collection of all known protein sequences made available to others in 1965 in a small book contained sequence information of 65 proteins several releases followed resulted in the Protein Information Resource (PIR)

Antje KrausePoznań 14.07.20065

Antje KrausePoznań 14.07.20066 Protein sequences >O54090|O54090_SULAC Hypothetical protein (Fragment). MKILDYSDLVFFRKLTNKMRDPKTRFDVREFINRGEDYLFNYTNKNVGGVDERRRKFLKS LIFGMAA >P70723|P70723_ACIAM Orf-2 (Fragment). MSKNSLDNLGEKALELLKKYPLCDSCLGRCFAKLGYRFANKERGKAIKTYLVLELDRKIK DHELEDLNEIKEILFNMGKEYLEYLIYLSNEKFQERT >sptrembl|Q9V2V9|Q9V2V9_PYRAE Rieske iron sulfur protein (ParR). MVDENRRNTLKIFLGTTAALGAGMLATPLVASVIGSKAGYIKPEPSGAIPVEICKDVDSC PKDYGVSLDELRNGPVFKLLKVNTMAIPAVFGIVRAKDGKEYPVAYVAICTHFGCPVNVS GGKYLIGFNCPCHGSIFAICNDPNGCPDYNAAFLEMYVSGGPAPRSLRAIKVAVKDGVVY PLVAYI >O93973|O93973_MALSM Allergen. MSNVIKKVFNTDKAEAEGSKVADAPQEAGHKGEGFLHDAKDRLQGFAGHGHHNAQNAASG VAGSAGAGGAPSVPSANVDVTNPVNDASVQGGVEAPRSWSTQLPQSQSVADTTGATSAGR NNLTQTTSTGSGVNVAAGNVDQDVQHLAPVTRHVHHRHEIEELLREREHHIHQHHIQHHV QPVVDSEHLAEQIHSRVVPQTTVREVHANTDKDAALMRAVAGNPKDTFTQAAIDRSVIDK GETVREIVHHHIHNIVQPIIEKETHEYHRIRTTIPTTHITHEAPIVHESTAHQPIRKEDF LKGGGVLTSTTRSIEEVGLLNLGNNQRTVEGETYTGGLPLSQ >Q02039|Q02039_RHYSE NIP1 precursor (NIP1 avirulence protein precursor). MKFLVLPLSLAFLQIGLVFSTPDRCRYTLCCDGALKAVSACLHESESCLVPGDCCRGKSR LTLCSYGEGGNGFQCPTGYRQC >Q873M4|Q873M4_MALSM Manganese superoxide dismutase (Fragment). PFYPIPSALPFPLPIHSLFSRRTRLFRFSRTAARAGTEHTLPPLPYEYNALEPFISADIM MVHHGKHHQTYVNNLNASTKAYNDAVQAQDVLKQMELLTAVKFNGGGHVNHALFWKTMAP QSQGGGQLNDGPLKQAIDKEFGDFEKFKAAFTAKALGIQGSGWCWLGLSKTGSLDLVVAK DQDTLTTHHPIIGWDGWEHAWYLQYKNDKASYLKQWWNVVNWSEAESRYSEGLKASL >Q2V2P9|Q2V2P9_YEAST Protein YDR119W-A. MFFSQVLRSSARAAPIKRYTGGRIGESWVITEGRRLIPEIFQWSAVLSVCLGWPGAVYFF SKARKA

Antje KrausePoznań 14.07.20067 Structure? Function? Evolutionary history? Interactions? Diseases? Development? Cellular location? MVDENRRNTLKIFLGTTAALGAGMLATPLVASVIGSKAGYIKPEPSGAIPVEICKDVDSC PKDYGVSLDELRNGPVFKLLKVNTMAIPAVFGIVRAKDGKEYPVAYVAICTHFGCPVNVS GGKYLIGFNCPCHGSIFAICNDPNGCPDYNAAFLEMYVSGGPAPRSLRAIKVAVKDGVVY PLVAYI Tissue? Regulation? © David S. Goodsell 1999

Antje KrausePoznań 14.07.20068 Protein structures Prediction of protein structure is still not possible from sequence alone Not all mechanisms of protein folding are known Experimental protein structure determination –is time consuming –is very expensive –is not always possible (protein must be in crystal structure) –results in only one conformation –does not show flexible regions –does not show the protein in its natural environment –can only be done with globular proteins (difficult with transmembrane proteins)

Antje KrausePoznań 14.07.20069 Different categories of protein databases Protein sequence databases: –Information about single proteins Protein structure databases: –Information about single proteins Protein domain databases: –Information about functional domains Protein (sequence) family databases: –Information about groups of evolutionarily and functionally related proteins Protein (structure) family databases: –Information about structural elements Gene family databases: –Information about groups of evolutionarily and functionally related proteins or genes mainly of completely sequenced species

Antje KrausePoznań 14.07.200610 Protein sequence databases UniProt = Universal Protein Resource Integration of Swiss-Prot/TrEMBL and PIR http://www.expasy.uniprot.org central repository of protein sequence and function maintained by –European Bioinformatics Institute –Swiss Institute of Bioinformatics –Georgetown University

Antje KrausePoznań 14.07.200611 contain experimentally verified entries...... and translated entries from DNA databases, namely EMBL –predicted proteins –hypothetical proteins –putative proteins Problem in the past: no clear difference between experimentally verified entries/annotation and predicted entries/annotation Protein sequence databases TrEMBLSwiss-Prot

Antje KrausePoznań 14.07.200612 Protein sequence databases (Swiss-Prot/TrEMBL)  now UniProt! ExPASy (http://www.expasy.ch) Expert Protein Analysis Systemhttp://www.expasy.ch SIB (http://www.isb-sib.ch) Swiss Institute of Bioinformatics, Geneva, CHhttp://www.isb-sib.ch Swiss-Prot (http://www.expasy.ch/sprot) Manually curated protein sequence databasehttp://www.expasy.ch/sprot TrEMBL (translated EMBL) Computer-annotated supplement to Swiss- Prot, contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot

Antje KrausePoznań 14.07.200613 Protein sequence databases (PIR-PSD) NBRF (http://pir.georgetown.edu/nbrf) National Biomedical Research Foundation Georgetown, Washington DC, USAhttp://pir.georgetown.edu/nbrf JIPID Japan International Protein Information Database MIPS (http://mips.gsf.de) Munich Information Center for Protein Sequences, GSF, Neuherberg, Munichhttp://mips.gsf.de PIR (http://pir.georgetown.edu) Protein Information Resource Collaboration of NBRF, JIPID and MIPShttp://pir.georgetown.edu PSD (http://pir.georgetown.edu/pirwww/search/textpsd.shtml) Protein Sequence Databasehttp://pir.georgetown.edu/pirwww/search/textpsd.shtml First published in the Atlas of Protein Sequence and Structure (1965-1978), the first systematic collection of protein sequences, generated by Margaret Dayhoff

Antje KrausePoznań 14.07.200614 ?

Antje KrausePoznań 14.07.200617 [SN]-P-x-[LV]-x(2)-H-A-x(3)-F. Multiple Sequence Alignment Pattern construction Pattern search

Antje KrausePoznań 14.07.200618 Use of standard IUPAC one-letter codes for amino acids Symbol 'x' for a position where any amino acid is possible Ambiguities are indicated by listing the acceptable amino acids in square parentheses '[ ]' Ambiguities are indicated by listing the not acceptable amino acids in curly brackets '{ }' Elements are separated by '-' Repetition of an element is indicated by a numerical value or a numerical range between parenthesis following that element Restriction of the pattern to either the N- or C-terminal of a sequence is indicated by either starting with a ' ' symbol A period ends the pattern Patterns

Antje KrausePoznań 14.07.200619 L-x(6)-L-x(6)-L-x(6)-L. Coiled-coil PROSITE Entry PDOC00029 Example leucine zipper

Antje KrausePoznań 14.07.200620 Example C2H2 zinc finger x x C H x \ / x x Zn x x / \ x C H x x x x x PROSITE Entry PDOC00028 C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.

Antje KrausePoznań 14.07.200621 Pattern Advantages: easy and intuitive definition simple to use in automated processing Disadvantages: yes/no-decisions: proteins not complying with a certain pattern will never be found although they may contain the domain needs multiple alignment

Antje KrausePoznań 14.07.200623 Rule Advantages: easy and intuitive notation simple to use in manual processing able to model long range dependencies Disadvantages: difficult to use in automated processing

Antje KrausePoznań 14.07.200625 Profile position specific scoring/weight matrix with N columns and 20+ rows N is the number of columns in a multiple alignment = length of the multiple alignment = length of the profile each row holds the information about 1 amino acid (IUPAC code), about gap penalties or other properties

Antje KrausePoznań 14.07.200626 Scoring matrices (e.g. BLOSUM62)

Antje KrausePoznań 14.07.200627 Average score method to calculate a profile Multiple Sequence Alignment with N=10 columns and Z=23 rows Profile with N=10 columns (k) and 20 + 1 rows (j) C ik : Quantity of amino acid i in column k S ij : Score of amino acid i and amino acid j in scoring matrix (e.g. BLOSUM62) M L1 = (C V1 / Z) * S VL + (C I1 / Z) * S IL = (4 / 23) * 1 + (19 / 23) * 2 = 1.83

Antje KrausePoznań 14.07.200628 Profile Advantages: captures degree of conservation at each position in a multiple alignment statistical method simple to use in automated processing Disadvantages: difficult to use in manual processing no formal statistical basis needs multiple alignment

Antje KrausePoznań 14.07.200632 Hidden Markov Model (HMM) statistical model where the system being modelled is assumed to be a Markov process (stochastic process) the probability of being in one state depends only on the previous state In a regular Markov model, the states are directly visible to the observer, and therefore the state transition probabilities are the only parameters HMM adds outputs: each state has a probability distribution over the possible output tokens

Antje KrausePoznań 14.07.200633 Profile HMM architecture from HMMER User Guide http://hmmer.wustl.eduhttp://hmmer.wustl.edu End C-terminal unaligned sequence Terminal Joining segment of unaligned sequences Start N-terminal unaligned sequence Begin Delete Match Insertion

Antje KrausePoznań 14.07.200634 Profile HMM Advantages: same as for profile statistical method with well established formal probabilistic basis can use unaligned sequences Disadvantages: no manual processing needs a higher number of sequences to give a satisfactory result

Antje KrausePoznań 14.07.200636 Domain databases describe functional regions of proteins (called domains, motifs, signatures...) a protein may consist of several and/or different domains (multi-domain-protein) domains can be described with –patterns (regular expressions) –rules –profiles –Hidden Markov Models

Antje KrausePoznań 14.07.200638 Domain databases Sanger Institute (http://www.sanger.ac.uk) The Wellcome Trust Sanger Institute, Hinxton, GBhttp://www.sanger.ac.uk Pfam (http://www.sanger.ac.uk/Software/Pfam)http://www.sanger.ac.uk/Software/Pfam –Protein FAMmilies database of alignments and HMMs –sequences from Swiss-Prot and TrEMBL Prosite (http://www.expasy.ch/prosite)http://www.expasy.ch/prosite –database of protein families and domains –consists of biologically significant sites, patterns and profiles –sequences from Swiss-Prot and TrEMBL –manual annotation made by experts

Antje KrausePoznań 14.07.200639 InterPro Integrated Resources of Protein Families, Domains and Functional Sites Collaboration of Pfam, PROSITE, PRINTS, ProDom, SMART and TIGR Used for automatic annotation of entries in TrEMBL

Antje KrausePoznań 14.07.200641 Domain databases FHCRC Fred Hutchinson Cancer Research Center, Seattle, Washington DC, USA BLOCKS (http://www.blocks.fhcrc.org)http://www.blocks.fhcrc.org –multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins –automatically derived from InterPro –originally developed for the creation of scoring matrices (substitution matrices)  BLOSUM62 (BLOcks SUbstitution Matrix)

Antje KrausePoznań 14.07.200642 Domain databases SMART (http://smart.embl-heidelberg.de/) Simple Modular Architecture Research Toolhttp://smart.embl-heidelberg.de/ identification and annotation of genetically mobile domains and the analysis of domain architectures signalling, extracellular and chromatin-associated proteins

Antje KrausePoznań 14.07.200644 intron length frame

Antje KrausePoznań 14.07.200645 Suppose we have n homologous protein sequences What do they have in common? What are the functional regions of these proteins? Which regions are conserved, which are not conserved? How can we characterize these proteins/their functional domains? What distinguishes these proteins/their functional domains from others? Domain and family databases

Antje KrausePoznań 14.07.200646 Similarity: Expressed in score, E-value, % sequence identity, etc. Homology: Relationship due to common ancestry Orthology: Genes in the genomes of different species with a common ancestor (resulting from a speciation event) Paralogy: Genes in the same genome with a common ancestor (resulting from a duplication event)

Antje KrausePoznań 14.07.200647 But! Similarity ≠ Homology Similarity is a good indicator for homology Normally we deduce homology from significant sequence similarity But, we can not deduce sequence similarity from homology! Thus we also can not deduce non-homology from non-sequence similarity! Mathematics Biology

Antje KrausePoznań 14.07.200648 Database Search

Antje KrausePoznań 14.07.200649 Transitivity use of intermediate sequences to derive knowledge about homology if the proteins A and B are homologous and the proteins B and C are homologous, than A and C are homologous, too this holds even if there is no sequence similarity detectable between A and C!

Antje KrausePoznań 14.07.200650 Transitivity?... may be limited to domains!... but often it's difficult to define domain boundaries!

Antje KrausePoznań 14.07.200651 Database Search Cutoff: 1e-30 Database Search Cutoff: 1e-20 Cutoff: 1e-10

Antje KrausePoznań 14.07.200652 Biologically meaningful partitioning of the data: Functional annotation Gain of information Reduction of the search space Selection of prototypic or representative sequences Phylogenetic analyses Protein prediction etc. Sequence Clustering: Goals

Antje KrausePoznań 14.07.200653 “Protein superfamily” (Dayhoff, 1974): Group of evolutionarily related proteins Hierarchy of homology domains, families, and superfamilies (Barker, 1996) Manual classification based on sequence similarity Most current proteins are thought to be the descendants of no more than 1,000 (structural) ancestors (Chothia, 1994) But no “definition”! Protein Families

Antje KrausePoznań 14.07.200654 Protein Families Following M.Dayhoff we can think of a Protein superfamily as a group of proteins –sharing domains –being evolutionarily related –showing weak sequene similarity Protein family as a group of proteins –being (closely) evolutionarily related –(showing at least 50% sequence similarity) Homeomorphic protein family as a group of proteins –having the same domains in the same order

Antje KrausePoznań 14.07.200655 Single Linkage Hierarchy Single Linkage Clustering weak Cutoff/Threshold conservative stringent

Antje KrausePoznań 14.07.200656 starting with 171,191 redundant sequences from Swiss-Prot after all-against-all BLAST database searches: 19,407,137 pairwise values after excluding 27,305 fragments (being 90% identical to another sequence over 95% of their sequence length): 13,083,209 pairwise values Reminder: 171,191 sequences  14,653,093,645 possible pairwise values! only 0.132% sequence pairs result in an Evalue < 10! Test data set

Antje KrausePoznań 14.07.200657 143,886 non-redundant sequences and 13,083,209 pairwise values

Antje KrausePoznań 14.07.200658  10% sequence overlap  50% sequence overlap  75% sequence overlap  90% sequence overlap

Antje KrausePoznań 14.07.200659 Observations Doing single-linkage-clustering with this data we can vary on the pairwise results of the BLAST searches, i.e., Evalue, % Identity, length of local alignment, % alignment length of sequence length, Score and all combinations! With a choice of at least 50% identity we are on the safe side (this was Margaret Dayhoff’s original value for a protein family!) Unfortunately (but no surprise) nature does not behave in cutoffs  There are highly conserved protein families (e.g., histones) and fast evolving protein families (e.g., immunoglobulines) Every protein family needs it’s own cutoff

Antje KrausePoznań 14.07.200660 Single linkage hierarchy Superfamilies Family clusters Superfamily distance graph Superfamilies as well as family clusters are derived from the structure generated by the data itself  no need for a user defined static cutoff SYSTERS (SYSTEmatic Re-Searching)

Antje KrausePoznań 14.07.200661 4/1 13/5 1/18 2/19 15/21 1/36 211,975/37 259/212,012

Antje KrausePoznań 14.07.200662 Algorithm 1: Superfamily determination Input: Tree T = (V, E) with n leaves (sequences) Output: Superfamilies 1: for all leaves li  V, i  {1,..., n} do 2: q  li 3: I  0 4: sfi  li 5: while (q  Troot) do 6: p  parent (q) 7: J  8: if (J > I) then 9: I  J 10: sfi  q 11: end if 12: q  p 13: end while 14: end for 15: Resolve inclusions by keeping the largest superfamilies subtreesize (p) - subtreesize (q) subtreesize (q)

Antje KrausePoznań 14.07.200663 About 300,000 non-redundant sequences 456 superfamilies with cutoff < 1e-180 64,282 superfamilies in 40,288 separate trees

Antje KrausePoznań 14.07.200664 4 1 4 A B D E FG C 4 4 4 4 4 44 1 1 1 11 |V| = 7 |E| = 15 x = 15 * (6 / 42) = 2,14 < (7 / 2)  Split graph  Process subgraphs x = 15 * (6 / 25) = 3,6 > (7 / 2)  Output graph 1 2 3 1 2 2 22 |V| 2 Minimal Cut C Stop criterion: x >

Antje KrausePoznań 14.07.200665 Algorithm 2: Highly Connected Subcluster (Hartuv & Shamir, 1999) Input: Connected graph G = (V, E) Output: Cluster graphs 1: (H 1, H 2, C)  mincut (G) 2: 3: if (x > (|V| / 2)) then 4: output G 5: else 6: 7: 8: end if x  |C| HCS HCS (H 1 ) HCS (H 2 ) unweighted weighted weighted_HCS weighted_ HCS (H 1 ) weighted_ HCS (H 2 ) x  |E| * ( iC w(i) /  jE w(j))

Antje KrausePoznań 14.07.200666 Ephrin type A Ephrin type B Predicted proteins (C.elegans and Drosophila)

Antje KrausePoznań 14.07.200667 About 300,000 non-redundant sequences SLC: Single Linkage Clustering SF: Superfamilies SF+SC: Family clusters derived from superfamilies

Antje KrausePoznań 14.07.200668 Family: Superfamily: Domains: systers.molgen.mpg.de

Antje KrausePoznań 14.07.200669 Exploit the self-structuring properties of the data: –Determine an individual cutoff for each superfamily based on the single linkage hierarchy –Split each superfamily into family clusters based on the superfamily distance graph Automated and independent of static user- defined cutoffs Results accessible on the Internet SYSTERS

Antje KrausePoznań 14.07.200670 Protein family databases - ProtoNet http://www.protonet.cs.huji.ac.il/ global classification of proteins into hierarchical clusters based on Swiss-Prot sequences, with TrEMBL sequences added after clustering N. Kaplan et al., NAR, 2005, 33(DB) 3 different hierarchical clustering methods available depending on the similarity measure (harmonic-, geometric-, arithemtic average ) based on the BLAST Evalue

Antje KrausePoznań 14.07.200671 Protein family databases - CluSTr http://www.ebi.ac.uk/clustr/index.html automatic hierarchical classification of all sequences in UniProt uses Z-Score based on Smith-Waterman comparison: Z-Score = min(Z(A,B), Z(B,A)) with Z(A,B) = (Score(A,B) – M) /  with M: arithmetic mean,  : stand. deviation of all results R. Petryszak et al., Bioin- formatics, 2005, 21(18) constructs single-linkage- hierarchy provides a subset of clusters at several different cutoff values

Antje KrausePoznań 14.07.200672 Protein family detection - TribeMCL http://www.ebi.ac.uk/research/cgg/tribe/ uses a Markov Clustering method based on BLAST Evalues primarily used for comparing protein sequence sets of completely sequenced genomes, e.g. in ENSEMBL clustering software available provides one set of protein families more specific than other methods, but less sensitive Related Not related Found True positive False  positive Not found False  negative True negative A.J.Enright et al., NAR, 2002, 30(7)

Antje KrausePoznań 14.07.200673 But wait a moment... We want to answer biological questions with these databases Different databases are needed to answer different questions There is no “right” or “wrong” The benefit highly depends on the questions The more concise the question, the more beneficial the answer Why so many databases? Which one is “right” which one is “wrong”? How can we proof that the results are correct?

Antje KrausePoznań 14.07.200674 Gene family databases Suppose we have the gene/protein sequences of 2 completely sequenced species Which genes/proteins do these species have in common? Which genes/proteins are orthologous? Where are the differences? Which genes/proteins have paralogs in one or the other species?

Antje KrausePoznań 14.07.200676 What happens to a duplicated gene? Duplication-Degeneration-Complementation Model (DDC) Lynch & Force (Genetics, 1999/2000)

Antje KrausePoznań 14.07.200677 Pairwise-best-hit-method All proteins (genes) of species A All proteins (genes) of species B 1.Search with all protein sequences of species A against all protein sequences of species B 2.Remember only the best hits 3.Search with all protein sequences of species B against all protein sequences of species A 4.Remember only the best hits 5.All pairwise-best-hits are assumed to be orthologs

Antje KrausePoznań 14.07.200678 Gene family databases - InParanoid http://inparanoid.cgb.ki.se/ clustering software available after determination of main-orthologs inparalogs are added to the groups inparalogs: duplicated after speciation event outparalogs: speciation event after duplication uses BLAST K.O’Brien et al., NAR, 2005, 33 (DB)

Antje KrausePoznań 14.07.200679 Gene family databases - COGs http://www.ncbi.nlm.nih.gov/COG/ Cluster of Orthologous Groups of proteins based on all-against-all sequence search a protein builds a COG if pairwise-best-hits consist for at least 3 species manual post- processing (alignments, trees) of COGs to split COGs of multi-domain-proteins R.L.Tatusov et al., 1997, Science, 278 Species A Species C Species B

Antje KrausePoznań 14.07.200680 Biological databases in general first issue every year is the database issue in 2006 this database collection covered 858 databases

Large scale protein sequence clustering Prof. Dr. Antje Krause Bioinformatics Wildau University of Applied Sciences

Similar presentations

Presentation on theme: "Large scale protein sequence clustering Prof. Dr. Antje Krause Bioinformatics Wildau University of Applied Sciences"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large scale protein sequence clustering Prof. Dr. Antje Krause Bioinformatics Wildau University of Applied Sciences

Similar presentations

Presentation on theme: "Large scale protein sequence clustering Prof. Dr. Antje Krause Bioinformatics Wildau University of Applied Sciences"— Presentation transcript:

Similar presentations

About project

Feedback