Computational analysis of membrane proteins implicated in metal transport in Arabidopsis thaliana Stefanie Hartmann Max Planck Institute for Molecular.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Protein Structure Prediction
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Profiles for Sequences
Molecular Evolution Revised 29/12/06
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Structural bioinformatics
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Tools to analyze protein characteristics Protein sequence -Family member -Multiple alignments Identification of conserved regions Evolutionary relationship.
An Introduction to Bioinformatics Protein Structure Prediction.
Protein structure (Part 2 of 2).
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
The Protein Data Bank (PDB)
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Protein Structures.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Protein Structure Prediction and Analysis
Computational Structure Prediction Kevin Drew BCH364C/391L Systems Biology/Bioinformatics 2/12/15.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Bioinformatics for biomedicine Protein domains and 3D structure Lecture 4, Per Kraulis
Homology Modeling David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
COMPARATIVE or HOMOLOGY MODELING
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Lecture 10 – protein structure prediction. A protein sequence.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
© Wiley Publishing All Rights Reserved. Protein 3D Structures.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Protein Modeling Protein Structure Prediction. 3D Protein Structure ALA CαCα LEU CαCαCαCαCαCαCαCα PRO VALVAL ARG …… ??? backbone sidechain.
Predicting Protein Structure: Comparative Modeling (homology modeling)
Protein Structure Prediction: Homology Modeling & Threading/Fold Recognition D. Mohanty NII, New Delhi.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDN GVDGEWTYTE Structure-Sequence alignment “Structure is better preserved than sequence” Me! Non-redundant.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
3.3b1 Protein Structure Threading (Fold recognition) Boris Steipe University of Toronto (Slides evolved from original material.
Pairwise Sequence Alignment
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Bioinformatics Overview
WRKY transcription factors in potato genome factors in potato genome
Sequence based searches:
Carlos Chuquillanqui1 • Ian Barker1
Carlos Chuquillanqui1 • Ian Barker1
Sequence Based Analysis Tutorial
Protein Structures.
Protein structure prediction.
Presentation transcript:

Computational analysis of membrane proteins implicated in metal transport in Arabidopsis thaliana Stefanie Hartmann Max Planck Institute for Molecular Plant Physiology Supervisors: Joachim Selbig, Ute Krämer CIAVVLCLVFMSVEVVGGIKANSLAILTDAAHLLSDVAAFAISLFSLWAAGWEATPRQTYGFFRIEILGALVSIQLI WLLT ALFLLINTAYMVVEFVAGFMSNSLGLISDACHMLFDCAALAIGLYASYISRLPANHQYNYGRGRFEVLSGYVNAV FLVLVG CFVVVLCLLFMSIEVVCGIKANSLAILADAAHLLTDVGAFAISMLSLWASSWEANPRQSYGFFRIEILGTLVSIQLI WLLT LIAVLLCAIFIVVEVVGGIKANSLAILTDAAHLLSDVAAFAISLFSLWASGWKANPQQSYGFFRIEILGALVSIQMIW LLA --- IFLYLIVMSVQIVGGFKANSLAVMTDAAHLLSDVAGLCVSLLAIKVSSWEANPRNSFGFKRLEVLAAFLSVQLIWL VS

12 membrane proteins involved in metal transport in Arabidopsis

Metal transporters are of great importance because… …they provide an adequate supply of essential trace metals …they prevent an excess of these potentially toxic ions in silico analyses may help design further experiments on basic research on metal homeostasis development of new ways of phytoremediation

Cation Diffusion Facilitator (CDF) proteins also referred to as cation efflux (CE) proteins occur in archaea, bacteria, eukaryotes are involved in transporting heavy metals (Co 2+, Cd 2+, Zn 2+, Ni 2+ ) the CDF family of proteins had 13 members in 1997 the CE Pfam family today has 348 members (July 2003) CDF signature sequence: S X (ASG) (LIVMT) 2 (SAT) (DA) (SGAL) (LIVFYA) (HDN) X 3 D X 2 (AS) 426 (Jan 2004)

CDF1: At2g46800 S LAILTDAAHLLS D VAA CDF2: At3g61940 S LAILADAAHLLT D VGA exact match CDF3: At3g58810 S LAILTDAAHLLS D VAA CDF4: At2g29410 S LAVMTDAAHLLS D VAG CDF5: At2g04620 S LGLISDACHMLF D CAA 1 mismatch CDF6: At2g47830 S TAIIADAAHSVS D VVL CDF7: At2g39450 S LAIIASTLDSLL D LLS CDF8: At1g16310 S MAVIASTLDSLL D LLS 2 mismatches CDF9: At1g79520 S MAVIASTLDSLL D LLS CDF10: At3g58060 S IAIAASTLDSLL D LMA CDF11: At3g12100 R VGLVSDAFHLTF G CGL CDF12: At1g51610 S HVIMAEVVHSVA D FAN 4 mismatches The Arabidopsis thaliana CDF protein family 3 mismatches

Research questions: Can all 12 proteins be classified as CDF proteins? i.e., are there predicted structural and functional similarities of these 12 Arabidopsis proteins? secondary structure prediction, inclusion in membrane- and transporter databases, evaluation of common motifs, etc

Research questions: Can all 12 proteins be classified as CDF proteins? i.e., are there predicted structural and functional similarities of these 12 Arabidopsis proteins? What are the relationships of the 12 Arabidopsis proteins among each other and to other published sequences? secondary structure prediction, inclusion in membrane- and transporter databases, evaluation of common motifs, etc intron/exon structure, phylogenetic reconstructions

Research questions: Can all 12 proteins be classified as CDF proteins? i.e., are there predicted structural and functional similarities of these 12 Arabidopsis proteins? What are the relationships of the 12 Arabidopsis proteins among each other and to other published sequences? Is it possible to predict the 3D structure of these proteins? secondary structure prediction, inclusion in membrane- and transporter databases, evaluation of common motifs, etc intron/exon structure, phylogenetic reconstructions fold recognition by threading

Sequence retrieval - four ambiguous sequences  TIGR Arabidopsis thaliana database  TAIR: The Arabidopsis Information Resource  MIPS Arabidopsis thaliana genome database different assignment of introns, use of alternative start codons Sequence analysis - three additional ambiguous sequences  SWALL  Pfam vs. TIGR/TAIR/MIPS insertions and deletions, different amino acid sequence Cloning and RT-PCR revealed correct sequences for six of the seven ambiguous CDFs

Inclusion in membrane and transport databases cation efflux, Pfam entry PF01545 Arabidopsis Membrane Protein Library (AMPL) ARAMEMNON Transport Protein Database PlantsT CDF1  CDF2  CDF3  CDF4  CDF5  ()()  - CDF6  CDF7  - CDF8  -- CDF9  -- CDF10  -  CDF11  CDF12  - 

Inclusion in membrane and transport databases cation efflux, Pfam entry PF01545 Arabidopsis Membrane Protein Library (AMPL) ARAMEMNON Transport Protein Database PlantsT CDF1  CDF2  CDF3  CDF4  CDF5  ()()()()  - CDF6  ()()()()  CDF7  ()()()()  - CDF8  ()()()() -- CDF9  ()()()() -- CDF10  ()()  -  CDF11  ()()  CDF12  ()()  - 

Inclusion in membrane and transport databases cation efflux, Pfam entry PF01545 Arabidopsis Membrane Protein Library (AMPL) ARAMEMNON Transport Protein Database PlantsT CDF1  CDF2  CDF3  CDF4  CDF5  ()()()()  – CDF6  ()()()()  CDF7  ()()()()  – CDF8  ()()()() –– CDF9  ()()()() –– CDF10  ()()  –  CDF11  ()()  CDF12  ()()  – 

Hidden Markov models used for secondary structure prediction states (loops, transmembrane domains, etc) are defined states are connected in a biologically reasonable way (transitions) each state has a specific probability distribution over the 20 amino acids each transition has a specific transition probability amino acid probabilities and transition probabilities are learned models are first taught using a training set, the trained model is then used for the prediction membranecytoplasmic sidenon-cytoplasmic side

number of TMD N-terminus within cytoplasm CDF162 / 3 CDF263 / 3 CDF362 / 3 CDF45-62 / 3 CDF563 / 3 CDF60-61 / 3 CDF74-62 / 3 CDF85-63 / 3 CDF95-63 / 3 CDF / 3 CDF1163 / 3 CDF / 3 Results of secondary structure predictions TMHMM v2(Tusnady and Simon, 1998, 2001) HMMTOP v2(Sonnhammer et al. 1998) Memsat2 (Jones et al. 1994, McGuffin et al. 2000) (14)

number of TMD N-terminus within cytoplasm CDF162 / 3 CDF263 / 3 CDF362 / 3 CDF45-62 / 3 CDF563 / 3 CDF60-61 / 3 CDF74-62 / 3 CDF85-63 / 3 CDF95-63 / 3 CDF / 3 CDF1163 / 3 CDF / 3 Results of secondary structure predictions TMHMM v2(Tusnady and Simon, 1998, 2001) HMMTOP v2(Sonnhammer et al. 1998) Memsat2 (Jones et al. 1994, McGuffin et al. 2000) (14)

CDF signature CE signature

Prediction of subcellular localization mTP: mitochondrialcTP: chloroplast SP: signal peptide targeting peptide transit peptide(ER/secretory pathway)

Prediction of subcellular localization - methods N-terminal sorting signals display characteristic amino acid compositions sequence-based methods predicting N-terminal sorting signals are based on this observation mTP: mitochondrialcTP: chloroplast SP: signal peptide targeting peptide transit peptide(ER/secretory pathway)  TargetP mTP, cTP, SPneural network-based  iPSORT mTP, cTP, SPdecision list  Predotar mTP, cTPneural network-based  SignalP NN  SignalP HMM SP neural network-based based on hidden Markov models

TargetPiPSORT Predotar SignalP NN HMM CDF1 CDF23/4 CDF3 CDF4 CDF5cTP 1/4 CDF6mTPcTPmTP CDF7 CDF8cTP*mTP*2/4*Y* CDF9 CDF10 CDF11 CDF12mTP Prediction of subcellular localization - results mTP: mitochondrialcTP: chloroplast SP: signal peptide targeting peptide transit peptide(ER/secretory pathway)

Exon structure of the CDF proteins # of exons

Gene organization of the CDF proteins CDF1 CDF2 CDF3 CDF4 CDF5 CDF11 CDF6 CDF12 CDF7 CDF8 CDF9 CDF10

Phylogenetic Relationships within Cation Transporter Families of Arabidopsis Plant Physiology 2001; 126 (4): 1646–1667 CDF4 CDF3 CDF2 CDF1 CDF12 CDF10 CDF11 CDF6 omitted:CDFs 5, 7, 8, 9

Phylogenetic analysis of the Arabidopsis CDF proteins

Phylogenetic analysis of sequences containing the CE signature Arabidopsis group I sequences, monocot and dicot sequences, mammalian metal transporters Arabidopsis group II sequences, monocot and dicot sequences, prokaryotic and eukaryotic seqs several two-domain proteins outgroup

N C working model: topology of Arabidopsis CDF proteins CDF signature sequence cytoplasm cell exterior/organelle

Information derived from the 3D structure of a protein assignment of function guide mutagenesis- experiments ligand and functional sites evolutionary relationships residue solvent exposure putative interaction sites

Structure determination 1.Classical approaches 2.Computational approaches X-ray crystallography NMR spectroscopy comparative (“homology”) modeling fold recognition (“threading”) ab initio methods

The number of folds occurring in nature is limited: There are many sequences with no significant sequence identity but with the same or similar folds The basis of fold recognition (“threading”) …HEAIDHKPKLTGMKTGRVVSSMKSNFFADLP… …HDGRSSMTRFSRYFRKTGRVSEYYKKQERLLE… PDB statistics:

Fold recognition methods aim: to find an optimal sequence-structure alignment 1.“threading” of an unknown target sequence into the backbone structure of template proteins of known structure ………CLVFMSVEVVGGIKANSLAILTD………

4.99 Å Fold recognition methods 2. evaluation of the compatibility between target sequence and proposed 3D structure using environment-based mean force potentials or using knowledge-based mean force potentials 3.Output: a list of folds (sorted or unsorted), their “compatibility score”, sometimes other information such as SCOP descriptors, alignment, rudimentary 3D model of the query protein, raw scores, solvation energy for the model, links

No new insights regarding the structure of CDF proteins Membrane proteins are significantly under-represented in structural databases – and therefore also in fold libraries If there is no fold similar to the native fold of the target protein, this approach cannot succed. Threading methods cannot be used for modeling of transmembrane proteins

Will the 3D structure of CDFs be available soon? for fold recognition methods to be used successfully: significantly more 3D structures of membrane proteins are needed fold recognition methods specifically for integral membrane proteins may eventually be developed cyrystallization of bacterial homologs and subsequent extraploation of structural features as an alternative? approach for globular proteins: predicting a protein’s solubility and propensity to crystallize, based on results from high-throughput structure determination

Can threading results be used as an independent way to verify group assignment? Were some structural hits specific for any of the CDF groups? 1.Which hits were common to 2. “Phylothreading” which of the CDF sequences?

Can threading results be used as an independent way to verify group assignment? Were some structural hits specific for any of the CDF groups? 1.Which hits were common to 2. “Phylothreading” which of the CDF sequences?

Which hits were common to which of the CDF sequences? Structural hits predicted for most CDF sequences for group I sequences for group II sequences for CDF5 and CDF11 for CDF6 and CDF12 Results were unable to provide evidence to verify group assignments based on other methods 1… … … 11 12

“Phylothreading” Phylothreading results can neither verify nor refute group assignments based on other methods

N C cytoplasm cell exterior/organelle Threading: non-transmembrane CDF fragments N-terminus histidine-rich loop between TMD 4 and 5 C-terminus

“Phylothreading”: CDF C-terminal fragments “phylothreading” results confirm the assignment of CDF sequences to groups that were based on independent methods

Conclusions The 12 Arabidopsis protein sequences reveal structural and therefore probably functional conservation My results support the classification of these proteins as CDF metal transporters I propose that the CDF protein family of A. thaliana contains two groups, each containing at least four proteins that are structurally and functionally closely related Threading methods cannot be used for transmembrane proteins or for their non-transmembrane domains (yet) Threading results for multiple sequences may be used to confirm (or find?) relationships among these sequences (“phylothreading”) I was able to evaluate and compare a number of online tools that are available for the analysis of sequence data

Conclusions 1. Sequence retrieval revealed conflicting information for 7 of the 12 proteins 2. The 12 Arabidopsis protein sequences reveal striking structural and therefore probably functional conservation 3. My results support the classification of these proteins as CDF metal transporters 4. I propose that the CDF protein family of A. thaliana contains two groups, each containing four proteins that are structurally and functionally closely related 5. I was able to evaluate and compare a variety of online tools available for the analysis of sequence data

Conclusions 1. Sequence retrieval revealed conflicting information for 7 of the 12 proteins 2. The 12 Arabidopsis protein sequences reveal striking structural and therefore probably functional conservation 3. My results support the classification of these proteins as CDF metal transporters 4. I propose that the CDF protein family of A. thaliana contains two groups, each containing four proteins that are structurally and functionally closely related 5.I was able to evaluate and compare a variety of online tools available for the analysis of sequence data 6. Threading methods cannot be used for transmembrane proteins or for their non-transmembrane domains (yet) 7. Threading results for multiple sequences can be used to confirm (or find?) relationships among these sequences (“phylothreading”)

METHODS

Phylogenetic analysis: tree-building methods distance-based methods overall distance between all pairs of sequences are calculated and then used to calculate a tree (Neighbor Joining) character-based methods the individual substitutions among the sequences are used to determine the most likely ancestral relationships (Maximum Parsimony, Maximum Likelihood) Bayesian inference of phylogenies...CLVFMSVEVVGGIKANSLAILTD......NTAYMVVEFVAGFMSNSLGLISD......CLLFMSIEVVCGIKANSLAILAD......CAIFIVVEVVGGIKANSLAILTD......YLIVMSVQIVGGFKANSLAVMTD...

Phylogenetic analysis: statistical evaluation of trees bootstrap analysis how much support exists for particular branches in a phylogeny? 1.tree construction, determination of the “best” tree 2.bootstrap datasets (pseudosamples) are created from the original dataset by random sampling with replacement 3.tree construction using the bootstrap datasets 4.comparison of the bootstrap tree with the inferred tree 5.this is repeated several hundred times 6.bootstrap value: percentage of times an interior branch in the bootstrap tree was the same as the one in the inferred tree...CLVFMSVEVVGGIKANSLAILTD......NTAYMVVEFVAGFMSNSLGLISD......CLLFMSIEVVCGIKANSLAILAD......CAIFIVVEVVGGIKANSLAILTD......YLIVMSVQIVGGFKANSLAVMTD...

2. evaluation of the compatibility between target sequence and proposed 3D structure Fold recognition methods using environment-based mean force potentials (Bowie, Fischer, Eisenberg: ) - residue positions are categorized into environment classes - the 3D protein structure is converted into a 1D sequence - generate alignment of this 1D string to target sequence using knowledge-based mean force potentials (Sippl: ) - information is automatically learned from databases of protein structures - pairwise interactions between structurally adjacent residues are calculated - transformation of mean force potentials as a function of distance

Fold recognition methods aim: to find an optimal sequence-structure alignment 1.“threading” of an unknown target sequence into the backbone structure of template proteins of known structure ………CLVFMSVEVVGGIKANSLAILTD……… query sequence fold library

4.99 Å Fold recognition methods 2. evaluation of the compatibility between target sequence and proposed 3D structure using environment-based mean force potentials or using knowledge-based mean force potentials

4.99 Å 2. evaluation of the compatibility between target sequence and proposed 3D structure using environment-based mean force potentials* or using knowledge-based mean force potentials* Fold recognition methods * distant-dependent forces that act between atoms/residues (electrostatic and van der Waals interactions, influences on the surrounding medium on these interactions, contacts between two or three amino acids, angles between residue pairs, …)

4.99 Å Fold recognition methods 2. evaluation of the compatibility between target sequence and proposed 3D structure using environment-based mean force potentials or using knowledge-based mean force potentials 3.Output: a list of folds (sorted or unsorted), their “compatibility score”, sometimes other information such as SCOP descriptors, alignment, rudimentary 3D model of the query protein, raw scores, solvation energy for the model, links

Threading methods used UCLA-DOE Fold Server P. Mallick et al., 2002 (BLAST, PSI-BLAST, SDP, DASEY) Threader D.T. Jones et al., 1992 mGenThreader L.J. McGuffin & D.T. Jones D-PSSM L.A. Kelley et al., 2000 Arby I. Sommer et al., unpublished (PSI-BLAST, 123D, Jprop)

top 10 structural hits are returned, all were kept compatibility of target sequence and all 2000 available templates is evaluated; lists were sorted by Z-value, approximately best hits were kept top 20 structural hits are returned, all were kept a list of the best scores is returned; the corresponding hits were extracted from a large table UCLA-DOE: Threader: mGenThreader: 3D-PSSM: Arby: Selection of structural hits for further analysis

Evaluation of the top score for each CDF sequence UCLA very poor score poor score borderline significant significant very significant Threader scores: no guidelines highly confident worthy of attention guess low confidence medium confidence high confidence certain mGen- Threader 3D-PSSM

There is no consensus of top fold predicted by different methods example: top two structural hits for CDF1 Threader:1ONEphosphopyruvate hydrolase 1C3Qthiazole kinase mGenThreader:1L8Mhis-rich protein (model) 1QGRimportin beta UCLA-DOE:1B8Fhistidine ammonia-lyase 1HFAclathrin assembly protein 3D-PSSM:1PW4glycerol-3-phosphate transporter 1KPWgreen cone pigment Arby:1HZXbovine rhodopsin 1EZVyeast cytochrome bc1

No new insights regarding the structure of CDF proteins Membrane proteins are significantly under-represented in structural databases – and therefore also in fold libraries If there is no fold similar to the native fold of the target protein, this approach cannot succed. Threading methods cannot be used for modeling approaches

Threading results: C-termini 1. Structural information no information of domains for metal transport available. BUT: several of the returned hits are proteins in which bound metals have structural or catalytic roles 2. Verification of group assignment i. Hits predicted for more than one C-terminus:48 folds specific for group I: 3 specific for group II: 2 specific for CDF5 and CDF11: 2 ii. “Phylothreading”

IIIIIIIVVVI TMD Pfam CE signature CDF signature BLOCKS (eMOTIF) Positions of conserved domains and signature sequences , 12 10, 11

Arabidopsis CDF proteins group I: - contain his-rich region between TMD 4 and 5 - one member is confirmed to transport Zn ions - genome structure conserved (no introns) group II: - lack the his-rich region between TMD 4 and 5 - proteins may transport Mn ions - C-terminal regions differ from group I sequences no group assignment: - CDF6, CDF12: possibly distant common ancestry and mitochondrial localization - CDF5, CDF11: close relationship also in PFAM tree

N C working model: topology of Arabidopsis CDF proteins CDF signature sequence cytoplasm cell exterior/organelle

Gene organization of the CDF proteins

Phylogenetic analysis of sequences containing the CE signature

Phylogenetic analysis: tree-building methods maximum parsimony methods the best tree topology minimizes the total amount of evolutionary change that has occurred distance methods the best tree topology minimizes the the total distance among taxa maximum likelihood methods given a particular substitution model and given a particular tree, how likely is the observed data?...CLVFMSVEVVGGIKANSLAILTD......NTAYMVVEFVAGFMSNSLGLISD......CLLFMSIEVVCGIKANSLAILAD......CAIFIVVEVVGGIKANSLAILTD......YLIVMSVQIVGGFKANSLAVMTD...

Inclusion in membrane and transport databases cation efflux, Pfam entry PF01545 Arabidopsis Membrane Protein Library (AMPL) ARAMEMNON Transport Protein Database PlantsT CDF1  CDF zinc transporter CDF CDF2  CDF putative MTP CDF CDF3  CDF putative MTP CDF CDF4  CDF putative MTP CDF CDF5  singleton (CDF related) putative cation transporter CDF- CDF6  singleton unknown protein CDF CDF7  family unknown protein CDF- CDF8  family hypothetical protein -- CDF9  family unknown protein -- CDF10  family putative MTP -CDF CDF11  singleton putative MTP CDF CDF12  singleton putative MTP -CDF

Inclusion in membrane and transport databases cation efflux, Pfam entry PF01545 Arabidopsis Membrane Protein Library (AMPL) ARAMEMNON Transport Protein Database PlantsT CDF1  CDF zinc transporter CDF CDF2  CDF putative MTP CDF CDF3  CDF putative MTP CDF CDF4  CDF putative MTP CDF CDF5  singleton (CDF related) putative cation transporter CDF- CDF6  singleton unknown protein CDF CDF7  family unknown protein CDF- CDF8  family hypothetical protein -- CDF9  family unknown protein -- CDF10  family putative MTP -CDF CDF11  singleton putative MTP CDF CDF12  singleton putative MTP -CDF

Inclusion in membrane and transport databases cation efflux, Pfam entry PF01545 Arabidopsis Membrane Protein Library (AMPL) ARAMEMNON Transport Protein Database PlantsT CDF1  CDF zinc transporter CDF CDF2  CDF putative MTP CDF CDF3  CDF putative MTP CDF CDF4  CDF putative MTP CDF CDF5  singleton (CDF related) putative cation transporter CDF- CDF6  singleton unknown protein CDF CDF7  family unknown protein CDF- CDF8  family hypothetical protein -- CDF9  family unknown protein -- CDF10  family putative MTP -CDF CDF11  singleton putative MTP CDF CDF12  singleton putative MTP -CDF

Inclusion in membrane and transport databases cation efflux, Pfam entry PF01545 Arabidopsis Membrane Protein Library (AMPL) ARAMEMNON Transport Protein Database PlantsT CDF1  CDF zinc transporter CDF CDF2  CDF putative MTP CDF CDF3  CDF putative MTP CDF CDF4  CDF putative MTP CDF CDF5  singleton (CDF related) putative cation transporter CDF- CDF6  singleton unknown protein CDF CDF7  family unknown protein CDF- CDF8  family hypothetical protein -- CDF9  family unknown protein -- CDF10  family putative MTP -CDF CDF11  singleton putative MTP CDF CDF12  singleton putative MTP -CDF