Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics.

Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics

By Michael Schroeder, Biotec, 2 Contents nMolecular biology primer nThe role of computer science nPhylogeny nSequence Searching nProtein structure nClinical implications nRead chapter 1

By Michael Schroeder, Biotec, 3 23 June 2000: Draft of Human genome sequenced! n1953: Watson and Crick discover the structure of DNA n2000: Draft of human genome is published n“The most wondrous map ever produced by human kind” n“One of the most significant scientific landmarks of all time, comparable with the invention of the wheel or the splitting of the atom”

By Michael Schroeder, Biotec, 4 High-throughput biomedicine nMicroarrays nMeasure activity of thousands of genes at the same time nExample: nCancer nCompare activity with and without drug treatment nResult: Hundreds of candidate drug targets nRNAi (Noble prize 2004, Fire and Mello) nKnock-down genes and observe effect nExample: nInfectious diseases nWhich proteins orchestrate entry into cell? nResult: Hundreds of candidate proteins nAtomic force microscopes (Noble prize Binnig) nPull protein out of membrane and measure force nExample: nEye diseases resulting fomr misfolding nResult: Hundreds of candidate residues

By Michael Schroeder, Biotec, 5 Drug Discovery nChallenge: Longer time to market, fewer drugs, exploding costs nApproach: Use of compound libraries and high- throughput screening

By Michael Schroeder, Biotec, 6 HTS and Bioinformatics nHigh-throughput technologies have completely changed the work of biomedical researchers nChallenge: Interpret (often large) results of screens nApproach: Before running secondary assays use bioinformatics and IT to assemble all possible information

By Michael Schroeder, Biotec, 7 Good News >30.000 3D Structures >1.000.000 Sequences >16.000.000 Articles >700 DBs/Tools

By Michael Schroeder, Biotec, 8 Bad News: Data != Knowledge nHow to analyse data, how to integrate data? nComptuer science to the rescue…

By Michael Schroeder, Biotec, 9 Examlpe: computer science is key for sequencing nHuman genome is a string of length 3.200.000.000 nShotgun sequencing: Break multiple copies of string into shorter substrings nExample: nshotgunsequencing shotgunsequencing shotgunsequencing cing en encing equ gun ing ns otgu seq sequ sh sho shot tg uenc un nComputing problem: Assemble strings

By Michael Schroeder, Biotec, 10 Computer science key for sequencing nsh nsho nshot n otgu n tg n gun n un n ns n seq n sequ n equ n uenc n encing en n cing n ing QUESTION: How can you handle long repetitive sequences? Heeeeelllllllllllooooooo QUESTION: Why was a draft announced? When was the final version ready?

By Michael Schroeder, Biotec, 11 Arabidopsis thaliana mouse rat Caenorhabitis elegans Drosophila melanogaster Mycobacterium leprae Vibrio cholerae Plasmodium falciparum Mycobacterium tuberculosis Neisseria meningitidis Z2491 Helicobacter pylori Xylella fastidiosa Borrelia burgorferi Rickettsia prowazekii Bacillus subtilis Archaeoglobus fulgidus Campylobacter jejuni Aquifex aeolicus Thermotoga maritima Chlamydia pneumoniae Pseudomonas aeruginosa Ureaplasma urealyticum Buchnerasp. APS Escherichia coli Saccharomyces cerevisiae Yersinia pestis Salmonella enterica Thermoplasma acidophilum

By Michael Schroeder, Biotec, 12 Break through of the year 2000 Next quest: Sequencing a genome for 1000$

By Michael Schroeder, Biotec, 13 Quantity and quality of data lead to ambitious goals nUnderstand integrative aspects of the biology of organisms nInterrelate sequence, three-dimensional structure, interactions, function of proteins, nucleic acids and protein-nucleic acid complexes nTravel in time nbackward (deduce events in evolutionary history) and nforward (deliberate modification of biological systems) nApplications in medicine, agriculture, and other scientific fields

By Michael Schroeder, Biotec, 14 Scenario nNew virus (e.g. SARS) and goal to develop treatment nScientists isolate genetic material of virus nScreen genome for relationships with previously studied viruses [10] nFrom virus’ DNA they compute the proteins it produces [1] nCompute proteins’ three-dimensional structure and thereby obtain clues about their functions nScreen for similar proteins sequences with known structure [15] nIf any are found nThen interpret difference (homology modelling) [25] nElse predict structure from sequence [55] nIdentify or design small molecule blocking relevant active sites of the protein [50] nDesign antibodies to neutralize the virus [50] nIndex of problem difficulty: n<30: solution exists already, n>30: we cannot solve this (yet)

By Michael Schroeder, Biotec, 15 Life in Time and Space nLife nA biological organism is a naturally-occurring, self-reproducing device that effects controlled manipulations of matter, energy and information nTime nSpecies evolve through nnatural mutation, nrecombination of genes in sexual reproduction, or ndirect gene transfer nRead the past in contemporary genomes nSpace nSpecies occupy local ecosystems nSpecies are composed of organisms nOrganisms are composed of cells nCells are composed of molecules

By Michael Schroeder, Biotec, 16 DNA – the molecule of life http://www.ornl.gov/hgmis

By Michael Schroeder, Biotec, 17 Proteins n20 naturally occurring amino acids in proteins nNon-polar nG glycine, A alanine, P proline, V valine nI isoleucine, L leucine, F phenylalanine, M methionine nPolar nS serine, C cysteine, T threonine, N asparagine nQ glutamine, H histidine, Y tyrosine, W tryptophan nCharged nD aspartic acid, E glutamic acid, K lysine, R arginine nOther classification nH,F,Y,W are aromatic and play role in membrane proteins nDistinguish natg = adenine-thymine-guanine and nATG = Alanine-Threonine-Glycine

By Michael Schroeder, Biotec, 18 The genetic code

By Michael Schroeder, Biotec, 19 Protein Structure nDNA: nNucleotides are very similar and hence the structure of DNA is very uniform nProteins: nGreat variety in three- dimensional conformation to support diverse structure and functions nIf heated, protein “unfolds” to biologically-inactive structure; in normal conditions protein folds

By Michael Schroeder, Biotec, 20 Paradox nTranslation from DNA sequence to amino acid sequence nis very simple to describe, nbut requires immensely complicated machinery (ribosome, tRNA) nThe folding of the protein sequence into its three- dimensional structure nis very difficult to describe nBut occurs spontaneously

By Michael Schroeder, Biotec, 21 Central Dogma nDNA sequence determines protein sequence nProtein sequence determines protein structure nProtein structure determines protein function

By Michael Schroeder, Biotec, 22 Observables and Data Archives nDatabases in molecular biology cover nNucleic acid and protein sequences, nMacromolecular structures and functions nArchival databanks of biological information nDNA and protein sequences including annotations nNucleic acid and protein structures including annotations nProtein expression patterns nDerived Databases nSequence motifs (“signatures” of protein families) nMutations and variants in DNA and protein sequences nClassification or relationships (e.g. hierarchy of structures) nBibliographic databases (PubMed with 17M abstracts) nCollections nof links to web sites nof databases

By Michael Schroeder, Biotec, 23 What is Bioinformatics nBioinformatics is the marriage of biology and information technology nBioinformatics is an integrated multidisciplinary field nCovers computational tools and methods for managing, analysing and manipulating sets of biological data nDisciplines include: nbiochemistry, genetics, structural biology, artificial intelligence, machine learning, software engineering, statistics, database theory, information visualisation, algorithm design

By Michael Schroeder, Biotec, 24 Bioinformatics nHas three components nCreation of databases nDevelopment of algorithms to analyse data nUse of these tools for analysing biological data

By Michael Schroeder, Biotec, 25 Databases: Types of Queries 1/2 n1. Given a sequence (fragment), find sequences in the database that are similar to it n2. Given a protein structure (or fragment), find protein structures in the database that are similar to it n3. Given sequence of a protein of unknown structure, find structures in the database that adopt similar three- dimensional structures n4. Given a protein structure, find sequences in the database that correspond to similar structures.

By Michael Schroeder, Biotec, 26 Databases: Given sequence, find structure n3. Given sequence of a protein of unknown structure, find structures in the database that adopt similar three-dimensional structures. But How? nEasy: Find similar sequences with known structure! nBut: There might be similar structures, whose sequence is not similar! n4. Given a protein structure, find sequences in the database that correspond to similar structures. But How? nEasy: Find similar structures and hence sequences nBut: There are so many more sequences with unknown structure that the above method will have only very limited success n1 and 2 are solved, 3 and 4 are active fields of research

By Michael Schroeder, Biotec, 27 Databases: Types of Queries 2/2 nE.g. for which proteins of known structure involved in disease of disrupted purine biosynthesis in humans, are there related proteins in yeast? nSolution: Virtual databases that provide transparent access to a number of underlying data sources and query and analysis tools

By Michael Schroeder, Biotec, 28 Databases: Curation and Quality nProblems: nGiven that there are primary and secondary databases, nhow to control updates, nhow to propagate change, nhow to maintain consistency? nContents (experimental results, annotations, supplementary information) all have there own source of error nOlder data were limited by older techniques

By Michael Schroeder, Biotec, 29 Databases: Annotation nExperimental data (e.g. raw DNA sequence) needs to be enriched with annotations nSource of data nInvestigators responsible nRelevant publication nFeature tables (e.g. coding regions) nProblems: n(often) lack of controlled and coherent vocabulary nComputer parseable nAutomated annotation needed nSwissProt = ca. 130.000 annotated sequences nTrEMBL = ca. 850.000 unannotated sequences nMaintanence of annotations (what if error detected?)

By Michael Schroeder, Biotec, 30 Computers and Computer Science nRelevant areas: nArtificial Intelligence nMachine Learning nNeural networks, rule- based learning nDatamining nAssociation rules nSoftware Engineering nDesign, implementation, testing of software nProgramming nObject-oriented C++, Java nImperative: C, Modula, Pascal, Cobol, Fortran nLogic: Prolog nFuntional: ML nScripting: Perl, Python nStatistics nDatabase theory nDesign and maintenance of databases nHow to index sequences, time series, 3D strucutres nInformation Visualisation nGraph drawing, diagrams, cartoons, 3D graphics nAlgorithm design nComplexity of algorithms nEfficient data structures

By Michael Schroeder, Biotec, 31 Programming nWe will use Python nScripting language nSupports string processing well nWidely used in bioinformatics

By Michael Schroeder, Biotec, 32 Biological Classification and Nomenclature nBack in 18 th century, Linnaeus, a Swedish naturalist, classified living things according to a hierarchy: Kingdom, Phylum, Class, Order, Family, Genus, Species nGenerally only genus and species are used for identification nHomo sapiens nDrosophila melanogastor nBos taurus nLinnaeus’ classification based on observed similarity nWidely reflects biological ancestry

By Michael Schroeder, Biotec, 33 Classification of Humans and Fruit Flies nKingdom:AnimaliaAnimalia nPhylum:ChordataChordata nClass:MammaliaInsecta nOrder:PrimataDiptera nFamily:HominidaeDrosophilidae nGenus:HomoDrosophila nSpecies:sapiensmelanogastor

By Michael Schroeder, Biotec, 34 Homology = derived from common ancestor nCharacteristics derived from a common ancestor are called homologous nE.g. eagle’s wing and human’s arm nOther apparently similar characteristics may have arisen independently by convergent evolution nE.g. eagle’s wing and bee’s wing. The most common ancestor of eagles and bees did not have wings nHomologous characters may diverge functionally nE.g. bones in human middle and jaws of primitive fish

By Michael Schroeder, Biotec, 35 Sequence analysis and Homology nSequence analysis gives unambiguous evidence for relationship of species nFor higher organisms sequence analysis and the classical tools of comparative anatomy, palaeontology, and embryology are often consistent nFor microorganisms there are problems nClassical methods: how to describe features nSequence analysis: lateral gene transfer

By Michael Schroeder, Biotec, 36 Domains of Life nRibosomal RNA is present in all organisms nBased on 15S ribosomal RNAs life is divided nBacteria nNo nucleus (procaryote) nE.g. tuberculosis and E. coli nArchaea nNo nucleus (procaryote) nfew organisms living in hostile environments (termophiles, halophiles, sulphur reducers, methanogens) nEukarya nHas a nucleus contained in membrane nNucleus contains chromosomes nInternal compartments called organelles for specialised biological processes nArea outside nucleus and organelles called cytoplasm nE.g. yeast and human beings

By Michael Schroeder, Biotec, 37 Eukaryotic cell

By Michael Schroeder, Biotec, 38 Domains of Life

By Michael Schroeder, Biotec, 39 Example: Use of sequences to determine phylogenetic relationships nUse ExPASy (www.expasy.ch/cgi-bin/sprot-search-ful) to search for pancreatic ribonuclease for nhorse (Equus caballus), nminke whale (Balaenoptera acutorostrata), nred kangaroo (Macropus rufus) nsp|P00674|RNP_HORSE Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Equus caballus (Horse). KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTF VHEPLADVQAICLQKNITCKNGQSNCYQSSSSMHITDCRLTSGSKY PNCAYQTSQKERHIIVACEGNPYVPVHFDASVEVST nUse sequence alignment to determine evolutionary relationship

By Michael Schroeder, Biotec, 40 Sequence alignment Global match: align all of one with all of the other sequence (mismatches, insertions, deletions) And.--so,.from.hour.to.hour.we.ripe.and.ripe |||| |||||||||||||||||||||||| |||||| And.then,.from.hour.to.hour.we.rot-.and.rot- Local match: find region in one sequence that matches the other (mismatches, insertions, deletions ; ends can be ignored) My.care.is.loss.of.care,.by.old.care.done, ||||||||| ||||||||||||| |||||| || Your.care.is.gain.of.care,.by.new.care.won

By Michael Schroeder, Biotec, 41 Sequence alignment nMotif search: nfind matches of short sequence in long sequence nOption: nperfect, n1 mismatch, nmismatches+gaps+insertions+deletions n match |||| for the watch to babble and to talk is most tolerable

By Michael Schroeder, Biotec, 42 Sequence alignment Multiple sequence alignment No.sooner.---met.--------.but.they.look’d No.sooner.look’d.--------.but.they.lo-v’d No.sooner.lo-v’d.--------.but.they.sigh’d No.sooner.sigh’d.--------.but.they.--asked.one.another.the.reason No.sooner.knew.the.reason.but.they.-------------sought.the.remedy No.sooner..but.they.

By Michael Schroeder, Biotec, 43 Example: Multiple alignment nUse sequence alignment to determine evolutionary relationship… nExample: horse, whale and kangoroo nExpected: horse and whale are placental mammals, kangoroo is marsupial nMultiple alignment with CLUSTAL-W (www.ebi.ac.uk/clustalw)

By Michael Schroeder, Biotec, 44 FASTA format >sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Equus caballus (Horse). KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF DASVEVST >sp|P00673|RNP_BALAC Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Balaenoptera acutorostrata (Minke whale) (Lesser rorqual). RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF DNSV >sp|P00686|RNP_MACRU Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Macropus rufus (Red kangaroo) (Megaleia rufa). ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQE NVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEGQYVPVHFDA YV

By Michael Schroeder, Biotec, 45 Multiple Alignment with ClustalW (www.ebi.ac.uk/clustalw) CLUSTAL W (1.82) multiple sequence alignmen sp|P00674|RNP_HORSE sp|P00673|RNP_BALAC sp|P00686|RNP_MACRU KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ 60 RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ 60 -ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQ 59 *:** **:*****: :......*** ** *.**.* ***:***:**. *.*:* * KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF 120 KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF 120 ENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEG-QYVPVHF 118 :*: ****::***:*.* : **:** *..****** *:**: :::******* ****** DASVEVST 128 DNSV---- 124 DAYV---- 122 *

By Michael Schroeder, Biotec, 46 Example: Number of Aligned Residues nHorse and Minke whale: 95 nMinke whale and Red kangoroo: 82 nHorse and Red kangoroo: 75 nConclusion: Horse and whale share the most identical resiues

By Michael Schroeder, Biotec, 47 Example: Elephant and Mammoth nMitochondrial cytochrome b from nSiberian woolly mammoth (Mammuthus primigenius) preserved in arctic perma frost nAfrican elephant (Loxodonta africana) nIndian elephant (Elephans maximus)

By Michael Schroeder, Biotec, 48 Indian elephant: sp|P24958|CYB_LOXAF Mammoth: sp|P92658|CYB_MAMPR African elephant: sp|O47885|CYB_ELEMA MTHIRKSHPLLKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 MTHTRKFHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 *** ** ***:**:********************************************** TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120 ************************************************************ LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180 LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFA 180 LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180 **************************************:********************* LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240 LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILFLL 240 FHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240 :********:***********************************************:** LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300 LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSILI 300 ******************************************************:***** LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYIIIGQMASILYFS 360 LGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIIIGQMASILYFS 360 LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEHPYIIIGQMASILYFS 360 **:*************************: *** **********:*************** IILAFLPIAGVIENYLIK 378 IILAFLPIAGMIENYLIK 378 **********:*******

By Michael Schroeder, Biotec, 49 Example: Elephant and Mammoth nMammoth and African elephant have 10 mismatches, nmammoth and Indian elephant 14. nSignificant?

By Michael Schroeder, Biotec, 50 Similarity and Homology nImportant difference: nSimilarity is the measurement of resemblance of sequences nHomology: common ancestor nSimilarity is gradual, homology is either true or false nSimilarity = now, homology = past events nHomology is only very rarely directly observed (e.g. lab population, clinical study of viral infection) nHomology is inferred from sequence similarity

By Michael Schroeder, Biotec, 51 Example: Homology/Similarity nThe assertion that the cytocrome b sequences are homologues means that there is a common ancestor nBUT: n1. Maybe cytochrome b functionally requires so many conserved residues and will hence occur in many species (  In fact, This is not the case here) n2. Maybe cytochrome b has to function this way in elephant-like species, but in fact started out from different ancestors (i.e. convergent evolution) n3. Maybe mammoth and African elephant have only fewer mismatches, because Indian elephant’s DNA mutated faster n4. Maybe all of them acquired cytochrome b through a virus (horizontal gene transfer)

By Michael Schroeder, Biotec, 52 Example: Conclusion nClassical methods confirm that for pancreatic ribonuclease inferring homology from similarity is justified nBut to answer whether Mammoth are closer to African or Indian elephants is too close to call nProblems with inferring phylogeny from gene and protein sequences nWide range of variation (possibly below statistical significance) nDifferent rates of evolution for different branches of the evolutionary tree

By Michael Schroeder, Biotec, 53 Inferring Phylogenies with SINES and LINES nRequirements: n‘all-or-none’ character nIrreversible appearance nSolution: nSINES and LINES (Short and Long Interspersed Nuclear Elements) nRepetitive, non-coding sequences in eukaryotic genomes n>30% in human genome, >50% in some plants nSINES = 70-500 base pairs long, up to 10 6 copies nLINES up to 7000 base pairs, up to 10 5 copies nThey enter genome by reverse transcription of RNA

By Michael Schroeder, Biotec, 54 A practical example: Fatherhood nThe picture shows a Southern blot of DNA from different family members, probed using a mini-satellite. nYou can work out which of F1 and F2 is the father of child C, by observing which bands they have in common. n(Reproduced from "Essential Medical Genetics" by M.Connor and M.Ferguson-Smith, with permission from Blackwell Science).

By Michael Schroeder, Biotec, 55 Why SINES are useful in phylogeny nEither present or absent nInserted at random in non-coding portion of genome ni.e. SINE has no important function so that convergent evolution can be excluded nPresence of a SINE in two species and absence in a third implies that first two species are more closely related nSINE insertion appears to be irreversible nTemporal order nPresence of a SINE in two species and absence in a third implies that ancestor of first two species is younger than ancestor of all three

By Michael Schroeder, Biotec, 56 Example revisited nWhat is the closest land-based relative of the whales nClassical palaeontology nlinks Cetacea (whales, dolphins, porpoises) with Arteriodactyla (including e.g. cattle) nBelief that Cetaceans diverged before Arteriodactyla split into suborder nSuiformes (e.g. pigs), nTylopoda (e.g. camels, llamas), nRuminantia (e.g. deer, cattle, goats, sheep, antelopes, giraffe) nSequence comparison results nBased on mitochondrial DNA, pancreatic ribonuclease, fibrinogen, and others nClosest relatives of whales are hippopotamuses (They share 4 SINES) nThese two are closest to Ruminantia

By Michael Schroeder, Biotec, 57 Searching for Similar Sequences with PSI-Blast nAny search method for sequences should be nSensitive: also pick up distant relationships nSelective: reported relationships are true nExample: database with (among others) 1000 globin sequences nGlobin familiy (oxygen transport) of proteins occurs in many species nProteins have same function and structure and nBut there are pairs of members of the family sharing less than 10% identical residues 1000 Globin Sequences Sequence Database 900 Search results True positives: 700 out of 900 are really globins False positives: 200 out of 900 are not globins False negatives: 300 out of 1000 are not found

By Michael Schroeder, Biotec, 58 Searching for Distant Relationships with PSI-BLAST nHow can we find distant relationships without increasing the false negatives? nPSI-BLAST: nPosition Sensistive Iterated – Basic Linear Alignment Sequence Tool nIdentifies patterns within the sequences nScore via intermediaries may be better than score from direct comparison nABC 50% Only 10% 50%

By Michael Schroeder, Biotec, 59 PSI-BLAST Example nHuman PAX-6 gene (SwissProt ID P26367) has homologues in many different species nPSI-Blast at NCBI site www.ncbi.nlm.nih.gov

By Michael Schroeder, Biotec, 60 Result BLASTP 2.2.6 [Apr-09-2003] RID: 1062117117-16602-2157828.BLASTQ3 Query= gi|6174889|sp|P26367|PAX6_HUMAN Paired box protein Pax-6 (Oculorhombin) (Aniridia, type II protein). (422 letters) Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF 1,509,571 sequences; 486,132,453 total letters Results of PSI-Blast iteration 1 Sequences with E-value BETTER than threshold Score E Sequences producing significant alignments: (bits) Value gi|4505615|ref|NP_000271.1| paired box gene 6 isoform a; Paired box h... 781 0.0 gi|189353|gb|AAA59962.1| oculorhombin >gi|189354|gb|AAA59963.1| oculo... 780 0.0 gi|6981334|ref|NP_037133.1| paired box homeotic gene 6 [Rattus norveg... 778 0.0 gi|26389393|dbj|BAC25729.1| unnamed protein product [Mus musculus] 776 0.0 gi|7305369|ref|NP_038655.1| paired box gene 6; small eye; Dickie's sm... 776 0.0 gi|383296|prf||1902328A PAX6 gene 775 0.0 gi|4580424|ref|NP_001595.2| paired box gene 6 isoform b; Paired box h... 775 0.0 gi|18138028|emb|CAC80516.1| paired box protein [Mus musculus] 773 0.0 gi|2576237|dbj|BAA23004.1| PAX6 protein [Gallus gallus] 770 0.0 gi|27469846|gb|AAH41712.1| Similar to paired box gene 6 [Xenopus laevis] 768 0.0 …

By Michael Schroeder, Biotec, 61 Introduction to Protein Structure nProteins play a variety of roles: nStructural (viral coat proteins, horny outer layer of human and animal skin, cytoskeleton) nCatalysis of chemical reactions (enzymes) nTransport and Storage (e.g. haemoglobin) nRegulation (e.g. hormones) nReceptor and signal transduction nGenetic transcription nRecognition (cell adhesion molecules) nAntibodies and other proteins of the immune system

By Michael Schroeder, Biotec, 62 Proteins nAre large molecules nOnly small part – the active site – is functional nEvolve by structural changes produced by mutations in the amino acid sequence nCa. 40000 proteins structures are now known nCan be obtained by X-ray crystallography or nuclear magnetic resonance (NMR)

By Michael Schroeder, Biotec, 63 Structure of Proteins nBackbone and sidechain nResidue i-1, Residue i, Residue i+1, S i-1 S i S i+1 Sidechain (variable) | | | …N-C α -C-N-C α -C-N-C α -C-… Mainchain (constant) || || || O O O nPolypeptide chain folds into a curve in space nCommon structural feature nAlpha-helix nBeta-sheet

By Michael Schroeder, Biotec, 64 Hierarchy of Architecture nPrimary structure: Amino acid sequence nSecondary structure: Helices, sheets, loops, hydrogen-bonding pattern of main chain nTertiary structure: Assembly and interactions of helices, sheets, etc. nQuaternary structure: Assembly of monomers nEvolution can merge proteins nFive enzymes in E. coli that catalyze successive steps in biosynthesis of aromatic amino acids correspond to one protein in Aspergillus nidulans nGlobins form tetramers in mammalian haemoglobin and dimers in ark clam Scaoharca inaequivalvis

By Michael Schroeder, Biotec, 65 Protein Structure Triosephosphate isomerase from Bacillus stearothermophilus Highly efficient enzyme appearing in most species

By Michael Schroeder, Biotec, 66 Hierarchy of Architecture: supersecondary structure nAlpha-helix hairpin nBeta hairpin nBeta-alpha-beta unit

By Michael Schroeder, Biotec, 67 Hierarchy of Architecture nSupersecondary structures: nAlpha-helix hairpin nBeta hairpin nBeta-alpha-beta unit nDomains: nCompact unit, single chain, independent stability nModular proteins: nMulti-domain nCopies of related domains or “mix-and-match”

By Michael Schroeder, Biotec, 68 Classification of Protein Structure nAll Alpha: mostly alpha helices nAll Beta: mostly beta sheets nAlpha+Beta: Helices and sheets in different parts of the molecule, no beta-alpha-beta units nAlpha/Beta: Helices and sheets assembled from beta-alpha-beta units nAlpha/Beta linear nAlpha/Beta barrel nLittle or no secondary structure

By Michael Schroeder, Biotec, 69 SCOP: Structural Classification of Proteins FOLD CLASS top SUPERFAMILY =evolutionary related, similar structure, not necessarily similar sequence FAMILY = set of domains with similar sequence C1 set domains (antibody constant) V set domains (antibody variable) All alpha (218) All Beta (144) Alpha/Beta (136) Alpha+Beta (279)Trypsin-like serine proteases (1) Immunoglobulin-like (23) Transglutaminase (1) Immunoglobulin (6)

By Michael Schroeder, Biotec, 70 Pymol

By Michael Schroeder, Biotec, 71 Engrailed homeodomain (1enh) Transcription factor important in developend Used to study protein folding Utrophin calmodulin homology domain (1bhd) Actin binding Closely relatd to dystrophin, whose lack causes muscular dystrophies (weak muscles) Cytochrome c, rice (1ccr) Electron transport across mitochondrial membrane DNA-binding domain of HIN recombinase (1hcr)

By Michael Schroeder, Biotec, 72 Fibronectin III domain (1fna) Found on cell surface Mannose-binding protein (1npl) Barnase (1brn) Cleaves RNA and is lethal if intracellular and not inhibited by barstar TATA-box-binding protein (1cdw)

By Michael Schroeder, Biotec, 73 OB-domain from Lys-tRNA synthetase (1bbw) Scytalone dehydratase (3std) Alcohol dehydrogenase, NAD- binding domain (1ee2) Break down of alcohol into simpler compounds Adenylate kinase (3adk) Energy production

By Michael Schroeder, Biotec, 74 Chemotaxis receptor methyltransferase (1af7) Thiamine phosphate synthase (2tps) Pancreatic spasmolytic polypeptide (2psp)

By Michael Schroeder, Biotec, 75 Protein Structure Prediction and Engineering nIf sequence of amino acids contains enough information to specify three-dimensional structure of proteins, it should be possible to devise algorithm for prediction nSecondary structure prediction: Which segments of the sequence are helices, which strands? nFold recognition: Given nlibrary of known structures with their sequences and na sequence with unknown structure, ncan we find the structure that is most similar nHomology modelling nGiven two homologous sequences, one with one without structure. If more than 50% of the residues are identical the structure can serve as a model

By Michael Schroeder, Biotec, 76 Critical Asessment of Structure Prediction (CASP) Chicken lysozyme KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGS Baboon alpha-lactalbumin KQFTKCELSQNLY--DIDGYGRIALPELICTMFHTSGYDTQAIVEND-ES Chicken lysozyme TDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVS Baboon alpha-lactalbumin TEYGLFQISNALWCKSSQSPQSRNICDITCDKFLDDDITDDIMCAKKILD Chicken lysozyme DGN-GMNAWVAWRNRCKGTDVQA-WIRGCRL- Baboon alpha-lactalbumin I--KGIDYWIAHKALC-TEKL-EQWL--CE-K

By Michael Schroeder, Biotec, 77 Clinical Implications nFast and reliable diagnosis of disease and risk: nWith symptoms nIn advance of appearance (e.g. Huntington) nIn utero (e.g. cystic fibrosis: mutation in cystic fibrosis transmembrane conductance regulator (CFTR), which is a chloride ion channel nGenetic counselling nCustomized treatment nE.g. childhood leukaemia is treated with toxic drug 6-mercaptopurine. Small fraction of patients used to die as they lack enzyme thiopurine methyltransferase. nIdentify drug targets n½ are receptors, ¼ are enzymes, ¼ are hormones n7% have unknown targets nGene therapy nReplace defective genes or supply gene products (insulin for diabetes and Blood Factor VIII for haemophilia) nHowever: Most diseases do not have a single genetic cause!

By Michael Schroeder, Biotec, 78 Quick check nBy now you should nHave read chapter 1 nKnow the main data sources (sequence and structure) nKnow the role that bioinformatics plays nUnderstand the difference between homology and similarity nUnderstand what sequence comparison and alignment are nUnderstand how they can be useful for phylogenetic studies nUnderstand primary, secondary, tertiary structure nBe able to assess the assumptions made and the quality of data

Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics.

Similar presentations

Presentation on theme: "Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics.

Similar presentations

Presentation on theme: "Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback