Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics.

Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics

Contents Molecular biology primer The role of computer science
Phylogeny Sequence Searching Protein structure Clinical implications Read chapter 1

23 June 2000: Draft of Human genome sequenced!
1953: Watson and Crick discover the structure of DNA 2000: Draft of human genome is published “The most wondrous map ever produced by human kind” “One of the most significant scientific landmarks of all time, comparable with the invention of the wheel or the splitting of the atom”

High-throughput biomedicine
Microarrays Measure activity of thousands of genes at the same time Example: Cancer Compare activity with and without drug treatment Result: Hundreds of candidate drug targets RNAi (Noble prize 2004, Fire and Mello) Knock-down genes and observe effect Infectious diseases Which proteins orchestrate entry into cell? Result: Hundreds of candidate proteins Atomic force microscopes (Noble prize Binnig) Pull protein out of membrane and measure force Eye diseases resulting fomr misfolding Result: Hundreds of candidate residues

Drug Discovery Challenge: Longer time to market, fewer drugs, exploding costs Approach: Use of compound libraries and high-throughput screening

HTS and Bioinformatics
High-throughput technologies have completely changed the work of biomedical researchers Challenge: Interpret (often large) results of screens Approach: Before running secondary assays use bioinformatics and IT to assemble all possible information

Good News >1.000.000 >16.000.000 Sequences Articles >700
DBs/Tools >30.000 3D Structures

Bad News: Data != Knowledge
How to analyse data, how to integrate data? Comptuer science to the rescue…

Examlpe: computer science is key for sequencing
Human genome is a string of length Shotgun sequencing: Break multiple copies of string into shorter substrings Example: shotgunsequencing shotgunsequencing shotgunsequencing cing en encing equ gun ing ns otgu seq sequ sh sho shot tg uenc un Computing problem: Assemble strings

Computer science key for sequencing
sh sho shot otgu tg gun un ns seq sequ equ uenc encing en cing ing QUESTION: How can you handle long repetitive sequences? Heeeeelllllllllllooooooo QUESTION: Why was a draft announced? When was the final version ready?

Arabidopsis thaliana mouse rat Caenorhabitis elegans Drosophila melanogaster Mycobacterium leprae Vibrio cholerae Plasmodium falciparum tuberculosis Neisseria meningitidis Z2491 Helicobacter pylori Xylella fastidiosa Borrelia burgorferi Rickettsia prowazekii Bacillus subtilis Archaeoglobus fulgidus Campylobacter jejuni Aquifex aeolicus Thermotoga maritima Chlamydia pneumoniae Pseudomonas aeruginosa Ureaplasma urealyticum Buchnerasp. APS Escherichia coli Saccharomyces cerevisiae Yersinia pestis Salmonella enterica Thermoplasma acidophilum

Break through of the year 2000
Next quest: Sequencing a genome for 1000$

Quantity and quality of data lead to ambitious goals
Understand integrative aspects of the biology of organisms Interrelate sequence, three-dimensional structure, interactions, function of proteins, nucleic acids and protein-nucleic acid complexes Travel in time backward (deduce events in evolutionary history) and forward (deliberate modification of biological systems) Applications in medicine, agriculture, and other scientific fields

Scenario Index of problem difficulty:
New virus (e.g. SARS) and goal to develop treatment Scientists isolate genetic material of virus Screen genome for relationships with previously studied viruses [10] From virus’ DNA they compute the proteins it produces [1] Compute proteins’ three-dimensional structure and thereby obtain clues about their functions Screen for similar proteins sequences with known structure [15] If any are found Then interpret difference (homology modelling) [25] Else predict structure from sequence [55] Identify or design small molecule blocking relevant active sites of the protein [50] Design antibodies to neutralize the virus [50] Index of problem difficulty: <30: solution exists already, >30: we cannot solve this (yet)

Life in Time and Space Life Time Space
A biological organism is a naturally-occurring, self-reproducing device that effects controlled manipulations of matter, energy and information Time Species evolve through natural mutation, recombination of genes in sexual reproduction, or direct gene transfer Read the past in contemporary genomes Space Species occupy local ecosystems Species are composed of organisms Organisms are composed of cells Cells are composed of molecules

DNA – the molecule of life

Proteins 20 naturally occurring amino acids in proteins
Non-polar G glycine, A alanine, P proline, V valine I isoleucine, L leucine, F phenylalanine, M methionine Polar S serine, C cysteine, T threonine, N asparagine Q glutamine, H histidine, Y tyrosine, W tryptophan Charged D aspartic acid, E glutamic acid, K lysine, R arginine Other classification H,F,Y,W are aromatic and play role in membrane proteins Distinguish atg = adenine-thymine-guanine and ATG = Alanine-Threonine-Glycine

The genetic code

Protein Structure DNA: Proteins:
Nucleotides are very similar and hence the structure of DNA is very uniform Proteins: Great variety in three-dimensional conformation to support diverse structure and functions If heated, protein “unfolds” to biologically-inactive structure; in normal conditions protein folds

Paradox Translation from DNA sequence to amino acid sequence
is very simple to describe, but requires immensely complicated machinery (ribosome, tRNA) The folding of the protein sequence into its three-dimensional structure is very difficult to describe But occurs spontaneously

Central Dogma DNA sequence determines protein sequence
Protein sequence determines protein structure Protein structure determines protein function

Observables and Data Archives
Databases in molecular biology cover Nucleic acid and protein sequences, Macromolecular structures and functions Archival databanks of biological information DNA and protein sequences including annotations Nucleic acid and protein structures including annotations Protein expression patterns Derived Databases Sequence motifs (“signatures” of protein families) Mutations and variants in DNA and protein sequences Classification or relationships (e.g. hierarchy of structures) Bibliographic databases (PubMed with 17M abstracts) Collections of links to web sites of databases

What is Bioinformatics
Bioinformatics is the marriage of biology and information technology Bioinformatics is an integrated multidisciplinary field Covers computational tools and methods for managing, analysing and manipulating sets of biological data Disciplines include: biochemistry, genetics, structural biology, artificial intelligence, machine learning, software engineering, statistics, database theory, information visualisation, algorithm design

Bioinformatics Has three components Creation of databases
Development of algorithms to analyse data Use of these tools for analysing biological data

Databases: Types of Queries 1/2
1. Given a sequence (fragment), find sequences in the database that are similar to it 2. Given a protein structure (or fragment), find protein structures in the database that are similar to it 3. Given sequence of a protein of unknown structure, find structures in the database that adopt similar three-dimensional structures 4. Given a protein structure, find sequences in the database that correspond to similar structures.

Databases: Given sequence, find structure
3. Given sequence of a protein of unknown structure, find structures in the database that adopt similar three-dimensional structures. But How? Easy: Find similar sequences with known structure! But: There might be similar structures, whose sequence is not similar! 4. Given a protein structure, find sequences in the database that correspond to similar structures. But How? Easy: Find similar structures and hence sequences But: There are so many more sequences with unknown structure that the above method will have only very limited success 1 and 2 are solved, 3 and 4 are active fields of research

Databases: Types of Queries 2/2
E.g. for which proteins of known structure involved in disease of disrupted purine biosynthesis in humans, are there related proteins in yeast? Solution: Virtual databases that provide transparent access to a number of underlying data sources and query and analysis tools

Databases: Curation and Quality
Problems: Given that there are primary and secondary databases, how to control updates, how to propagate change, how to maintain consistency? Contents (experimental results, annotations, supplementary information) all have there own source of error Older data were limited by older techniques

Databases: Annotation
Experimental data (e.g. raw DNA sequence) needs to be enriched with annotations Source of data Investigators responsible Relevant publication Feature tables (e.g. coding regions) Problems: (often) lack of controlled and coherent vocabulary Computer parseable Automated annotation needed SwissProt = ca annotated sequences TrEMBL = ca. 40 Mio unannotated sequences Maintanence of annotations (what if error detected?)

Computers and Computer Science
Relevant areas: Artificial Intelligence Machine Learning Neural networks, rule-based learning Datamining Association rules Software Engineering Design, implementation, testing of software Programming Object-oriented C++, Java Imperative: C, Modula, Pascal, Cobol, Fortran Logic: Prolog Funtional: ML Scripting: Perl, Python Statistics Database theory Design and maintenance of databases How to index sequences, time series, 3D strucutres Information Visualisation Graph drawing, diagrams, cartoons, 3D graphics Algorithm design Complexity of algorithms Efficient data structures

Programming We will use Python Scripting language
Supports string processing well Widely used in bioinformatics

Biological Classification and Nomenclature
Back in 18th century, Linnaeus, a Swedish naturalist, classified living things according to a hierarchy: Kingdom, Phylum, Class, Order, Family, Genus, Species Generally only genus and species are used for identification Homo sapiens Drosophila melanogastor Bos taurus Linnaeus’ classification based on observed similarity Widely reflects biological ancestry

Classification of Humans and Fruit Flies
Kingdom: Animalia Animalia Phylum: Chordata Chordata Class: Mammalia Insecta Order: Primata Diptera Family: Hominidae Drosophilidae Genus: Homo Drosophila Species: sapiens melanogastor

Homology = derived from common ancestor
Characteristics derived from a common ancestor are called homologous E.g. eagle’s wing and human’s arm Other apparently similar characteristics may have arisen independently by convergent evolution E.g. eagle’s wing and bee’s wing. The most common ancestor of eagles and bees did not have wings Homologous characters may diverge functionally E.g. bones in human middle and jaws of primitive fish

Sequence analysis and Homology
Sequence analysis gives unambiguous evidence for relationship of species For higher organisms sequence analysis and the classical tools of comparative anatomy, palaeontology, and embryology are often consistent For microorganisms there are problems Classical methods: how to describe features Sequence analysis: lateral gene transfer

Domains of Life Ribosomal RNA is present in all organisms
Based on 15S ribosomal RNAs life is divided Bacteria No nucleus (procaryote) E.g. tuberculosis and E. coli Archaea few organisms living in hostile environments (termophiles, halophiles, sulphur reducers, methanogens) Eukarya Has a nucleus contained in membrane Nucleus contains chromosomes Internal compartments called organelles for specialised biological processes Area outside nucleus and organelles called cytoplasm E.g. yeast and human beings

Eukaryotic cell

Domains of Life

Example: Use of sequences to determine phylogenetic relationships
Use ExPASy ( to search for pancreatic ribonuclease for horse (Equus caballus), minke whale (Balaenoptera acutorostrata), red kangaroo (Macropus rufus) >sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC ) (RNase 1) (RNase A) - Equus caballus (Horse). KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQKNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHFDASVEVST Use sequence alignment to determine evolutionary relationship

Sequence alignment Global match: align all of one with all of the other sequence (mismatches, insertions, deletions) And.--so,.from.hour.to.hour.we.ripe.and.ripe |||| |||||||||||||||||||||||| |||||| And.then,.from.hour.to.hour.we.rot-.and.rot- Local match: find region in one sequence that matches the other (mismatches, insertions, deletions ; ends can be ignored) My.care.is.loss.of.care,.by.old.care.done, ||||||||| ||||||||||||| |||||| || Your.care.is.gain.of.care,.by.new.care.won

Sequence alignment 3. Motif search:
find matches of short sequence in long sequence Option: perfect, 1 mismatch, mismatches+gaps+insertions+deletions match |||| for the watch to babble and to talk is most tolerable

Sequence alignment 4. Multiple sequence alignment
No.sooner.---met but.they.look’d No.sooner.look’d but.they.lo-v’d No.sooner.lo-v’d but.they.sigh’d No.sooner.sigh’d but.they.--asked.one.another.the.reason No.sooner.knew.the.reason.but.they sought.the.remedy No.sooner but.they.

Example: Multiple alignment
Use sequence alignment to determine evolutionary relationship… Example: horse, whale and kangaroo Expected: horse and whale are placental mammals, kangaroo is marsupial Multiple alignment with CLUSTAL-W ( multiple sequence alignment computer program main parameters: gap opening/extension penalty

FASTA format >sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC ) (RNase 1) (RNase A) - Equus caballus (Horse). KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF DASVEVST >sp|P00673|RNP_BALAC Ribonuclease pancreatic (EC ) (RNase 1) (RNase A) - Balaenoptera acutorostrata (Minke whale) (Lesser rorqual). RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF DNSV >sp|P00686|RNP_MACRU Ribonuclease pancreatic (EC ) (RNase 1) (RNase A) - Macropus rufus (Red kangaroo) (Megaleia rufa). ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQE NVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEGQYVPVHFDA YV

Multiple Alignment with ClustalW (http://www.genome.jp/tools/clustalw)
CLUSTAL W (1.82) multiple sequence alignmen sp|P00674|RNP_HORSE sp|P00673|RNP_BALAC sp|P00686|RNP_MACRU KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ 60 RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ 60 -ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQ 59 *:** **:*****: :......*** ** *.**.* ***:***:**. *.*:* * KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF 120 KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF 120 ENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEG-QYVPVHF 118 :*: ****::***:*.* : **:** *..****** *:**: :::******* ****** DASVEVST 128 DNSV DAYV * *

Example: Number of Aligned Residues
Horse and Minke whale: 95 Minke whale and Red kangoroo: 82 Horse and Red kangoroo: 75 Conclusion: Horse and whale share the most identical residues

New Example: Elephant and Mammoth
Mitochondrial cytochrome b from Siberian woolly mammoth (Mammuthus primigenius) preserved in arctic permafrost African elephant (Loxodonta africana) Indian elephant (Elephans maximus) Q: To which one is the Mammuth more closely related?

Indian elephant: sp|P24958|CYB_LOXAF
Mammoth: sp|P92658|CYB_MAMPR African elephant: sp|O47885|CYB_ELEMA MTHIRKSHPLLKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 MTHTRKFHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 *** ** ***:**:********************************************** TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120 ************************************************************ LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180 LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFA 180 **************************************:********************* LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240 LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILFLL 240 FHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240 :********:***********************************************:** LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300 LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSILI 300 ******************************************************:***** LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYIIIGQMASILYFS 360 LGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIIIGQMASILYFS 360 LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEHPYIIIGQMASILYFS 360 **:*************************: *** **********:*************** IILAFLPIAGVIENYLIK 378 IILAFLPIAGMIENYLIK 378 **********:*******

Example: Elephant and Mammoth
Mammoth and African elephant have 10 mismatches, Mammoth and Indian elephant 14. Significant? Q1: can we tell from these sequences alone that they are closely related? Q2: differences are small – do they come from selection, random noise or drift Strategies needed difference judging of similiarities

Excursion: Similarity and Homology
Important difference: Similarity is the measurement of resemblance of sequences Homology: common ancestor Similarity is gradual, homology is either true or false Similarity = now, homology = past events Homology is only very rarely directly observed (e.g. lab population, clinical study of viral infection) Homology is inferred from sequence similarity

Example: Homology/Similarity
The assertion that the cytochrome b sequences are homologues means that there is a common ancestor BUT: 1. Maybe cytochrome b functionally requires so many conserved residues and will hence occur in many species ( In fact, This is not the case here) 2. Maybe cytochrome b has to function this way in elephant-like species, but in fact started out from different ancestors (i.e. convergent evolution)Mammoth are homolgues – are also ribonuclease sequences homologues? Difference is much bigger 3. Maybe mammoth and african elephant have only fewer mismatches, because Indian elephant’s DNA mutated faster 4. Maybe all of them acquired cytochrome b through a virus (horizontal gene transfer)

Examples: Conclusion Classical methods confirm that for pancreatic ribonuclease (Horse – whale - kangoroo) inferring homology from similarity is justified But to answer whether Mammoth are closer to African or Indian elephants is too close to call (non-significant) Problems with inferring phylogeny from gene and protein sequence comparison Wide range of variation (possibly below statistical significance) Different rates of evolution for different branches of the evolutionary tree Even if relationship - which sequence came first?

Inferring Phylogenies with SINES and LINES
Pylogeneticist’s dream of features: ‘all-or-none’ character Irreversible appearance Solution: SINES and LINES (Short and Long Interspersed Nuclear Elements) Repetitive, non-coding sequences in eukaryotic genomes >30% in human genome, >50% in some plants SINES = base pairs long, up to 106 copies LINES up to 7000 base pairs, up to 105 copies They enter genome by reverse transcription of RNA

A practical example: Fatherhood
The picture shows a Southern blot of DNA from different family members, probed using a mini-satellite. You can work out which of F1 and F2 is the father of child C, by observing which bands they have in common. (Reproduced from "Essential Medical Genetics" by M.Connor and M.Ferguson-Smith, with permission from Blackwell Science.)

Why SINES are useful in phylogeny
Either present or absent Inserted at random in non-coding portion of genome i.e. SINE has no important function so that convergent evolution can be excluded Presence of a SINE in two species and absence in a third implies that first two species are more closely related SINE insertion appears to be irreversible Temporal order Presence of a SINE in two species and absence in a third implies that ancestor of first two species is younger than ancestor of all three

Example revisited Q: What is the closest land-based relative of the whales? Classical palaeontology links Cetacea (whales, dolphins, porpoises) with Artiodactyla (including e.g. cattle) Belief that Cetaceans diverged before Artiodactyla split into suborder of Suiformes (e.g. pigs), Tylopoda (e.g. camels, llamas), Ruminantia (e.g. deer, cattle, goats, sheep, antelopes, giraffe)

Example revisited Sequence comparison results
Based on mitochondrial DNA, pancreatic ribonuclease, fibrinogen, and others Closest relatives of whales are hippopotamuses (share 4 SINES) These two are closest to Ruminantia

Searching for Similar Sequences with PSI-Blast
False negatives: 300 out of 1000 are not found Searching for Similar Sequences with PSI-Blast Sequence Database Any search method for sequences should be Sensitive: pick up distant relationships Selective: reported relationships are true Example: database with (among others) 1000 globin sequences Globin familiy (oxygen transport) of proteins occurs in many species Proteins have same function and structure But there are pairs of members of the family sharing less than 10% identical residues 1000 Globin Sequences 900 Search results True positives: 700 out of 900 are really globins False positives: 200 out of 900 are not globins

Searching for Distant Relationships with PSI-BLAST
How can we find distant relationships without increasing the false negatives? PSI-BLAST: Position Sensitive Iterated – Basic Linear Alignment Sequence Tool Identifies conserved patterns within the sequences Improves Sens and Spec Score via intermediaries may be better than score from direct comparison A B C 50% Only 10%

PSI-BLAST Example Human PAX-6 gene (SwissProt ID P26367) has homologues in many different species (human, Drosophila, etc.) TF for eye development Mutations in: Human: no or deformed iris Drosophila: no eyes, expressed in wing or leg ectopic eyes PSI-Blast at NCBI site (

Result

Result Description of sequence
Max score – linked to data that show where sequences match Total score - includes scores from non-contiguous portions of the subject sequence that match the query Query coverage Identity - % of a sequence with the highest percentage of identical bases E-Value Accession number – linked to Gene bank record

Result BLASTP 2.2.28+ RID: 6D2U321501N
Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects 33,121,465 sequences; 11,555,699,950 total letters Query= gi| |sp|P |PAX6_HUMAN RecName: Full=Paired box protein Pax-6; AltName: Full=Aniridia type II protein; AltName: Full=Oculorhombin Length=422 Score E Sequences producing significant alignments: (Bits) Value ref|NP_ | paired box protein Pax-6 isoform a [Homo sap ref|XP_ | PREDICTED: paired box protein Pax-6 isofo ref|XP_ | PREDICTED: paired box protein Pax-6 isofo ref|XP_ | PREDICTED: paired box protein Pax-6 isofo ref|XP_ | PREDICTED: paired box protein Pax-6 isofo ref|NP_ | paired box protein Pax-6 [Bos taurus] >re gb|AAA | oculorhombin [Homo sapiens] ref|NP_ | paired box protein Pax-6 [Rattus norvegicus] gb|EAW | paired box gene 6 (aniridia, keratitis), isofo ...

Introduction to Protein Structure
Proteins play a variety of roles: Structural (viral coat proteins, horny outer layer of human and animal skin, cytoskeleton) Catalysis of chemical reactions (enzymes) Transport and Storage (e.g. haemoglobin) Regulation (e.g. hormones) Receptor and signal transduction Genetic transcription Recognition (cell adhesion molecules) Antibodies and other proteins of the immune system

Proteins Are large molecules
Only small part – the active site – is functional Evolve by structural changes produced by mutations in the amino acid sequence Ca human proteins structures are now known Overall protein structures in PDB Can be obtained by X-ray crystallography or nuclear magnetic resonance (NMR)

Structure of Proteins Backbone and side chain | | |
Residue i-1, Residue i, Residue i+1, Si Si Si Side chain (variable) | | | …N-Cα-C-N-Cα-C-N-Cα-C-… Main chain (constant) || || || O O O Polypeptide chain folds into a curve in space Common structural feature Alpha-helix Beta-sheet Turns and Loops

Hierarchy of Architecture
Primary structure: Amino acid sequence Secondary structure: Helices, sheets, loops, hydrogen-bonding pattern of main chain Tertiary structure: Assembly and interactions of helices, sheets, etc. Quaternary structure: Assembly of monomers Evolution can merge proteins E.g.: 5 enzymes in E. coli = 1 protein in fungi Aspergillus nidulans catalyze successive steps in biosynthesis of aromatic amino acids E.g.: Globins form tetramers in mammalian haemoglobin and dimers in ark clam Scaoharca inaequivalvis

Protein Structure DHAP to GAP in Glycolyse Triosephosphate isomerase from Bacillus stearothermophilus Highly efficient enzyme appearing in most species

Extra layer of Architecture: supersecondary structure
Alpha-helix hairpin Beta hairpin Beta-alpha-beta unit = Patterns of interaction between helices and sheets

Hierarchy of Architecture
Supersecondary structures: Alpha-helix hairpin Beta hairpin Beta-alpha-beta unit Domains: Compact unit, single chain, independent stability Modular proteins: Multi-domain Copies of related domains or “mix-and-match”

Classification of Protein Structure
All Alpha: mostly alpha helices All Beta: mostly beta sheets Alpha+Beta: Helices and sheets in different parts of the molecule, no beta-alpha-beta units Alpha/Beta: Helices and sheets assembled from beta-alpha-beta units Alpha/Beta linear Alpha/Beta barrel Little or no secondary structure

SCOP: Structural Classification of Proteins
top All alpha (284) All Beta (174) Alpha/Beta (147) Alpha+Beta (376) CLASS Immunoglobulin-like (23) Trypsin-like serine proteases (1) FOLD Immunoglobulin (6) Transglutaminase (1) SUPERFAMILY = evolutionary related, similar structure, not necessarily similar sequence FAMILY = set of domains with similar sequence C1 set domains (antibody constant) V set domains (antibody variable)

Engrailed homeodomain (1enh)
Transcription factor important in development Used to study protein folding Utrophin calmodulin homology domain (1bhd) Actin binding Closely relatd to dystrophin, whose lack causes muscular dystrophies (weak muscles) Cytochrome c, rice (1ccr) Electron transport across mitochondrial membrane DNA-binding domain of HIN recombinase (1hcr)

Engrailed homeodomain (1enh)

Fibronectin III domain (1fna) Found on cell surface
Mannose-binding protein (1npl) Barnase (1brn) Cleaves RNA and is lethal if intracellular and not inhibited by barstar TATA-box-binding protein (1cdw)

Scytalone dehydratase (3std) OB-domain from Lys-tRNA synthetase (1bbw)
Alcohol dehydrogenase, NAD-binding domain (1ee2) Break down of alcohol into simpler compounds Adenylate kinase (3adk) Energy production

Chemotaxis receptor methyltransferase (1af7)
Thiamine phosphate synthase (2tps) Pancreatic spasmolytic polypeptide (2psp)

Protein Structure Prediction and Engineering
If sequence of amino acids contains enough information to specify three-dimensional structure of proteins, it should be possible to devise algorithm for prediction Secondary structure prediction: Which segments of the sequence are helices, which strands? Fold recognition: Given library of known structures with their sequences and a sequence with unknown structure, can we find the structure that is most similar Homology modelling Given two homologous sequences, one with one without structure. If between 30 and 50% of the residues are identical, the structure can serve as a model

Critical Asessment of Structure Prediction (CASP)
Chicken lysozyme KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGS Baboon alpha-lactalbumin KQFTKCELSQNLY--DIDGYGRIALPELICTMFHTSGYDTQAIVEND-ES Chicken lysozyme TDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVS Baboon alpha-lactalbumin TEYGLFQISNALWCKSSQSPQSRNICDITCDKFLDDDITDDIMCAKKILD Chicken lysozyme DGN-GMNAWVAWRNRCKGTDVQA-WIRGCRL- Baboon alpha-lactalbumin I--KGIDYWIAHKALC-TEKL-EQWL--CE-K

Clinical Implications of Sequencing
Fast and reliable diagnosis of disease and risk: Easy diagnosis (with symptoms) In advance of appearance (e.g. Huntington) In utero diagnosis (e.g. cystic fibrosis: thick secretions in lung) Genetic counselling Customized treatment (predict response to therapy/side effects) E.g. childhood leukaemia is treated with toxic drug 6-mercaptopurine. Small fraction of patients used to die as they lack enzyme thiopurine methyltransferase. Identify drug targets Nowadays targets are: ½ receptors, ¼ enzymes, ¼ hormones 7% have unknown targets Gene therapy Replace defective genes or supply gene products (insulin for diabetes and Blood Factor VIII for haemophilia) However: Most diseases do not have a single genetic cause!

Quick check By now you should Have read chapter 1
Know the main data sources (sequence and structure) Know the role that bioinformatics plays Understand the difference between homology and similarity Understand what sequence comparison and alignment are Understand how they can be useful for phylogenetic studies Understand primary, secondary, tertiary structure Be able to assess the assumptions made and the quality of data

Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics.

Similar presentations

Presentation on theme: "Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics.

Similar presentations

Presentation on theme: "Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback