Introduction to computational biology

Introduction to computational biology
Craig A. Stewart Indiana University 1 July 2005 Höchstleistungsrechenzentrum Stuttgart

License terms Please cite as: Stewart, C.A Introduction to computational biology. Tutorial presented at High Performance Computing Center, Stuttgart. 1 July 2005, Stuttgart, Germany. Some figures are shown here taken from web, under an interpretation of fair use that seemed reasonable at the time and within reasonable readings of copyright interpretations. Such diagrams are indicated here with a source url. In several cases these web sites are no longer available, so the diagrams are included here for historical value. Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license ( This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.

Planned schedule for the day
9:00-9:15 Introduction and objectives 9:15-10:00 An introduction to the biological basis [11] 10:00-10:15 Biological data sources [30] 10:15-10:30 Similarity searching and Alignment [39] 10:30-10:45 Coffee Break 10:45-11:30 Similarity searching and alignment, con;t 11:30-12:00 Multiple Alignment [58] 12:00-13:00 Lunch 13:00-13:30 Phylogenetics [75] 13:30-14:00 RNA and Protein Structure [91] 14:00-14:30 Systems Biology [119] 14:30:14:45 Miscellaneous semi-random topics [138]

Plan & Objectives Materials focus on open source software (generally not the presenters own work) Objectives. At the end of the class, participants should: understand enough biology to understand key computational biology problems, and be familiar with some strategies for collaborating with biologists and biomedical scientists be conversant with key open source applications in computational biology and bioinformatics, and current problems in these areas Be ready to download some code and start making it better!

Motivation The “-omics” trend
Finding press pieces about huge computing problems is easy How many bio codes really scale to hundreds of processors? What are the coming high performance needs of biologists? Importance of computational biology and bioinformatics to the HPC community The challenges and promise are real Successes and failures so far Successes: Protein structure, Genome assembly, Surgical assistance, Phylogenetics Mismatched priorities: Ab initio protein folding Not yet successful: Drug discovery

What has changed recently?
Bioinformatics not new Protein structure Phylogenetics What is new is high-throughput sequencing: Lots more data The possibility of going from a knowledge of the DNA sequence to an understanding of diseases and health

Genome Projects Timeline
First virus (SV40) sequenced (5224 base pairs) DOE announces Human Genome Initiative First complete map of all human chromosomes First living organism sequenced (H. influenzae) 2 Mb Yeast (S. cerevisiae) - 12 Mb Intestinal bacterium (E. coli) - 5 Mb Nematode worm (C. elegans) Mb Celera announcement; Public effort regroups Human Chromosome 22 – 34 Mb Joint announcement by NHGRI – Celera “As good as it gets” human genome This slide based on slide by Manfred D. Zorn

Definitions Computational Biology: any use of advanced information technology in the study of biological problems. “Bioinformatics applies the principles of information sciences and technologies to make the vast, diverse and complex life sciences data mnore understandable and useful” (NIH BISTIC Committee grants1.nih.gov/grants/bistic/CompuBioDef.pdf) Genomics – study of genomes and gene function Proteomics – study of proteins and protein function ___omics –

Challenges Different types of biological data at different scales
Data of varying quality Much of the underlying biology is not well understood Prior to the availability of high-throughput sequencing, scientists could only study small pieces of the genetic information of any organism. Now the entire genome of several organisms has been completed, but knowing the genome is different than knowing how it works!

Comparison of Complexity
Physics & Chemistry 2 elementary particles 4 forces 112 elements When random events occur it is often possible to study average behavior Typically ahistoric (astrophysics an exception) Biology 3B base pairs in humans Min. 30,000 genes in humans ~1.5M species Individual random events important; no law of large numbers Intensely historic, heavily contingent

Complexity, Con't Chip design Cells All components known
Device physics for individual components known Itanium has 3 x 10^8 connections and 2 x 10^8 devices Unified basic currency (electrons) Computer program required to understand (e.g. SPICE) Cells Components not known Function of individual components not known # components ~10^13 No unified basic currency Ecell, Karyote, etc. attempting to model cells Computer programs do not yet permit full understanding

A rapid introduction to key elements of biology

Why is it important to know some biology?
Would you study numerical methods without knowing some mathematics? Much current biological knowledge is very specific to particular organisms, genes, or diseases If you just wade into the available data online you can do some very silly things. Anopheles gambiae From mosquito/mtm/index.html Source Library:Centers for Disease Control Photo Credit:Jim Gathany

Central dogma of biology
The central dogma of biology is that genes act to create phenotypes through a flow of information form DNA to RNA to proteins, to interactions among proteins (regulatory circuits and metabolic pathways), and ultimately to phenotypes. Collections of individual phenotypes constitute a population (first put forward by Crick in 1958)

Cell Structure Eukaryotes Chromosomes linear
Eukaryotes Chromosomes linear Introns, exons, postprocessing Nucleus & nuclear wall Mictochondria and (in plants) Chloroplasts Prokaryotes Chromosome circular Location is everything No nucleus No plastids

Four (or Five) Bases DNA consists of four nucleotides: Cytosine, Thymine, Adenine, and Guanine. In the double helix, A&T are always bound, and C&G are always bound to each other RNA consists of four nucleotides as well: Cytosine, Uracil, Adenine, and Guanine RNA may loop back on itself but it does not form a double helix

Translating DNA to RNA and Transcribing RNA to Proteins
AAAAAGGAGCAAATT DNA 1 4 2 5 3 6 UUUUUCCUCGUUUAA RNA One possible amino acid string Phe Asn Asp Ala

Genetic Code Ala Alanine Leu Leucine Arg Arginine Lys Lysine
Asn Asparagine Asp Aspartic acid Cys Cysteine Glu Glutamic acid Gln Glutamine Gly Glycine His Histidine Ile Isoleucine Leu Leucine Lys Lysine Met Methionine Phe Phenylalanine Pro Proline Ser Serine Thr Threonine Trp Tryptophan Tyr Tyrosine Val Valine Original8Hour/Genetics/geneticcode.html

Sickle Cell Normal RBC GAG codes for Glutamine disc-Shaped, soft
easily flow through small blood vessels lives for 120 days Sickle RBC GTG codes for Valine sickle-Shaped, hard often get stuck in small blood vessels lives for 20 days or less Malaria vs. Anaemia! ency/imagepages/1223.htm

What is a Gene? An inheritable trait associated with a region of DNA that codes for a polypeptide chain or specifies an RNA molecule which in turn has an influence on some characteristic phenotype of the organism. Early views: genes lined up on the chromosome like beads on a string; one gene => one protein Examples of genes: color blindness, sickle-cell anaemia Mendelian genes, Sex-linked genes, Quantitative traits Annotation: Extraction, definition, and interpretation of features on the genome sequence Annotations vs. genes: Many annotations describe features that constitute a gene. Others may not always directly correspond in this way An annotation is what we think… nature may disagree! Inheritance problem with annotations

Human Chromosomes Prokaryotes and Eukaryotes that reproduce asexually just get a copy of all of their genes from their ‘parent’ Eukaryotes (that includes us) that reproduce sexually receive one chromosome from each of our parents. This means we have two copies of the same information – one copy from each parent Genes can be dominant (brown eyes) or recessive (blue eyes) Human_Genome/graphics/slides/ elsikaryotype.html

Mendelian Genetics Round peas are “dominant” to that a pea with the genotype RR or Rr is going to look round Only peas with the rr genotype will look wrinkled For very simple Mendelian genes, you can use something called a Punnet square Many genes are inherited in ways OTHER than simple Mendelian genetics, which is why sequencing the geneome has been very important!

Gene Components Prokaryotes Location is everything
Essentially all of the DNA is transcribed (few mitochondrial diseases) Eukaryotes Non-contiguous pieces of DNA may comprise one gene Start sequence (complicated and long) Stop Codons – end transcription Exons – portions of sequence that are transcribed and used Introns – portions of sequence that are not used

Alternate splicing

Population genetics & evolution
Mutations create the raw material for evolution Natural selection and chance affect the frequency with which particular genes or DNA sequences are present in populations Given enough time and enough change, evolution, speciation, and so forth happen Genes can be ‘fixed’ or ‘maintained in an equilibrium’ in a population by chance or through natural selection

Key points (so far) Biological processes are complicated; the historicity and complexity of biological processes and our lack of understanding of many matters makes biology an interesting topic! The fundamental dogma of molecular biology is that genes act to create phenotypes through a flow of information form DNA to RNA to proteins, to interactions among proteins (regulatory circuits and metabolic pathways), and ultimately to phenotypes. Collections of individual phenotypes constitute a population. DNA consists of four base pairs (ATCG). A is always paired with T; C always paired with G. DNA is translated into RNA. RNA consists of four base pairs as well (AUCG). The linear structure of DNA is transcribed into RNA and then into proteins. Proteins have their 3D configuration as the basis for their structure.

Bioinformatics data sources

MedLine & PubMed Medline:
U.S. National Library of Medicine – NLM Medline ~12 million references on life sciences/biomedicine (since 1996) Pubmed - Standard search tool for Medline Structured Language - Medical Subject Heading

Genomic, Proteomic, etc. data sources
A tremendous amount of data is available through public data sources via the Web, ftp, or by other means. To analyze biological data, we first have to get it…. Several ways to organize presentation of material – by site, by type, etc. We will organize by data type. Types of Databases: Chromosomal ( DNA/Genes Protein Biochemistry and metabolic pathways Microarray Web collections

DNA databases GenBank. Operated by NCBI (National Center for Biotechnology Information). European Molecular Biology Laboratory – Nucleotide Sequence Database. DNA Database of Japan (DDBJ). All share data daily. Update conflicts avoided by policy. Differences are in “value added” and interfaces

Data Structures Current
Primary DNA repository data based on ASN.1. Makes possible linkages among many types of biomedical info. The software libraries now often handle XML as well Software toolkits and docs available at Genbank Flat File format FASTA >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC

Protein Structure NCBI Swiss-Prot/TrEMBL at http://www.expasy.org/
Note: 125,744 chemically determined vs 861,482 inferred from automated translation of DNA sequences!!!!! Protein Data Base – PDB - one of the first online bioinformatics databases!!!

Biochemistry and pathways
ENZYME (part of the ExPASy system) BIND (part of the NCBI system) Pathways PathDB Kegg WIT

Web Resources - General NCBI http://www.ncbi.nlm.nih.gov/
EBI Biocatalog IUBio Archive

Similarity matching & Sequence Alignment

Why pattern matching (and what are the problems)
US! and… Bonobo

Problems! For proteins, 95% similarity is ~ identical, 80% similarity is a lot. Even less similarity than that needed for DNA Database techniques inadequate – they are too precise! Datasets very large to search Homology Common ancestry Sequence (and usually structure) conservation Homology is inferred rather than measured Identity Objective and well defined Can be quantified easily, but not very useful! Similarity Most common method used, but not as easily defined

How do sequences differ?
CGTACCGTTAATAT CGTACCGATAATAT Differences in individual bases Bases may be added to a sequence Bases may be deleted from a sequence CGTACCCCGTAATAT CGTACC . .GTAATAT CGTACCGTTAATAT CGTACCG . . .ATAT

Alignment An alignment is an arrangement of two sequences opposite one another It shows where they are different and where they are similar We want to find the optimal alignment - the most similarity and the least differences Alignments have two aspects: Quantity: To what degree are the sequences similar (percentage, other scoring method) Quality: Regions of similarity in a given sequence

Scoring Alignments CGTACCGTTAATAT CGTACCG . . .ATAT CGTACCGTTAATAT
GCTAAATTC ++ x x GC AAGTT Matches are good: they get a positive value Mismatches are bad: they get a negative value Gaps are bad: they get a negative value Gap opening penalty Gap extension penalty Score = Matches –Mismatches -∑{gap opening penalty +(length)*gap length penalty} CGTACCGTTAATAT CGTACCG . . .ATAT CGTACCGTTAATAT CGT. C . GTT .ATAT

Dotter Simple way to get a feel for how sequences compare to each other. Used both with DNA and Protein sequences "A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis" Erik L.L. Sonnhammer and Richard Durbin Gene 167(2):GC1-10 (1995) Modular nature of proteins

Local Alignments with BLAST
Basic Linear Alignment Search Tool First a quick demo (hopefully) So, what did we do? BLAST – Basic Linear Alignment Search Tool In particular, BLASTn (for nucleotides) Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J Basic Local alignment search tool. Journal of Molecular Biology 215:

(Original) BLAST Algorithm
Original algorithm does not permit gaps The original BLAST algorithm is a local (heuristic) alignment tool Given a search sequence, e.g. ACGTAGGCATGAA BLAST first makes a list of all “words” of a given length that would possibly have a score of at least T against the search string. In the case of this example there would be (at least) the following: ACGTAGGCATG CGTAGGCATGA GTAGGCATGAA

(Original) BLAST Algorithm, 2
BLAST takes the list of all words with a score of at least T against the string one is trying to match…. and then searches a database for any matches to these words. So if one were using the example and the NR database, BLAST would search NR for all occurrences of the words: ACGTAGGCATG CGTAGGCATGA GTAGGCATGAA Suppose BLAST finds in the NR database an exact match to BLAST then attempts to extend the match in both directions ACGTAGGCATGA So now we have an exact match of 12 letters

(Original) BLAST algorithm,3
So BLAST keeps going, and in this case would stop at an exact match of 13 letters (if one existed), since 13 letters was the entire initial search string: ACGTAGGCATGAA BLAST has a stopping algorithm for dropping particular search directions, or stopping altogether

BLAST algorithm in more detail
The BLAST algorithm searches for MSPs – Maximal Scoring Pairs – such that the score of sequences cannot be improved either by lengthening it or shortening it. “Pairs” here refers to a string – or a substring – of the initial string used as the search string – and one or more strings or substrings found in a database. The search starts with the creation of all possible subwords of a given length (default typically 11 for DNA sequences, 3 amino acids for protein sequences) that would score at least T when matched against the original search string. (T is short for Threshold) BLAST searches for any occurrence of each of these words that have a score of at least T. This is a “hit” – or a “High Scoring Pair (HSP)” The search then continues by trying to extend these HSPs. Suppose “S” is the best score found for a word of length k. BLAST stops trying to extend words when the score drops a certain amount below the best value S in the previous round. BLAST continues on and on until it is no longer possible to improve the score of HSPs by making them longer. Then it generates a list of the best HSPs. Default is a cutoff E-value of 10 BLAST (original) has an infinite gap penalty

PSI BLAST Position Specific Iterative BLAST
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997 Sep 1;25(17): Required two non-overlapping similarities with search term to occur within a certain distance (A) on the genome Permits gaps in the alignments Can be iterated to allow for user-specified scoring matrices By default, uses the BLOSUM-62 Matrix

PSI BLAST Two hits, T=11 A=40 vs One hit, T=13 In the original BLAST, the step of extending the length of the ‘hits’ took ~90% of execution time. The initial threshold value T must be lower than with the original BLAST, but far fewer hits are pursued, meaning that the extension time is lower vol25/issue17/images/gka56202.gif

Example Problem (hoffentlich)

mpiBLAST http://mpiblast.lanl.gov/

mpiBLAST Algorithm Darling, A.E., L. Carey, W.-C. Feng The design, implementation, and evaluation of mpiBLAST. Presented at ClusterWorld Algorithm Database is segmented. Portions of database are placed on data storage devices on multiple nodes in a HPC system. mpiformatdb is a wrapper for the BLAST formatdb program. Number of subdivisions specified by user Foreman/worker algorithm. Portions of the database are assigned to workers, using a greedy algorithm

mpiBLAST performance Scaling can be super linear when pieces are small enough that they fit into memory Scalability limitations due to communication, implicit barrier before assembly of results If pieces of data distributed out to workers are larger than available RAM, then scaling is still good but not super linear Blast is the most heavily used bioinformatics tool in existence. Parallelization of BLAST has huge payoff for practicing biologists

BLAST superlinear scaling as memory phenomenon
Standard BLAST running on a system with 128 MB of memory. Conclusion: Performance degrades due to extra disk I/O when the database is larger than core memory. Slide courtesy of Wu-chun Feng Los Alamos National Laboratory

Multiple Alignment

Multiple Alignment Sequences lined up so that homologous residues are next to one another Color reflects residue type (e.g., green = hydrophobic) This slide based on a slide created by Dr. Richard Repasky, Indiana University

Uses Alignments reveal the degree to which sequences have been conserved Most functional sequences are conserved Multiple alignment is used to locate them Functional groups of enzymes Predict protein structure Gene promoters Unknown functional units of non-coding regions of DNA Alignments necessary to estimate evolutionary trees This slide based on a slide created by Dr. Richard Repasky, Indiana University

Progressive alignments
Pairwise dynamic programming alignment algorithms can be extended to multiple sequences but scale poorly to large numbers of sequences (or to long sequences) Heuristic algorithms are employed. Commonly used heuristic methods are progressive - build multiple alignment by aggregating from paired alignments Order in which sequences are added determined by a guide-tree that reflects similarity/distance Guide tree constructed from sequences Closely related sequences aggregated/added first Errors in early additions tend to propagate Algorithms differ in strategy for minimizing error propagation Algorithms also differ in guide tree construction & scoring This slide based on a slide created by Dr. Richard Repasky, Indiana University

Progressive alignment steps
1 - align B & C 2 - align D & E 3 - align (D & E) & A 4 -align (D & E & A) & (B & C) This slide based on a slide created by Dr. Richard Repasky, Indiana University

Three algorithms CLUSTAL W T-COFFEE ProbCons
Oldest of three & most widely used Initial alignment error not addressed Good performance by adding realistic details T-COFFEE Initial alignment error addressed by using consistency methods Uses CLUSTAL W, improves performance ProbCons New this year Initial alignment error also addressed by consistency methods Uses hidden Markov models This slide based on a slide created by Dr. Richard Repasky, Indiana University

CLUSTAL W Thompson et al Nucleic Acids Res. 22: Uses dynamic programming with distance matrices and gap penalties for alignments Selective use of scoring matrices Strict matrices for closely related sequences Permissive matrices for distantly related sequences Relatedness determined by branch lengths in guide tree Uses residue-specific gap penalties from reference alignments This slide based on a slide created by Dr. Richard Repasky, Indiana University

CLUSTAL W Gap penalties reduced in short stretches of hydrophilic residues (usually associated with bends and are gap-prone) Gap penalties increased in areas within 8 residues of existing gaps because such gaps are rare in reference alignments Sequences weighted by relatedness Attempt to correct for unbalanced sampling across guide tree Closely related sequences discounted in importance Progression Leaves of tree joined by dynamic programming Leaves joined with internal nodes by sequence-profile alignment Internal nodes joined by profile-profile alignment This slide based on a slide created by Dr. Richard Repasky, Indiana University

Example output FOS_RAT MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNT
FOS_MOUSE MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNT FOS_CHICK MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNS FOSB_MOUSE -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAAS FOSB_HUMAN -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAAS *:..* .:*:: .***** **:.:* * *..***.* :.. :*: FOS_RAT IPTVTAISTSPDLQWLVQPTLVSSVAPSQ TRAPHPYGLP FOS_MOUSE IPTVTAISTSPDLQWLVQPTLVSSVAPSQ TRAPHPYGLP FOS_CHICK VPTVTAISTSPDLQWLVQPTLISSVAPSQ NRG-HPYGVP FOSB_MOUSE VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMP FOSB_HUMAN VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMP :******:** **********:**:* **... ::. .**.:* :

ClustalW-MPI Li, K.-B ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19: Initial pairwise alignment process is parallelized and scales very well Multiple alignment process is parallelized and scales modestly Scaling tests published thus far up to 16 processors, reduces time from hours to minutes

Consistency Methods Make estimates based on more information - “averaging” Lazy teacher analogy In progressive multiple alignment, use as much information as possible when adding sequences to the alignment T-COFFEE: each position in one alignment is weighted by consistency in all alignments of all pairs of sequences that include the sequences being aligned ProbCons: posterior probabilities in pairwise HMM alignments weighted by posterior probabilities of same positions in other alignments This slide based on a slide created by Dr. Richard Repasky, Indiana University

T-COFFEE Notredame, et al. (2000, J. Mol. Biol. 302: ) Gives better alignments than CLUSTAL W on benchmark datasets Avoids problem of early bad gaps using consistency methods Progressive alignment based on weights pooled from all pairwise alignments rather than currently accumulated sequences This slide based on a slide created by Dr. Richard Repasky, Indiana University

T-COFFEE steps Calculation of weights Progression
All pairwise alignments using CLUSTAL W, local alignments using FASTA Lalign For each aligned base pair in each pair of sequences calculate weight Aggregate weights for aligned base pairs using triplets of sequences Progression Align all sequences pairwise using weights Build guide tree from pairwise alignments Progressively build multiple alignment using tree and weights This slide based on a slide created by Dr. Richard Repasky, Indiana University

ProbCons Http://probcons.stanford.edu
ISMB 2004, Bioinformatics 20:Supplement 1 Constistency methods & HMM to align Gives better alignments than CLUSTAL W & T-COFFEE on alignment benchmarks Use HMM to align all pairs of sequences Keep posterior probability matrices & update each value by averaging over all alignments in which the sequence position occurs. Do twice. Create a guide tree from expected accuracies (sums of posterior probabilities of highest summing path in matrix) Progressive alignment objective function is sum of posterior probabilities for all aligned residues This slide based on a slide created by Dr. Richard Repasky, Indiana University

Multiple alignment viewers
CLUSTAL X - X windows ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/ Jalview - Java Variable color schemes Editing Front end to aligners This slide based on a slide created by Dr. Richard Repasky, Indiana University

Abstracting Multiple Alignments
Hidden Markov models can be used to describe alignments Called profile HMMS Think of them as definitions of proteins or averages Useful for aligning newly discovered sequences Search sequence databases for sequences that match the alignment profile (Consider the alternative!) Build databases of profiles and search for profiles that match query sequences This slide based on a slide created by Dr. Richard Repasky, Indiana University

HMMER http://hmmer.wustl.edu/
Profile HMMs for protein sequence analysis Builds profiles from existing alignments Searches sequence databases for molecules that match profiles Can be used to construct db’s of profiles and to search for profiles that match sequences Generates random sequences from profiles Also available as a parallel code using PVM Has been ported to vector supercomputers Scales reasonably well as regards number of processors. Does not scale as well as regards size of the biological problem

Phylogenetic Inference

Building Phylogenetic Trees
Goal: an objective means by which phylogenetic trees can be estimated in tolerable amounts of wall-clock time, producing phylogenetic trees with measures of their uncertainty All evolutionary changes are described as bifurcating trees -genes or gene products -organisms

Why is tree-building a HPC problem?
The number of bifurcating unrooted trees for n taxa is (2n-5)!/ (n-3)! 2n-3 for 50 taxa the number of possible trees is ~1074; most scientists are interested in much larger problems NP-hard problem The number of rooted trees is (2n-5)!

Stochastic change of DNA
Markov process, independent for each site: 4 x 4 matrix for DNA, 20 x 20 for amino acids A C G T A p(A->A) p(A->C) p(A->G) … C p(C->A) p(C->C) p(C->G) … G . T . Transitions more probable than transversions. Must account for heterogeneity in substitution rates among sites (DNArates – Olsen)

fastDNAml Developed by Gary Olsen
Derived from Felsensteins’s PHYLIP programs One of the more commonly used ML methods The first phylogenetic software implemented in a parallel program (at Argonne National Laboratory, using P4 libraries) Olsen, G.J.,et al fastDNAml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Computer Applications in Biosciences 10: 41-48 MPI version produced by Indiana University in collaboration with Gary Olsen available from

fastDNAml algorithm – adding taxa
Optimize tree for 3 (randomly chosen) taxa - only one topology possible Randomly pick another taxon – (2i-5) trees possible Keep the best (maximum likelihood tree)

Basic fastDNAml algorithm - Branch rearrangement
Move any subtree crossing n vertices (if n=1 there are 2i-6 possibilities) Keep best resulting tree Repeat this step until local swapping no longer improves likelihood value

fastDNAml algorithm con’t: Iterate
Get sequence data for next taxon Add new taxa (2i-5) Keep best Local rearrangements (2i-6) Keep going…. When all taxa have been added, perform a full tree check

Overview of parallel program flow
Program modules Master (generates trees, receives back from Foreman best tree at each step) Foreman (dispatches trees to workers, determines best tree, tracks activity of workers) Worker Monitor (instrumentation) Parallel versions include fault tolerance features (useful in large clusters and grid computing)

Performance of fastDNAml

SC2003 HPC Challenge (“It seemed like a good idea at the time”)
Are Hexapods a single evolutionary group? Are ecdysozoans a single evolutionary group?

A partial bestiary All organism illustrations copyright Jennifer Fairman, Used by agreement

Software and data analysis
Non-grid preparatory work Download sequences from NCBI (67 Taxa, 12,162 bp, mitochondrial genes for 12 proteins) Align sequences with Multi-Clustal Determine rate parameters with TreePuzzle Grid preparatory work Analyze performance of fastDNAml with Vampir Meetings via Access Grid & CoVise The grid software PACXMPI – Grid/MPI middleware Covise – Collaboration and visualization Application Framework – Matthias Hess fastDNAml – Maximum Likelihood phylogenetics

Application framework, COVISE, FastDNAml
ML analysis of phylogenetic trees based on DNA sequences Foreman/worker MPI program Heuristic search for best trees For 67 taxa: ~10109 trees Goal: 300 bootstraps, 10 jumbles per – 3000 executions (more than 3x typical!)

Why this project on a grid?
Important & time-sensitive biological question requiring massive computer resources A biologically-oriented code that scales well Grid middleware environment & collaboration tool well suited to the task at hand Opportunity to create a grid spanning every continent on earth (except Antarctica) It seemed like a good idea at the time

The results Hundreds of trees were analyzed during the course of the week The biological results are still being analyzed Our HPC challenge project was awarded the prize for “Most geographically distributed application” We have not yet become rich or famous

RNA & Protein Structure

RNA & Protein Structure
Want to know functions Function dictated by structure Need structure to understand function Empirical determination of structure difficult/expensive Shortcut: predict structure from sequence Algorithms & software for predicting structure

RNA Nucleotide sequence Composition differs from DNA
Thymine replaced by uracil Alphabet: C, G, A, U Types of RNA Messenger RNA - template from DNA - decoded to produce protein Transfer RNA - interface attached to amino acids - identifies amino acid to protein producing machinery Ribosomal RNA - protein producing machinery Regulatory RNA - small polynucleotides that bind to other molecules and alter behavior Catalytic RNA - most catalyze reactions of DNA

RNA secondary structure
Single stranded RNA folds on itself Complementary bases join A – U, G - C Forms loops & hairpins In nature, structure nearly minimizes energy Energy - more or less bending/stress on bond angles Zuker algorithm minimizes calculated energy tRNA

RNA Structure – Vienna RNA
Package consists of several parts (from the web site): RNAfold - predict minimum energy secondary structures and pair probabilities RNAeval - evaluate energy of RNA secondary structures RNAheat - calculate the specific heat (melting curve) of an RNA sequence RNAinverse - inverse fold (design) sequences with predefined structure RNAdistance - compare secondary structures RNApdist - compare base pair probabilities RNAsubopt - complete suboptimal folding

Protein Enzymes - catalysts
Regulatory - bind with molecules to alter behavior Transport - move here to there as oxygen in hemoglobin Storage - e.g., caches of nitrogen or metal ions Mobility - contractile & motile proteins (muscle, flagella) Structural proteins - fill space, provide support Scaffold - supports for construction of macro molecules Defense/Attack - immune system proteins, venom

Structure and function of proteins
Enzymes receive most attention Enzymes catalyze reactions = lower energy required Place reactants in favorable positions for reaction Location is everything Example enolase

Primary structure - amino acids
Share nitrogen group (amino) Share acid group (carboxyl) Differ in side chains 20 common amino acids Polymerize amino-to-carboxyl Side chains determine secondary & tertiary structure

Secondary structure & Bond rotation
Reflects angles in amino acid chain Shape of the peptide chain over short sequences Determined by amino acid composition

Angles of rotation & secondary structure
course/3_geometry/rama.html

Visualization: ways to view molecules
Wireframe Often used by crystallographers while interpreting data Frames fit nicely in electron density mesh Space filled (Van der Waals radii) Often used in docking applications Van derWaals radii useful for thinking about hydrogen bonds

Visualization: ways to view molecules
Richardson-type (also ribbon or noodle) Depict secondary and tertiary structure Omit details of atoms and polymerization Ball and stick Usually more useful for ligands than for whole proteins Atoms & covalent bonds

Molecular viewing software
Most programs do several types of visualizations VRML – Cosmo Player RASMOL - RasTop - CHIME - Swiss Pdb Viewer - MICE - Many tend to be touchy about browsers and plugins

Empirical structure X-ray crystallography
Yields 3-D plot of electron density in space Create model of molecule that matches electron density ~127,863 entries in SwissProt ~857,950 entries in TrEMBL

Secondary structure prediction
Induction: assign structure based on sequence similarity to proteins of known structure Consensus methods do best Search database for similar sequences Align sequences Apply several algorithms (e.g., neural network, nearest-neighbor) to predict structure type Take consensus of predictions JPRED:

Tertiary structure prediction
Long segments fold Folds are held in place by molecular forces (e.g., electrostatic, hydrogen bonds, some covalent bonds) Proteins fold to minimize energy Folding algorithms seek conformation with minimum energy Two main methods Ab initio Fold recognition

Criteria Goal: prediction of molecule position within 1 angstrom
Remember, location is everything in enzymes Measuring quality of fit Root mean square of atom distances from correct position RMSD = √ (∑di2)/N Q3 = (true positives + true negatives)/total residues Better than 70% right is really good!

Fold Recognition (Threading)
Impose known folds on molecule; evaluate fit Dissimilar sequences may fold similarly Number of possible folds is finite Many methods of fitting (e.g., dynamic programming, Gibbs sampling, hidden Markov models) Calculate energy or distances Web services - many methods General recommendation: use many methods and build a consensus

Ab Initio methods L. From the beginning (O. E. D.)
A real ab initio chemist would complain about use of the term A scoring function is used to judge conformations Search function used to explore conformational space Criterion: usually minimize free energy Scoring function types Molecular mechanics calculations Use empirically derived scoring functions based on probability distributions of data in Protein Data Bank Search function may be coarse-grained or fine-grained, usually matches granularity of scoring function

Amber Assisted Model Building with Energy Refinement
Not open source – modest license cost D.A. Pearlman, D.A. Case, J.W. Caldwell, W.R. Ross, T.E. Cheatham, III, S. DeBolt, D. Ferguson, G. Seibel and P. Kollman. AMBER, a computer program for applying molecular mechanics, normal mode analysis, molecular dynamics and free energy calculations to elucidate the structures and energies of molecules. Comp. Phys. Commun. 91, 1-41 (1995) Simulated annealing approach with energy refinements

GAMESS General Atomic and Molecular Electronic Structure System
M.W.Schmidt, M.W., K.K.Baldridge, J.A.Boatz, S.T.Elbert, M.S.Gordon, J.H.Jensen, S.Koseki, N.Matsunaga, K.A.Nguyen, S.Su, T.L.Windus, M.Dupuis, J.A.Montgomery General Atomic and Molecular Electronic Structure System J. Comput. Chem.14: NPACI/SDSC Web portal for GAMESS: No-cost academic license It’s parallel

CHARMM CHARMM (Chemistry at HARvard Molecular Mechanics)
For prediction of macromolecular structure Not open source - modest license cost CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations, J. Comp. Chem. 4, (1983), by B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus. CHARMM: The Energy Function and Its Parameterization with an Overview of the Program, in The Encyclopedia of Computational Chemistry, 1, , P. v. R. Schleyer et al., editors (John Wiley & Sons: Chichester, 1998), by A. D. MacKerell, Jr., B. Brooks, C. L. Brooks, III, L. Nilsson, B. Roux, Y. Won, and M. Karplus.

Rosetta Work with fragments 3-9 amino acids in length
Restrict conformations of individual fragments to the distribution of conformations exhibited in real proteins Fragment conformations modeled stochastically using distributions of observed conformations Seek array of conformations that minimize energy Local minima likely Run many searches Cluster results & pick large clusters as likely conformations Typically licensed at no cost for academic purposes

Molecular Docking Will two molecules bind?
Usually interested in docking of small molecules (e.g., drug candidates) to proteins Small molecule called ligand (from Latin ligare - to bind) Specific question: will ligand bind to a receptor in a protein? Receptor usually the largest pocket in the surface of a protein Steps Characterize receptor site Orient ligand(s) & evaluate

Molecular Docking Process: usually create negative image of receptor site and ask whether ligands take that conformation Rigid models use a grid search Models with flexible surfaces Usually assume receptor fixed and ligand flexible. Explore conformation of ligand (simulated annealing) Models with flexible surfaces do better than those with fixed surfaces Autodock is a commonly used package Flexible model Can do simulated annealing and genetic algorithms

Systems Biology

Systems Biology Special issue of Science: 295, Mar. 2002
Special issue of Nature: 420, Nov. 2002 “Systems biology is a new field in biology that aims at a systems-level understanding of biological systems.” Nobody’s quite sure what it is, but it sure is hot! graphics/slides/images/ _web.gif

Historical approach to biological experiments
From Lazebnik, Y Cancer cell 2:179: Traditional biological experimentation much like the process of trying to fix a broken radio (or if you are or were a 12-year old boy…) Some typical steps: Cataloguing components and their attributes Perturbing the system Knock-out experiments Drawing diagrams Eventually may find a component that, when replaced, repairs the radio In a very complex system, knowing what all of the parts are, and knowing the function of individual pathways, may still not tell you how the systems work. It may simply be impossible to deduce this from 1-st order interactions Interactions, multiple changes Power supply and other components (well-known PC repair example!) Change everything all at once so that we’ll never know what worked!

Systems Biology Systems biology emphasizes close integration of experiment, theory and computational modeling Goal: understanding the structure and dynamics of biological systems, placing the parts in the context of the dynamic whole Studies the complex interactions of many levels of biological information Quantitative, predictive models are central Computational modeling in particular is a key tool Why model You are forced to really state what you are hypothesizing Allows you to understand an *approximation* of reality in great detail Computational Cell Biology Springer Verlag (Fall et al, eds). Foundations of systems biology. MIT Press, Kitano (ed)

A small sampling BALSA BASIS BIOCHAM BioCharon biocyc2SBML BioGrid
BioNetGen BioPathways Explorer Bio Sketch Pad BioSPICE Dashboard BioSpreadsheet BioUML Cellware Cytoscape DBsolve Dizzy E-CELL FluxAnalyzer Gepasi INSILICO discovery Jarnac JDesigner JSIM JWS Karyote

Example - MCell MCell is: A General Monte Carlo Simulator of Cellular Microphysiology. MCell focuses on simulations using a Brownian dynamics random walk algorithm. MCell's use to date has been focused on the microphysiology of synaptic transmission. Images and MCell-related material courtesy of Joel R. Stiles, Pittsburgh SupercomputingCenter and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute.

MCell Scalability Images and MCell-related material courtesy of Joel R. Stiles, Pittsburgh Supercomputing Center and Carnegie Mellon University, and Thomas M. Bartol, Computational Neurobiology Laboratory, The Salk Institute.

CompuCell CompuCell currently uses a combination of "extended Potts model" for cell sorting and clustering, and "Schnakenberg Reaction Diffusion" equations to establish the underlying chemical field to which cells respond and form typical patterns found in such biological systems as a growing chicken limb. Image courtesy of James Glazier

SBML Systems Biology Markup Language
There is currently a proliferation of software, but no single package answers all needs Systems Biology Markup Language Purpose: develop software and standards to Enable sharing of simulation & analysis software Enable sharing of models Goal: make it easier to share than to reimplement An XML-based markup language Active and functional leadership and reasonable funding stream SBML is focused on biochemical networks, but of all of the biology-oriented markup languages, it seems to be the one with the most traction Permits storage, transmission, and reuse of models Consists of “levels”

What does an SBML model look like?
<?xml version="1.0" encoding="UTF-8" ?> - <sbml xmlns=" version="2" level="1" xmlns:celldesigner=" - <model name="ban00010"> - <annotation> <celldesigner:modelVersion>2.2</celldesigner:modelVersion> <celldesigner:modelDisplay sizeX="876" sizeY="1177" /> - <celldesigner:listOfCompartmentAliases> - <celldesigner:compartmentAlias id="ca1" compartment="uVol"> <celldesigner:class>SQUARE</celldesigner:class> <celldesigner:bounds x="10.0" y="10.0" w="856" h="1157" /> </celldesigner:compartmentAlias> </celldesigner:listOfCompartmentAliases> - <celldesigner:listOfSpeciesAliases> -

SBML Levels Level 1 – Biochemical networks. Frozen.
Level 2 – enhancements and extensions to level 1. Frozen June 2003. Uses MathML for equation specifications Uses same metadatascheme as CellML (exp named function defs), catalysts, time delays Fixes minor issues in Level 1 specification Any Level 1 model can be run within software that supports Level 2 Level 3 – current development effor

Components of a Level 1 or 2 model
Compartment: a well-stirred container Species: chemical compounds Reaction: transformation, transport, or binding process involving a species. May have a rate parameter Parameter: a quantity that has a symbolic name (global and local) Unit definition Rule: added to set constraints, initial conditions, bounds, etc on the reactions Everything in SBML is one of the above!

So you actually want to run one…
MANY programs will handle a model written in SBML libSBML provides a C/C++ API if you want to write your own Math SBML – an open source toolbox for running SBML models within Mathematica SBML Toolbox – the equivalent for MatLab While an open source toolkit for a proprietary software package seems odd at first blush… There is a KEGG to SBML converter!

JWS Online From

CellML Originally designed to describe and exchange models of cellular and subcellular processes. XML-based specification of interchange of cell model information Includes: Information about model structure Math, based on MathML Metadata about the model Project of Bioengineering Institute of University of Auckland with support from Physiome Sciences Inc.

BioSpice Lead by Adam Arkin – a DARPA-backed effort
Described in some detail in two recent issues of “-Omics” More licensing term details than many open source efforts The BioSpice Dashboard may be one of the better “integrative” tools under development at present Uses SBML for model specification

Systems biology URLs SBW & SBML www.sbw-sbml.org
NetBuilder strc.herts.ac.uk/bio/Maria/NetBuilder CellML Jarnac + JDesigner Gepasi Virtual Cell (NIH-supported) E-CELL (based in Japan JigCell gnida.cs.vt.edu/~cellcyclepse/ DARPA BioSPICE Karyote

Some Good Books Winter, P.C., G.I. Hickey, H.L. Fletcher Instant notes in genetics. Springer-Verlag, NY. ISBM Durbin, R., S. Eddy, A. Krogh, G. Mitchison Biological sequence analysis. Cambridge University Press. Gibas, C., and P. Jambeck Developing bioinformatics computer skills. O’Reilly. Tisdall, J Beginning perl for bioinformatics. O’Reilly. Gusfield, D Algorithms on strings, trees, and sequences. Cambridge University Press. Berman, F., G.C. Fox, A.J.G. Hey. (eds) Grid computing: making the grid infrastructure a reality. Wiley, Sussex

And a few semi-random other matters

GeneIndex Location of initiators, promoters, etc. a key question in genomics First step in this is creating a dictionary of words of various lengths (many possible next steps) To be useful, analysis must be performed on entire genomes at once GeneIndex finds frequencies and positions of all words of a given length in a DNA sequence. Visualization with Tcl/Tk. Genome is broken up into n sections, where n = number of processors After each segment is analyzed, linked lists are joined

GeneIndex Scalability: Speedup

BioPerl Duct tape for DNA/protein sequence bioinformatics Perl API for
Reading many data formats/data sources Writing many data formats/data sources Manipulating data objects in well known ways Object oriented Perl module(s) As with all frameworks, it can be extremely painful for quick and dirty applications Siblings: BIOPYTHON ( BIOJAVA (

Apple bioclusters Apple Xserve clusters: head node plus compute nodes
Batch & queuing with Platform LSF, Sun Grid Engine, Open PBS, or PBS Pro Parallel API’s: MPICH, MPI Pro, LAM/MPI Globus tookit available iNquiry Bioinformatics tools available Many open source bioinformatics packages pre-compiled All available through Pise web interface “Easy” system/cluster administration tools

BIRN Biomedical Informatics Research Network http://www.nbirn.net/
NIH-sponsored attempt to create health-oriented cyberinfrastructure Function BIRN – brain function and disorders, e.g. schizophrenia Morphometry BIRN – brain structural disorders, e.g. Alzheimers Mouse BIRN – studying mouse brain and mouse models of human brain disorders Grid technology, using federated data system approach, based on Globus, SRB, etc.

Grid.org Volunteer cycles to handle various problems
Uses United Devices software Human proteome project – uses Rosetta Cancer research – does screening against potential projects

Computational biology, biomedical research, and HPC
Two challenges: Scalability of applications Wall-clock time sensitivity Bioinformatics, Genomics, Proteomics, ____ics will radically change understanding of biological function and the way biomedical research is done. Traditional biomedical researchers must take advantage of new possibilities Computer-oriented researchers must take advantage of the knowledge held by traditional biomedical researchers

Future directions & needs
Drug Discovery Target generation – so what Target verification – that’s important! Toxicity prediction – VERY important!! (Cholesterol example) Counterintuitive problem: the more personalized a therapy is, the smaller its target audience! Many biologists are unfamiliar with the real possibilities Useful applications may require straightforward application of well known principles, but writing a parallel application that can be used to treat people is a very difficult challenge Attacks on all fronts simultaneously are needed Interactive Tera-scale applications might for many biologists be more valuable right now than Peta-scale applications (even if we had them!) Portals and the TeraGrid –> solutions to problems that biologists care about Lots of open source codes are out there waiting for you

Acknowledgments Some of the research described herein was supported by the following:\ The Indiana Genomics Initiative of Indiana University, supported in part by Lilly Endowment Inc. The Indiana METACyt Initiative of Indiana University, supported in part by Lilly Endowment Inc. Shared University Research grants from IBM, Inc. to Indiana University. National Science Foundation under Grant No and Grant No. CDA Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Some of the ideas presented here were developed while the senior author was a visiting scientist at Höchstleistungsrechenzentrum Universität Stuttgart. Thanks to HLRS and everyone I have worked with there, especially Michael Resch, Matthias Müller, Peggy Lindner, Matthias Hess, Rainer Keller, and Edgar Gabriel. John Herrin, Malinda Lingwall, & W. Leslie Teach assisted with graphics

Introduction to computational biology

Similar presentations

Presentation on theme: "Introduction to computational biology"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to computational biology

Similar presentations

Presentation on theme: "Introduction to computational biology"— Presentation transcript:

Similar presentations

About project

Feedback