Some Independent Study on Sequence Alignment — Lan Lin prepared for theory group meeting on July 16, 2003.

Slides:



Advertisements
Similar presentations
Gene expression From Gene to Protein
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Profiles for Sequences
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
Lecture 8 Alignment of pairs of sequence Local and global alignment
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Sequencing and Sequence Alignment
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Lecture 1 BNFO 240 Usman Roshan. Course overview Perl progamming language (and some Unix basics) Sequence alignment problem –Algorithm for exact pairwise.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment II Dynamic Programming
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
DNA as the genetic code.
2.7 DNA Replication, transcription and translation
Sequence comparison: Local alignment
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Intelligent Systems for Bioinformatics Michael J. Watts
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Chapter 3 Computational Molecular Biology Michael Smith
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Introduction to Bioinformatics Algorithms Algorithms for Molecular Biology CSCI Elizabeth White
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
An Improved Search Algorithm for Optimal Multiple-Sequence Alignment Paper by: Stefan Schroedl Presentation by: Bryan Franklin.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Pairwise Sequence Alignment. Three modifications for local alignment The scoring system uses negative scores for mismatches The minimum score for.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
DNA Structure and Protein Synthesis Topic 2.4. Introduction  Cause of CF?  faulty CFTR protein  What causes faulty protein?  DNA Mutation  What is.
Ch. 11: DNA Replication, Transcription, & Translation Mrs. Geist Biology, Fall Swansboro High School.
Bioinformatics Overview
Pairwise sequence comparison
INTRODUCTION TO BIOINFORMATICS
Sequence comparison: Local alignment
DNA vs RNA.
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
It is the presentation about the overview of DOT MATRIX and GAP PENALITY..
Presentation transcript:

Some Independent Study on Sequence Alignment — Lan Lin prepared for theory group meeting on July 16, 2003

Biological Background (1) Genetic information is stored in DNA – used to make identical copies – transferred from DNA to RNA to protein DNA is a linear polymer of 4 nucleotides ATGC – (A, T, G, C) RNA is a similar polymer AUGC – (A, U, G, C) double helix Both can pair one with another — “double helix” G-CA-T/U – pairing being sequence specific (G-C, A-T/U) – templating resulting in DNA replication and RNA copy of a DNA sequence

Biological Background (2) Proteins are variable length linear, mixed polymers of 20 different amino acids – peptidespolypeptides – peptides and polypeptides for amino acid polymers – functional property largely determined by the amino acid sequence RNA protein by translation of a code consisting of 3 nucleotides into 1 amino acid – one amino acid encoded by 1 ~ 6 different triplet codes stop codons – 3 stop codons specifying “end of peptide sequence” – 3 reading frames for a DNA sequence, 6 for one with its (inferred) complementary strand

Sequence Analysis (1) Some difficulties – Where the code for a protein starts and stops? exons – DNA frequently scattered in separate “exons”, not continuous – RNAs up- and down-stream of the coding region, non-coding regions can be quite large; not all RNAs encode proteins Inferring structure and function from a protein sequence is even harder! – 3 levels of protein structure primary structure primary structure — sequence of amino acids in the protein secondary structure alpha helixbeta sheet secondary structure — polypeptide chains folding into regular structures (i.e., alpha helix or beta sheet) tertiary structure tertiary structure — 3D structure of protein determining biological function – homology-based approach used to determine the tertiary structure by primary sequence analysis of related proteins

Sequence Analysis (2) What can be done? – Identification of protein primary sequence from DNA sequence – searching for DBs for similar sequences DDBJEMBLGenBank DNA sequences: DDBJ, EMBL, GenBank BLASTFASTA –for rapid search for a query sequence: BLAST and FASTA SwissProtPIR protein sequences: SwissProt, PIR – calculation of sequence alignments for evolutionary inferences and to aid in structural and functional analysis

Pairwise Sequence Alignment Two quantitative measures – similarity – similarity (the larger the better) – distance – distance (the smaller the better) Edit operations by introducing a gap character “-” indel – match, replacement, insertion/deletion (“indel”) The unit cost model cost of an alignmentst st – The cost of an alignment of two sequences s and t is the sum of the cost of all the edit operations that lead from s to t. optimal alignment – An optimal alignment is one with the minimum cost. edit distancests twd w (s, t) – The edit distance of s and t is the cost of an optimal alignment of s and t under a cost function w denoted by d w (s, t).

Pairwise Alignment via Dynamic Programming (1) Recursion step d w ( 0 :s: i, 0 :t: j ) = min {d w ( 0 :s: (i-1), 0 :t: (j-1) ) + w(s i, t j ), d w ( 0 :s: (i-1), 0 :t: j ) + w(s i, - ), d w ( 0 :s: i, 0 :t: (j-1) ) + w(-, t j )} for i, j  1 Base d w ( 0 :s: 0, 0 :t: 0 ) = 0 d w ( 0 :s: i, 0 :t: 0 ) = d w ( 0 :s: (i-1), 0 :t: 0 ) + w(s i, - ) for i = 1, …, m d w ( 0 :s: 0, 0 :t: j ) = d w ( 0 :s: 0, 0 :t: (j-1) ) + w(-, t j ) for j = 1, …, n

Pairwise Alignment via Dynamic Programming (2) (m+1)  (n+1) D = (d i, j ) d i, j = d w ( 0 :s: i, 0 :t: j ) The edit distances of all prefixes define an (m+1)  (n+1) distance marix D = (d i, j ) with d i, j = d w ( 0 :s: i, 0 :t: j ). Pattern of dependencies between matrix elements d i-1, j-1 d i-1, j d i, j-1 d i, j d i, j-1 d i, j The bottom right corner contains the desired result: d mn = d w ( 0 :s: m, 0 :t: n ) = d w (s, t) d mn = d w ( 0 :s: m, 0 :t: n ) = d w (s, t). A path through the distance matrix indicating how to align – A diagonal line means replacement/match – A vertical line means deletion – A horizontal line means insertion The most common order of calculation is line by line (each line from left to right), or column by column (each column from top to bottom).

On Scoring Functions Different words all attributing a numeric value to a pair of sequences distance – “distance” values are never negative; should be minimized cost – “cost” implies positive values, with the overall cost to be minimized weightsscores – “weights” and “scores” can be positive or negative; the optimal alignments maximize scores similarity – “similarity” implies large values are good; should be maximized If relating sequences of different length, length-relative scores make sense.

Realistic Gap Models No-gap alignment No-gap alignment – using matches/replacements only in some regions (i.e., sites of protein-protein interaction) – DP algorithm geared to do this by setting costs for indel to infinity (or something close to it) Block-indel Block-indel – charging a certain set-up cost for introducing the gap, whereas extending the gap is less expensive – DP algorithm adapted without much effect on its efficiency

Variations of Pairwise Alignment (1) Local alignment (approximate pattern matching) s t t s – where s is relatively short with respect to t and we seek that subunit of t which s aligns best with: 0 :s: m 0 :t: ni :t: j d w (s, i :t: j ) 0  i  j  n Given 0 :s: m and 0 :t: n, find i :t: j such that d w (s, i :t: j ) is minimal among all choices of 0  i  j  n. Local alignment recursion 0 :t: i, – no cost for deletion of a prefix 0 :t: i, j :t: n, – no cost for deletion of a suffix j :t: n, – d mn i :t: j – d mn gives the cost of the optimal local alignment, i :t: j is found by: j = min {k|d m,k = d m, n } i d m, j i is the point where the optimal path leading to d m, j starts from the 1st row

Variations of Pairwise Alignment (2) Local similarity s t – asking for those subunits of s and t that exhibit most similarity – using a similarity rather than a distance measure w(a, b) > 0a, b w(a, b) > 0, if a, b are similar, w(a, b) < 0a, b w(a, b) < 0, if a, b are not similar w(a, -) < 0, w(-, b) < 0 w(a, -) < 0, and w(-, b) < 0, in particular – score 0 as a cut-off value between subsequences with/without similarity long stretches of dissimilarity shown as regions of zeroes in the matrix stretches of local similarity rising as islands of positive values

Heuristic Methods Edit distance calculation complexity 0 :s: m, 0 :t: n m  n O(m  n) – for input sequences 0 :s: m, and 0 :t: n, DP calculates m  n matrix entries; time complexity is O(m  n) O(m) O(n) – to only get the edit distance, only one column (or one row) of the matrix needs to be stored; space complexity is O(m) or O(n) O(m  n) – to retrace optimal path, the whole matrix needs to be stored; space complexity is also O(m  n) O(m+n) Heuristic methods approximate optimal alignment in a time complexity close to O(m+n) – trading speed for precision

Multiple Alignment Helpful for protein structure prediction and evolutionary history inference k k A multiple alignment of k sequences is a rectangular array of k rows which resemble the corresponding sequences when ignoring the gap character, with each column containing at lease one character different from “-”. Two ways to formulate a cost/weight function – colomuns-first – pairs-first optimal multiple alignment An optimal multiple alignment is one with minimum overall cost, or maximal overall similarity. SP-cost “sum-of-pairs” based on SP-cost (“sum-of-pairs”)

MSA by Standard DP and Heuristics DP matrix DP hyperlattice O(2 k  |s i |)O(  |s i |) – taking time in O(2 k  |s i |) and space in O(  |s i |) – NP-hard with regard to the number of sequences with the SP measure Alignment along a phylogenetic tree – tree generation through all optimal pairwise alignments – most similar pairs aligned first before aligning alignments – not necessarily optimal due to error accumulation sequencesprofiles – “sequences” “profiles” sum-of-pairsscoring along a tree “sum-of-pairs” “scoring along a tree” i=1,…,ki=1,…,k

More Interesting Topics Phylogenetic tree Genetic algorithms and protein folding RNA secondary structure prediction Protein structures Finding instances of known/unknown sites etc, …

References Online Lectures on Bioinformatics Biocomputing Hypertext Coursebook bielefeld.de/bcd/Curric/welcome.html Lecture Notes on Biological Sequence Analysis res/tompa00lecture.pdf