Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.

Slides:



Advertisements
Similar presentations
Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology.
Advertisements

Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
Multiple Sequence Alignment
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
CPM '05 Sensitivity Analysis for Ungapped Markov Models of Evolution David Fernández-Baca Department of Computer Science Iowa State University (Joint work.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
What is Alignment ? One of the oldest techniques used in computational biology The goal of alignment is to establish the degree of similarity between two.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Sequence Alignment Oct 9, 2002 Joon Lee Genomics & Computational Biology.
Multiple Sequence alignment Chitta Baral Arizona State University.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Alignment II Dynamic Programming
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Developing Pairwise Sequence Alignment Algorithms
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Introduction to Profile Hidden Markov Models
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
 -globin ( 141) and  -globin (146) V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Approximate Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Chapter 3 Computational Molecular Biology Michael Smith
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,
Doug Raiford Phage class: introduction to sequence databases.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Heuristic Alignment Algorithms Hongchao Li Jan
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
JinJu Lee & Beatrice Seifert CSE 5311 Fall 2005 Week 10 (Nov 1 & 3)
String Processing.
Alignment of Long Sequences
SPIRE Normalized Similarity of RNA Sequences
Sequence Alignment 11/24/2018.
Fast Sequence Alignments
Dynamic Programming 1/15/2019 8:22 PM Dynamic Programming.
SPIRE Normalized Similarity of RNA Sequences
Multiple Sequence Alignment (I)
Dynamic Programming-- Longest Common Subsequence
Bioinformatics Algorithms and Data Structures
String Processing.
Space-Saving Strategies for Analyzing Biomolecular Sequences
Fragment Assembly 7/30/2019.
Presentation transcript:

Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm

Motivation To compare sequences we typically first need to identify homologous segments. This essentially means constructing an alignment of the sequences.

“Classical” local and global alignment a c g t – g t c a a c g – t a c g t c g t a - - g c t a gap / indel columns Match / sub columns

Pairwise global alignment A global pairwise alignment is a path in the dynamic programming table. Essentially similar for local alignments. Heuristics needed for multiple alignments.

Pairwise global alignment Cost(i-1, j-1) + subcost(A[i], B[j]) Cost(i-1, j) + gapcost Cost(i, j-1) + gapcost 0 if i=0 and j=0 Cost(i, j) = max a a

Pairwise global alignment Backtracking to obtain alignment from DP table. Space and runtime complexity O(nm) with n length of first sequence and m length of second sequence.

Pairwise global alignment Backtracking to obtain alignment from DP table. Space and runtime complexity O(nm) with n length of first sequence and m length of second sequence. For the biologists: O(nm) means “order of” or proportional to n times m

Pairwise global alignment Backtracking to obtain alignment from DP table. Space and runtime complexity O(nm) with n length of first sequence and m length of second sequence. For the biologists: O(nm) means “order of” or proportional to n times m Say we want to align: Human Chr 22 ( ≈ bp) vs. Mouse Chr 19 ( ≈ bp) Running time: x op / op/sec = 347 days

Heuristics... the art of the possible... Try to detect and report as many biological reasonable similarities within reasonable time (and space)...

Banding... the art of the possible... Try to detect and report as many biological reasonable similarities within reasonable time (and space)... Observation: Realistic alignments do not stray much from the diagonal of the DP table.

Banding

Time and space is now O(b*max{n,m}) where b is the band width.

Banding Time and space is now O(b*max{n,m}) where b is the band width. Consider again Human Chr 22 ( ≈ bp) vs. Mouse Chr 19 ( ≈ bp) with, say, band width 5000 Running time: 5000 x op / op/sec = 3000 sec = 50 min

Banding Time and space is now O(b*max{n,m}) where b is the band width. Consider again Human Chr 22 ( ≈ bp) vs. Mouse Chr 19 ( ≈ bp) with, say, band width 500 Running time: 5000 x op / op/sec = 3000 sec = 50 min But beware: Banding is very sensitive to large (or a high number of) indels.

Anchors... the art of the possible... Try to detect and report as many biological reasonable similarities within reasonable time (and space)... Observation: Highly similar sub-sequences will be part of a realistic alignment

Anchors

General idea: Find “large scale” similarities. Decide on a “good” selection of these to align. Deal with the much smaller subproblems of aligning/examining “the rest” independently...

Finding anchors Finding the anchors is a local alignment problem – but the Smith-Waterman algorithm will of course not work. Several approaches to solving this problem – with either exact or approximate anchor matching. Next few slides show one of these approaches; one that finds exact matching anchors.

Maximal Unique Match A: B: Occurs only here Mismatch at both ends MUMs are sequences in genomes A and B that: Occur exactly once in A and in B Are not contained in any larger matching sequence

M. tuberculosis CDC1551 vs H37rV

Finding MUMs MUMs can efficiently be found using suffix trees...

Suffix trees

MUMs in a (generalized) suffix tree Consider cgacta and agacag, contains gac as a MUM Suffix generate using online tool at a MUM is the path-label of a node with exactly two child nodes that are leaf nodes from each genome. This implies uniqueness and right-maximality. Left-maximality can be checked by lookup in the genomes. Total time is O(n+m), the size of the suffix tree...

Putting the MUMs together Find the Longest Increasing Sequence Easy to solve in time O(k 2 ) using dynamic programming (can be done in time O(k log k)) where k is the number of MUMs.

Co-linearity All these algorithms consider only co-linear sequences. But only very closely related species have co-linear genomes!

Helicobactor pylori strain vs J99

Dealing with rearrangements Need an initial step for recognizing co-linear blocks that can then be aligned (heuristically or exact with the dynamic programming method).

Dealing with rearrangements Locate “high scoring” local alignments (anchors).

Dealing with rearrangements Identify larger likely co- linear blocks by combining local alignments Possibly excluding aligning the same nucleotide more than once in either one or both sequence

Dealing with rearrangements Non-anchor parts of co-linear blocks can be handled recursively or through dynamic programming alignments.

Example

Multiple sequence alignments The dynamical programming approach doesn't scale to multiple sequences (and several heuristics has the same problem).

Multiple sequence alignments The dynamical programming approach doesn't scale to multiple sequences (and several heuristics has the same problem). Yet again heuristics are needed!

Multiple sequence alignments A common approach is progressive alignment where sequences are aligned pairwise along a “guide tree”

Summary Structural rearrangements and the large size of genomes complicates aligning them. Heuristics are needed to ● Identify co-linear segments ● Align large segments ● Handle multiple sequences