Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.

Slides:



Advertisements
Similar presentations
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Advertisements

Sequence Assembly for Single Molecule Methods Steven Skiena, Alexey Smirnov Department of Computer Science SUNY at Stony Brook {skiena,
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Alignments Why do Alignments?. Detecting Selection Evolution of Drug Resistance in HIV.
Next Generation Sequencing, Assembly, and Alignment Methods
Heuristic alignment algorithms and cost matrices
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
1 Protein Multiple Alignment by Konstantin Davydov.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequence Analysis Tools
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 07/01/08 Multiple sequence alignment 2 Sequence analysis 2007 Optimizing.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignments
Genome sequencing and assembling
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Chapter 4 Genome Sequencing Strategies and procedures for.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Mouse Genome Sequencing
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Protein Sequence Alignment and Database Searching.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.
Metagenomics Assembly Hubert DENISE
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Human Genome.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Virginia Commonwealth University
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Multiple sequence alignment (msa)
Genome sequence assembly
Sequence comparison: Local alignment
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
Graph Algorithms in Bioinformatics
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal (heuristic) alignment algorithms are also very important: e.g. BLAST

Key Issues Types of alignments (local vs. global) The scoring system The alignment algorithm Measuring alignment significance

Types of Alignment Global—sequences aligned from end- to-end. Local—alignments may start in the middle of either sequence Ungapped—no insertions or deletions are allowed Other types: overlap alignments, repeated match alignments

Local vs. Global Pairwise Alignments A global alignment includes all elements of the sequences and includes gaps. A global alignment may or may not include "end gap" penalties. Global alignments are better indicators of homology and take longer to compute. A local alignment includes only subsequences, and sometimes is computed without gaps. Local alignments can find shared domains in divergent proteins and are fast to compute

How do you compare alignments? Scoring scheme What events do we score? Matches Mismatches Gaps What scores will you give these events? What assumptions are you making? Score your alignment

Scoring Matrices How do you determine scores? What is out there already for your use? DNA versus Amino Acids? TTACGGAGCTTC CTGAGATCC

Multiple Sequence Alignment Global versus Local Alignments Progressive alignment Estimate guide tree Do pairwise alignment on subtrees ClustalX

Improvements Consistency-based Algorithms T-Coffee - consistency-based objective function to minimize potential errors Generates pair-wise global (Clustal) Local (Lalign) Then combine, reweight, progressive alignment

Iterative Algorithms Estimate draft progressive alignment (uncorrected distances) Improved progressive (reestimate guide tree using Kimura 2-parameter) Refinement - divide into 2 subtrees, estimate two profiles, then re-align 2 profiles Continue refinement until convergence

Software Clustal T-Coffee MUSCLE (limited models) MAFFT (wide variety of models)

Comparisons Speed Muscle>MAFFT>CLUSTALW>T-COFFEE Accuracy MAFFT>Muscle>T-COFFEE>CLUSTALW Lots more work to do here!

Why Genome Sequencing?

Modern Sequencing Methods Sanger (1982) introduced a sequencing method amenable to automation. Whole-genome sequencing: Clone-By-Clone vs. Shotgun Assembly Drosophila melongaster sequenced (Myers et al. 2000) Homo sapien sequenced (Venter et al. 2001)

Main idea: Obtain fragments of all possible lengths, ending in A, C, T, G. Using gel electrophoresis, we can separate fragments of differing lengths, and then assemble them. Sanger (1982) introduced chain- termination sequencing.

Automated Sequencing Perkin-Elmer 3700: Can sequence ~500bp with 98.5% accuracy

Reads and Contigs Sequencing machines are limited to about ~ bp, so we must break up DNA into short and long fragments, with reads on either end. Reads are then assembled into contigs, then scaffolds.

Clone-by-Clone vs. Shotgun Traditionally, long fragments are mapped, and then assembled by finding a minimum tiling path. Then, shotgun assembly is used to sequence long fragments. Shotgun assembly is cheaper, but requires more computational resources. Drosophila was successfully sequenced using shotgun assembly.

In a Perfect World

Difficulties? Good coverage does not guarantee that we can “see” repeats. Read coverage is generally not “truly” random, due to complications in fragmentation and cloning. Any automated approach requires extensive post-processing. Phrap

The Fruit Fly Drosophila melongaster was sequenced in 2000 using whole genome shotgun assembly. Genome size is ~120Mbp for euchromatic (coding) portion, with roughly 13,600 genes. The genome is still being refined.

NIH used a Clone-By- Clone strategy; Celera used shotgun assembly. Celera used 300 sequencing machines in parallel to obtain 175,000 reads per day. Efforts were combined, resulting in 8x coverage of the human genome; consensus sequence is 2.91 billion base pairs.

Abstraction The basic question is: given a set of fragments from a long string, can we reconstruct the string? What is the shortest common superstring of the given fragments?

Overlap-Layout-Consensus Construct a (directed) overlap graph, where nodes represent reads and edges represent overlap. Paths are contigs in this graph. Problem: Find the consensus sequence by finding a path that visits all nodes in layout graph. Note: This is an idealization, since we must handle errors!

Approximation Algorithms The shortest common superstring problem is NP-complete. Greedily choosing edges is a 4- approximation, conjectured to be a 2- approximation. Another idea: TSP has a 2-approximation if the edge weights are metric (Waterman et al gives such metrics).

Handling Repeats We can estimate how much coverage a given set of overlapping reads should yield, based on coverage. Repeats will “seem” to have unusually good coverage. Celera’s algorithms are proprietary, but there is no explicit way to handle repeats in the overlap-layout-consensus paradigm.

The Big Picture

Hybridization Suppose we had a way to probe fragments of length k that were present in our sequence, from a hybridization assay. Commercial products: Affymetrix GeneChip, Agilent, Amersham, etc.

Sequencing-By-Hybridization Then instead of reads, we have regularly sized fragments, k-mers. Construct a multigraph G with (k-1)-mers as nodes, with edges representing k-mers. G is a de Bruijn graph. Idea: An Eulerian path in G corresponds to the assembled sequence, and we don’t lose repeats (Pevzner 1989).

Bridges of Königsberg Theorem (Euler 1736): A graph has a path visiting every edge exactly once if and only if it is connected and has 2 or fewer vertices of odd degree.

Pros and Cons An Eulerian path in a graph can be found in linear time, if one exists. Errors in the hybridization experiments may prevent us from finding a solution. Can we just use reads as “virtual” hybridization data?

Graph Preprocessing Read errors mean up to k missing/erroneous edges. But we cannot correct this until we are done assembling! Greedily mutate reads to minimize size of set of k-mers. We also need to deal with repeats, which requires contracting certain paths to single edges…

Sizes of genomes and numbers of genes

Sequencing parameters Difficulty and cost of large-scale sequencing projects depend on the following parameters: Accuracy How many errors are tolerated Coverage How many times the same region is sequenced The two parameters are related More coverage usually means higher accuracy Accuracy is also dependent on the finishing effort

Sequence accuracy Highly accurate sequences are needed for the following: Diagnostics e.g., Forensics, identifying disease alleles in a patient Protein coding prediction One insertion or deletion changes the reading frame Lower accuracy sufficient for homology searches Differences in sequence are tolerated by search programs

Sequence accuracy and sequencing cost Level of accuracy determines cost of project Increasing accuracy from one error in 100 to one error in 10,000 increases costs three to fivefold Need to determine appropriate level of accuracy for each project If reference sequence already exists, then a lower level of accuracy should suffice Can find genes in genome, but not their position

Sequencing coverage Coverage is the number of times the same region is sequenced Ideally, one wants an equal number of sequences in each direction To obtain accuracy of one error in 10,000 bases, one needs the following: 10 x coverage Stringent finishing Complete sequence Base-perfect sequencing

NCBI Genome Summary NCBI