Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Slides:



Advertisements
Similar presentations
Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Multiple Sequence Alignment
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
JM - 1 Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming Jarek Meller Jarek Meller Division.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Efficient Clustering of Large EST Data Sets on Parallel Computers CECS Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,
Next Generation Sequencing, Assembly, and Alignment Methods
DNA Sequencing.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs of Words, Patterns 3.Systems.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Linear-Space Alignment. Linear-space alignment Using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Assembly.
DNA Sequencing and Assembly
CS273a Lecture 4, Autumn 08, Batzoglou Fragment Assembly (in whole-genome shotgun sequencing) CS273a Lecture 5.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable.
DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Sequence alignment, E-value & Extreme value distribution
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Sequence Alignment.
Mouse Genome Sequencing
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Genome sequencing Haixu Tang School of Informatics.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Human Genome.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly
DNA Sequencing Project
Fragment Assembly (in whole-genome shotgun sequencing)
Removing Erroneous Connections
Alignment of Long Sequences
Sequence Alignment 11/24/2018.
CSE 589 Applied Algorithms Spring 1999
Presentation transcript:

Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003

ARACHNE: Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT.. 2. Merge good pairs of reads into longer contigs 3. Link contigs to form supercontigs

Lecture 10, Thursday May 1, Link Contigs into Supercontigs Too dense: Overcollapsed? (Myers et al. 2000) Inconsistent links: Overcollapsed? Normal density

Lecture 10, Thursday May 1, 2003 Find all links between unique contigs 3. Link Contigs into Supercontigs (cont’d) Connect contigs incrementally, if  2 links

Lecture 10, Thursday May 1, 2003 Fill gaps in supercontigs with paths of overcollapsed contigs 3. Link Contigs into Supercontigs (cont’d)

Lecture 10, Thursday May 1, 2003 Define G = ( V, E ) V := contigs E := ( A, B ) such that d( A, B ) < C Reason to do so: Efficiency; full shortest paths cannot be computed 3. Link Contigs into Supercontigs (cont’d) d ( A, B ) Contig A Contig B

Lecture 10, Thursday May 1, Link Contigs into Supercontigs (cont’d) Contig A Contig B Define T: contigs linked to either A or B Fill gap between A and B if there is a path in G passing only from contigs in T

Lecture 10, Thursday May 1, Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting

Lecture 10, Thursday May 1, 2003 Simulated Whole Genome Shotgun Known genomes Flu, yeast, fly, Human chromosomes 21, 22 Make “realistic” shotgun reads Run ARACHNE Align output with genome and compare

Lecture 10, Thursday May 1, 2003 Making a Simulated Read Simulated reads have error patterns taken from random real reads ERRORIZER Simulated read artificial shotgun read real read

Lecture 10, Thursday May 1, 2003 Human 22, Results of Simulations Plasmid/ Cosmid cov 10 X / 0.5 X 5 X / 0.5 X 3 X/ 0 X N50 contig353 Kb15 Kb2.7 Kb Mean contig142 Kb10.6 Kb2.0 Kb N50 scaffold3 Mb 4.1 Kb Avg base qual % > 2 kb

Lecture 10, Thursday May 1, 2003 Neurospora crassa Genome (Real Data) 40 Mb genome, shotgun sequencing complete (WI-CGR) Coverage: 1705 contigs 368 supercontigs 1% uncovered (of finished BACs) Evaluated assembly using 1.5Mb of finished BACs Efficiency: Time: 20 hr Memory: 9 Gb Accuracy: < 3 misassemblies compared with 1 Gb of finished sequence Errors/10 6 letters: Subst. 260 Indel: 164

Lecture 10, Thursday May 1, 2003 Mouse Genome Improved version of ARACHNE assembled the mouse genome Several heuristics of iteratively: Breaking supercontigs that are suspicious Rejoining supercontigs Size of problem: 32,000,000 reads Time: 15 days, 1 processor Memory: 28 Gb N50 Contig size: 16.3 Kb  24.8 Kb N50 Supercontig size:.265 Mb  16.9 Mb

Lecture 10, Thursday May 1, 2003 Next few lectures More on alignments Large-scale global alignment – Comparing entire genomes Suffix trees, sparse dynamic programming MumMer, Avid, LAGAN, Shuffle-LAGAN Multiple alignment – Comparing proteins, many genomes Scoring, Multidimensional-DP, Center-Star, Progressive alignment CLUSTALW, TCOFFEE, MLAGAN Gene recognition Gene recognition on a single genome GENSCAN – A HMM for gene recognition Cross-species comparison-based gene recognition TWINSCAN – A HMM SLAM – A pair-HMM

Rapid Global Alignments How to align genomic sequences in (more or less!) linear time

Lecture 10, Thursday May 1, 2003 Motivation Genomic sequences are very long: –Human genome = 3 x 10 9 –long –Mouse genome = 2.7 x 10 9 –long Aligning genomic regions is useful for revealing common gene structure –Useful to compare regions > 1,000,000-long

Lecture 10, Thursday May 1, 2003 Main Idea Genomic regions of interest contain ordered islands of similarity –E.g. genes 1.Find local alignments 2.Chain an optimal subset of them

Lecture 10, Thursday May 1, 2003 Outline Methods to FIND Local Alignments –Sorting k-long words –Suffix Trees Methods to CHAIN Local Alignments –Dynamic Programming –Sparse Dynamic Programming

Methods to FIND Local Alignments 1.Sorting K-long words BLAST, BLAT, and the like 2.Suffix Trees

Lecture 10, Thursday May 1, 2003 Finding Local Alignments: Sorting k-long words Given sequences x, y: 1.Write down all (w, 0, i):w = x i+1 …x i+k (z, 1, j): z = y j+1 …y j+k 2.Sort them lexicographically 3.Deduce all k-long matches between x and y 4.Extend to local alignments

Lecture 10, Thursday May 1, 2003 Sorting k-long words: example Let x, y be matched with 3-long words: x = caggc:(cag,0,0), (agg,0,1), (ggc,0,2) y = ggcag: (ggc,1,0), (gca,1,1), (cag,1,2) Sorted: (agg,0,1),(cag,0,0),(cag,1,2),(ggc,0,2),(ggc,1,0),(gca,1,1) Matches: 1. cag: x 1 x 2 x 3 = y 3 y 4 y 5 2. ggc: x 3 x 4 x 5 = y 1 y 2 y 3

Lecture 10, Thursday May 1, 2003 Running time Worst case: O(NxM) In practice: a large value of k results in a short list of matches Tradeoff: Low k: worse running time High k: significant alignments missed PatternHunter: Sampling non-consecutive positions increases the likelihood to detect a conserved region, for a fixed value of k – refer to Lecture 3

Lecture 10, Thursday May 1, 2003 Suffix Trees Suffix trees are a method to find all maximal matches between two strings (and much more) Example: x = dabdac d ab d a c c a b d a c c c c a d b

Lecture 10, Thursday May 1, 2003 Definition of a Suffix Tree Definition: For string x = x 1 …x m, a suffix tree is: –A rooted tree with m leaves Leaf i: x i …x m –Each edge is a substring –No two edges out of a node, start with same letter It follows, every substring corresponds to an initial part of a path from root to a leaf

Lecture 10, Thursday May 1, 2003 Constructing a Suffix Tree Naïve algorithm: O( N 2 ) time Better algorithms: O( N ) time (outside the scope of this class – too technical and not so interesting) Memory: O( N ) but with a sizeable constant

Lecture 10, Thursday May 1, 2003 Naïve Algorithm to Construct a Suffix Tree 1.Initialize tree T: a single root node r 2.Insert special symbol $ at end of x 3.For j = 1 to m Find longest match of x i …x m to T, starting from r Split edge where match stops: new node w Create edge (w, j), and label with unmatched portion of x i …x m

Lecture 10, Thursday May 1, 2003 Example of Suffix Tree Construction 1 x = d a b d a $ d ab d a $ 1. Insert d a b d a $ a b d a $ 2 2. Insert a b d a $ $ a d b 3 3. Insert b d a $ $ 4 4. Insert d a $ $ 5 5. Insert a $ $ 6 6. Insert $

Lecture 10, Thursday May 1, 2003 Faster Construction Several algorithms O( N ) time, O( N ) memory with a big constant Technical but not deep, outside the scope of this course Optional: Gusfield, chapter 6

Lecture 10, Thursday May 1, 2003 Memory to Store Suffix Tree Can store in O( N ) memory! Every edge is labeled with (i, j): (i,j) denotes x i …x j Tree has O( N ) nodes Proof: 1.# leafs  # nodes – 1 2.# leafs = |x|

Lecture 10, Thursday May 1, 2003 Application: Find all Matches Between x and y 1.Build suffix tree for x, mark nodes with x 2.Insert y in suffix tree, mark all nodes y “passes from” with y –The path label of every node marked both 0 and 1, is a common substring

Lecture 10, Thursday May 1, 2003 Example of Suffix Tree Construction for x, y 1 x = d a b d a $ y = a b a d a $ d ab d a $ 1. Construct tree for x a b d a $ 2 $ a d b 3 $ 4 $ 5 $ 6 x x x 6. Insert a $ Insert $ 4. Insert a d a $ d a $ 3 5. Insert d a $ y 4 2. Insert a b a d a $ a y d a $ 1 y y x 3. Insert b a d a $ a d y 2 a $ x

Lecture 10, Thursday May 1, 2003 Application: Online Search of Strings on a Database Say a database D = { s 1, s 2, …s n } (eg. proteins) Question: given new string x, find all matches of x to database 1.Build suffix tree for {s 1,…, s n } 2.All new queries x take O( |x| ) time (somewhat like BLAST)

Lecture 10, Thursday May 1, 2003 Application: Common Substrings of k Strings Say we want to find the longest common substring of s 1, s 2, …s n 1.Build suffix tree for s 1,…, s n 2.All nodes labeled {s i1, …, s ik } represent a match between s i1, …, s ik